This is the collection of my open source contributions to scikit-learn, a Python module for machine learning. It has its code base maintained on GitHub, with over 2500 contributors.
I am a core developer of scikit-learn and also a member of its documentation team. I have contributed 88 merged pull requests, being its Top #43 contributor. Note that throughout this post, when saying a bug existed in scikit-learn a.b.c, it does not take into consideration backporting. For instance, "a bug existed in scikit-learn 1.3.1" and "fixed in scikit-learn 1.3.2" only implies that the bug was fixed after the release of scikit-learn 1.3.1, but does not gurantee that one would see the bug with scikit-learn 1.3.1 now since the fix for scikit-learn 1.3.2 may be backported to earlier versions, especially 1.3.x.
Code Contributions
Items in each section are sorted in reverse chronological order by the time of merge.
Cluster
FIX AffinityPropagation assigning multiple clusters for equal points #28121
In scikit-learn 1.4.0, cluster.AffinityPropagation assigned different clusters to completely equal points. For instance,
I made a simple fix to use the same cluster for equal points, thus the previous example would produce all zeros from scikit-learn 1.4.1.
Cross Decomposition
ENH ravel prediction of PLSRegression when fitted on 1d y #26602
In scikit-learn 1.3.x, cross_decomposition.PLSRegression always predicted 2D result no matter if the input was 1D or 2D. This was somehow inconsistent with other regressors such as linear_model.LinearRegression and linear_model.Ridge. For instance,
I made a simple fix to determine whether to automatically ravel the output based on the shape of the input y. From scikit-learn 1.4.0, PLSRegression would behave consistently with other regressors, such that it returns 1D prediction if fitted with 1D y.
This is the function is written in Cython for efficiency, but the signature of X in one of the helper functions is int_or_float[:, :] X, i.e., a mutable Cython memoryview. I made a simple fix by adding const to the signature. From scikit-learn 1.4.1, datasets.dump_svmlight_file would work correctly with read-only X.
ENH make_sparse_spd_matrix use sparse memory layout #27438
In scikit-learn 1.3.x, the function datasets.make_sparse_spd_matrix used dense numpy array throughout the computation and output a dense array, even though the result is sparse (the sparsity increases with the parameter alpha). This is inefficient in terms of both memory usage and computational cost. I refactored the code and used sparse memory layout from the very beginning, greatly reducing memory consumption especially when alpha is large. I also added a new keyword argument sparse_format which allows outputting a sparse matrix directly. By default, sparse_format is None, meaning that the output will be converted to a dense numpy array. Otherwise it should be a string like "csr", "csc", or others, specifying the specific sparse format to return. This improvement is included from scikit-learn 1.4.0.
FIX KernelPCA inverse transform when gamma is not given #26337
In scikit-learn 1.2.x, decomposition.KernelPCA could produce incorrect results through the inverse_transform method when gamma=None. In particular, with gamma=None the value of gamma should be automatically chosen as 1/n_features where n_features is the number of features of X, so the result using gamma=None and gamma=1/n_features should be theoretically the same, which was not the case in scikit-learn 1.2.x. For instance,
The reason was, when gamma=None and automatically setting to 1/n_features, the value of n_features at fit time and at inverse transform time could be different, leading to different gamma values of the kernel and thus different results. When gamma is not None, however, the value is always consistent. I made a simple fix to make sure that the value of n_features at fit time is consistently used, so from scikit-learn 1.3.0, kpca1 and kpca2 would produce the same results.
Ensemble
FIX Remove spurious feature names warning in IsolationForest #25931
In scikit-learn 1.2.x, when fitting an ensemble.IsolationForest on a pandas DataFrame and contamination is not "auto", there would be a spurious warning about missing features. For instance,
The reason was that, the fit method was called with a DataFrame so there were features coming in, but at the end of the fit method, if contamination is not "auto", we would call the score_samples method but the input data X has already been validated and converted into a numpy array at this point, thus the warning. I created a private method _score_samples without validation to be called at the end of fit, so this warning would go away. This fix is included from scikit-learn 1.3.0.
Feature Selection
FIX mutual_info_regression when X is of integer dtype #26748
In scikit-learn 1.3.x, feature_selection.mutual_info_regression would return inaccurate or incorrect result when input data X is of integer dtype. As an illustration, we compare its output with two input data that are different only in there dtypes:
The reason was that, continuous features of the input data were scaled and assigned back to the input data X, but if X is of integer dtype, such an in-place assignment would force the scaled values to be rounded, thus largely losing precision. I made a simple fix to convert X to np.float64 dtype in advance, so from scikit-learn 1.4.0, res and res_float would be the same.
FIX SequentialFeatureSelector throws IndexError when cv is a generator #25973
When fitting a feature_selection.SequentialFeatureSelector, the parameter cv should accept any iterable according to the documentation. However in scikit-learn 1.2.x, a confusing IndexError would occur when cv is passed as a generator. For instance,
>>>importnumpyasnp>>>fromsklearn.datasetsimportmake_classification>>>fromsklearn.feature_selectionimportSequentialFeatureSelector>>>fromsklearn.model_selectionimportKFold>>>fromsklearn.neighborsimportKNeighborsClassifier>>>X,y=make_classification(random_state=0)>>>knc=KNeighborsClassifier(n_neighbors=5)>>>cv=KFold().split(X,y)# This is a generator
>>>sfs=SequentialFeatureSelector(knc,n_features_to_select=5,cv=cv)>>>sfs.fit(X,y)IndexError:listindexoutofrange
This was because cv was passed around and consumed multiple times, thus the generator was exhausted after the first time. I made a simple fix by calling model_selection.check_cv that would convert the generator into a list, thus being safe to pass around. This fix is included from scikit-learn 1.3.0.
Metrics
ENH PrecisionRecallDisplay add option to plot chance level #26019
PrecisionRecallDisplay plots the Precision-Recall (PR) curve. The PR curve is constructed by plotting the precision against recall, where precision is the proportion of true positives among all actual positives, and recall is the proportion of true positives among all predicted positives. The baseline is called chance level, which is the PR curve of a predictor that predicts all examples as positive. Therefore, I added a keyword plot_chance_level for plotting the chance level. I also provided a keyword chance_level_kw that supports a dictionary of matplotlib keywords for customizing the rendering of the chance level line.
ENH RocCurveDisplay add option to plot chance level #25987
RocCurveDisplay plots the Receiver Operating Characteristic (ROC) curve. The ROC curve is constructed by plotting the true positive rate against the false positive rate, which measures the diagnostic ability of binary classifiers. The baseline is called chance level, which is the diagonal, and the farther the ROC curve is from the baseline, the better performance the classifier has. Therefore, I added a keyword plot_chance_level for plotting the chance level. I also provided a keyword chance_level_kw that supports a dictionary of matplotlib keywords for customizing the rendering of the chance level line.
Neighbors
FIX KNeighborsClassifier raise when all neighbors of some sample have zero weights #26410
In scikit-learn 1.4.0, neighbors.KNeighborsClassifier behaved unreasonably when the weights of all neighbors of some sample are zero. This is possible when using a user-define weight function. For instance, there could be cases where all neighbors are pretty far away and the user-defined weight function assigns zero weights to all points outside a certain threshold. For simplicity of illustration, consider the following example where we directly assign zero weights to all points:
As shown in the example above, the predict method predicted the first class and the predict_proba method returned all zero probabilities. After discussions, maintainers were convinced that it was worth breaking backward compatibility to raise an error in this case, instead of returning almost random results that hide potential bugs. Therefore, I made a simple fix to check if there is any all-zero row (i.e., sample with all-zero neighbor weights). To avoid creating large temporary boolean arrays in memory during this check, I also implemented a small Cython function for this. From scikit-learn 1.4.1, the aforementioned ill case would lead to an informative error message with a suggestion to fix the problem, such that
FIX PowerTransformer raise when "box-cox" has nan column #26400
In scikit-learn 1.2.x, preprocessing.PowerTransformer would raise a confusing error when using box-cox transformation and there exists a column with all nan values. As an illustration,
This was because scikit-learn internally used scipy.stats.boxcox, which returned an empty array if the input array is empty (because all nan values are masked), thus not unpackable. After discussions with maintainers we believed that in this case it is best to raise a more informative error rather than letting it pass. Therefore, I made a fix to check the input array in advance, and the following error message would be raised from scikit-learn 1.3.0:
FIX export_text and export_graphviz accepts feature and class names as array-like #26289
In scikit-learn 1.2.x, tree.export_text and tree.export_graphviz only accepted feature_names and class_names as lists of strings. If using an array-like it would raise a confusing error. For instance,
However under many circumstances users get feature names and class names from arrays or dataframes, so I extended support to all array-like inputs for feature_names and class_names. From scikit-learn 1.3.0, the above example would work directly without convertion in advance.
FIX improve error message in check_array when getting a Series and expecting a 2D container #28090
In scikit-learn 1.4.0, when calling utils.check_array with a Series-like object (e.g., pandas Series or polars Series) and expecting a 2D container, the error message would be confusing. For instance,
This is because the error message was generated after converting the input, and it did not really distinguish between Series-like and other one-dimensional array-like objects. I made a simply fix to the error message by explicitly stating the type of the input and using a more appropriate error message customized for Series-like objects. This fix is included from scikit-learn 1.4.1, and the improved error message is shown as below:
>>>check_array(ser,ensure_2d=True)ValueError:Expecteda2-dimensionalcontainerbutgot<class'pandas.core.series.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.
Maintenance Contributions
Items are sorted in reverse chronological order by the time of merge.
MAINT fix update_environments_and_lock_files for non-posix systems #28133
MNT Work-around sphinx-gallery UnicodeDecodeError in recommender system #27969