scikit-learn

Core Developer

This is the collection of my open source contributions to scikit-learn, a Python module for machine learning. It has its code base maintained on GitHub, with over 2500 contributors.

I am a core developer of scikit-learn and also a member of its documentation team. I have contributed 88 merged pull requests, being its Top #43 contributor. Note that throughout this post, when saying a bug existed in scikit-learn a.b.c, it does not take into consideration backporting. For instance, "a bug existed in scikit-learn 1.3.1" and "fixed in scikit-learn 1.3.2" only implies that the bug was fixed after the release of scikit-learn 1.3.1, but does not gurantee that one would see the bug with scikit-learn 1.3.1 now since the fix for scikit-learn 1.3.2 may be backported to earlier versions, especially 1.3.x.

Code Contributions

Items in each section are sorted in reverse chronological order by the time of merge.

Cluster

FIX AffinityPropagation assigning multiple clusters for equal points #28121

In scikit-learn 1.4.0, cluster.AffinityPropagation assigned different clusters to completely equal points. For instance,

>>> import numpy as np
>>> from sklearn.cluster import AffinityPropagation
>>> X = np.zeros((8, 1))
>>> af = AffinityPropagation(affinity="euclidean", damping=0.5, random_state=42).fit(X)
>>> af.labels_
array([0, 1, 2, 3, 4, 5, 6, 7])

I made a simple fix to use the same cluster for equal points, thus the previous example would produce all zeros from scikit-learn 1.4.1.

Cross Decomposition

ENH ravel prediction of PLSRegression when fitted on 1d y #26602

In scikit-learn 1.3.x, cross_decomposition.PLSRegression always predicted 2D result no matter if the input was 1D or 2D. This was somehow inconsistent with other regressors such as linear_model.LinearRegression and linear_model.Ridge. For instance,

>>> import numpy as np
>>> from sklearn.cross_decomposition import PLSRegression
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [2, 4], [3, 9], [4, 16], [5, 25], [6, 36]])
>>> y = np.array([2, 6, 12, 20, 30, 42])
>>> lr = LinearRegression().fit(X, y)
>>> lr.predict(X)
array([ 2.,  6., 12., 20., 30., 42.])
>>> plsr = PLSRegression().fit(X, y)
>>> plsr.predict(X)
array([[ 2.],
       [ 6.],
       [12.],
       [20.],
       [30.],
       [42.]])

I made a simple fix to determine whether to automatically ravel the output based on the shape of the input y. From scikit-learn 1.4.0, PLSRegression would behave consistently with other regressors, such that it returns 1D prediction if fitted with 1D y.

Datasets

FIX dump svmlight when data is read-only #28111

In scikit-learn 1.4.0, the function datasets.dump_svmlight_file would raise a ValueError when X is read-only, e.g., memmap-based. For instance,

>>> import numpy as np
>>> from sklearn.datasets import dump_svmlight_file
>>> X = np.zeros((3, 4))
>>> y = np.zeroes(3)
>>> X.flags.writeable = False
>>> dump_svmlight_file(X, y, "test.svmlight")
ValueError: buffer source array is read-only

This is the function is written in Cython for efficiency, but the signature of X in one of the helper functions is int_or_float[:, :] X, i.e., a mutable Cython memoryview. I made a simple fix by adding const to the signature. From scikit-learn 1.4.1, datasets.dump_svmlight_file would work correctly with read-only X.

ENH make_sparse_spd_matrix use sparse memory layout #27438

In scikit-learn 1.3.x, the function datasets.make_sparse_spd_matrix used dense numpy array throughout the computation and output a dense array, even though the result is sparse (the sparsity increases with the parameter alpha). This is inefficient in terms of both memory usage and computational cost. I refactored the code and used sparse memory layout from the very beginning, greatly reducing memory consumption especially when alpha is large. I also added a new keyword argument sparse_format which allows outputting a sparse matrix directly. By default, sparse_format is None, meaning that the output will be converted to a dense numpy array. Otherwise it should be a string like "csr", "csc", or others, specifying the specific sparse format to return. This improvement is included from scikit-learn 1.4.0.

>>> from sklearn.datasets import make_sparse_spd_matrix
>>> make_sparse_spd_matrix(4, random_state=0)
array([[ 1.23245136, -0.48213209,  0.        ,  0.        ],
       [-0.48213209,  1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  1.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.        ]])
>>> make_sparse_spd_matrix(4, random_state=0, sparse_format="csr")
<4x4 sparse matrix of type '<class 'numpy.float64'>'
        with 6 stored elements in Compressed Sparse Row format>
>>> make_sparse_spd_matrix(4, random_state=0, sparse_format="csc")
<4x4 sparse matrix of type '<class 'numpy.float64'>'
        with 6 stored elements in Compressed Sparse Column format>

Decomposition

FIX KernelPCA inverse transform when gamma is not given #26337

In scikit-learn 1.2.x, decomposition.KernelPCA could produce incorrect results through the inverse_transform method when gamma=None. In particular, with gamma=None the value of gamma should be automatically chosen as 1/n_features where n_features is the number of features of X, so the result using gamma=None and gamma=1/n_features should be theoretically the same, which was not the case in scikit-learn 1.2.x. For instance,

>>> import numpy as np
>>> from sklearn.decomposition import KernelPCA
>>> rng = np.random.RandomState(0)
>>> X = rng.random_sample((5, 4))
>>> kwargs = {
...     "n_components": 2,
...     "random_state": rng,
...     "fit_inverse_transform": True,
...     "kernel": "rbf",
... }
>>> kpca1 = KernelPCA(gamma=None, **kwargs).fit(X)
>>> kpca2 = KernelPCA(gamma=1 / X.shape[1], **kwargs).fit(X)
>>> kpca1.inverse_transform(kpca1.transform(X))
array([[0.44334543, 0.59512828, 0.46569486, 0.50624967],
       [0.39736721, 0.59593894, 0.47787971, 0.53831915],
       [0.48914575, 0.50067493, 0.46519924, 0.45894488],
       [0.41840832, 0.58699316, 0.323197  , 0.3546591 ],
       [0.30997022, 0.57619197, 0.46445272, 0.5512108 ]])
>>> kpca2.inverse_transform(kpca1.transform(X))
array([[0.43304986, 0.58984587, 0.45734748, 0.49734729],
       [0.407028  , 0.59067998, 0.46527052, 0.51668476],
       [0.46205772, 0.53774587, 0.45877911, 0.47233884],
       [0.42196757, 0.58748932, 0.3754222 , 0.41023455],
       [0.35705227, 0.58148822, 0.45995811, 0.52773517]])

The reason was, when gamma=None and automatically setting to 1/n_features, the value of n_features at fit time and at inverse transform time could be different, leading to different gamma values of the kernel and thus different results. When gamma is not None, however, the value is always consistent. I made a simple fix to make sure that the value of n_features at fit time is consistently used, so from scikit-learn 1.3.0, kpca1 and kpca2 would produce the same results.

Ensemble

FIX Remove spurious feature names warning in IsolationForest #25931

In scikit-learn 1.2.x, when fitting an ensemble.IsolationForest on a pandas DataFrame and contamination is not "auto", there would be a spurious warning about missing features. For instance,

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.ensemble import IsolationForest
>>> rng = np.random.RandomState(0)
>>> X = pd.DataFrame(data=rng.randn(4), columns=["a"])
>>> clf = IsolationForest(random_state=0, contamination=0.05)
>>> clf.fit(X)
UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names
IsolationForest(contamination=0.05, random_state=0)

The reason was that, the fit method was called with a DataFrame so there were features coming in, but at the end of the fit method, if contamination is not "auto", we would call the score_samples method but the input data X has already been validated and converted into a numpy array at this point, thus the warning. I created a private method _score_samples without validation to be called at the end of fit, so this warning would go away. This fix is included from scikit-learn 1.3.0.

Feature Selection

FIX mutual_info_regression when X is of integer dtype #26748

In scikit-learn 1.3.x, feature_selection.mutual_info_regression would return inaccurate or incorrect result when input data X is of integer dtype. As an illustration, we compare its output with two input data that are different only in there dtypes:

>>> import numpy as np
>>> from numpy.testing import assert_array_equal
>>> from sklearn.feature_selection import mutual_info_regression
>>> rng = np.random.RandomState(0)
>>> X = rng.randint(100, size=(100, 10))
>>> X_float = X.astype(np.float64, copy=True)
>>> y = rng.randint(100, size=100)
>>> res = mutual_info_regression(X, y, random_state=0)
>>> res_float = mutual_info_regression(X_float, y, random_state=0)
>>> assert_array_equal(res, res_float)
AssertionError:
Arrays are not equal

Mismatched elements: 6 / 10 (60%)
Max absolute difference: 0.08158231
Max relative difference: 1.80656396
...

The reason was that, continuous features of the input data were scaled and assigned back to the input data X, but if X is of integer dtype, such an in-place assignment would force the scaled values to be rounded, thus largely losing precision. I made a simple fix to convert X to np.float64 dtype in advance, so from scikit-learn 1.4.0, res and res_float would be the same.

FIX SequentialFeatureSelector throws IndexError when cv is a generator #25973

When fitting a feature_selection.SequentialFeatureSelector, the parameter cv should accept any iterable according to the documentation. However in scikit-learn 1.2.x, a confusing IndexError would occur when cv is passed as a generator. For instance,

>>> import numpy as np
>>> from sklearn.datasets import make_classification
>>> from sklearn.feature_selection import SequentialFeatureSelector
>>> from sklearn.model_selection import KFold
>>> from sklearn.neighbors import KNeighborsClassifier
>>> X, y = make_classification(random_state=0)
>>> knc = KNeighborsClassifier(n_neighbors=5)
>>> cv = KFold().split(X, y)  # This is a generator
>>> sfs = SequentialFeatureSelector(knc, n_features_to_select=5, cv=cv)
>>> sfs.fit(X, y)
IndexError: list index out of range

This was because cv was passed around and consumed multiple times, thus the generator was exhausted after the first time. I made a simple fix by calling model_selection.check_cv that would convert the generator into a list, thus being safe to pass around. This fix is included from scikit-learn 1.3.0.

Metrics

ENH PrecisionRecallDisplay add option to plot chance level #26019

PrecisionRecallDisplay plots the Precision-Recall (PR) curve. The PR curve is constructed by plotting the precision against recall, where precision is the proportion of true positives among all actual positives, and recall is the proportion of true positives among all predicted positives. The baseline is called chance level, which is the PR curve of a predictor that predicts all examples as positive. Therefore, I added a keyword plot_chance_level for plotting the chance level. I also provided a keyword chance_level_kw that supports a dictionary of matplotlib keywords for customizing the rendering of the chance level line.

ENH RocCurveDisplay add option to plot chance level #25987

RocCurveDisplay plots the Receiver Operating Characteristic (ROC) curve. The ROC curve is constructed by plotting the true positive rate against the false positive rate, which measures the diagnostic ability of binary classifiers. The baseline is called chance level, which is the diagonal, and the farther the ROC curve is from the baseline, the better performance the classifier has. Therefore, I added a keyword plot_chance_level for plotting the chance level. I also provided a keyword chance_level_kw that supports a dictionary of matplotlib keywords for customizing the rendering of the chance level line.

Neighbors

FIX KNeighborsClassifier raise when all neighbors of some sample have zero weights #26410

In scikit-learn 1.4.0, neighbors.KNeighborsClassifier behaved unreasonably when the weights of all neighbors of some sample are zero. This is possible when using a user-define weight function. For instance, there could be cases where all neighbors are pretty far away and the user-defined weight function assigns zero weights to all points outside a certain threshold. For simplicity of illustration, consider the following example where we directly assign zero weights to all points:

>>> import numpy as np
>>> from sklearn.neighbors import KNeighborsClassifier
>>> X = np.array([[0, 1], [1, 2], [2, 3], [3, 4]])
>>> y = np.array([0, 0, 1, 1])
>>> def _weights(dist):
...     return np.vectorize(lambda x: 0 if x > 0.5 else 1)(dist)
>>> est = KNeighborsClassifier(n_neighbors=3, weights=_weights).fit(X, y)
>>> est.predict([[1.1, 1.1]])
array([0])
>>> est.predict_proba([[1.1, 1.1]])
array([[0., 0.]])

As shown in the example above, the predict method predicted the first class and the predict_proba method returned all zero probabilities. After discussions, maintainers were convinced that it was worth breaking backward compatibility to raise an error in this case, instead of returning almost random results that hide potential bugs. Therefore, I made a simple fix to check if there is any all-zero row (i.e., sample with all-zero neighbor weights). To avoid creating large temporary boolean arrays in memory during this check, I also implemented a small Cython function for this. From scikit-learn 1.4.1, the aforementioned ill case would lead to an informative error message with a suggestion to fix the problem, such that

>>> est.predict([[1.1, 1.1]])
ValueError: All neighbors of some sample is getting zero weights. Please modify 'weights' to avoid this case if you are using a user-defined function.
>>> est.predict_proba([[1.1, 1.1]])
ValueError: All neighbors of some sample is getting zero weights. Please modify 'weights' to avoid this case if you are using a user-defined function.

Preprocessing

FIX PowerTransformer raise when "box-cox" has nan column #26400

In scikit-learn 1.2.x, preprocessing.PowerTransformer would raise a confusing error when using box-cox transformation and there exists a column with all nan values. As an illustration,

>>> import numpy as np
>>> from sklearn.preprocessing import PowerTransformer
>>> X = np.random.random_sample((4, 5))
>>> X[:, 0] = np.nan
>>> PowerTransformer(method="box-cox").fit_transform(X)
ValueError: not enough values to unpack (expected 2, got 0)

This was because scikit-learn internally used scipy.stats.boxcox, which returned an empty array if the input array is empty (because all nan values are masked), thus not unpackable. After discussions with maintainers we believed that in this case it is best to raise a more informative error rather than letting it pass. Therefore, I made a fix to check the input array in advance, and the following error message would be raised from scikit-learn 1.3.0:

>>> PowerTransformer(method="box-cox").fit_transform(X)
ValueError: Column must not be all nan.

Tree

FIX export_text and export_graphviz accepts feature and class names as array-like #26289

In scikit-learn 1.2.x, tree.export_text and tree.export_graphviz only accepted feature_names and class_names as lists of strings. If using an array-like it would raise a confusing error. For instance,

>>> import numpy as np
>>> from sklearn.datasets import load_iris
>>> from sklearn.tree import DecisionTreeClassifier, export_text
>>> iris = load_iris()
>>> X, y = iris["data"], iris["target"]
>>> feats = np.array(iris["feature_names"])
>>> clf = DecisionTreeClassifier(random_state=0, max_depth=2).fit(X, y)
>>> print(export_text(clf, feature_names=feats))
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

However under many circumstances users get feature names and class names from arrays or dataframes, so I extended support to all array-like inputs for feature_names and class_names. From scikit-learn 1.3.0, the above example would work directly without convertion in advance.

>>> print(export_text(clf, feature_names=feats))
|--- petal width (cm) <= 0.80
|   |--- class: 0
|--- petal width (cm) >  0.80
|   |--- petal width (cm) <= 1.75
|   |   |--- class: 1
|   |--- petal width (cm) >  1.75
|   |   |--- class: 2

Utilities

FIX improve error message in check_array when getting a Series and expecting a 2D container #28090

In scikit-learn 1.4.0, when calling utils.check_array with a Series-like object (e.g., pandas Series or polars Series) and expecting a 2D container, the error message would be confusing. For instance,

>>> import pandas as pd
>>> from sklearn.utils import check_array
>>> ser = pd.Series([1, 2, 3])
>>> check_array(ser, ensure_2d=True)
ValueError: Expected 2D array, got 1D array instead:
array=[1 2 3].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

This is because the error message was generated after converting the input, and it did not really distinguish between Series-like and other one-dimensional array-like objects. I made a simply fix to the error message by explicitly stating the type of the input and using a more appropriate error message customized for Series-like objects. This fix is included from scikit-learn 1.4.1, and the improved error message is shown as below:

>>> check_array(ser, ensure_2d=True)
ValueError: Expected a 2-dimensional container but got <class 'pandas.core.series.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.

Maintenance Contributions

Items are sorted in reverse chronological order by the time of merge.

MAINT fix update_environments_and_lock_files for non-posix systems #28133

MNT Work-around sphinx-gallery UnicodeDecodeError in recommender system #27969

CLN avoid nested conftests #27954

TST Extend tests for scipy.sparse.*array in sklearn/svm/tests/test_sparse #27723

TST Extend tests for scipy.sparse/*array in sklearn/manifold/tests/test_spectral_embedding #27240

FIX make dataset fetchers accept os.Pathlike for data_home #27468

TST Extend tests for scipy.sparse/*array in sklearn/neighbors/tests/test_neighbors #27250

TST Extend tests for scipy.sparse/*array in sklearn/impute/tests/test_common #27277

TST Extend tests for scipy.sparse/*array in sklearn/feature_extraction/tests/test_text #27219

TST Extend tests for scipy.sparse/*array in sklearn/ensemble/tests/test_forest #27216

TST Extend tests for scipy.sparse/*array in sklearn/ensemble/tests/test_gradient_boosting #27217

TST Extend tests for scipy.sparse/*array in sklearn/ensemble/tests/test_iforest #27218

TST Extend tests for scipy.sparse/*array in sklearn/tree/tests/test_tree #27261

TST Extend tests for scipy.sparse/*array in sklearn/preprocessing/tests/test_data #27253

TST Extend tests for scipy.sparse/*array in sklearn/linear_model/tests/test_base #27225

TST Extend tests for scipy.sparse/*array in sklearn/linear_model/tests/test_coordinate_descent #27226

TST Extend tests for scipy.sparse/*array in sklearn/linear_model/tests/test_sparse_coordinate_descent #27237

TST Extend tests for scipy.sparse/*array in sklearn/feature_selection/tests/test_variance_threshold #27222

TST Extend tests for scipy.sparse/*array in sklearn/linear_model/tests/test_ridge #27235

TST Extend tests for scipy.sparse/*array in sklearn/linear_model/tests/test_quantile #27228

TST Extend tests for scipy.sparse/*array in sklearn/linear_model/tests/test_ransac #27233

TST Extend tests for scipy.sparse/*array in sklearn/preprocessing/tests/test_function_transformer #27254

TST Extend tests for scipy.sparse/*array in sklearn/neural_network/tests/test_rbm #27252

TST Extend tests for scipy.sparse/*array in sklearn/metrics/cluster/tests/test_unsupervised #27241

TST Extend tests for scipy.sparse/*array in sklearn/model_selection/tests/test_split #27246

TST Extend tests for scipy.sparse/*array in sklearn/utils/tests/test_extmath #27262

TST Extend tests for scipy.sparse/*array in sklearn/utils/tests/test_testing #27276

TST Extend tests for scipy.sparse/*array in sklearn/utils/tests/test_multiclass #27274

CLN v1.4.rst entries are not sorted #26759

MAINT Parameters validation for sklearn.utils.gen_even_slices #26682

MAINT Parameters validation for sklearn.linear_model.ridge_regression #26250

MAINT Parameters validation for sklearn.metrics.pairwise_distances_chunked #26125

MAINT Parameters validation for sklearn.metrics.pairwise_distances_argmin #26124

MAINT Parameters validation for sklearn.metrics.pairwise.manhattan_distances #26122

MAINT Parameters validation for sklearn.model_selection.learning_curve #26227

MAINT Parameters validation for sklearn.model_selection.validation_curve #26229

MAINT Parameters validation for sklearn.model_selection.permutation_test_score #26230

MAINT Parameters validation for sklearn.datasets.fetch_species_distributions #26161

MAINT Parameters validation for sklearn.datasets.load_breast_cancer #26165

MAINT Parameters validation for sklearn.datasets.load_diabetes #26166

MAINT Parameters validation for sklearn.datasets.fetch_rcv1 #26126

MAINT Parameters validation for sklearn.metrics.pairwise.sigmoid_kernel #26072

MAINT Parameters validation for sklearn.metrics.pairwise.rbf_kernel #26071

MAINT Parameters validation for sklearn.metrics.pairwise.polynomial_kernel #26070

MAINT Parameters validation for sklearn.metrics.pairwise.paired_cosine_distances #26075

MAINT Parameters validation for sklearn.metrics.pairwise.paired_manhattan_distances #26074

MAINT Parameters validation for sklearn.metrics.pairwise.paired_euclidean_distances #26073

MAINT Parameters validation for sklearn.metrics.pairwise.cosine_distances #26046

MAINT Parameters validation for sklearn.metrics.pairwise.linear_kernel #26049

MAINT Parameters validation for sklearn.metrics.pairwise.laplacian_kernel #26048

MAINT Parameters validation for sklearn.metrics.pairwise.haversine_distances #26047

MAINT Parameters validation for sklearn.preprocessing.scale #26036

MAINT Parameters validation for sklearn.tree.export_graphviz #26034

Documentation Contributions

Items are sorted in reverse chronological order by the time of merge.

DOC restructure changelog (in particular for switching to pydata-sphinx-theme) #28255

DOC solve some sphinx errors when updating to pydata-sphinx-theme #28134

DOC make up for errors in #26410 #28128

DOC fix the confusing ordering of whats_new/v1.5.rst #28120

DOC fix wrong indentations in the documentation that lead to undesired blockquotes #28107

DOC update doc build sphinx link to by matching regex in lock file #27970

DOC minor fixes of splitter docstrings (from #26423) #27790

DOC fix return type of make_sparse_spd_matrix #27472

DOC show usage of __ in Pipeline and FeatureUnion #26661

DOC search link to sphinx version #26610

DOC fix SplineTransformer include_bias docstring #26018