Version 1.4#
For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 1.4.
Legend for changelogs
Major Feature something big that you couldn’t do before.
Feature something that you couldn’t do before.
Efficiency an existing feature now may not require as much computation or memory.
Enhancement a miscellaneous minor improvement.
Fix something that previously didn’t work as documented – or according to reasonable expectations – should now work.
API Change you will need to change your code to have the same effect in the future; or a feature will be removed in the future.
Version 1.4.1#
In Development
Changes impacting all modules#
Partial revert of #28191 to avoid a performance regression for estimators relying on euclidean pairwise computation with sparse matrices. The impacted estimators are:
Changelog#
sklearn.calibration
#
Fix
calibration.CalibratedClassifierCV
supports predict_proba with float32 output from the inner estimator. #28247 by Thomas Fan.
sklearn.cluster
#
Fix
cluster.AffinityPropagation
now avoids assigning multiple different clusters for equal points. #28121 by Pietro Peterlongo and Yao Xiao.Fix Avoid infinite loop in
cluster.KMeans
when the number of clusters is larger than the number of non-duplicate samples. #28165 by Jérémie du Boisberranger.Enhancement Pandas and Polars dataframe are validated directly without ducktyping checks. #28195 by Thomas Fan.
sklearn.preprocessing
#
Fix
preprocessing.FunctionTransformer
now also warns whenset_output
is called withtransform="polars"
andfunc
does not return a Polars dataframe orfeature_names_out
is not specified. #28263 by Guillaume Lemaitre.
sklearn.tree
#
Fix
tree.DecisionTreeClassifier
andtree.DecisionTreeRegressor
are handling missing values properly. The internal criterion was not initialize when no missing values were present in the data, leading to potentially wrong criterion values. #28295 by Guillaume Lemaitre.
sklearn.utils
#
Fix
utils._safe_indexing
now raises aValueError
whenX
is a Python list andaxis=1
, as documented in the docstring. #28222 by Guillaume Lemaitre.
Version 1.4.0#
January 2024
Changed models#
The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.
Efficiency
linear_model.LogisticRegression
andlinear_model.LogisticRegressionCV
now have much better convergence for solvers"lbfgs"
and"newton-cg"
. Both solvers can now reach much higher precision for the coefficients depending on the specifiedtol
. Additionally, lbfgs can make better use oftol
, i.e., stop sooner or reach higher precision. Note: The lbfgs is the default solver, so this change might effect many models. This change also means that with this new version of scikit-learn, the resulting coefficientscoef_
andintercept_
of your models will change for these two solvers (when fit on the same data again). The amount of change depends on the specifiedtol
, for small values you will get more precise results. #26721 by Christian Lorentzen.Fix fixes a memory leak seen in PyPy for estimators using the Cython loss functions. #27670 by Guillaume Lemaitre.
Changes impacting all modules#
Major Feature Transformers now support polars output with
set_output(transform="polars")
. #27315 by Thomas Fan.Enhancement All estimators now recognizes the column names from any dataframe that adopts the DataFrame Interchange Protocol. Dataframes that return a correct representation through
np.asarray(df)
is expected to work with our estimators and functions. #26464 by Thomas Fan.Enhancement The HTML representation of estimators now includes a link to the documentation and is color-coded to denote whether the estimator is fitted or not (unfitted estimators are orange, fitted estimators are blue). #26616 by Riccardo Cappuzzo, Ines Ibnukhsein, Gael Varoquaux, Joel Nothman and Lilian Boulard.
Fix Fixed a bug in most estimators and functions where setting a parameter to a large integer would cause a
TypeError
. #26648 by Naoise Holohan.
Metadata Routing#
The following models now support metadata routing in one or more or their methods. Refer to the Metadata Routing User Guide for more details.
Feature
LarsCV
andLassoLarsCV
now support metadata routing in theirfit
method and route metadata to the CV splitter. #27538 by Omar Salman.Feature
multiclass.OneVsRestClassifier
,multiclass.OneVsOneClassifier
andmulticlass.OutputCodeClassifier
now support metadata routing in theirfit
andpartial_fit
, and route metadata to the underlying estimator’sfit
andpartial_fit
. #27308 by Stefanie Senger.Feature
pipeline.Pipeline
now supports metadata routing according to metadata routing user guide. #26789 by Adrin Jalali.Feature
cross_validate
,cross_val_score
, andcross_val_predict
now support metadata routing. The metadata are routed to the estimator’sfit
, the scorer, and the CV splitter’ssplit
. The metadata is accepted via the newparams
parameter.fit_params
is deprecated and will be removed in version 1.6.groups
parameter is also not accepted as a separate argument when metadata routing is enabled and should be passed via theparams
parameter. #26896 by Adrin Jalali.Feature
GridSearchCV
,RandomizedSearchCV
,HalvingGridSearchCV
, andHalvingRandomSearchCV
now support metadata routing in theirfit
andscore
, and route metadata to the underlying estimator’sfit
, the CV splitter, and the scorer. #27058 by Adrin Jalali.Feature
ColumnTransformer
now supports metadata routing according to metadata routing user guide. #27005 by Adrin Jalali.Feature
linear_model.LogisticRegressionCV
now supports metadata routing.linear_model.LogisticRegressionCV.fit
now accepts**params
which are passed to the underlying splitter and scorer.linear_model.LogisticRegressionCV.score
now accepts**score_params
which are passed to the underlying scorer. #26525 by Omar Salman.Feature
feature_selection.SelectFromModel
now supports metadata routing infit
andpartial_fit
. #27490 by Stefanie Senger.Feature
linear_model.OrthogonalMatchingPursuitCV
now supports metadata routing. Itsfit
now accepts**fit_params
, which are passed to the underlying splitter. #27500 by Stefanie Senger.Feature
ElasticNetCV
,LassoCV
,MultiTaskElasticNetCV
andMultiTaskLassoCV
now support metadata routing and route metadata to the CV splitter. #27478 by Omar Salman.Fix All meta-estimators for which metadata routing is not yet implemented now raise a
NotImplementedError
onget_metadata_routing
and onfit
if metadata routing is enabled and any metadata is passed to them. #27389 by Adrin Jalali.
Support for SciPy sparse arrays#
Several estimators are now supporting SciPy sparse arrays. The following functions and classes are impacted:
Functions:
cluster.compute_optics_graph
in #27104 by Maren Westermann and in #27250 by Yao Xiao;decomposition.non_negative_factorization
in #27100 by Isaac Virshup;feature_selection.f_regression
in #27239 by Yaroslav Korobko;feature_selection.r_regression
in #27239 by Yaroslav Korobko;
Classes:
cluster.HDBSCAN
in #27250 by Yao Xiao;cluster.KMeans
in #27179 by Nurseit Kamchyev;cluster.OPTICS
in #27104 by Maren Westermann and in #27250 by Yao Xiao;decomposition.NMF
in #27100 by Isaac Virshup;feature_extraction.text.TfidfTransformer
in #27219 by Yao Xiao;manifold.Isomap
in #27250 by Yao Xiao;manifold.TSNE
in #27250 by Yao Xiao;impute.SimpleImputer
in #27277 by Yao Xiao;impute.KNNImputer
in #27277 by Yao Xiao;kernel_approximation.PolynomialCountSketch
in #27301 by Lohit SundaramahaLingam;random_projection.GaussianRandomProjection
in #27314 by Stefanie Senger;random_projection.SparseRandomProjection
in #27314 by Stefanie Senger.
Support for Array API#
Several estimators and functions support the Array API. Such changes allows for using the estimators and functions with other libraries such as JAX, CuPy, and PyTorch. This therefore enables some GPU-accelerated computations.
See Array API support (experimental) for more details.
Functions:
sklearn.metrics.accuracy_score
andsklearn.metrics.zero_one_loss
in #27137 by Edoardo Abati;sklearn.model_selection.train_test_split
in #26855 by Tim Head;is_multilabel
in #27601 by Yaroslav Korobko.
Classes:
decomposition.PCA
for thefull
andrandomized
solvers (with QR power iterations) in #26315, #27098 and #27431 by Mateusz Sokół, Olivier Grisel and Edoardo Abati;
Private Loss Function Module#
Fix The gradient computation of the binomial log loss is now numerically more stable for very large, in absolute value, input (raw predictions). Before, it could result in
np.nan
. Among the models that profit from this change areensemble.GradientBoostingClassifier
,ensemble.HistGradientBoostingClassifier
andlinear_model.LogisticRegression
. #28048 by Christian Lorentzen.
Changelog#
sklearn.base
#
Enhancement
base.ClusterMixin.fit_predict
andbase.OutlierMixin.fit_predict
now accept**kwargs
which are passed to thefit
method of the estimator. #26506 by Adrin Jalali.Enhancement
base.TransformerMixin.fit_transform
andbase.OutlierMixin.fit_predict
now raise a warning iftransform
/predict
consume metadata, but no customfit_transform
/fit_predict
is defined in the class inheriting from them correspondingly. #26831 by Adrin Jalali.Enhancement
base.clone
now supportsdict
as input and creates a copy. #26786 by Adrin Jalali.API Change
process_routing
now has a different signature. The first two (the object and the method) are positional only, and all metadata are passed as keyword arguments. #26909 by Adrin Jalali.
sklearn.calibration
#
Enhancement The internal objective and gradient of the
sigmoid
method ofcalibration.CalibratedClassifierCV
have been replaced by the private loss module. #27185 by Omar Salman.
sklearn.cluster
#
Fix The
degree
parameter in thecluster.SpectralClustering
constructor now accepts real values instead of only integral values in accordance with thedegree
parameter of thesklearn.metrics.pairwise.polynomial_kernel
. #27668 by Nolan McMahon.Fix Fixes a bug in
cluster.OPTICS
where the cluster correction based on predecessor was not using the right indexing. It would lead to inconsistent results depedendent on the order of the data. #26459 by Haoying Zhang and Guillaume Lemaitre.Fix Improve error message when checking the number of connected components in the
fit
method ofcluster.HDBSCAN
. #27678 by Ganesh Tata.Fix Create copy of precomputed sparse matrix within the
fit
method ofcluster.DBSCAN
to avoid in-place modification of the sparse matrix. #27651 by Ganesh Tata.Fix Raises a proper
ValueError
whenmetric="precomputed"
and requested storing centers via the parameterstore_centers
. #27898 by Guillaume Lemaitre.API Change
kdtree
andballtree
values are now deprecated and are renamed askd_tree
andball_tree
respectively for thealgorithm
parameter ofcluster.HDBSCAN
ensuring consistency in naming convention.kdtree
andballtree
values will be removed in 1.6. #26744 by Shreesha Kumar Bhat.API Change The option
metric=None
incluster.AgglomerativeClustering
andcluster.FeatureAgglomeration
is deprecated in version 1.4 and will be removed in version 1.6. Use the default value instead. #27828 by Guillaume Lemaitre.
sklearn.compose
#
Major Feature Adds polars input support to
compose.ColumnTransformer
through the DataFrame Interchange Protocol. The minimum supported version for polars is0.19.12
. #26683 by Thomas Fan.Fix
cluster.spectral_clustering
andcluster.SpectralClustering
now raise an explicit error message indicating that sparse matrices and arrays withnp.int64
indices are not supported. #27240 by Yao Xiao.API Change outputs that use pandas extension dtypes and contain
pd.NA
inColumnTransformer
now result in aFutureWarning
and will cause aValueError
in version 1.6, unless the output container has been configured as “pandas” withset_output(transform="pandas")
. Before, such outputs resulted in numpy arrays of dtypeobject
containingpd.NA
which could not be converted to numpy floats and caused errors when passed to other scikit-learn estimators. #27734 by Jérôme Dockès.
sklearn.covariance
#
Enhancement Allow
covariance.shrunk_covariance
to process multiple covariance matrices at once by handling nd-arrays. #25275 by Quentin Barthélemy.API Change Fix
ColumnTransformer
now replaces"passthrough"
with a correspondingFunctionTransformer
in the fittedtransformers_
attribute. #27204 by Adrin Jalali.
sklearn.datasets
#
Enhancement
datasets.make_sparse_spd_matrix
now uses a more memory- efficient sparse layout. It also accepts a new keywordsparse_format
that allows specifying the output format of the sparse matrix. By defaultsparse_format=None
, which returns a dense numpy ndarray as before. #27438 by Yao Xiao.Fix
datasets.dump_svmlight_file
now does not raiseValueError
whenX
is read-only, e.g., anumpy.memmap
instance. #28111 by Yao Xiao.API Change
datasets.make_sparse_spd_matrix
deprecated the keyword argumentdim
in favor ofn_dim
.dim
will be removed in version 1.6. #27718 by Adam Li.
sklearn.decomposition
#
Feature
decomposition.PCA
now supportsscipy.sparse.sparray
andscipy.sparse.spmatrix
inputs when using thearpack
solver. When used on sparse data likedatasets.fetch_20newsgroups_vectorized
this can lead to speed-ups of 100x (single threaded) and 70x lower memory usage. Based on Alexander Tarashansky’s implementation in scanpy. #18689 by Isaac Virshup and Andrey Portnoy.Enhancement An “auto” option was added to the
n_components
parameter ofdecomposition.non_negative_factorization
,decomposition.NMF
anddecomposition.MiniBatchNMF
to automatically infer the number of components from W or H shapes when using a custom initialization. The default value of this parameter will change fromNone
toauto
in version 1.6. #26634 by Alexandre Landeau and Alexandre Vigny.Fix
decomposition.dict_learning_online
does not ignore anymore the parametermax_iter
. #27834 by Guillaume Lemaitre.Fix The
degree
parameter in thedecomposition.KernelPCA
constructor now accepts real values instead of only integral values in accordance with thedegree
parameter of thesklearn.metrics.pairwise.polynomial_kernel
. #27668 by Nolan McMahon.API Change The option
max_iter=None
indecomposition.MiniBatchDictionaryLearning
,decomposition.MiniBatchSparsePCA
, anddecomposition.dict_learning_online
is deprecated and will be removed in version 1.6. Use the default value instead. #27834 by Guillaume Lemaitre.
sklearn.ensemble
#
Major Feature
ensemble.RandomForestClassifier
andensemble.RandomForestRegressor
support missing values when the criterion isgini
,entropy
, orlog_loss
, for classification orsquared_error
,friedman_mse
, orpoisson
for regression. #26391 by Thomas Fan.Major Feature
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
supportscategorical_features="from_dtype"
, which treats columns with Pandas or Polars Categorical dtype as categories in the algorithm.categorical_features="from_dtype"
will become the default in v1.6. Categorical features no longer need to be encoded with numbers. When categorical features are numbers, the maximum value no longer needs to be smaller thanmax_bins
; only the number of (unique) categories must be smaller thanmax_bins
. #26411 by Thomas Fan and #27835 by Jérôme Dockès.Major Feature
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
got the new parametermax_features
to specify the proportion of randomly chosen features considered in each split. #27139 by Christian Lorentzen.Feature
ensemble.RandomForestClassifier
,ensemble.RandomForestRegressor
,ensemble.ExtraTreesClassifier
andensemble.ExtraTreesRegressor
now support monotonic constraints, useful when features are supposed to have a positive/negative effect on the target. Missing values in the train data and multi-output targets are not supported. #13649 by Samuel Ronsin, initiated by Patrick O’Reilly.Efficiency
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
are now a bit faster by reusing the parent node’s histogram as children node’s histogram in the subtraction trick. In effect, less memory has to be allocated and deallocated. #27865 by Christian Lorentzen.Efficiency
ensemble.GradientBoostingClassifier
is faster, for binary and in particular for multiclass problems thanks to the private loss function module. #26278 and #28095 by Christian Lorentzen.Efficiency Improves runtime and memory usage for
ensemble.GradientBoostingClassifier
andensemble.GradientBoostingRegressor
when trained on sparse data. #26957 by Thomas Fan.Efficiency
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
is now faster whenscoring
is a predefined metric listed inmetrics.get_scorer_names
and early stopping is enabled. #26163 by Thomas Fan.Enhancement A fitted property,
estimators_samples_
, was added to all Forest methods, includingensemble.RandomForestClassifier
,ensemble.RandomForestRegressor
,ensemble.ExtraTreesClassifier
andensemble.ExtraTreesRegressor
, which allows to retrieve the training sample indices used for each tree estimator. #26736 by Adam Li.Fix Fixes
ensemble.IsolationForest
when the input is a sparse matrix andcontamination
is set to a float value. #27645 by Guillaume Lemaitre.Fix Raises a
ValueError
inensemble.RandomForestRegressor
andensemble.ExtraTreesRegressor
when requesting OOB score with multioutput model for the targets being all rounded to integer. It was recognized as a multiclass problem. #27817 by Daniele OngariFix Changes estimator tags to acknowledge that
ensemble.VotingClassifier
,ensemble.VotingRegressor
,ensemble.StackingClassifier
,ensemble.StackingRegressor
, support missing values if allestimators
support missing values. #27710 by Guillaume Lemaitre.Fix Support loading pickles of
ensemble.HistGradientBoostingClassifier
andensemble.HistGradientBoostingRegressor
when the pickle has been generated on a platform with a different bitness. A typical example is to train and pickle the model on 64 bit machine and load the model on a 32 bit machine for prediction. #28074 by Christian Lorentzen and Loïc Estève.API Change In
ensemble.AdaBoostClassifier
, thealgorithm
argumentSAMME.R
was deprecated and will be removed in 1.6. #26830 by Stefanie Senger.
sklearn.feature_extraction
#
API Change Changed error type from
AttributeError
toexceptions.NotFittedError
in unfitted instances offeature_extraction.DictVectorizer
for the following methods:feature_extraction.DictVectorizer.inverse_transform
,feature_extraction.DictVectorizer.restrict
,feature_extraction.DictVectorizer.transform
. #24838 by Lorenz Hertel.
sklearn.feature_selection
#
Enhancement
feature_selection.SelectKBest
,feature_selection.SelectPercentile
, andfeature_selection.GenericUnivariateSelect
now support unsupervised feature selection by providing ascore_func
takingX
andy=None
. #27721 by Guillaume Lemaitre.Enhancement
feature_selection.SelectKBest
andfeature_selection.GenericUnivariateSelect
withmode='k_best'
now shows a warning whenk
is greater than the number of features. #27841 by Thomas Fan.Fix
feature_selection.RFE
andfeature_selection.RFECV
do not check for nans during input validation. #21807 by Thomas Fan.
sklearn.inspection
#
Enhancement
inspection.DecisionBoundaryDisplay
now accepts a parameterclass_of_interest
to select the class of interest when plotting the response provided byresponse_method="predict_proba"
orresponse_method="decision_function"
. It allows to plot the decision boundary for both binary and multiclass classifiers. #27291 by Guillaume Lemaitre.Fix
inspection.DecisionBoundaryDisplay.from_estimator
andinspection.PartialDependenceDisplay.from_estimator
now return the correct type for subclasses. #27675 by John Cant.API Change
inspection.DecisionBoundaryDisplay
raise anAttributeError
instead of aValueError
when an estimator does not implement the requested response method. #27291 by Guillaume Lemaitre.
sklearn.kernel_ridge
#
Fix The
degree
parameter in thekernel_ridge.KernelRidge
constructor now accepts real values instead of only integral values in accordance with thedegree
parameter of thesklearn.metrics.pairwise.polynomial_kernel
. #27668 by Nolan McMahon.
sklearn.linear_model
#
Efficiency
linear_model.LogisticRegression
andlinear_model.LogisticRegressionCV
now have much better convergence for solvers"lbfgs"
and"newton-cg"
. Both solvers can now reach much higher precision for the coefficients depending on the specifiedtol
. Additionally, lbfgs can make better use oftol
, i.e., stop sooner or reach higher precision. This is accomplished by better scaling of the objective function, i.e., using average per sample losses instead of sum of per sample losses. #26721 by Christian Lorentzen.Efficiency
linear_model.LogisticRegression
andlinear_model.LogisticRegressionCV
with solver"newton-cg"
can now be considerably faster for some data and parameter settings. This is accomplished by a better line search convergence check for negligible loss improvements that takes into account gradient information. #26721 by Christian Lorentzen.Efficiency Solver
"newton-cg"
inlinear_model.LogisticRegression
andlinear_model.LogisticRegressionCV
uses a little less memory. The effect is proportional to the number of coefficients (n_features * n_classes
). #27417 by Christian Lorentzen.Fix Ensure that the
sigma_
attribute oflinear_model.ARDRegression
andlinear_model.BayesianRidge
always has afloat32
dtype when fitted onfloat32
data, even with the type promotion rules of NumPy 2. #27899 by Olivier Grisel.API Change The attribute
loss_function_
oflinear_model.SGDClassifier
andlinear_model.SGDOneClassSVM
has been deprecated and will be removed in version 1.6. #27979 by Christian Lorentzen.
sklearn.metrics
#
Efficiency Computing pairwise distances via
metrics.DistanceMetric
for CSR x CSR, Dense x CSR, and CSR x Dense datasets is now 1.5x faster. #26765 by Meekail Zain.Efficiency Computing distances via
metrics.DistanceMetric
for CSR x CSR, Dense x CSR, and CSR x Dense now uses ~50% less memory, and outputs distances in the same dtype as the provided data. #27006 by Meekail Zain.Enhancement Improve the rendering of the plot obtained with the
metrics.PrecisionRecallDisplay
andmetrics.RocCurveDisplay
classes. the x- and y-axis limits are set to [0, 1] and the aspect ratio between both axis is set to be 1 to get a square plot. #26366 by Mojdeh Rastgoo.Enhancement Added
neg_root_mean_squared_log_error_scorer
as scorer #26734 by Alejandro Martin Gil.Enhancement
metrics.confusion_matrix
now warns when only one label was found iny_true
andy_pred
. #27650 by Lucy Liu.Fix computing pairwise distances with
metrics.pairwise.euclidean_distances
no longer raises an exception whenX
is provided as afloat64
array andX_norm_squared
as afloat32
array. #27624 by Jérôme Dockès.Fix
f1_score
now provides correct values when handling various cases in which division by zero occurs by using a formulation that does not depend on the precision and recall values. #27577 by Omar Salman and Guillaume Lemaitre.Fix
metrics.make_scorer
now raises an error when using a regressor on a scorer requesting a non-thresholded decision function (fromdecision_function
orpredict_proba
). Such scorer are specific to classification. #26840 by Guillaume Lemaitre.Fix
metrics.DetCurveDisplay.from_predictions
,metrics.PrecisionRecallDisplay.from_predictions
,metrics.PredictionErrorDisplay.from_predictions
, andmetrics.RocCurveDisplay.from_predictions
now return the correct type for subclasses. #27675 by John Cant.API Change Deprecated
needs_threshold
andneeds_proba
frommetrics.make_scorer
. These parameters will be removed in version 1.6. Instead, useresponse_method
that accepts"predict"
,"predict_proba"
or"decision_function"
or a list of such values.needs_proba=True
is equivalent toresponse_method="predict_proba"
andneeds_threshold=True
is equivalent toresponse_method=("decision_function", "predict_proba")
. #26840 by Guillaume Lemaitre.API Change The
squared
parameter ofmetrics.mean_squared_error
andmetrics.mean_squared_log_error
is deprecated and will be removed in 1.6. Use the new functionsmetrics.root_mean_squared_error
andmetrics.root_mean_squared_log_error
instead. #26734 by Alejandro Martin Gil.
sklearn.model_selection
#
Enhancement
model_selection.learning_curve
raises a warning when every cross validation fold fails. #26299 by Rahil Parikh.Fix
model_selection.GridSearchCV
,model_selection.RandomizedSearchCV
, andmodel_selection.HalvingGridSearchCV
now don’t change the given object in the parameter grid if it’s an estimator. #26786 by Adrin Jalali.
sklearn.multioutput
#
Enhancement Add method
predict_log_proba
tomultioutput.ClassifierChain
. #27720 by Guillaume Lemaitre.
sklearn.neighbors
#
Efficiency
sklearn.neighbors.KNeighborsRegressor.predict
andsklearn.neighbors.KNeighborsClassifier.predict_proba
now efficiently support pairs of dense and sparse datasets. #27018 by Julien Jerphanion.Efficiency The performance of
neighbors.RadiusNeighborsClassifier.predict
and ofneighbors.RadiusNeighborsClassifier.predict_proba
has been improved whenradius
is large andalgorithm="brute"
with non-Euclidean metrics. #26828 by Omar Salman.Fix Improve error message for
neighbors.LocalOutlierFactor
when it is invoked withn_samples=n_neighbors
. #23317 by Bharat Raghunathan.Fix
neighbors.KNeighborsClassifier.predict
andneighbors.KNeighborsClassifier.predict_proba
now raises an error when the weights of all neighbors of some sample are zero. This can happen whenweights
is a user-defined function. #26410 by Yao Xiao.API Change
neighbors.KNeighborsRegressor
now acceptsmetrics.DistanceMetric
objects directly via themetric
keyword argument allowing for the use of accelerated third-partymetrics.DistanceMetric
objects. #26267 by Meekail Zain.
sklearn.preprocessing
#
Efficiency
preprocessing.OrdinalEncoder
avoids calculating missing indices twice to improve efficiency. #27017 by Xuefeng Xu.Efficiency Improves efficiency in
preprocessing.OneHotEncoder
andpreprocessing.OrdinalEncoder
in checkingnan
. #27760 by Xuefeng Xu.Enhancement Improves warnings in
preprocessing.FunctionTransformer
whenfunc
returns a pandas dataframe and the output is configured to be pandas. #26944 by Thomas Fan.Enhancement
preprocessing.TargetEncoder
now supportstarget_type
‘multiclass’. #26674 by Lucy Liu.Fix
preprocessing.OneHotEncoder
andpreprocessing.OrdinalEncoder
raise an exception whennan
is a category and is not the last in the user’s provided categories. #27309 by Xuefeng Xu.Fix
preprocessing.OneHotEncoder
andpreprocessing.OrdinalEncoder
raise an exception if the user provided categories contain duplicates. #27328 by Xuefeng Xu.Fix
preprocessing.FunctionTransformer
raises an error attransform
if the output ofget_feature_names_out
is not consistent with the column names of the output container if those are defined. #27801 by Guillaume Lemaitre.Fix Raise a
NotFittedError
inpreprocessing.OrdinalEncoder
when callingtransform
without callingfit
sincecategories
always requires to be checked. #27821 by Guillaume Lemaitre.
sklearn.tree
#
Feature
tree.DecisionTreeClassifier
,tree.DecisionTreeRegressor
,tree.ExtraTreeClassifier
andtree.ExtraTreeRegressor
now support monotonic constraints, useful when features are supposed to have a positive/negative effect on the target. Missing values in the train data and multi-output targets are not supported. #13649 by Samuel Ronsin, initiated by Patrick O’Reilly.
sklearn.utils
#
Enhancement
sklearn.utils.estimator_html_repr
dynamically adapts diagram colors based on the browser’sprefers-color-scheme
, providing improved adaptability to dark mode environments. #26862 by Andrew Goh Yisheng, Thomas Fan, Adrin Jalali.Enhancement
MetadataRequest
andMetadataRouter
now have aconsumes
method which can be used to check whether a given set of parameters would be consumed. #26831 by Adrin Jalali.Enhancement Make
sklearn.utils.check_array
attempt to outputint32
-indexed CSR and COO arrays when converting from DIA arrays if the number of non-zero entries is small enough. This ensures that estimators implemented in Cython and that do not acceptint64
-indexed sparse datastucture, now consistently accept the same sparse input formats for SciPy sparse matrices and arrays. #27372 by Guillaume Lemaitre.Fix
sklearn.utils.check_array
should accept both matrix and array from the sparse SciPy module. The previous implementation would fail ifcopy=True
by calling specific NumPynp.may_share_memory
that does not work with SciPy sparse array and does not return the correct result for SciPy sparse matrix. #27336 by Guillaume Lemaitre.Fix
check_estimators_pickle
withreadonly_memmap=True
now relies on joblib’s own capability to allocate aligned memory mapped arrays when loading a serialized estimator instead of calling a dedicated private function that would crash when OpenBLAS misdetects the CPU architecture. #27614 by Olivier Grisel.Fix Error message in
check_array
when a sparse matrix was passed butaccept_sparse
isFalse
now suggests to use.toarray()
and notX.toarray()
. #27757 by Lucy Liu.Fix Fix the function
check_array
to output the right error message when the input is a Series instead of a DataFrame. #28090 by Stan Furrer and Yao Xiao.API Change
sklearn.extmath.log_logistic
is deprecated and will be removed in 1.6. Use-np.logaddexp(0, -x)
instead. #27544 by Christian Lorentzen.
Code and documentation contributors
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 1.3, including:
101AlexMartin, Abhishek Singh Kushwah, Adam Li, Adarsh Wase, Adrin Jalali, Advik Sinha, Alex, Alexander Al-Feghali, Alexis IMBERT, AlexL, Alex Molas, Anam Fatima, Andrew Goh, andyscanzio, Aniket Patil, Artem Kislovskiy, Arturo Amor, ashah002, avm19, Ben Holmes, Ben Mares, Benoit Chevallier-Mames, Bharat Raghunathan, Binesh Bannerjee, Brendan Lu, Brevin Kunde, Camille Troillard, Carlo Lemos, Chad Parmet, Christian Clauss, Christian Lorentzen, Christian Veenhuis, Christos Aridas, Cindy Liang, Claudio Salvatore Arcidiacono, Connor Boyle, cynthias13w, DaminK, Daniele Ongari, Daniel Schmitz, Daniel Tinoco, David Brochart, Deborah L. Haar, DevanshKyada27, Dimitri Papadopoulos Orfanos, Dmitry Nesterov, DUONG, Edoardo Abati, Eitan Hemed, Elabonga Atuo, Elisabeth Günther, Emma Carballal, Emmanuel Ferdman, epimorphic, Erwan Le Floch, Fabian Egli, Filip Karlo Došilović, Florian Idelberger, Franck Charras, Gael Varoquaux, Ganesh Tata, Gleb Levitski, Guillaume Lemaitre, Haoying Zhang, Harmanan Kohli, Ily, ioangatop, IsaacTrost, Isaac Virshup, Iwona Zdzieblo, Jakub Kaczmarzyk, James McDermott, Jarrod Millman, JB Mountford, Jérémie du Boisberranger, Jérôme Dockès, Jiawei Zhang, Joel Nothman, John Cant, John Hopfensperger, Jona Sassenhagen, Jon Nordby, Julien Jerphanion, Kennedy Waweru, kevin moore, Kian Eliasi, Kishan Ved, Konstantinos Pitas, Koustav Ghosh, Kushan Sharma, ldwy4, Linus, Lohit SundaramahaLingam, Loic Esteve, Lorenz, Louis Fouquet, Lucy Liu, Luis Silvestrin, Lukáš Folwarczný, Lukas Geiger, Malte Londschien, Marcus Fraaß, Marek Hanuš, Maren Westermann, Mark Elliot, Martin Larralde, Mateusz Sokół, mathurinm, mecopur, Meekail Zain, Michael Higgins, Miki Watanabe, Milton Gomez, MN193, Mohammed Hamdy, Mohit Joshi, mrastgoo, Naman Dhingra, Naoise Holohan, Narendra Singh dangi, Noa Malem-Shinitski, Nolan, Nurseit Kamchyev, Oleksii Kachaiev, Olivier Grisel, Omar Salman, partev, Peter Hull, Peter Steinbach, Pierre de Fréminville, Pooja Subramaniam, Puneeth K, qmarcou, Quentin Barthélemy, Rahil Parikh, Rahul Mahajan, Raj Pulapakura, Raphael, Ricardo Peres, Riccardo Cappuzzo, Roman Lutz, Salim Dohri, Samuel O. Ronsin, Sandip Dutta, Sayed Qaiser Ali, scaja, scikit-learn-bot, Sebastian Berg, Shreesha Kumar Bhat, Shubhal Gupta, Søren Fuglede Jørgensen, Stefanie Senger, Tamara, Tanjina Afroj, THARAK HEGDE, thebabush, Thomas J. Fan, Thomas Roehr, Tialo, Tim Head, tongyu, Venkatachalam N, Vijeth Moudgalya, Vincent M, Vivek Reddy P, Vladimir Fokow, Xiao Yuan, Xuefeng Xu, Yang Tao, Yao Xiao, Yuchen Zhou, Yuusuke Hiramatsu