sklearn.preprocessing
.power_transform#
- sklearn.preprocessing.power_transform(X, method='yeo-johnson', *, standardize=True, copy=True)[source]#
Parametric, monotonic transformation to make data more Gaussian-like.
Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.
Currently, power_transform supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.
Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.
By default, zero-mean, unit-variance normalization is applied to the transformed data.
Read more in the User Guide.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
The data to be transformed using a power transformation.
- method{‘yeo-johnson’, ‘box-cox’}, default=’yeo-johnson’
The power transform method. Available methods are:
‘yeo-johnson’ [1], works with positive and negative values
‘box-cox’ [2], only works with strictly positive values
Changed in version 0.23: The default value of the
method
parameter changed from ‘box-cox’ to ‘yeo-johnson’ in 0.23.- standardizebool, default=True
Set to True to apply zero-mean, unit-variance normalization to the transformed output.
- copybool, default=True
If False, try to avoid a copy and transform in place. This is not guaranteed to always work in place; e.g. if the data is a numpy array with an int dtype, a copy will be returned even with copy=False.
- Returns:
- X_transndarray of shape (n_samples, n_features)
The transformed data.
See also
PowerTransformer
Equivalent transformation with the Transformer API (e.g. as part of a preprocessing
Pipeline
).quantile_transform
Maps data to a standard normal distribution with the parameter
output_distribution='normal'
.
Notes
NaNs are treated as missing values: disregarded in
fit
, and maintained intransform
.For a comparison of the different scalers, transformers, and normalizers, see: Compare the effect of different scalers on data with outliers.
References
[1]I.K. Yeo and R.A. Johnson, “A new family of power transformations to improve normality or symmetry.” Biometrika, 87(4), pp.954-959, (2000).
[2]G.E.P. Box and D.R. Cox, “An Analysis of Transformations”, Journal of the Royal Statistical Society B, 26, 211-252 (1964).
Examples
>>> import numpy as np >>> from sklearn.preprocessing import power_transform >>> data = [[1, 2], [3, 2], [4, 5]] >>> print(power_transform(data, method='box-cox')) [[-1.332... -0.707...] [ 0.256... -0.707...] [ 1.076... 1.414...]]
Warning
Risk of data leak. Do not use
power_transform
unless you know what you are doing. A common mistake is to apply it to the entire data before splitting into training and test sets. This will bias the model evaluation because information would have leaked from the test set to the training set. In general, we recommend usingPowerTransformer
within a Pipeline in order to prevent most risks of data leaking, e.g.:pipe = make_pipeline(PowerTransformer(), LogisticRegression())
.