sklearn.feature_selection.chi2#

sklearn.feature_selection.chi2(X, y)[source]#

Compute chi-squared stats between each non-negative feature and class.

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

Read more in the User Guide.

Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)

Sample vectors.

yarray-like of shape (n_samples,)

Target vector (class labels).

Returns:
chi2ndarray of shape (n_features,)

Chi2 statistics for each feature.

p_valuesndarray of shape (n_features,)

P-values for each feature.

See also

f_classif

ANOVA F-value between label/feature for classification tasks.

f_regression

F-value between label/feature for regression tasks.

Notes

Complexity of this algorithm is O(n_classes * n_features).

Examples

>>> import numpy as np
>>> from sklearn.feature_selection import chi2
>>> X = np.array([[1, 1, 3],
...               [0, 1, 5],
...               [5, 4, 1],
...               [6, 6, 2],
...               [1, 4, 0],
...               [0, 0, 0]])
>>> y = np.array([1, 1, 0, 0, 2, 2])
>>> chi2_stats, p_values = chi2(X, y)
>>> chi2_stats
array([15.3...,  6.5       ,  8.9...])
>>> p_values
array([0.0004..., 0.0387..., 0.0116... ])

Examples using sklearn.feature_selection.chi2#

Column Transformer with Mixed Types

Column Transformer with Mixed Types