`sklearn.feature_extraction.text`.HashingVectorizer#

class sklearn.feature_extraction.text.HashingVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', n_features=1048576, binary=False, norm='l2', alternate_sign=True, dtype=<class 'numpy.float64'>)[source]#

Convert a collection of text documents to a matrix of token occurrences.

It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory.
it is fast to pickle and un-pickle as it holds no state besides the constructor parameters.
it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
no IDF weighting as this would render the transformer stateful.

The hash function employed is the signed 32-bit version of Murmurhash3.

For an efficiency comparison of the different feature extractors, see FeatureHasher and DictVectorizer Comparison.

See also

CountVectorizer: Convert a collection of text documents to a matrix of token counts.
TfidfVectorizer: Convert a collection of raw documents to a matrix of TF-IDF features.

Notes

This estimator is stateless and does not need to be fitted. However, we recommend to call fit_transform instead of transform, as parameter validation is only performed in fit.

Examples

>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = HashingVectorizer(n_features=2**4)
>>> X = vectorizer.fit_transform(corpus)
>>> print(X.shape)
(4, 16)

Methods

`build_analyzer`()	Return a callable to process input data.
`build_preprocessor`()	Return a function to preprocess the text before tokenization.
`build_tokenizer`()	Return a function that splits a string into a sequence of tokens.
`decode`(doc)	Decode the input into a string of unicode symbols.
`fit`(X[, y])	Only validates estimator's parameters.
`fit_transform`(X[, y])	Transform a sequence of documents to a document-term matrix.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`get_stop_words`()	Build or fetch the effective stop words list.
`partial_fit`(X[, y])	Only validates estimator's parameters.
`set_output`(*[, transform])	Set output container.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Transform a sequence of documents to a document-term matrix.

build_analyzer()[source]#

Return a callable to process input data.

The callable handles preprocessing, tokenization, and n-grams generation.

Returns:

analyzer: callable: A function to handle preprocessing, tokenization and n-grams generation.

build_preprocessor()[source]#

Return a function to preprocess the text before tokenization.

Returns:

preprocessor: callable: A function to preprocess the text before tokenization.

build_tokenizer()[source]#

Return a function that splits a string into a sequence of tokens.

Returns:

tokenizer: callable: A function to split a string into a sequence of tokens.

decode(doc)[source]#

Decode the input into a string of unicode symbols.

The decoding strategy depends on the vectorizer parameters.

Parameters:

docbytes or str: The string to decode.

Returns:

doc: str: A string of unicode symbols.

fit(X, y=None)[source]#

Only validates estimator’s parameters.

This method allows to: (i) validate the estimator’s parameters and (ii) be consistent with the scikit-learn transformer API.

Parameters:

Xndarray of shape [n_samples, n_features]: Training data.
yIgnored: Not used, present for API consistency by convention.

Returns:

selfobject: HashingVectorizer instance.

fit_transform(X, y=None)[source]#

Transform a sequence of documents to a document-term matrix.

Parameters:

Xiterable over raw text documents, length = n_samples: Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.
yany: Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns:

Xsparse matrix of shape (n_samples, n_features): Document-term matrix.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

get_stop_words()[source]#

Build or fetch the effective stop words list.

Returns:

stop_words: list or None: A list of stop words.

partial_fit(X, y=None)[source]#

Only validates estimator’s parameters.

This method allows to: (i) validate the estimator’s parameters and (ii) be consistent with the scikit-learn transformer API.

Parameters:

Xndarray of shape [n_samples, n_features]: Training data.
yIgnored: Not used, present for API consistency by convention.

Returns:

selfobject: HashingVectorizer instance.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform{“default”, “pandas”}, default=None

Configure output of transform and fit_transform.

"default": Default output format of a transformer
"pandas": DataFrame output
"polars": Polars output
None: Transform configuration is unchanged

New in version 1.4: "polars" option was added.

Returns:

selfestimator instance: Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

transform(X)[source]#

Transform a sequence of documents to a document-term matrix.

Parameters:

Xiterable over raw text documents, length = n_samples: Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.

Returns:

Xsparse matrix of shape (n_samples, n_features): Document-term matrix.

Examples using `sklearn.feature_extraction.text.HashingVectorizer`#

Out-of-core classification of text documents

Clustering text documents using k-means

FeatureHasher and DictVectorizer Comparison

sklearn.feature_extraction.text.HashingVectorizer#

Examples using sklearn.feature_extraction.text.HashingVectorizer#

`sklearn.feature_extraction.text`.HashingVectorizer#

Examples using `sklearn.feature_extraction.text.HashingVectorizer`#