sklearn.datasets
.dump_svmlight_file#
- sklearn.datasets.dump_svmlight_file(X, y, f, *, zero_based=True, comment=None, query_id=None, multilabel=False)[source]#
Dump the dataset in svmlight / libsvm file format.
This format is a text-based format, with one sample per line. It does not store zero valued features hence is suitable for sparse dataset.
The first element of each line can be used to store a target variable to predict.
- Parameters:
- X{array-like, sparse matrix} of shape (n_samples, n_features)
Training vectors, where
n_samples
is the number of samples andn_features
is the number of features.- y{array-like, sparse matrix}, shape = (n_samples,) or (n_samples, n_labels)
Target values. Class labels must be an integer or float, or array-like objects of integer or float for multilabel classifications.
- fstr or file-like in binary mode
If string, specifies the path that will contain the data. If file-like, data will be written to f. f should be opened in binary mode.
- zero_basedbool, default=True
Whether column indices should be written zero-based (True) or one-based (False).
- commentstr or bytes, default=None
Comment to insert at the top of the file. This should be either a Unicode string, which will be encoded as UTF-8, or an ASCII byte string. If a comment is given, then it will be preceded by one that identifies the file as having been dumped by scikit-learn. Note that not all tools grok comments in SVMlight files.
- query_idarray-like of shape (n_samples,), default=None
Array containing pairwise preference constraints (qid in svmlight format).
- multilabelbool, default=False
Samples may have several labels each (see https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html).
New in version 0.17: parameter
multilabel
to support multilabel datasets.
Examples
>>> from sklearn.datasets import dump_svmlight_file, make_classification >>> X, y = make_classification(random_state=0) >>> output_file = "my_dataset.svmlight" >>> dump_svmlight_file(X, y, output_file)