dask_ml.feature_extraction.text.CountVectorizer

`dask_ml.feature_extraction.text`.CountVectorizer¶

class dask_ml.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)[源代码]¶

将一组文本文档转换为词频矩阵

参见

sklearn.feature_extraction.text.CountVectorizer

注释

当未提供词汇表时，fit_transform 需要对数据集进行两次遍历：一次用于学习词汇表，第二次用于转换数据。如果在不提供 vocabulary 的情况下调用 fit 或 transform，请考虑在数据适合（分布式）内存时持久化数据。

此外，这种实现方式在一个单机上也能从活跃的 dask.distributed.Client 中获益。当存在客户端时，学习到的 vocabulary 会被持久化在分布式内存中，这样可以节省一些重新计算和冗余通信。

示例

Dask-ML 的实现目前要求 raw_documents 是一个 dask.bag.Bag 类型的文档（字符串列表）。

>>> from dask_ml.feature_extraction.text import CountVectorizer
>>> import dask.bag as db
>>> from distributed import Client
>>> client = Client()
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> corpus = db.from_sequence(corpus, npartitions=2)
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
dask.array<concatenate, shape=(nan, 9), dtype=int64, chunksize=(nan, 9), ...
           chunktype=scipy.csr_matrix>
>>> X.compute().toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])
>>> vectorizer.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

方法

`build_analyzer`()	返回一个可调用对象来处理输入数据。
`build_preprocessor`()	返回一个函数，用于在分词前预处理文本。
`build_tokenizer`()	返回一个将字符串分割成一系列标记的函数。
`decode`(doc)	将输入解码为一系列 Unicode 符号。
`fit`(raw_documents[, y])	学习原始文档中所有标记的词汇字典。
`fit_transform`(raw_documents[, y])	学习词汇字典并返回文档-词项矩阵。
`get_feature_names_out`([input_features])	获取转换后的输出特征名称。
`get_metadata_routing`()	获取此对象的元数据路由。
`get_params`([deep])	获取此估计器的参数。
`get_stop_words`()	构建或获取有效的停用词列表。
`inverse_transform`(X)	返回在 X 中具有非零条目的文档的术语。
`set_fit_request`(*[, raw_documents])	传递给 `fit` 方法的请求元数据。
`set_params`(**params)	设置此估计器的参数。
`set_transform_request`(*[, raw_documents])	传递给 `transform` 方法的请求元数据。
`transform`(raw_documents)	将文档转换为文档-词项矩阵。

__init__(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)[源代码]¶

dask_ml.preprocessing.BlockTransformer

dask_ml.feature_extraction.text.HashingVectorizer

dask_ml.feature_extraction.text.CountVectorizer

dask_ml.feature_extraction.text.CountVectorizer¶

`dask_ml.feature_extraction.text`.CountVectorizer¶