TfidfTransformer#

class sklearn.feature_extraction.text.TfidfTransformer(*, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)#

将计数矩阵转换为归一化的tf或tf-idf表示形式。

Tf表示词频，而tf-idf表示词频乘以逆文档频率。这是信息检索中常见的术语加权方案，在文档分类中也得到了良好的应用。

使用tf-idf而不是给定文档中某个标记的原始出现频率的目标是缩小在给定语料库中非常频繁出现的标记的影响，这些标记在经验上比在训练语料库的一小部分中出现的特征信息量更少。

用于计算文档集中的文档d的术语t的tf-idf的公式是tf-idf(t, d) = tf(t, d) * idf(t)，idf计算为idf(t) = log [ n / df(t) ] + 1（如果 smooth_idf=False ），其中n是文档集中的文档总数，df(t)是术语t的文档频率；文档频率是包含术语t的文档集中的文档数量。在上述方程中将“1”添加到idf的效果是，具有零idf的术语，即在训练集中出现在所有文档中的术语，将不会被完全忽略。（注意，上面的idf公式与标准教科书符号定义的idf不同，后者定义idf为idf(t) = log [ n / (df(t) + 1) ]）。

如果 smooth_idf=True （默认），则在idf的分子和分母中添加常数“1”，就像看到一个包含集合中每个术语恰好一次的额外文档一样，这可以防止零除：idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1。

此外，用于计算tf和idf的公式取决于与IR中使用的SMART表示法相对应的参数设置，如下所示：

Tf默认是“n”（自然），当 sublinear_tf=True 时是“l”（对数）。 Idf在给出use_idf时是“t”，否则是“n”（无）。归一化是当 norm='l2' 时的“c”（余弦），当 norm=None 时的“n”（无）。

更多信息请参见用户指南。

Parameters:

norm{‘l1’, ‘l2’} 或 None, default=’l2’

每个输出行将具有单位范数，可以是：

‘l2’: 向量元素的平方和为1。当应用l2范数时，两个向量之间的余弦相似度是它们的点积。
‘l1’: 向量元素的绝对值之和为1。见 normalize 。
None: 无归一化。

use_idfbool, default=True

启用逆文档频率重新加权。如果为False，idf(t) = 1。

smooth_idfbool, default=True

通过将文档频率加一进行平滑idf权重，就像看到一个包含集合中每个术语恰好一次的额外文档一样。防止零除。

sublinear_tfbool, default=False

应用次线性tf缩放，即用1 + log(tf)替换tf。

Attributes:

idf_形状为 (n_features) 的数组: 逆文档频率（IDF）向量；仅在 use_idf 为 True 时定义。

Added in version 0.20.
n_features_in_int: 在 fit 期间看到的特征数量。

Added in version 1.0.
feature_names_in_形状为 ( n_features_in_ ,) 的 ndarray: 在 fit 期间看到的特征名称。仅当 X 具有全部为字符串的特征名称时定义。

Added in version 1.0.

See also

CountVectorizer: 将文本转换为n-gram计数的稀疏矩阵。
TfidfVectorizer: 将原始文档集合转换为TF-IDF特征矩阵。
HashingVectorizer: 将文本文档集合转换为标记出现次数的矩阵。

References

[Yates2011]

Baeza-Yates 和 B. Ribeiro-Neto (2011)。现代信息检索。Addison Wesley，pp. 68-74。

[MRS2008]

C.D. Manning, P. Raghavan 和 H. Schütze (2008)。信息检索导论。剑桥大学出版社，pp. 118-120。

Examples

>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.pipeline import Pipeline
>>> corpus = ['this is the first document',
...           'this document is the second document',
...           'and this is the third one',
...           'is this the first document']
>>> vocabulary = ['this', 'document', 'first', 'is', 'second', 'the',
...               'and', 'one']
>>> pipe = Pipeline([('count', CountVectorizer(vocabulary=vocabulary)),
...                  ('tfid', TfidfTransformer())]).fit(corpus)
>>> pipe['count'].transform(corpus).toarray()
array([[1, 1, 1, 1, 0, 1, 0, 0],
       [1, 2, 0, 1, 1, 1, 0, 0],
       [1, 0, 0, 1, 0, 1, 1, 1],
       [1, 1, 1, 1, 0, 1, 0, 0]])
>>> pipe['tfid'].idf_
array([1.        , 1.22314355, 1.51082562, 1.        , 1.91629073,
       1.        , 1.91629073, 1.91629073])
>>> pipe.transform(corpus).shape
(4, 8)

fit(X, y=None)#

学习 idf 向量（全局词权重）。

Parameters:

X形状为 (n_samples, n_features) 的稀疏矩阵: 一个词/标记计数的矩阵。
yNone: 计算 tf-idf 不需要此参数。

Returns:

selfobject: 拟合的转换器。

fit_transform(X, y=None, **fit_params)#

拟合数据，然后进行转换。

将转换器拟合到 X 和 y ，并带有可选参数 fit_params ，并返回 X 的转换版本。

Parameters:

X形状为 (n_samples, n_features) 的类数组: 输入样本。
y形状为 (n_samples,) 或 (n_samples, n_outputs) 的类数组, 默认=None: 目标值（无监督转换为 None）。
**fit_paramsdict: 其他拟合参数。

Returns:

X_new形状为 (n_samples, n_features_new) 的 ndarray 数组: 转换后的数组。

get_feature_names_out(input_features=None)#

获取变换后的输出特征名称。

Parameters:

input_features字符串数组或None，默认=None

输入特征。

如果 input_features 是 None ，则使用 feature_names_in_ 作为输入特征名称。如果 feature_names_in_ 未定义，则生成以下输入特征名称： ["x0", "x1", ..., "x(n_features_in_ - 1)"] 。
如果 input_features 是数组类型，则 input_features 必须与 feature_names_in_ 匹配（如果 feature_names_in_ 已定义）。

Returns:

feature_names_out字符串对象的ndarray: 与输入特征相同。

get_metadata_routing()#

获取此对象的元数据路由。

请查看用户指南以了解路由机制的工作原理。

Returns:

routingMetadataRequest: MetadataRequest 封装的路由信息。

get_params(deep=True)#

获取此估计器的参数。

Parameters:

deepbool, 默认=True: 如果为True，将返回此估计器和包含的子对象（也是估计器）的参数。

Returns:

paramsdict: 参数名称映射到它们的值。

set_output(*, transform=None)#

设置输出容器。

请参阅介绍 set_output API 以了解如何使用API的示例。

Parameters:

transform{“default”, “pandas”, “polars”}, 默认=None

配置 transform 和 fit_transform 的输出。

"default" : 转换器的默认输出格式
"pandas" : DataFrame 输出
"polars" : Polars 输出
None : 转换配置不变

Added in version 1.4: "polars" 选项已添加。

Returns:

self估计器实例: 估计器实例。

set_params(**params)#

设置此估计器的参数。

该方法适用于简单估计器以及嵌套对象（例如 Pipeline ）。后者具有形式为 <component>__<parameter> 的参数，以便可以更新嵌套对象的每个组件。

Parameters:

**paramsdict: 估计器参数。

Returns:

selfestimator instance: 估计器实例。

set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') → TfidfTransformer#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False : metadata is not requested and the meta-estimator will not pass it to transform .
None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Parameters:

copystr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for copy parameter in transform .

Returns:

selfobject: The updated object.

transform(X, copy=True)#

将计数矩阵转换为tf或tf-idf表示。

Parameters:

Xsparse matrix of (n_samples, n_features): 术语/标记计数矩阵。
copybool, default=True: 是否复制X并在副本上操作，或执行就地操作。 copy=False 仅对CSR稀疏矩阵有效。

Returns:

vectorssparse matrix of shape (n_samples, n_features): Tf-idf加权的文档-术语矩阵。

Gallery examples#

文本数据集上的半监督分类

使用k-means聚类文本文档

特征哈希器和字典向量化器比较