Note

Go to the end to download the full example code. or to run this example in your browser via Binder

特征哈希器和字典向量化器比较#

在这个例子中，我们展示了文本向量化，这是将非数值输入数据（如字典或文本文件）表示为实数向量的过程。

我们首先通过使用两种方法来向量化通过自定义Python函数预处理（标记化）的文本文件，比较了:func:~sklearn.feature_extraction.FeatureHasher 和:func:~sklearn.feature_extraction.DictVectorizer 。

随后，我们介绍并分析了文本特定的向量化器:func:~sklearn.feature_extraction.text.HashingVectorizer 、CountVectorizer 和:func:~sklearn.feature_extraction.text.TfidfVectorizer ，这些向量化器在一个类中处理标记化和特征矩阵的组装。

这个例子的目的是展示文本向量化API的使用，并比较它们的处理时间。有关文本文件实际学习的示例脚本，请参见:ref:sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py 和:ref:sphx_glr_auto_examples_text_plot_document_clustering.py 。

# 作者：scikit-learn 开发者
# SPDX-License-Identifier: BSD-3-Clause

Load Data#

我们从 The 20 newsgroups text dataset 加载数据，该数据集包含大约 18000 篇关于 20 个主题的新闻组帖子，分为两个子集：一个用于训练，一个用于测试。为了简化和减少计算成本，我们选择了 7 个主题的子集，并且只使用训练集。

from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "comp.graphics",
    "comp.sys.ibm.pc.hardware",
    "misc.forsale",
    "rec.autos",
    "sci.space",
    "talk.religion.misc",
]

print("Loading 20 newsgroups training data")
raw_data, _ = fetch_20newsgroups(subset="train", categories=categories, return_X_y=True)
data_size_mb = sum(len(s.encode("utf-8")) for s in raw_data) / 1e6
print(f"{len(raw_data)} documents - {data_size_mb:.3f}MB")

Loading 20 newsgroups training data
3803 documents - 6.245MB

定义预处理函数#

一个标记可以是一个单词、单词的一部分或字符串中空格或符号之间的任何内容。这里我们定义了一个函数，使用一个简单的正则表达式（regex）来提取标记，该正则表达式匹配Unicode单词字符。这包括大多数可以构成任何语言单词的字符，以及数字和下划线：

import re

def tokenize(doc):
    """从文档中提取标记。

这使用了一个简单的正则表达式来匹配单词字符，将字符串分解为标记。要采用更系统的方法，请参见 CountVectorizer 或 TfidfVectorizer。
"""
    return (tok.lower() for tok in re.findall(r"\w+", doc))

list(tokenize("This is a simple example, isn't it?"))

['this', 'is', 'a', 'simple', 'example', 'isn', 't', 'it']

我们定义了一个附加函数，用于统计给定文档中每个标记的出现频率。它返回一个频率字典，供向量化器使用。

from collections import defaultdict


def token_freqs(doc):
    """提取一个字典，将文档中的标记映射到它们的出现次数。"""

    freq = defaultdict(int)
    for tok in tokenize(doc):
        freq[tok] += 1
    return freq


token_freqs("That is one example, but this is another one")

defaultdict(<class 'int'>, {'that': 1, 'is': 2, 'one': 2, 'example': 1, 'but': 1, 'this': 1, 'another': 1})

特别要注意的是，重复的标记“is”在这种情况下被计算了两次。

将文本文档分解为单词标记，可能会丢失句子中单词之间的顺序信息，这通常被称为词袋表示法。

字典向量化器#

首先，我们对 DictVectorizer 进行基准测试，然后将其与 FeatureHasher 进行比较，因为它们都接收字典作为输入。

from time import time

from sklearn.feature_extraction import DictVectorizer

dict_count_vectorizers = defaultdict(list)

t0 = time()
vectorizer = DictVectorizer()
vectorizer.fit_transform(token_freqs(d) for d in raw_data)
duration = time() - t0
dict_count_vectorizers["vectorizer"].append(
    vectorizer.__class__.__name__ + "\non freq dicts"
)
dict_count_vectorizers["speed"].append(data_size_mb / duration)
print(f"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s")
print(f"Found {len(vectorizer.get_feature_names_out())} unique terms")

done in 0.622 s at 10.0 MB/s
Found 47928 unique terms

实际的文本标记到列索引的映射明确地存储在 .vocabulary_ 属性中，这是一个可能非常大的 Python 字典：

type(vectorizer.vocabulary_)

len(vectorizer.vocabulary_)

vectorizer.vocabulary_["example"]

特征哈希器#

字典占用了大量的存储空间，并且随着训练集的增长而增大。与其随着字典一起增长向量，特征哈希通过对特征（例如，标记）应用哈希函数 h 来构建预定义长度的向量，然后直接使用哈希值作为特征索引，并在这些索引处更新生成的向量。当特征空间不够大时，哈希函数往往会将不同的值映射到相同的哈希码（哈希冲突）。因此，不可能确定是哪个对象生成了任何特定的哈希码。

由于上述原因，不可能从特征矩阵中恢复原始标记，估计原始词典中唯一术语数量的最佳方法是计算编码特征矩阵中活跃列的数量。为此，我们定义了以下函数：

import numpy as np


def n_nonzero_columns(X):
    """在CSR矩阵中至少有一个非零值的列数。

这在使用FeatureHasher时有助于计算实际有效的特征列数。
"""
    return len(np.unique(X.nonzero()[1]))

默认情况下，FeatureHasher 的特征数量是 2**20。这里我们设置 n_features = 2**18 来说明哈希冲突。

FeatureHasher on frequency dictionaries

from sklearn.feature_extraction import FeatureHasher

t0 = time()
hasher = FeatureHasher(n_features=2**18)
X = hasher.transform(token_freqs(d) for d in raw_data)
duration = time() - t0
dict_count_vectorizers["vectorizer"].append(
    hasher.__class__.__name__ + "\non freq dicts"
)
dict_count_vectorizers["speed"].append(data_size_mb / duration)
print(f"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s")
print(f"Found {n_nonzero_columns(X)} unique tokens")

done in 0.359 s at 17.4 MB/s
Found 43873 unique tokens

使用 FeatureHasher 时的唯一标记数量低于使用 DictVectorizer 时的数量。这是由于哈希冲突造成的。

通过增加特征空间可以减少碰撞次数。请注意，当设置大量特征时，向量化器的速度不会显著变化，尽管这会导致系数维度增大，从而需要更多的内存来存储它们，即使其中大多数是非活动的。

t0 = time()
hasher = FeatureHasher(n_features=2**22)
X = hasher.transform(token_freqs(d) for d in raw_data)
duration = time() - t0

print(f"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s")
print(f"Found {n_nonzero_columns(X)} unique tokens")

done in 0.351 s at 17.8 MB/s
Found 47668 unique tokens

我们确认，唯一标记的数量接近于由 DictVectorizer 找到的唯一术语的数量。

FeatureHasher on raw tokens

或者，可以在 FeatureHasher 中设置 input_type="string" ，以直接对自定义 tokenize 函数输出的字符串进行向量化。这相当于传递一个隐含频率为 1 的特征名称字典。

t0 = time()
hasher = FeatureHasher(n_features=2**18, input_type="string")
X = hasher.transform(tokenize(d) for d in raw_data)
duration = time() - t0
dict_count_vectorizers["vectorizer"].append(
    hasher.__class__.__name__ + "\non raw tokens"
)
dict_count_vectorizers["speed"].append(data_size_mb / duration)
print(f"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s")
print(f"Found {n_nonzero_columns(X)} unique tokens")

done in 0.323 s at 19.4 MB/s
Found 43873 unique tokens

我们现在绘制上述方法的向量化速度图。

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 6))

y_pos = np.arange(len(dict_count_vectorizers["vectorizer"]))
ax.barh(y_pos, dict_count_vectorizers["speed"], align="center")
ax.set_yticks(y_pos)
ax.set_yticklabels(dict_count_vectorizers["vectorizer"])
ax.invert_yaxis()
_ = ax.set_xlabel("speed (MB/s)")

在这两种情况下，FeatureHasher 的速度大约是 DictVectorizer 的两倍。这在处理大量数据时非常方便，但缺点是会失去转换的可逆性，这反过来使得模型的解释变得更加复杂。

input_type="string" 的 FeatureHasher 比基于频率字典的变体稍快，因为它不计算重复的标记：每个标记隐式地只计算一次，即使它被重复了。根据下游的机器学习任务，这可能是一个限制，也可能不是。

与专用文本向量化工具的比较#

CountVectorizer 接受原始数据，因为它内部实现了标记化和出现计数。与在上一节中使用自定义函数 token_freqs 的 DictVectorizer 类似。不同之处在于 CountVectorizer 更加灵活。特别是它通过 token_pattern 参数接受各种正则表达式模式。

from sklearn.feature_extraction.text import CountVectorizer

t0 = time()
vectorizer = CountVectorizer()
vectorizer.fit_transform(raw_data)
duration = time() - t0
dict_count_vectorizers["vectorizer"].append(vectorizer.__class__.__name__)
dict_count_vectorizers["speed"].append(data_size_mb / duration)
print(f"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s")
print(f"Found {len(vectorizer.get_feature_names_out())} unique terms")

done in 0.366 s at 17.1 MB/s
Found 47885 unique terms

我们看到，使用 CountVectorizer 实现大约比使用 DictVectorizer 以及我们定义的简单映射标记的函数快两倍。原因是 CountVectorizer 通过对整个训练集重用一个编译的正则表达式进行了优化，而不是像我们天真的标记函数那样为每个文档创建一个正则表达式。

现在我们用 HashingVectorizer 进行类似的实验，它相当于结合了 FeatureHasher 类实现的“哈希技巧”和 CountVectorizer 的文本预处理和标记化。

from sklearn.feature_extraction.text import HashingVectorizer

t0 = time()
vectorizer = HashingVectorizer(n_features=2**18)
vectorizer.fit_transform(raw_data)
duration = time() - t0
dict_count_vectorizers["vectorizer"].append(vectorizer.__class__.__name__)
dict_count_vectorizers["speed"].append(data_size_mb / duration)
print(f"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s")

done in 0.273 s at 22.8 MB/s

我们可以观察到，这是迄今为止最快的文本分词策略，前提是下游的机器学习任务能够容忍一些冲突。

Tfidf向量化器

在一个大型文本语料库中，一些词语出现的频率较高（例如英语中的“the”、“a”、“is”），并且不携带关于文档实际内容的有意义信息。如果我们直接将词频数据输入分类器，这些非常常见的词语会掩盖那些较为罕见但更具信息量的词语的频率。为了将词频特征重新加权为适合分类器使用的浮点值，通常会使用由:func:~sklearn.feature_extraction.text.TfidfTransformer 实现的tf-idf变换。TF代表“词频”，而“tf-idf”表示词频与逆文档频率的乘积。

我们现在对 TfidfVectorizer 进行基准测试，它相当于将 CountVectorizer 的标记化和出现计数与 TfidfTransformer 的归一化和加权相结合。

from sklearn.feature_extraction.text import TfidfVectorizer

t0 = time()
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(raw_data)
duration = time() - t0
dict_count_vectorizers["vectorizer"].append(vectorizer.__class__.__name__)
dict_count_vectorizers["speed"].append(data_size_mb / duration)
print(f"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s")
print(f"Found {len(vectorizer.get_feature_names_out())} unique terms")

done in 0.344 s at 18.2 MB/s
Found 47885 unique terms

总结#

让我们通过一个图表总结所有记录的处理速度：

fig, ax = plt.subplots(figsize=(12, 6))

y_pos = np.arange(len(dict_count_vectorizers["vectorizer"]))
ax.barh(y_pos, dict_count_vectorizers["speed"], align="center")
ax.set_yticks(y_pos)
ax.set_yticklabels(dict_count_vectorizers["vectorizer"])
ax.invert_yaxis()
_ = ax.set_xlabel("speed (MB/s)")

请注意，从图中可以看出，由于 TfidfTransformer 引入的额外操作，TfidfVectorizer 比 CountVectorizer 略慢。

还请注意，通过将特征数量设置为 n_features = 2**18 ，HashingVectorizer 的表现优于 CountVectorizer ，但代价是由于哈希冲突导致转换不可逆。

我们强调，CountVectorizer 和 HashingVectorizer 在手动标记的文档上表现优于其等效的 DictVectorizer 和 FeatureHasher ，因为前者的内部标记步骤会编译一个正则表达式一次，然后在所有文档中重复使用它。

Total running time of the script: (0 minutes 2.969 seconds)

Download Jupyter notebook: plot_hashing_vs_dict_vectorizer.ipynb

Download Python source code: plot_hashing_vs_dict_vectorizer.py

Download zipped: plot_hashing_vs_dict_vectorizer.zip

Related examples

Gallery generated by Sphinx-Gallery