代码示例 / 自然语言处理 / 从零开始的文本分类

从零开始的文本分类

作者: Mark Omernick, Francois Chollet
创建日期: 2019/11/06
最后修改: 2020/05/17
描述: 从原始文本文件开始的文本情感分类。

在 Colab 中查看 GitHub 源代码


介绍

本示例展示了如何从原始文本(作为磁盘上的一组文本文件)开始进行文本分类。我们在未处理版本的 IMDB 情感分类数据集上演示工作流程。我们使用 TextVectorization 层进行分词和索引。


设置

import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import keras
import tensorflow as tf
import numpy as np
from keras import layers

加载数据:IMDB 电影评论情感分类

让我们下载数据并检查其结构。

!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
  % 总计    % 接收 % 传输  平均速度    时间    时间     时间  当前
                                 下载  上传   总计   花费   剩余  速度
100 80.2M  100 80.2M    0     0  87.7M      0 --:--:-- --:--:-- --:--:-- 87.7M

aclImdb 文件夹包含一个 train 和一个 test 子文件夹:

!ls aclImdb
!ls aclImdb/test
!ls aclImdb/train
imdbEr.txt  imdb.vocabREADMEtest  train

labeledBow.feat  neg  pos  urls_neg.txt  urls_pos.txt

labeledBow.feat  posunsupBow.feat  urls_pos.txt
neg unsupurls_neg.txt   urls_unsup.txt

aclImdb/train/posaclImdb/train/neg 文件夹包含文本文件,每个文件都代表一条评论(正面或负面):

!cat aclImdb/train/pos/6248_7.txt
作为一个奥地利人,这对我来说简直是一个直击面颊的打击。幸运的是,我离这个电影发生的地方远得很,但不幸的是,它描绘了其他奥地利人对维也纳人(或者说与这个地区相关的人)所惧怕的一切。很显然,这是导演的意图:让你感到无奈,想要说“哦我的天,这怎么可能!”不,我可不想,(在我看来)完全夸张的未经过滤的换妻俱乐部场景并不必要,我看色情片,当然,但在这种背景下,我更感到厌恶而不是能够融入其中。<br /><br />这部电影讲述了那些由于缺乏教育或不良陪伴而误入歧途的人们如何在一个冗余且乏味的世界中求生存。一个被她超嫉妒的男友当作妓女对待的女孩(还是不断回来),一个女教师通过将她超残酷的“情人”的生命置于危险中而发现自己的受虐倾向,一对几乎拥有数学规律日常循环的老夫妻(她是他前妻的“官方替代品”),一对刚刚离婚的夫妻,前夫明显因前妻和她的按摩师之间的关系而痛苦不堪,最后一个疯狂的搭便车者向她的司机们提出最不寻常的问题,单单通过不断令人厌烦来挑战他们的神经。<br /><br />看完它后,你几乎没有任何感觉。你甚至不感到震惊、悲伤、沮丧或想要做些什么……也许这就是我给它7分的原因,它让我以一种我从未有过的方式做出反应。这个好或坏取决于你自己!

我们仅对 posneg 子文件夹感兴趣,因此让我们删除其他包含文本文件的子文件夹:

!rm -r aclImdb/train/unsup

你可以使用实用工具 keras.utils.text_dataset_from_directory 从一组按类特定文件夹存放的文本文件生成一个带标签的 tf.data.Dataset 对象。

让我们用它来生成训练集、验证集和测试集。验证集和训练集是从 train 目录的两个子集生成的,20% 的样本进入验证集,80% 进入训练集。

除了测试集外,拥有验证集对于调整超参数(例如模型架构)是有用的,而测试集不应被用于此。

不过,在将模型应用到现实世界之前,应该使用所有可用的训练数据进行再训练(而不创建验证集),以最大化其性能。 当使用 validation_splitsubset 参数时,请确保要么指定一个随机种子,要么传递 shuffle=False,这样您获得的验证和训练分割就不会重叠。

batch_size = 32
raw_train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=1337,
)
raw_val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=1337,
)
raw_test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

print(f"Number of batches in raw_train_ds: {raw_train_ds.cardinality()}")
print(f"Number of batches in raw_val_ds: {raw_val_ds.cardinality()}")
print(f"Number of batches in raw_test_ds: {raw_test_ds.cardinality()}")
找到 25000 个文件,属于 2 类。
使用 20000 个文件进行训练。
找到 25000 个文件,属于 2 类。
使用 5000 个文件进行验证。
找到 25000 个文件,属于 2 类。
原始训练数据集中的批次数: 625
原始验证数据集中的批次数: 157
原始测试数据集中的批次数: 782

让我们预览一些样本:

# 查看原始数据很重要,以确保您的归一化
# 和标记化可以按预期工作。我们可以通过从训练集中选取一些
# 示例并查看它们来实现这一点。
# 这是急切执行突出的地方之一:
# 我们只需使用 .numpy() 评估这些张量
# 而不需要在会话/图上下文中评估它们。
for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(5):
        print(text_batch.numpy()[i])
        print(label_batch.numpy()[i])
b'I\'ve seen tons of science fiction from the 70s; some horrendously bad, and others thought provoking and truly frightening. Soylent Green fits into the latter category. Yes, at times it\'s a little campy, and yes, the furniture is good for a giggle or two, but some of the film seems awfully prescient. Here we have a film, 9 years before Blade Runner, that dares to imagine the future as somthing dark, scary, and nihilistic. Both Charlton Heston and Edward G. Robinson fare far better in this than The Ten Commandments, and Robinson\'s assisted-suicide scene is creepily prescient of Kevorkian and his ilk. Some of the attitudes are dated (can you imagine a filmmaker getting away with the "women as furniture" concept in our oh-so-politically-correct-90s?), but it\'s rare to find a film from the Me Decade that actually can make you think. This is one I\'d love to see on the big screen, because even in a widescreen presentation, I don\'t think the overall scope of this film would receive its due. Check it out.'
1
b'First than anything, I\'m not going to praise I\xc3\xb1arritu\'s short film, even I\'m Mexican and proud of his success in mainstream Hollywood.<br /><br />In another hand, I see most of the reviews focuses on their favorite (and not so) short films; but we are forgetting that there is a subtle bottom line that circles the whole compilation, and maybe it will not be so pleasant for American people. (Even if that was not the main purpose of the producers) <br /><br />What i\'m talking about is that most of the short films does not show the suffering that WASP people went through because the terrorist attack on September 11th, but the suffering of the Other people.<br /><br />Do you need proofs about what i\'m saying? Look, in the Bosnia short film, the message is: "You cry because of the people who died in the Towers, but we (The Others = East Europeans) are crying long ago for the crimes committed against our women and nobody pay attention to us like the whole world has done to you".<br /><br />Even though the Burkina Fasso story is more in comedy, there is a the same thought: "You are angry because Osama Bin Laden punched you in an evil way, but we (The Others = Africans) should be more angry, because our people is dying of hunger, poverty and AIDS long time ago, and nobody pay attention to us like the whole world has done to you".<br /><br />Look now at the Sean Penn short: The fall of the Twin Towers makes happy to a lonely (and alienated) man. So the message is that the Power and the Greed (symbolized by the Towers) must fall for letting the people see the sun rise and the flowers blossom? It is remarkable that this terrible bottom line has been proposed by an American. There is so much irony in this short film that it is close to be subversive.<br /><br />Well, the Ken Loach (very know because his anti-capitalism ideology) is much more clearly and shameless in going straight to the point: "You are angry because your country has been attacked by evil forces, but we (The Others = Latin Americans) suffered at a similar date something worst, and nobody remembers our grief as the whole world has done to you".<br /><br />It is like if the creative of this project wanted to say to Americans: "You see now, America? You are not the only that have become victim of the world violence, you are not alone in your pain and by the way, we (the Others = the Non Americans) have been suffering a lot more than you from long time ago; so, we are in solidarity with you in your pain... and by the way, we are sorry because you have had some taste of your own medicine" Only the Mexican and the French short films showed some compassion and sympathy for American people; the others are like a slap on the face for the American State, that is not equal to American People.'
1
b'Blood Castle (aka Scream of the Demon Lover, Altar of Blood, Ivanna--the best, but least exploitation cinema-sounding title, and so on) is a very traditional Gothic Romance film. That means that it has big, creepy castles, a headstrong young woman, a mysterious older man, hints of horror and the supernatural, and romance elements in the contemporary sense of that genre term. It also means that it is very deliberately paced, and that the film will work best for horror mavens who are big fans of understatement. If you love films like Robert Wise\'s The Haunting (1963), but you also have a taste for late 1960s/early 1970s Spanish and Italian horror, you may love Blood Castle, as well.<br /><br />Baron Janos Dalmar (Carlos Quiney) lives in a large castle on the outskirts of a traditional, unspecified European village. The locals fear him because legend has it that whenever he beds a woman, she soon after ends up dead--the consensus is that he sets his ferocious dogs on them. This is quite a problem because the Baron has a very healthy appetite for women. At the beginning of the film, yet another woman has turned up dead and mutilated.<br /><br />Meanwhile, Dr. Ivanna Rakowsky (Erna Sch\xc3\xbcrer) has appeared in the center of the village, asking to be taken to Baron Dalmar\'s castle. She\'s an out-of-towner who has been hired by the Baron for her expertise in chemistry. Of course, no one wants to go near the castle. Finally, Ivanna finds a shady individual (who becomes even shadier) to take her. Once there, an odd woman who lives in the castle, Olga (Cristiana Galloni), rejects Ivanna and says that she shouldn\'t be there since she\'s a woman. Baron Dalmar vacillates over whether she should stay. She ends up staying, but somewhat reluctantly. The Baron has hired her to try to reverse the effects of severe burns, which the Baron\'s brother, Igor, is suffering from.<br /><br />Unfortunately, the Baron\'s brother appears to be just a lump of decomposing flesh in a vat of bizarre, blackish liquid. And furthermore, Ivanna is having bizarre, hallucinatory dreams. Just what is going on at the castle? Is the Baron responsible for the crimes? Is he insane? <br /><br />I wanted to like Blood Castle more than I did. As I mentioned, the film is very deliberate in its pacing, and most of it is very understated. I can go either way on material like that. I don\'t care for The Haunting (yes, I\'m in a very small minority there), but I\'m a big fan of 1960s and 1970s European horror. One of my favorite directors is Mario Bava. I also love Dario Argento\'s work from that period. But occasionally, Blood Castle moved a bit too slow for me at times. There are large chunks that amount to scenes of not very exciting talking alternated with scenes of Ivanna slowly walking the corridors of the castle.<br /><br />But the atmosphere of the film is decent. Director Jos\xc3\xa9 Luis Merino managed more than passable sets and locations, and they\'re shot fairly well by Emanuele Di Cola. However, Blood Castle feels relatively low budget, and this is a Roger Corman-produced film, after all (which usually means a low-budget, though often surprisingly high quality "quickie"). So while there is a hint of the lushness of Bava\'s colors and complex set decoration, everything is much more minimalist. Of course, it doesn\'t help that the Retromedia print I watched looks like a 30-year old photograph that\'s been left out in the sun too long. It appears "washed out", with compromised contrast.<br /><br />Still, Merino and Di Cola occasionally set up fantastic visuals. For example, a scene of Ivanna walking in a darkened hallway that\'s shot from an exaggerated angle, and where an important plot element is revealed through shadows on a wall only. There are also a couple Ingmar Bergmanesque shots, where actors are exquisitely blocked to imply complex relationships, besides just being visually attractive and pulling your eye deep into the frame.<br /><br />The performances are fairly good, and the women--especially Sch\xc3\xbcrer--are very attractive. Merino exploits this fact by incorporating a decent amount of nudity. Sch\xc3\xbcrer went on to do a number of films that were as much soft corn porn as they were other genres, with English titles such as Sex Life in a Woman\'s Prison (1974), Naked and Lustful (1974), Strip Nude for Your Killer (1975) and Erotic Exploits of a Sexy Seducer (1977). Blood Castle is much tamer, but in addition to the nudity, there are still mild scenes suggesting rape and bondage, and of course the scenes mixing sex and death.<br /><br />The primary attraction here, though, is probably the story, which is much a slow-burning romance as anything else. The horror elements, the mystery elements, and a somewhat unexpected twist near the end are bonuses, but in the end, Blood Castle is a love story, about a couple overcoming various difficulties and antagonisms (often with physical threats or harms) to be together.'
1
b"I was talked into watching this movie by a friend who blubbered on about what a cute story this was.<br /><br />Yuck.<br /><br />I want my two hours back, as I could have done SO many more productive things with my time...like, for instance, twiddling my thumbs. I see nothing redeeming about this film at all, save for the eye-candy aspect of it...<br /><br />3/10 (and that's being generous)"
0
b"Michelle Rodriguez is the defining actress who could be the charging force for other actresses to look out for. She has the audacity to place herself in a rarely seen tough-girl role very early in her career (and pull it off), which is a feat that should be recognized. Although her later films pigeonhole her to that same role, this film was made for her ruggedness.<br /><br />Her character is a romanticized student/fighter/lover, struggling to overcome her disenchanted existence in the projects, which is a little overdone in film...but not by a girl. That aspect of this film isn't very original, but the story goes in depth when the heated relationships that this girl has to deal with come to a boil and her primal rage takes over.<br /><br />I haven't seen an actress take such an aggressive stance in movie-making yet, and I'm glad that she's getting that original twist out there in Hollywood. This film got a 7 from me because of the average story of ghetto youth, but it has such a great actress portraying a rarely-seen role in a minimal budget movie. Great work."
1

准备数据

特别地,我们移除 <br /> 标签。

import string
import re


# 看了我们上面的数据后,我们发现原始文本包含的 HTML 换行
# 标签是 '<br />'。这些标签不会被默认的
# 标准化器移除(它并不去除 HTML)。因此,我们需要
# 创建一个自定义的标准化函数。
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, f"[{re.escape(string.punctuation)}]", ""
    )


# 模型常量。
max_features = 20000
embedding_dim = 128
sequence_length = 500

# 现在我们有了自定义标准化,我们可以实例化我们的文本
# 向量化层。我们使用此层来规范化、拆分并将
# 字符串映射到整数,因此我们将 'output_mode' 设置为 'int'。
# 注意,我们使用默认的拆分函数,
# 以及上述定义的自定义标准化。
# 我们还设置了一个明确的最大序列长度,因为我们模型后面的 CNN
# 不支持不规则序列。
vectorize_layer = keras.layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

# 现在已经创建了 vectorize_layer,调用 `adapt` 在一个纯文本
# 数据集上以创建词汇表。你不必分批,但对于非常大的
# 数据集,这意味着你不会在内存中保留数据集的多余副本。

# 让我们创建一个纯文本数据集(没有标签):
text_ds = raw_train_ds.map(lambda x, y: x)
# 让我们调用 `adapt`:
vectorize_layer.adapt(text_ds)

向量化数据的两种选择

我们可以使用文本向量化层的两种方式:

选项 1:将其作为模型的一部分,以便获得一个处理原始 字符串的模型,如下所示:

text_input = keras.Input(shape=(1,), dtype=tf.string, name='text')
x = vectorize_layer(text_input)
x = layers.Embedding(max_features + 1, embedding_dim)(x)
...

选项 2:将其应用于文本数据集,以获取一个词索引的数据集, 然后将其输入到一个期望整数序列作为输入的模型中。

这两者之间一个重要的区别是选项 2 使你能够进行 异步 CPU 处理和数据缓冲,以在 GPU 上训练时获得最佳性能。 因此,如果你在 GPU 上训练模型,您可能希望选择此选项以获得 最佳性能。这就是我们下面要做的。

如果我们要将模型导出到生产环境中,我们会发布一个接受原始 字符串作为输入的模型,如上面选项 1 的代码片段所示。这可以在 训练后完成。我们将在最后一部分中这样做。

def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label


# 向量化数据。
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

# 对数据进行异步预取/缓冲以获得 GPU 上的最佳性能。
train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)

构建模型

我们选择一个简单的 1D 卷积网络,从 Embedding 层开始。

# 一个用于词汇索引的整数输入。
inputs = keras.Input(shape=(None,), dtype="int64")

# 接下来,我们添加一个层将这些词汇索引映射到一个维度为
# 'embedding_dim' 的空间。
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Dropout(0.5)(x)

# Conv1D + 全局最大池化
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.GlobalMaxPooling1D()(x)

# 我们添加一个普通的隐藏层:
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)

# 我们投影到一个单一单元输出层,并使用 sigmoid 进行压缩:
predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)

model = keras.Model(inputs, predictions)

# 使用二元交叉熵损失和 Adam 优化器编译模型。
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

训练模型

epochs = 3

# 使用训练和测试数据集拟合模型。
model.fit(train_ds, validation_data=val_ds, epochs=epochs)
第 1 轮/3
 625/625 ━━━━━━━━━━━━━━━━━━━━ 5s 4ms/步 - 准确率: 0.6082 - 损失: 0.6121 - 验证准确率: 0.8589 - 验证损失: 0.3313
第 2 轮/3
 625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/步 - 准确率: 0.8855 - 损失: 0.2748 - 验证准确率: 0.8662 - 验证损失: 0.3499
第 3 轮/3
 625/625 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/步 - 准确率: 0.9463 - 损失: 0.1432 - 验证准确率: 0.8758 - 验证损失: 0.3789

<keras.src.callbacks.history.History at 0x7ff434de94b0>

在测试集上评估模型

model.evaluate(test_ds)
 782/782 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - accuracy: 0.8634 - loss: 0.3848

[0.3857516348361969, 0.8642103672027588]

创建一个端到端模型

如果你想获得一个能够处理原始字符串的模型,你可以简单地 创建一个新模型(使用我们刚刚训练的权重):

# 一个字符串输入
inputs = keras.Input(shape=(1,), dtype="string")
# 将字符串转换为词汇索引
indices = vectorize_layer(inputs)
# 将词汇索引转换为预测
outputs = model(indices)

# 我们的端到端模型
end_to_end_model = keras.Model(inputs, outputs)
end_to_end_model.compile(
    loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]
)

# 用`raw_test_ds`测试,该数据集返回原始字符串
end_to_end_model.evaluate(raw_test_ds)
 782/782 ━━━━━━━━━━━━━━━━━━━━ 5s 5ms/step - accuracy: 0.8636 - loss: 0.3829

[0.38630548119544983, 0.8639705777168274]