Wikitext 数据教程

隐藏
! [ -e /content ] && pip install -Uqq fastai  # 在Colab上升级fastai
from fastai.basics import *
from fastai.callback.all import *
from fastai.text.all import *

在文本中使用 DatasetsPipelineTfmdListsTransform

在本教程中,我们将探索文本应用中数据收集的中级API。我们将使用在pets教程中介绍的基础知识,因此您应该已经对TransformPipelineTfmdListsDatasets等概念熟悉。

数据

path = untar_data(URLs.WIKITEXT_TINY)

该数据集包含两个csv文件中的文章,因此我们将其读取并合并到一个数据框中。

df_train = pd.read_csv(path/'train.csv', header=None)
df_valid = pd.read_csv(path/'test.csv', header=None)
df_all = pd.concat([df_train, df_valid])
df_all.head()
0
0 \n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z...
1 \n = Big Boy ( song ) = \n \n " Big Boy " <unk> " I 'm A Big Boy Now " was the first single ever recorded by the Jackson 5 , which was released by Steeltown Records in January 1968 . The group played instruments on many of their Steeltown compositions , including " Big Boy " . The song was neither a critical nor commercial success , but the Jackson family were delighted with the outcome nonetheless . \n The Jackson 5 would release a second single with Steeltown Records before moving to Motown Records . The group 's recordings at Steeltown Records were thought to be lost , but they were re...
2 \n = The Remix ( Lady Gaga album ) = \n \n The Remix is a remix album by American recording artist Lady Gaga . Released in Japan on March 3 , 2010 , it contains remixes of the songs from her first studio album , The Fame ( 2008 ) , and her third extended play , The Fame Monster ( 2009 ) . A revised version of the track list was prepared for release in additional markets , beginning with Mexico on May 3 , 2010 . A number of recording artists have produced the songs , including Pet Shop Boys , Passion Pit and The Sound of Arrows . The remixed versions feature both uptempo and <unk> composit...
3 \n = New Year 's Eve ( Up All Night ) = \n \n " New Year 's Eve " is the twelfth episode of the first season of the American comedy television series Up All Night . The episode originally aired on NBC in the United States on January 12 , 2012 . It was written by Erica <unk> and was directed by Beth McCarthy @-@ Miller . The episode also featured a guest appearance from Jason Lee as Chris and Reagan 's neighbor and Ava 's boyfriend , Kevin . \n During Reagan ( Christina Applegate ) and Chris 's ( Will <unk> ) first New Year 's Eve game night , Reagan 's competitiveness comes out causing Ch...
4 \n = Geopyxis carbonaria = \n \n Geopyxis carbonaria is a species of fungus in the genus Geopyxis , family <unk> . First described to science in 1805 , and given its current name in 1889 , the species is commonly known as the charcoal loving elf @-@ cup , dwarf <unk> cup , <unk> <unk> cup , or pixie cup . The small , <unk> @-@ shaped fruitbodies of the fungus are reddish @-@ brown with a whitish fringe and measure up to 2 cm ( 0 @.@ 8 in ) across . They have a short , tapered stalk . Fruitbodies are commonly found on soil where brush has recently been burned , sometimes in great numbers ....

我们可以基于空格对其进行分词以进行比较(通常是这样做的),但在这里我们将使用标准的fastai分词器。

splits = [list(range_of(df_train)), list(range(len(df_train), len(df_all)))]
tfms = [attrgetter("text"), Tokenizer.from_df(0), Numericalize()]
dsets = Datasets(df_all, [tfms], splits=splits, dl_type=LMDataLoader)
/home/jhoward/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order)
bs,sl = 104,72
dls = dsets.dataloaders(bs=bs, seq_len=sl)
dls.show_batch(max_n=3)
text text_
0 xxbos = xxmaj mexico xxmaj city xxmaj metropolitan xxmaj cathedral = \n▁\n▁ xxmaj the xxmaj metropolitan xxmaj cathedral of the xxmaj assumption of the xxmaj most xxmaj blessed xxmaj virgin xxmaj mary into xxmaj heaven ( xxmaj spanish : xxunk xxunk de la xxunk de la xxmaj santísima xxunk xxmaj maría a los xxunk ) is the largest cathedral in the xxmaj americas , and seat of the xxmaj roman xxmaj catholic = xxmaj mexico xxmaj city xxmaj metropolitan xxmaj cathedral = \n▁\n▁ xxmaj the xxmaj metropolitan xxmaj cathedral of the xxmaj assumption of the xxmaj most xxmaj blessed xxmaj virgin xxmaj mary into xxmaj heaven ( xxmaj spanish : xxunk xxunk de la xxunk de la xxmaj santísima xxunk xxmaj maría a los xxunk ) is the largest cathedral in the xxmaj americas , and seat of the xxmaj roman xxmaj catholic xxmaj
1 , who had campaigned for a negotiated peace with xxmaj nazi xxmaj germany , was interned by the xxmaj british xxmaj authorities under xxmaj defence xxmaj regulation xxunk , along with most other active fascists in xxmaj britain . xxmaj lady xxmaj mosley was imprisoned a month later . xxmaj max and his brother xxmaj alexander were not included in this internship and as a result were separated from their parents for who had campaigned for a negotiated peace with xxmaj nazi xxmaj germany , was interned by the xxmaj british xxmaj authorities under xxmaj defence xxmaj regulation xxunk , along with most other active fascists in xxmaj britain . xxmaj lady xxmaj mosley was imprisoned a month later . xxmaj max and his brother xxmaj alexander were not included in this internship and as a result were separated from their parents for the
2 jewish xxmaj question to the xxmaj jewish xxmaj state : xxmaj an xxmaj essay on the xxmaj theory of xxmaj zionism ( thesis ) , xxmaj princeton xxmaj university . \n▁\n▁ = = = xxmaj articles and chapters = = = \n▁\n▁ " xxunk and the xxmaj palestine xxmaj question : xxmaj the not - so - strange xxmaj case of xxmaj joan xxmaj peter 's ' xxmaj from xxmaj time xxmaj xxmaj question to the xxmaj jewish xxmaj state : xxmaj an xxmaj essay on the xxmaj theory of xxmaj zionism ( thesis ) , xxmaj princeton xxmaj university . \n▁\n▁ = = = xxmaj articles and chapters = = = \n▁\n▁ " xxunk and the xxmaj palestine xxmaj question : xxmaj the not - so - strange xxmaj case of xxmaj joan xxmaj peter 's ' xxmaj from xxmaj time xxmaj immemorial

模型

config = awd_lstm_lm_config.copy()
config.update({'input_p': 0.6, 'output_p': 0.4, 'weight_p': 0.5, 'embed_p': 0.1, 'hidden_p': 0.2})
model = get_language_model(AWD_LSTM, len(dls.vocab), config=config)
opt_func = partial(Adam, wd=0.1, eps=1e-7)
cbs = [MixedPrecision(), GradientClip(0.1)] + rnn_cbs(alpha=2, beta=1)
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), opt_func=opt_func, cbs=cbs, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1, 5e-3, moms=(0.8,0.7,0.8), div=10)
epoch train_loss valid_loss accuracy perplexity time
0 5.503713 5.095897 0.237340 163.350342 02:07
#learn.fit_one_cycle(90, 5e-3, 动量=(0.8,0.7,0.8), 分段=10)