表格训练

! [ -e /content ] && pip install -Uqq fastai  # 在Colab上升级fastai

如何在 fastai 中使用表格应用程序

为了说明表格应用，我们将使用成人数据集的示例，在该示例中，我们需要预测一个人年收入是否超过或低于$50,000，基于一些一般数据。

from fastai.tabular.all import *

我们可以使用常规的 untar_data 命令下载该数据集的样本：

path = untar_data(URLs.ADULT_SAMPLE)
path.ls()

(#3) [Path('/home/ml1/.fastai/data/adult_sample/models'),Path('/home/ml1/.fastai/data/adult_sample/export.pkl'),Path('/home/ml1/.fastai/data/adult_sample/adult.csv')]

然后我们可以看看数据是如何结构化的：

df = pd.read_csv(path/'adult.csv')
df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

有些列是连续的（例如年龄），我们将它们视为浮点数，以便直接输入我们的模型。其他列是分类的（例如工作类别或教育），我们会将它们转换为一个唯一的索引，以便输入到嵌入层中。我们可以在TabularDataLoaders工厂方法中指定我们的分类和连续列名称，以及因变量的名称：

dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [Categorify, FillMissing, Normalize])

最后一部分是我们对数据应用的预处理器列表：

Categorify 将处理每个分类变量，并创建一个从整数到唯一类别的映射，然后用相应的索引替换值。
FillMissing 将通过现有值的中位数填充连续变量中的缺失值（如果需要，你可以选择一个特定的值）。
Normalize 将对连续变量进行归一化（减去均值并除以标准差）。

为了进一步揭示表面下发生的事情，下面我们将重新编写这个示例，利用 fastai 的 TabularPandas 类。我们需要进行一个调整，即定义我们想要如何划分数据。默认情况下，上述工厂方法使用了随机的 80/20 划分，因此我们也将采取相同的做法：

splits = RandomSplitter(valid_pct=0.2)(range_of(df))

to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
                   cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education-num'],
                   y_names='salary',
                   splits=splits)

一旦我们构建了我们的 TabularPandas 对象，我们的数据就完全预处理，如下所示：

to.xs.iloc[:2]

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
15780	2	16	1	5	2	5	1	0.984037	2.210372	-0.033692
17442	5	12	5	8	2	5	1	-1.509555	-0.319624	-0.425324

现在我们可以再次构建我们的 DataLoaders 了：

dls = to.dataloaders(bs=64)

稍后我们将探讨为何使用 TabularPandas 进行预处理是有价值的。

show_batch 方法的工作方式与其他应用程序类似：

dls.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	State-gov	Bachelors	Married-civ-spouse	Prof-specialty	Wife	White	False	41.000000	75409.001182	13.0	>=50k
1	Private	Some-college	Never-married	Craft-repair	Not-in-family	White	False	24.000000	38455.005013	10.0	<50k
2	Private	Assoc-acdm	Married-civ-spouse	Prof-specialty	Husband	White	False	48.000000	101299.003093	12.0	<50k
3	Private	HS-grad	Never-married	Other-service	Other-relative	Black	False	42.000000	227465.999281	9.0	<50k
4	State-gov	Some-college	Never-married	Prof-specialty	Not-in-family	White	False	20.999999	258489.997130	10.0	<50k
5	Local-gov	12th	Married-civ-spouse	Tech-support	Husband	White	False	39.000000	207853.000067	8.0	<50k
6	Private	Assoc-voc	Married-civ-spouse	Sales	Husband	White	False	36.000000	238414.998930	11.0	>=50k
7	Private	HS-grad	Never-married	Craft-repair	Not-in-family	White	False	19.000000	445727.998937	9.0	<50k
8	Local-gov	Bachelors	Married-civ-spouse	#na#	Husband	White	True	59.000000	196013.000174	10.0	>=50k
9	Private	HS-grad	Married-civ-spouse	Prof-specialty	Wife	Black	False	39.000000	147500.000403	9.0	<50k

我们可以使用 tabular_learner 方法定义一个模型。在我们定义模型时，fastai 将根据我们之前的 y_names 尝试推断损失函数。

注意：有时在处理表格数据时，您的 y 可能会被编码（例如 0 和 1）。在这种情况下，您应该在构造函数中显式传递 y_block = CategoryBlock，以便 fastai 不会假设您是在进行回归。

learn = tabular_learner(dls, metrics=accuracy)

我们可以使用 fit_one_cycle 方法训练该模型（fine_tune 方法在这里没有用处，因为我们没有预训练模型）。

learn.fit_one_cycle(1)

epoch	train_loss	valid_loss	accuracy	time
0	0.369360	0.348096	0.840756	00:05

我们可以看看一些预测：

learn.show_results()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary	salary_pred
0	5.0	12.0	3.0	8.0	1.0	5.0	1.0	0.324868	-1.138177	-0.424022	0.0	0.0
1	5.0	10.0	5.0	2.0	2.0	5.0	1.0	-0.482055	-1.351911	1.148438	0.0	0.0
2	5.0	12.0	6.0	12.0	3.0	5.0	1.0	-0.775482	0.138709	-0.424022	0.0	0.0
3	5.0	16.0	5.0	2.0	4.0	4.0	1.0	-1.362335	-0.227515	-0.030907	0.0	0.0
4	5.0	2.0	5.0	0.0	4.0	5.0	1.0	-1.509048	-0.191191	-1.210252	0.0	0.0
5	5.0	16.0	3.0	13.0	1.0	5.0	1.0	1.498575	-0.051096	-0.030907	1.0	1.0
6	5.0	12.0	3.0	15.0	1.0	5.0	1.0	-0.555412	0.039167	-0.424022	0.0	0.0
7	5.0	1.0	5.0	6.0	4.0	5.0	1.0	-1.582405	-1.396391	-1.603367	0.0	0.0
8	5.0	3.0	5.0	13.0	2.0	5.0	1.0	-1.362335	0.158354	-0.817137	0.0	0.0

或者在一行上使用预测方法：

row, clas, probs = learn.predict(df.iloc[0])

row.show()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Private	Assoc-acdm	Married-civ-spouse	#na#	Wife	White	False	49.0	101319.99788	12.0	>=50k

clas, probs

(tensor(1), tensor([0.4995, 0.5005]))

要对新的数据框进行预测，可以使用 DataLoaders 的 test_dl 方法。该数据框的列中不需要包含因变量。

test_df = df.copy()
test_df.drop(['salary'], axis=1, inplace=True)
dl = learn.dls.test_dl(test_df)

然后 Learner.get_preds 将会给你预测结果：

learn.get_preds(dl=dl)

(tensor([[0.4995, 0.5005],
         [0.4882, 0.5118],
         [0.9824, 0.0176],
         ...,
         [0.5324, 0.4676],
         [0.7628, 0.2372],
         [0.5934, 0.4066]]), None)

Note

由于机器学习模型无法神奇地理解从未训练过的类别，因此数据应反映这一点。如果您的测试数据中存在不同的缺失值，在训练之前应对此进行处理。

`fastai`与其他库

如前所述，TabularPandas是一个强大且易于使用的表格数据预处理工具。与随机森林和XGBoost等库的集成只需一个额外的步骤，而.dataloaders调用为我们完成了这个步骤。让我们再次看看我们的to。它的值存储在一个类似于DataFrame的对象中，我们可以提取其中的cats、conts、xs和ys，如果我们需要的话：

to.xs[:3]

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
25387	5	16	3	5	1	5	1	0.471582	-1.467756	-0.030907
16872	1	16	5	1	4	5	1	-1.215622	-0.649792	-0.030907
25852	5	16	3	5	1	5	1	1.865358	-0.218915	-0.030907

现在一切都已经编码完成，您可以通过提取训练集和验证集及其值，将其发送给 XGBoost 或随机森林：

X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_test, y_test = to.valid.xs, to.valid.ys.values.ravel()

现在我们可以直接发送这个了！

fastai与其他库

`fastai`与其他库