Pandas 数据框¶

设置¶

In [ ]:

Copied!

pip install ydf pandas -U
pip install ydf pandas -U

Pandas¶

YDF可以直接在Pandas数据框上进行训练。YDF尝试自动推断列的语义。为了更细致的控制，YDF提供了高级选项以指定列的语义。

In [8]:

Copied!





import ydf
import pandas as pd
import numpy as np

# 创建一个包含不同列类型的小型数据框。
df = pd.DataFrame(
    {"feature_1": [1, 2, 3, 1] * 20, # 一个数值特征
     "feature_2": ["X", "X", "Y", "Y"] * 20, # 一个分类特征
     "feature_3": [True, False, True, False ] * 20, # 一个布尔特征
     "label": [True, True, False, False ] * 20, # 标签
})
df.head()
import ydf
import pandas as pd
import numpy as np

# 创建一个包含不同列类型的小型数据框。
df = pd.DataFrame(
    {"feature_1": [1, 2, 3, 1] * 20, # 一个数值特征
     "feature_2": ["X", "X", "Y", "Y"] * 20, # 一个分类特征
     "feature_3": [True, False, True, False ] * 20, # 一个布尔特征
     "label": [True, True, False, False ] * 20, # 标签
})
df.head()

Out[8]:

	feature_1	feature_2	feature_3	label
0	1	X	True	True
1	2	X	False	True
2	3	Y	True	False
3	1	Y	False	False
4	1	X	True	True

我们可以直接在这个数据框上训练一个模型。

In [4]:

Copied!

# 训练一个模型。
model = ydf.RandomForestLearner(label="label").train(df)
# 训练一个模型。
model = ydf.RandomForestLearner(label="label").train(df)

Train model on 80 examples
Model trained in 0:00:00.003959

In [5]:

Copied!

model.describe()
model.describe()

Out[5]:

Name : RANDOM_FOREST
Task : CLASSIFICATION
Label : label
Features (3) : feature_1 feature_2 feature_3
Weights : None
Trained with tuner : No
Model size : 257 kB

Number of records: 80
Number of columns: 4

Number of columns by type:
	CATEGORICAL: 2 (50%)
	BOOLEAN: 1 (25%)
	NUMERICAL: 1 (25%)

Columns:

CATEGORICAL: 2 (50%)
	0: "label" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"false" 40 (50%) dtype:DTYPE_BOOL
	2: "feature_2" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"X" 40 (50%) dtype:DTYPE_BYTES

BOOLEAN: 1 (25%)
	3: "feature_3" BOOLEAN true_count:40 false_count:40 dtype:DTYPE_BOOL

NUMERICAL: 1 (25%)
	1: "feature_1" NUMERICAL mean:1.75 min:1 max:3 sd:0.829156 dtype:DTYPE_FLOAT64

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

The following evaluation is computed on the validation or out-of-bag dataset.

Number of predictions (without weights): 80
Number of predictions (with weights): 80
Task: CLASSIFICATION
Label: label

Accuracy: 1  CI95[W][0.963246 1]
LogLoss: : 0
ErrorRate: : 0

Default Accuracy: : 0.5
Default LogLoss: : 0.693147
Default ErrorRate: : 0.5

Confusion Table:
truth\prediction
       false  true
false     40     0
 true      0    40
Total: 80

Variable importances measure the importance of an input feature for a model.

    1. "feature_2"  1.000000

    1. "feature_2" 300.000000

    1. "feature_2" 300.000000

    1. "feature_2" 16479.940276

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 300

Only printing the first tree.

Tree #0:
    "feature_2" is in [BITMAP] {X} [s:0.692835 n:80 np:39 miss:0] ; val:"false" prob:[0.5125, 0.4875]
        ├─(pos)─ val:"true" prob:[0, 1]
        └─(neg)─ val:"false" prob:[1, 0]

在特征的子集上训练模型¶

默认情况下，模型使用所有可用的列。相反，您可以限制 YDF 仅使用某些特征。

仅在 feature_1 和 feature_2 上训练模型。

In [7]:

Copied!





model = ydf.RandomForestLearner(
    label="label",
    features=["feature_1", "feature_2"]
).train(df)

print("Model input features:", model.input_feature_names())
model = ydf.RandomForestLearner(
    label="label",
    features=["feature_1", "feature_2"]
).train(df)

print("Model input features:", model.input_feature_names())

Train model on 80 examples
Model trained in 0:00:00.003908
Model input features: ['feature_1', 'feature_2']

重写特征语义¶

为了使用一个特征，模型需要知道如何解释这个特征。这被称为特征的“语义”。YDF支持四种类型的特征语义：

数值型：用于量或测量。
分类型：用于类别或枚举。
布尔型：一种特殊的分类类型，仅有两个类别：真和假。
分类集：用于类别、标签或词袋的集合。

YDF会根据特征的表示自动确定特征的语义。例如，浮点数和整数值会被自动检测为数值型。

例如，以下是上面训练的模型的语义：

In [9]:

Copied!

model.input_features()
model.input_features()

Out[9]:

[InputFeature(name='feature_1', semantic=<Semantic.NUMERICAL: 1>, column_idx=0),
 InputFeature(name='feature_2', semantic=<Semantic.CATEGORICAL: 2>, column_idx=1)]

在某些情况下，强制特定的语义是很有趣的。例如，如果一个枚举值用整数表示，强制将该特征视为分类特征是很重要的：

In [11]:

Copied!





model = ydf.RandomForestLearner(
    label="label",
    features=[ydf.Feature("feature_1", ydf.Semantic.CATEGORICAL)],
    include_all_columns=True  # Use all the features; not just the ones in "features".
).train(df)

model.input_features()
model = ydf.RandomForestLearner(
    label="label",
    features=[ydf.Feature("feature_1", ydf.Semantic.CATEGORICAL)],
    include_all_columns=True  # Use all the features; not just the ones in "features".
).train(df)

model.input_features()

Train model on 80 examples
Model trained in 0:00:00.004236

Out[11]:

[InputFeature(name='feature_1', semantic=<Semantic.CATEGORICAL: 2>, column_idx=0),
 InputFeature(name='feature_2', semantic=<Semantic.CATEGORICAL: 2>, column_idx=2),
 InputFeature(name='feature_3', semantic=<Semantic.BOOLEAN: 5>, column_idx=3)]