Pandas 数据框¶
设置¶
In [ ]:
Copied!
pip install ydf pandas -U
pip install ydf pandas -U
In [8]:
Copied!
import ydf
import pandas as pd
import numpy as np
# 创建一个包含不同列类型的小型数据框。
df = pd.DataFrame(
{"feature_1": [1, 2, 3, 1] * 20, # 一个数值特征
"feature_2": ["X", "X", "Y", "Y"] * 20, # 一个分类特征
"feature_3": [True, False, True, False ] * 20, # 一个布尔特征
"label": [True, True, False, False ] * 20, # 标签
})
df.head()
import ydf
import pandas as pd
import numpy as np
# 创建一个包含不同列类型的小型数据框。
df = pd.DataFrame(
{"feature_1": [1, 2, 3, 1] * 20, # 一个数值特征
"feature_2": ["X", "X", "Y", "Y"] * 20, # 一个分类特征
"feature_3": [True, False, True, False ] * 20, # 一个布尔特征
"label": [True, True, False, False ] * 20, # 标签
})
df.head()
Out[8]:
feature_1 | feature_2 | feature_3 | label | |
---|---|---|---|---|
0 | 1 | X | True | True |
1 | 2 | X | False | True |
2 | 3 | Y | True | False |
3 | 1 | Y | False | False |
4 | 1 | X | True | True |
我们可以直接在这个数据框上训练一个模型。
In [4]:
Copied!
# 训练一个模型。
model = ydf.RandomForestLearner(label="label").train(df)
# 训练一个模型。
model = ydf.RandomForestLearner(label="label").train(df)
Train model on 80 examples Model trained in 0:00:00.003959
In [5]:
Copied!
model.describe()
model.describe()
Out[5]:
Name : RANDOM_FOREST
Task : CLASSIFICATION
Label : label
Features (3) : feature_1 feature_2 feature_3
Weights : None
Trained with tuner : No
Model size : 257 kB
Task : CLASSIFICATION
Label : label
Features (3) : feature_1 feature_2 feature_3
Weights : None
Trained with tuner : No
Model size : 257 kB
Number of records: 80 Number of columns: 4 Number of columns by type: CATEGORICAL: 2 (50%) BOOLEAN: 1 (25%) NUMERICAL: 1 (25%) Columns: CATEGORICAL: 2 (50%) 0: "label" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"false" 40 (50%) dtype:DTYPE_BOOL 2: "feature_2" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"X" 40 (50%) dtype:DTYPE_BYTES BOOLEAN: 1 (25%) 3: "feature_3" BOOLEAN true_count:40 false_count:40 dtype:DTYPE_BOOL NUMERICAL: 1 (25%) 1: "feature_1" NUMERICAL mean:1.75 min:1 max:3 sd:0.829156 dtype:DTYPE_FLOAT64 Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
The following evaluation is computed on the validation or out-of-bag dataset.
Number of predictions (without weights): 80 Number of predictions (with weights): 80 Task: CLASSIFICATION Label: label Accuracy: 1 CI95[W][0.963246 1] LogLoss: : 0 ErrorRate: : 0 Default Accuracy: : 0.5 Default LogLoss: : 0.693147 Default ErrorRate: : 0.5 Confusion Table: truth\prediction false true false 40 0 true 0 40 Total: 80
Variable importances measure the importance of an input feature for a model.
1. "feature_2" 1.000000
1. "feature_2" 300.000000
1. "feature_2" 300.000000
1. "feature_2" 16479.940276
Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.
Num trees : 300
Only printing the first tree.
Tree #0: "feature_2" is in [BITMAP] {X} [s:0.692835 n:80 np:39 miss:0] ; val:"false" prob:[0.5125, 0.4875] ├─(pos)─ val:"true" prob:[0, 1] └─(neg)─ val:"false" prob:[1, 0]
In [7]:
Copied!
model = ydf.RandomForestLearner(
label="label",
features=["feature_1", "feature_2"]
).train(df)
print("Model input features:", model.input_feature_names())
model = ydf.RandomForestLearner(
label="label",
features=["feature_1", "feature_2"]
).train(df)
print("Model input features:", model.input_feature_names())
Train model on 80 examples Model trained in 0:00:00.003908 Model input features: ['feature_1', 'feature_2']
重写特征语义¶
为了使用一个特征,模型需要知道如何解释这个特征。这被称为特征的“语义”。YDF支持四种类型的特征语义:
- 数值型:用于量或测量。
- 分类型:用于类别或枚举。
- 布尔型:一种特殊的分类类型,仅有两个类别:真和假。
- 分类集:用于类别、标签或词袋的集合。
YDF会根据特征的表示自动确定特征的语义。例如,浮点数和整数值会被自动检测为数值型。
例如,以下是上面训练的模型的语义:
In [9]:
Copied!
model.input_features()
model.input_features()
Out[9]:
[InputFeature(name='feature_1', semantic=<Semantic.NUMERICAL: 1>, column_idx=0), InputFeature(name='feature_2', semantic=<Semantic.CATEGORICAL: 2>, column_idx=1)]
在某些情况下,强制特定的语义是很有趣的。例如,如果一个枚举值用整数表示,强制将该特征视为分类特征是很重要的:
In [11]:
Copied!
model = ydf.RandomForestLearner(
label="label",
features=[ydf.Feature("feature_1", ydf.Semantic.CATEGORICAL)],
include_all_columns=True # Use all the features; not just the ones in "features".
).train(df)
model.input_features()
model = ydf.RandomForestLearner(
label="label",
features=[ydf.Feature("feature_1", ydf.Semantic.CATEGORICAL)],
include_all_columns=True # Use all the features; not just the ones in "features".
).train(df)
model.input_features()
Train model on 80 examples Model trained in 0:00:00.004236
Out[11]:
[InputFeature(name='feature_1', semantic=<Semantic.CATEGORICAL: 2>, column_idx=0), InputFeature(name='feature_2', semantic=<Semantic.CATEGORICAL: 2>, column_idx=2), InputFeature(name='feature_3', semantic=<Semantic.BOOLEAN: 5>, column_idx=3)]