import ydf
import pandas as pd
dataset = pd.DataFrame({
"label": [True, False, True, False],
"feature_1": ["red", "red", "blue", "green"],
"feature_2": ["hot", "hot", "cold", ""],
})
model = ydf.RandomForestLearner(label="label").train(dataset)
Train model on 4 examples Model trained in 0:00:00.008941
我们可以看到特征在数据规格选项卡中被检测为分类变量。
model.describe()
Task : CLASSIFICATION
Label : label
Features (2) : feature_1 feature_2
Weights : None
Trained with tuner : No
Model size : 57 kB
Number of records: 4 Number of columns: 3 Number of columns by type: CATEGORICAL: 3 (100%) Columns: CATEGORICAL: 3 (100%) 0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%) 1: "feature_1" CATEGORICAL has-dict vocab-size:1 num-oods:4 (100%) 2: "feature_2" CATEGORICAL num-nas:1 (25%) has-dict vocab-size:1 num-oods:3 (100%) Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
The following evaluation is computed on the validation or out-of-bag dataset.
Number of predictions (without weights): 4 Number of predictions (with weights): 4 Task: CLASSIFICATION Label: label Accuracy: 0 CI95[W][0 0.527129] LogLoss: : 1.57374 ErrorRate: : 1 Default Accuracy: : 0.5 Default LogLoss: : 0.693147 Default ErrorRate: : 0.5 Confusion Table: truth\prediction false true false 0 2 true 2 0 Total: 4
Variable importances measure the importance of an input feature for a model.
Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.
Only printing the first tree.
Tree #0: val:"false" prob:[0.5, 0.5]
有时,您可能希望强制将特征的语义设置为分类。
在下一个示例中,“feature_1”和“feature_2”是整数,因此它们会被自动检测为数值型。 然而,我们希望将“feature_1”检测为分类特征。
在模型描述中,请注意“feature_1”是分类的,而“feature_2”是数值型的。
dataset = pd.DataFrame({
"label": [True, False, True, False],
"feature_1": [1, 2, 2, 1],
"feature_2": [5, 6, 7, 6],
})
model = ydf.RandomForestLearner(label="label",
features=[ydf.Feature("feature_1", ydf.Semantic.CATEGORICAL)],
include_all_columns=True,
).train(dataset)
# 注意:include_all_columns=True 允许模型使用所有列
# columns as features, not just the ones in "features".
model.describe()
Train model on 4 examples Model trained in 0:00:00.004352
Task : CLASSIFICATION
Label : label
Features (2) : feature_1 feature_2
Weights : None
Trained with tuner : No
Model size : 57 kB
Number of records: 4 Number of columns: 3 Number of columns by type: CATEGORICAL: 2 (66.6667%) NUMERICAL: 1 (33.3333%) Columns: CATEGORICAL: 2 (66.6667%) 0: "feature_1" CATEGORICAL has-dict vocab-size:1 num-oods:4 (100%) 1: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%) NUMERICAL: 1 (33.3333%) 2: "feature_2" NUMERICAL mean:6 min:5 max:7 sd:0.707107 Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
The following evaluation is computed on the validation or out-of-bag dataset.
Number of predictions (without weights): 4 Number of predictions (with weights): 4 Task: CLASSIFICATION Label: label Accuracy: 0 CI95[W][0 0.527129] LogLoss: : 1.57805 ErrorRate: : 1 Default Accuracy: : 0.5 Default LogLoss: : 0.693147 Default ErrorRate: : 0.5 Confusion Table: truth\prediction false true false 0 2 true 2 0 Total: 4
Variable importances measure the importance of an input feature for a model.
Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.
Only printing the first tree.
Tree #0: val:"false" prob:[0.5, 0.5]