类别特征¶

特征的处理方式取决于其语义，例如数值型、类别型、布尔型或文本型。如果没有指定语义，则会自动推断。例如，浮动型和整型特征会被检测为数值型，而字符串会被检测为类别型。

类别特征代表一种类型或类，在有限的可能值集中没有排序。例如，考虑颜色 RED 在集合 {RED, BLUE, GREEN} 中的情况。类别特征可以是字符串、字节字面量、整数或布尔值。缺失值用 ""（空字符串）表示。

让我们训练一个关于类别字符串特征的例子。

In [ ]:

Copied!

import ydf
import pandas as pd
import ydf
import pandas as pd

In [ ]:

Copied!





dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": ["red", "red", "blue", "green"],
    "feature_2": ["hot", "hot", "cold", ""],
})

model = ydf.RandomForestLearner(label="label").train(dataset)
dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": ["red", "red", "blue", "green"],
    "feature_2": ["hot", "hot", "cold", ""],
})

model = ydf.RandomForestLearner(label="label").train(dataset)

Train model on 4 examples
Model trained in 0:00:00.008941

我们可以看到特征在数据规格选项卡中被检测为分类变量。

In [ ]:

Copied!

model.describe()
model.describe()

Out[ ]:

Name : RANDOM_FOREST
Task : CLASSIFICATION
Label : label
Features (2) : feature_1 feature_2
Weights : None
Trained with tuner : No
Model size : 57 kB

Number of records: 4
Number of columns: 3

Number of columns by type:
	CATEGORICAL: 3 (100%)

Columns:

CATEGORICAL: 3 (100%)
	0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%)
	1: "feature_1" CATEGORICAL has-dict vocab-size:1 num-oods:4 (100%)
	2: "feature_2" CATEGORICAL num-nas:1 (25%) has-dict vocab-size:1 num-oods:3 (100%)

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

The following evaluation is computed on the validation or out-of-bag dataset.

Number of predictions (without weights): 4
Number of predictions (with weights): 4
Task: CLASSIFICATION
Label: label

Accuracy: 0  CI95[W][0 0.527129]
LogLoss: : 1.57374
ErrorRate: : 1

Default Accuracy: : 0.5
Default LogLoss: : 0.693147
Default ErrorRate: : 0.5

Confusion Table:
truth\prediction
       false  true
false      0     2
 true      2     0
Total: 4

Variable importances measure the importance of an input feature for a model.

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 300

Only printing the first tree.

Tree #0:
    val:"false" prob:[0.5, 0.5]

有时，您可能希望强制将特征的语义设置为分类。

在下一个示例中，“feature_1”和“feature_2”是整数，因此它们会被自动检测为数值型。然而，我们希望将“feature_1”检测为分类特征。

在模型描述中，请注意“feature_1”是分类的，而“feature_2”是数值型的。

In [ ]:

Copied!





dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": [1, 2, 2, 1],
    "feature_2": [5, 6, 7, 6],
})

model = ydf.RandomForestLearner(label="label",
                                features=[ydf.Feature("feature_1", ydf.Semantic.CATEGORICAL)],
                                include_all_columns=True,
                                ).train(dataset)
# 注意：include_all_columns=True 允许模型使用所有列
# columns as features, not just the ones in "features".

model.describe()
dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": [1, 2, 2, 1],
    "feature_2": [5, 6, 7, 6],
})

model = ydf.RandomForestLearner(label="label",
                                features=[ydf.Feature("feature_1", ydf.Semantic.CATEGORICAL)],
                                include_all_columns=True,
                                ).train(dataset)
# 注意：include_all_columns=True 允许模型使用所有列
# columns as features, not just the ones in "features".

model.describe()

Train model on 4 examples
Model trained in 0:00:00.004352

Out[ ]:

Name : RANDOM_FOREST
Task : CLASSIFICATION
Label : label
Features (2) : feature_1 feature_2
Weights : None
Trained with tuner : No
Model size : 57 kB

Number of records: 4
Number of columns: 3

Number of columns by type:
	CATEGORICAL: 2 (66.6667%)
	NUMERICAL: 1 (33.3333%)

Columns:

CATEGORICAL: 2 (66.6667%)
	0: "feature_1" CATEGORICAL has-dict vocab-size:1 num-oods:4 (100%)
	1: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%)

NUMERICAL: 1 (33.3333%)
	2: "feature_2" NUMERICAL mean:6 min:5 max:7 sd:0.707107

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

The following evaluation is computed on the validation or out-of-bag dataset.

Number of predictions (without weights): 4
Number of predictions (with weights): 4
Task: CLASSIFICATION
Label: label

Accuracy: 0  CI95[W][0 0.527129]
LogLoss: : 1.57805
ErrorRate: : 1

Default Accuracy: : 0.5
Default LogLoss: : 0.693147
Default ErrorRate: : 0.5

Confusion Table:
truth\prediction
       false  true
false      0     2
 true      2     0
Total: 4

Variable importances measure the importance of an input feature for a model.

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 300

Only printing the first tree.

Tree #0:
    val:"false" prob:[0.5, 0.5]