分类¶

设置¶

In [ ]:

Copied!

pip install ydf -U
pip install ydf -U

什么是分类？¶

分类是预测一个分类值的任务，例如从有限的可能值集中预测一个枚举、类型或类别。例如，从可能颜色集合中预测颜色（如红色、蓝色、绿色）就是一个分类任务。分类模型的输出是对可能类别的概率分布。预测的类别是具有最高概率的类别。

当只有两个类别时，我们称之为二分类。在这种情况下，模型只返回一个概率。

分类标签可以是字符串、整数或布尔值。

训练分类模型¶

模型的任务（例如，分类、回归）由 task 学习器参数确定。该参数的默认值为 ydf.Task.CLASSIFICATION，这意味着默认情况下，YDF 训练分类模型。

In [2]:

Copied!





# 加载库
import ydf  # Yggdrasil决策森林
import pandas as pd  # 我们使用Pandas加载小型数据集。

# 下载一个分类数据集，并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# 打印前5个训练样本
train_ds.head(5)
# 加载库
import ydf  # Yggdrasil决策森林
import pandas as pd  # 我们使用Pandas加载小型数据集。

# 下载一个分类数据集，并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# 打印前5个训练样本
train_ds.head(5)

Out[2]:

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
0	44	Private	228057	7th-8th	4	Married-civ-spouse	Machine-op-inspct	Wife	White	Female	0	40	Dominican-Republic	<=50K
1	20	Private	299047	Some-college	10	Never-married	Other-service	Not-in-family	White	Female	0	20	United-States	<=50K
2	40	Private	342164	HS-grad	9	Separated	Adm-clerical	Unmarried	White	Female	0	37	United-States	<=50K
3	30	Private	361742	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	50	United-States	<=50K
4	67	Self-emp-inc	171564	HS-grad	9	Married-civ-spouse	Prof-specialty	Wife	White	Female	20051	30	England	>50K

标签列是：

In [3]:

Copied!

train_ds["income"]
train_ds["income"]

Out[3]:

0        <=50K
1        <=50K
2        <=50K
3        <=50K
4         >50K
         ...  
22787    <=50K
22788     >50K
22789    <=50K
22790    <=50K
22791    <=50K
Name: income, Length: 22792, dtype: object

我们可以训练一个分类模型：

In [4]:

Copied!

model = ydf.RandomForestLearner(label="income",
                                task=ydf.Task.CLASSIFICATION).train(train_ds)
# 注意：ydf.Task.CLASSIFICATION 是默认值 "task"

assert model.task() == ydf.Task.CLASSIFICATION
model = ydf.RandomForestLearner(label="income",
                                task=ydf.Task.CLASSIFICATION).train(train_ds)
# 注意：ydf.Task.CLASSIFICATION 是默认值 "task"

assert model.task() == ydf.Task.CLASSIFICATION

Train model on 22792 examples
Model trained in 0:00:01.179527

分类模型的评估使用准确率、混淆矩阵、ROC-AUC和PR-AUC。

In [5]:

Copied!

evaluation = model.evaluate(test_ds)

print(evaluation)
evaluation = model.evaluate(test_ds)

print(evaluation)

accuracy: 0.866005
confusion matrix:
    label (row) \ prediction (col)
    +-------+-------+-------+
    |       | <=50K |  >50K |
    +-------+-------+-------+
    | <=50K |  6976 |   873 |
    +-------+-------+-------+
    |  >50K |   436 |  1484 |
    +-------+-------+-------+
characteristics:
    name: '>50K' vs others
    ROC AUC: 0.908676
    PR AUC: 0.790029
    Num thresholds: 302
loss: 0.394958
num examples: 9769
num examples (weighted): 9769

您可以绘制丰富的评估，通过ROC和PR图。

In [6]:

Copied!

evaluation
evaluation

Out[6]:

accuracy:

0.866005

AUC: '>50K' vs others:

0.908676

PR-AUC: '>50K' vs others:

0.790029

loss:

0.394958

num examples:

9769

num examples (weighted):

9769

Confusion matrix

Label \ Pred	<=50K	>50K
<=50K	6976	436
>50K	873	1484