分类¶
设置¶
In [ ]:
Copied!
pip install ydf -U
pip install ydf -U
什么是分类?¶
分类是预测一个分类值的任务,例如从有限的可能值集中预测一个枚举、类型或类别。例如,从可能颜色集合中预测颜色(如红色、蓝色、绿色)就是一个分类任务。分类模型的输出是对可能类别的概率分布。预测的类别是具有最高概率的类别。
当只有两个类别时,我们称之为二分类。在这种情况下,模型只返回一个概率。
分类标签可以是字符串、整数或布尔值。
训练分类模型¶
模型的任务(例如,分类、回归)由 task
学习器参数确定。该参数的默认值为 ydf.Task.CLASSIFICATION
,这意味着默认情况下,YDF 训练分类模型。
In [2]:
Copied!
# 加载库
import ydf # Yggdrasil决策森林
import pandas as pd # 我们使用Pandas加载小型数据集。
# 下载一个分类数据集,并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")
# 打印前5个训练样本
train_ds.head(5)
# 加载库
import ydf # Yggdrasil决策森林
import pandas as pd # 我们使用Pandas加载小型数据集。
# 下载一个分类数据集,并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")
# 打印前5个训练样本
train_ds.head(5)
Out[2]:
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 44 | Private | 228057 | 7th-8th | 4 | Married-civ-spouse | Machine-op-inspct | Wife | White | Female | 0 | 0 | 40 | Dominican-Republic | <=50K |
1 | 20 | Private | 299047 | Some-college | 10 | Never-married | Other-service | Not-in-family | White | Female | 0 | 0 | 20 | United-States | <=50K |
2 | 40 | Private | 342164 | HS-grad | 9 | Separated | Adm-clerical | Unmarried | White | Female | 0 | 0 | 37 | United-States | <=50K |
3 | 30 | Private | 361742 | Some-college | 10 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
4 | 67 | Self-emp-inc | 171564 | HS-grad | 9 | Married-civ-spouse | Prof-specialty | Wife | White | Female | 20051 | 0 | 30 | England | >50K |
标签列是:
In [3]:
Copied!
train_ds["income"]
train_ds["income"]
Out[3]:
0 <=50K 1 <=50K 2 <=50K 3 <=50K 4 >50K ... 22787 <=50K 22788 >50K 22789 <=50K 22790 <=50K 22791 <=50K Name: income, Length: 22792, dtype: object
我们可以训练一个分类模型:
In [4]:
Copied!
model = ydf.RandomForestLearner(label="income",
task=ydf.Task.CLASSIFICATION).train(train_ds)
# 注意:ydf.Task.CLASSIFICATION 是默认值 "task"
assert model.task() == ydf.Task.CLASSIFICATION
model = ydf.RandomForestLearner(label="income",
task=ydf.Task.CLASSIFICATION).train(train_ds)
# 注意:ydf.Task.CLASSIFICATION 是默认值 "task"
assert model.task() == ydf.Task.CLASSIFICATION
Train model on 22792 examples Model trained in 0:00:01.179527
分类模型的评估使用准确率、混淆矩阵、ROC-AUC和PR-AUC。
In [5]:
Copied!
evaluation = model.evaluate(test_ds)
print(evaluation)
evaluation = model.evaluate(test_ds)
print(evaluation)
accuracy: 0.866005 confusion matrix: label (row) \ prediction (col) +-------+-------+-------+ | | <=50K | >50K | +-------+-------+-------+ | <=50K | 6976 | 873 | +-------+-------+-------+ | >50K | 436 | 1484 | +-------+-------+-------+ characteristics: name: '>50K' vs others ROC AUC: 0.908676 PR AUC: 0.790029 Num thresholds: 302 loss: 0.394958 num examples: 9769 num examples (weighted): 9769
您可以绘制丰富的评估,通过ROC和PR图。
In [6]:
Copied!
evaluation
evaluation
Out[6]:
accuracy:
0.866005
AUC: '>50K' vs others:
0.908676
PR-AUC: '>50K' vs others:
0.790029
loss:
0.394958
9769
9769
Label \ Pred | <=50K | >50K |
---|---|---|
<=50K | 6976 | 436 |
>50K | 873 | 1484 |