交叉验证¶
设置¶
In [ ]:
Copied!
pip install ydf -U
pip install ydf -U
In [2]:
Copied!
import ydf
import pandas as pd
# 下载一个分类数据集,并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
dataset = pd.read_csv(f"{ds_path}/adult.csv")
# 打印前5个示例
dataset.head(5)
import ydf
import pandas as pd
# 下载一个分类数据集,并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
dataset = pd.read_csv(f"{ds_path}/adult.csv")
# 打印前5个示例
dataset.head(5)
Out[2]:
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
In [9]:
Copied!
learner = ydf.RandomForestLearner(label="income")
evaluation = learner.cross_validation(dataset, folds=10)
evaluation
learner = ydf.RandomForestLearner(label="income")
evaluation = learner.cross_validation(dataset, folds=10)
evaluation
[INFO 23-11-01 14:14:30.9654 CET dataset.cc:440] max_vocab_count = -1 for column income, the dictionary will not be pruned by size.
Out[9]:
accuracy:
0.866681
AUC: '>50K' vs others:
0.904458
PR-AUC: '>50K' vs others:
0.786608
loss:
0.459967
32561
32561
Label \ Pred | <=50K | >50K |
---|---|---|
<=50K | 23385 | 1335 |
>50K | 3006 | 4835 |