pip install ydf -U
什么是模型调优?¶
模型调优,也称为自动化模型超参数优化或AutoML,涉及寻找学习者的最佳超参数,以最大化模型的性能。YDF支持开箱即用的模型调优。
YDF模型调优有两种模式。用户可以手动指定要优化的超参数及其候选值,或者使用预配置的调优。第二种选项更简单,而第一种选项则给予您更多控制。我们将在本教程中演示这两种选项。
调优可以在单台机器上进行,也可以通过分布式训练在多台机器上进行。本教程重点介绍在单台机器上的调优。本地调优设置简单,并且可以在小型数据集上产生优秀的结果。
分布式模型调优¶
分布式训练调优对于训练时间较长或超参数搜索空间较大的模型可能是有利的。分布式调优需要配置工作节点并指定学习者的workers
构造参数。工作节点设置完成后,模型调优策略与在本地机器上的调优相同。有关更多信息,请参阅分布式训练教程。
下载数据集¶
我们使用成人数据集。
import ydf # Yggdrasil决策森林
import pandas as pd # 我们使用Pandas加载小型数据集。
# 下载一个分类数据集,并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")
# 打印前5个训练样本
train_ds.head(5)
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 44 | Private | 228057 | 7th-8th | 4 | Married-civ-spouse | Machine-op-inspct | Wife | White | Female | 0 | 0 | 40 | Dominican-Republic | <=50K |
1 | 20 | Private | 299047 | Some-college | 10 | Never-married | Other-service | Not-in-family | White | Female | 0 | 0 | 20 | United-States | <=50K |
2 | 40 | Private | 342164 | HS-grad | 9 | Separated | Adm-clerical | Unmarried | White | Female | 0 | 0 | 37 | United-States | <=50K |
3 | 30 | Private | 361742 | Some-college | 10 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
4 | 67 | Self-emp-inc | 171564 | HS-grad | 9 | Married-civ-spouse | Prof-specialty | Wife | White | Female | 20051 | 0 | 30 | England | >50K |
tuner = ydf.RandomSearchTuner(num_trials=50)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])
tuner.choice("max_depth", [3, 4, 5, 6])
<ydf.learner.tuner.SearchSpace at 0x7f3eb4372310>
我们使用这个调谐器创建一个学习者,并训练一个模型:
注意: 未调谐的参数可以直接在学习者上指定。
注意: 要在调谐过程中打印调谐日志,请使用 ydf.verbose(2)
来启用日志记录。
learner = ydf.GradientBoostedTreesLearner(
label="income",
num_trees=100, # 用于所有试验。
tuner=tuner,
)
model =learner.train(train_ds)
Train model on 22792 examples Model trained in 0:00:03.998356
模型描述包含调优日志,这是一个测试过的超参数及其得分的列表,可以在模型描述的tuning
选项卡中找到。
model.describe()
Task : CLASSIFICATION
Label : income
Features (14) : age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country
Weights : None
Trained with tuner : Yes
Model size : 543 kB
Number of records: 22792 Number of columns: 15 Number of columns by type: CATEGORICAL: 9 (60%) NUMERICAL: 6 (40%) Columns: CATEGORICAL: 9 (60%) 0: "income" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"<=50K" 17308 (75.9389%) 2: "workclass" CATEGORICAL num-nas:1257 (5.51509%) has-dict vocab-size:8 num-oods:3 (0.0139308%) most-frequent:"Private" 15879 (73.7358%) 4: "education" CATEGORICAL has-dict vocab-size:17 zero-ood-items most-frequent:"HS-grad" 7340 (32.2043%) 6: "marital_status" CATEGORICAL has-dict vocab-size:8 zero-ood-items most-frequent:"Married-civ-spouse" 10431 (45.7661%) 7: "occupation" CATEGORICAL num-nas:1260 (5.52826%) has-dict vocab-size:14 num-oods:4 (0.018577%) most-frequent:"Prof-specialty" 2870 (13.329%) 8: "relationship" CATEGORICAL has-dict vocab-size:7 zero-ood-items most-frequent:"Husband" 9191 (40.3256%) 9: "race" CATEGORICAL has-dict vocab-size:6 zero-ood-items most-frequent:"White" 19467 (85.4115%) 10: "sex" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"Male" 15165 (66.5365%) 14: "native_country" CATEGORICAL num-nas:407 (1.78571%) has-dict vocab-size:41 num-oods:1 (0.00446728%) most-frequent:"United-States" 20436 (91.2933%) NUMERICAL: 6 (40%) 1: "age" NUMERICAL mean:38.6153 min:17 max:90 sd:13.661 3: "fnlwgt" NUMERICAL mean:189879 min:12285 max:1.4847e+06 sd:106423 5: "education_num" NUMERICAL mean:10.0927 min:1 max:16 sd:2.56427 11: "capital_gain" NUMERICAL mean:1081.9 min:0 max:99999 sd:7509.48 12: "capital_loss" NUMERICAL mean:87.2806 min:0 max:4356 sd:403.01 13: "hours_per_week" NUMERICAL mean:40.3955 min:1 max:99 sd:12.249 Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
A tuner automatically selects the hyper-parameters of a learner.
trial | score | duration | shrinkage | subsample | max_depth |
---|---|---|---|---|---|
16 | -0.574861 | 2.49348 | 0.2 | 1 | 5 |
31 | -0.576405 | 3.53616 | 0.2 | 1 | 6 |
15 | -0.577211 | 2.4727 | 0.1 | 1 | 5 |
33 | -0.578941 | 3.69053 | 0.2 | 0.9 | 5 |
32 | -0.579071 | 3.54803 | 0.2 | 0.9 | 6 |
35 | -0.579637 | 3.99118 | 0.1 | 1 | 6 |
19 | -0.581703 | 2.68832 | 0.2 | 0.8 | 6 |
34 | -0.582941 | 3.90171 | 0.1 | 0.8 | 6 |
14 | -0.583348 | 2.46785 | 0.2 | 0.8 | 5 |
27 | -0.583466 | 3.23896 | 0.2 | 0.9 | 4 |
10 | -0.58463 | 2.14364 | 0.2 | 1 | 4 |
22 | -0.584824 | 2.97681 | 0.1 | 0.9 | 6 |
13 | -0.585809 | 2.46436 | 0.1 | 0.9 | 5 |
12 | -0.587067 | 2.29765 | 0.1 | 0.8 | 5 |
8 | -0.590813 | 1.97632 | 0.2 | 0.8 | 4 |
24 | -0.593991 | 3.0293 | 0.05 | 1 | 6 |
9 | -0.595175 | 2.14037 | 0.1 | 1 | 4 |
21 | -0.596592 | 2.91333 | 0.05 | 0.8 | 6 |
28 | -0.597159 | 3.2767 | 0.1 | 0.9 | 4 |
20 | -0.597244 | 2.90384 | 0.05 | 0.9 | 6 |
6 | -0.597766 | 1.96352 | 0.1 | 0.8 | 4 |
5 | -0.603554 | 1.71404 | 0.2 | 1 | 3 |
23 | -0.60517 | 3.01335 | 0.2 | 0.9 | 3 |
18 | -0.605849 | 2.54463 | 0.05 | 0.9 | 5 |
0 | -0.606706 | 1.49037 | 0.2 | 0.8 | 3 |
17 | -0.607283 | 2.511 | 0.05 | 0.8 | 5 |
30 | -0.608091 | 3.47695 | 0.05 | 1 | 5 |
25 | -0.619956 | 3.17843 | 0.1 | 0.9 | 3 |
3 | -0.620752 | 1.63833 | 0.1 | 0.8 | 3 |
4 | -0.621349 | 1.70712 | 0.1 | 1 | 3 |
7 | -0.625488 | 1.96705 | 0.05 | 0.8 | 4 |
29 | -0.626953 | 3.43528 | 0.05 | 0.9 | 4 |
11 | -0.62982 | 2.16092 | 0.05 | 1 | 4 |
1 | -0.656424 | 1.57613 | 0.05 | 0.8 | 3 |
26 | -0.656732 | 3.20212 | 0.05 | 1 | 3 |
2 | -0.656747 | 1.62633 | 0.05 | 0.9 | 3 |
The following evaluation is computed on the validation or out-of-bag dataset.
Task: CLASSIFICATION Label: income Loss (BINOMIAL_LOG_LIKELIHOOD): 0.574861 Accuracy: 0.87251 CI95[W][0 1] ErrorRate: : 0.12749 Confusion Table: truth\prediction <=50K >50K <=50K 1570 94 >50K 194 401 Total: 2259
Variable importances measure the importance of an input feature for a model.
1. "age" 0.257622 ################ 2. "capital_gain" 0.249047 ############# 3. "relationship" 0.244032 ########### 4. "occupation" 0.242881 ########### 5. "hours_per_week" 0.238530 ########## 6. "education" 0.237441 ######### 7. "marital_status" 0.234935 ######## 8. "capital_loss" 0.231145 ####### 9. "fnlwgt" 0.226059 ###### 10. "native_country" 0.225767 ###### 11. "workclass" 0.220718 #### 12. "education_num" 0.219033 #### 13. "sex" 0.211384 # 14. "race" 0.206124
1. "capital_gain" 11.000000 ################ 2. "age" 10.000000 ############## 3. "hours_per_week" 10.000000 ############## 4. "relationship" 9.000000 ############ 5. "marital_status" 7.000000 ######### 6. "education" 6.000000 ######## 7. "capital_loss" 6.000000 ######## 8. "fnlwgt" 5.000000 ###### 9. "workclass" 3.000000 ### 10. "education_num" 3.000000 ### 11. "sex" 3.000000 ### 12. "occupation" 1.000000 13. "race" 1.000000
1. "occupation" 144.000000 ################ 2. "age" 121.000000 ############# 3. "education" 113.000000 ############ 4. "capital_gain" 111.000000 ############ 5. "capital_loss" 90.000000 ######### 6. "native_country" 87.000000 ######### 7. "fnlwgt" 84.000000 ######### 8. "relationship" 73.000000 ####### 9. "marital_status" 68.000000 ####### 10. "hours_per_week" 64.000000 ###### 11. "workclass" 49.000000 ##### 12. "education_num" 28.000000 ## 13. "sex" 14.000000 # 14. "race" 5.000000
1. "relationship" 1675.422986 ################ 2. "capital_gain" 1040.150118 ######### 3. "education_num" 687.196583 ###### 4. "occupation" 526.056194 ##### 5. "marital_status" 469.469421 #### 6. "age" 289.979275 ## 7. "capital_loss" 281.277707 ## 8. "education" 259.256109 ## 9. "hours_per_week" 181.939375 # 10. "native_country" 108.750643 # 11. "workclass" 64.136268 12. "fnlwgt" 46.873309 13. "sex" 30.074515 14. "race" 2.153583
Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.
Only printing the first tree.
Tree #0: "relationship" is in [BITMAP] {<OOD>, Husband, Wife} [s:0.036623 n:20533 np:9213 miss:1] ; pred:-8.31766e-09 ├─(pos)─ "education_num">=12.5 [s:0.0343752 n:9213 np:2773 miss:0] ; pred:0.233866 | ├─(pos)─ "capital_gain">=5095.5 [s:0.0125728 n:2773 np:434 miss:0] ; pred:0.545366 | | ├─(pos)─ "occupation" is in [BITMAP] {<OOD>, Prof-specialty, Exec-managerial, Craft-repair, Adm-clerical, Sales, Other-service, Machine-op-inspct, Transport-moving, Handlers-cleaners, ...[2 left]} [s:0.000434532 n:434 np:429 miss:1] ; pred:0.832346 | | | ├─(pos)─ pred:0.834828 | | | └─(neg)─ pred:0.619473 | | └─(neg)─ "capital_loss">=1782.5 [s:0.0101181 n:2339 np:249 miss:0] ; pred:0.492116 | | ├─(pos)─ pred:0.813402 | | └─(neg)─ pred:0.453839 | └─(neg)─ "capital_gain">=5095.5 [s:0.0205419 n:6440 np:303 miss:0] ; pred:0.0997371 | ├─(pos)─ "age">=60.5 [s:0.00421502 n:303 np:43 miss:0] ; pred:0.810859 | | ├─(pos)─ pred:0.634856 | | └─(neg)─ pred:0.839967 | └─(neg)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Adm-clerical, Sales, Tech-support, Protective-serv} [s:0.0100346 n:6137 np:2334 miss:0] ; pred:0.0646271 | ├─(pos)─ pred:0.205598 | └─(neg)─ pred:-0.0218904 └─(neg)─ "capital_gain">=7073.5 [s:0.0143125 n:11320 np:199 miss:0] ; pred:-0.190336 ├─(pos)─ "age">=21.5 [s:0.00807667 n:199 np:194 miss:1] ; pred:0.795647 | ├─(pos)─ "capital_gain">=7565.5 [s:0.00761118 n:194 np:184 miss:0] ; pred:0.811553 | | ├─(pos)─ pred:0.833976 | | └─(neg)─ pred:0.398979 | └─(neg)─ pred:0.178485 └─(neg)─ "education" is in [BITMAP] {<OOD>, Bachelors, Masters, Prof-school, Doctorate} [s:0.00229611 n:11121 np:2199 miss:1] ; pred:-0.207979 ├─(pos)─ "age">=31.5 [s:0.00725859 n:2199 np:1263 miss:1] ; pred:-0.10157 | ├─(pos)─ pred:-0.0207104 | └─(neg)─ pred:-0.210678 └─(neg)─ "capital_loss">=2218.5 [s:0.000534265 n:8922 np:41 miss:0] ; pred:-0.234206 ├─(pos)─ pred:0.14084 └─(neg)─ pred:-0.235938
模型可以像往常一样进行评估。
model.evaluate(test_ds)
Label \ Pred | <=50K | >50K |
---|---|---|
<=50K | 6974 | 438 |
>50K | 781 | 1576 |
配置条件超参数¶
有些超参数只有在其他超参数以特定方式配置时才相关。例如,当 growing_strategy=LOCAL
时,优化 max_depth
是合理的。然而,当 growing_strategy=BEST_FIRST_GLOBAL
时,优化 max_num_nodes
更为合适。我们可以配置一个调优器来考虑这些条件依赖关系。
tuner = ydf.RandomSearchTuner(num_trials=50)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])
local_subspace = tuner.choice("growing_strategy", ["LOCAL"])
local_subspace.choice("max_depth", [3, 4, 5, 6])
global_subspace = tuner.choice("growing_strategy", ["BEST_FIRST_GLOBAL"], merge=True)
global_subspace.choice("max_num_nodes", [32, 64, 128, 256])
<ydf.learner.tuner.SearchSpace at 0x7f3f10549e50>
让我们调整模型并展示结果。
learner = ydf.GradientBoostedTreesLearner(
label="income",
num_trees=100,
tuner=tuner,
)
model =learner.train(train_ds)
Train model on 22792 examples Model trained in 0:00:06.789261
model.describe()
Task : CLASSIFICATION
Label : income
Features (14) : age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country
Weights : None
Trained with tuner : Yes
Model size : 543 kB
Number of records: 22792 Number of columns: 15 Number of columns by type: CATEGORICAL: 9 (60%) NUMERICAL: 6 (40%) Columns: CATEGORICAL: 9 (60%) 0: "income" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"<=50K" 17308 (75.9389%) 2: "workclass" CATEGORICAL num-nas:1257 (5.51509%) has-dict vocab-size:8 num-oods:3 (0.0139308%) most-frequent:"Private" 15879 (73.7358%) 4: "education" CATEGORICAL has-dict vocab-size:17 zero-ood-items most-frequent:"HS-grad" 7340 (32.2043%) 6: "marital_status" CATEGORICAL has-dict vocab-size:8 zero-ood-items most-frequent:"Married-civ-spouse" 10431 (45.7661%) 7: "occupation" CATEGORICAL num-nas:1260 (5.52826%) has-dict vocab-size:14 num-oods:4 (0.018577%) most-frequent:"Prof-specialty" 2870 (13.329%) 8: "relationship" CATEGORICAL has-dict vocab-size:7 zero-ood-items most-frequent:"Husband" 9191 (40.3256%) 9: "race" CATEGORICAL has-dict vocab-size:6 zero-ood-items most-frequent:"White" 19467 (85.4115%) 10: "sex" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"Male" 15165 (66.5365%) 14: "native_country" CATEGORICAL num-nas:407 (1.78571%) has-dict vocab-size:41 num-oods:1 (0.00446728%) most-frequent:"United-States" 20436 (91.2933%) NUMERICAL: 6 (40%) 1: "age" NUMERICAL mean:38.6153 min:17 max:90 sd:13.661 3: "fnlwgt" NUMERICAL mean:189879 min:12285 max:1.4847e+06 sd:106423 5: "education_num" NUMERICAL mean:10.0927 min:1 max:16 sd:2.56427 11: "capital_gain" NUMERICAL mean:1081.9 min:0 max:99999 sd:7509.48 12: "capital_loss" NUMERICAL mean:87.2806 min:0 max:4356 sd:403.01 13: "hours_per_week" NUMERICAL mean:40.3955 min:1 max:99 sd:12.249 Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
A tuner automatically selects the hyper-parameters of a learner.
trial | score | duration | shrinkage | subsample | growing_strategy | max_depth | max_num_nodes |
---|---|---|---|---|---|---|---|
31 | -0.574861 | 5.4128 | 0.2 | 1 | LOCAL | 5 | |
10 | -0.576405 | 2.72618 | 0.2 | 1 | LOCAL | 6 | |
18 | -0.578031 | 3.67246 | 0.1 | 0.9 | BEST_FIRST_GLOBAL | 32 | |
25 | -0.578941 | 4.434 | 0.2 | 0.9 | LOCAL | 5 | |
11 | -0.579071 | 2.97415 | 0.2 | 0.9 | LOCAL | 6 | |
21 | -0.579482 | 4.04769 | 0.1 | 0.9 | BEST_FIRST_GLOBAL | 64 | |
39 | -0.579482 | 5.72021 | 0.1 | 0.9 | BEST_FIRST_GLOBAL | 128 | |
44 | -0.579637 | 6.08383 | 0.1 | 1 | LOCAL | 6 | |
16 | -0.580548 | 3.50807 | 0.1 | 0.8 | BEST_FIRST_GLOBAL | 32 | |
8 | -0.582698 | 2.65852 | 0.2 | 1 | BEST_FIRST_GLOBAL | 64 | |
28 | -0.582941 | 5.26744 | 0.1 | 0.8 | LOCAL | 6 | |
6 | -0.583348 | 2.62349 | 0.2 | 0.8 | LOCAL | 5 | |
4 | -0.583466 | 2.33348 | 0.2 | 0.9 | LOCAL | 4 | |
14 | -0.583824 | 3.30352 | 0.2 | 0.8 | BEST_FIRST_GLOBAL | 128 | |
15 | -0.583824 | 3.32547 | 0.2 | 0.8 | BEST_FIRST_GLOBAL | 64 | |
3 | -0.584435 | 2.30352 | 0.2 | 0.8 | BEST_FIRST_GLOBAL | 32 | |
42 | -0.584518 | 5.98935 | 0.1 | 0.8 | BEST_FIRST_GLOBAL | 256 | |
49 | -0.584518 | 6.78263 | 0.1 | 0.8 | BEST_FIRST_GLOBAL | 128 | |
33 | -0.58463 | 5.52032 | 0.2 | 1 | LOCAL | 4 | |
12 | -0.584824 | 3.29028 | 0.1 | 0.9 | LOCAL | 6 | |
41 | -0.587067 | 5.83872 | 0.1 | 0.8 | LOCAL | 5 | |
32 | -0.589069 | 5.51251 | 0.2 | 0.9 | BEST_FIRST_GLOBAL | 32 | |
47 | -0.590361 | 6.48934 | 0.05 | 1 | BEST_FIRST_GLOBAL | 64 | |
23 | -0.590361 | 4.2553 | 0.05 | 1 | BEST_FIRST_GLOBAL | 256 | |
43 | -0.590541 | 6.02143 | 0.05 | 0.9 | BEST_FIRST_GLOBAL | 32 | |
37 | -0.590813 | 5.66041 | 0.2 | 0.8 | LOCAL | 4 | |
45 | -0.592258 | 6.24542 | 0.2 | 0.9 | BEST_FIRST_GLOBAL | 64 | |
34 | -0.592258 | 5.6012 | 0.2 | 0.9 | BEST_FIRST_GLOBAL | 256 | |
9 | -0.592258 | 2.72076 | 0.2 | 0.9 | BEST_FIRST_GLOBAL | 128 | |
22 | -0.59235 | 4.12727 | 0.05 | 0.9 | BEST_FIRST_GLOBAL | 128 | |
48 | -0.59235 | 6.76277 | 0.05 | 0.9 | BEST_FIRST_GLOBAL | 256 | |
46 | -0.59389 | 6.38848 | 0.05 | 1 | BEST_FIRST_GLOBAL | 32 | |
35 | -0.593991 | 5.64836 | 0.05 | 1 | LOCAL | 6 | |
17 | -0.594588 | 3.5419 | 0.05 | 0.8 | BEST_FIRST_GLOBAL | 32 | |
19 | -0.595605 | 3.78829 | 0.05 | 0.8 | BEST_FIRST_GLOBAL | 64 | |
20 | -0.595605 | 3.82693 | 0.05 | 0.8 | BEST_FIRST_GLOBAL | 256 | |
36 | -0.597159 | 5.65092 | 0.1 | 0.9 | LOCAL | 4 | |
13 | -0.597244 | 3.29495 | 0.05 | 0.9 | LOCAL | 6 | |
2 | -0.597766 | 2.23067 | 0.1 | 0.8 | LOCAL | 4 | |
1 | -0.603554 | 1.84102 | 0.2 | 1 | LOCAL | 3 | |
29 | -0.60517 | 5.35164 | 0.2 | 0.9 | LOCAL | 3 | |
7 | -0.605849 | 2.63557 | 0.05 | 0.9 | LOCAL | 5 | |
0 | -0.606706 | 1.81158 | 0.2 | 0.8 | LOCAL | 3 | |
5 | -0.607283 | 2.58145 | 0.05 | 0.8 | LOCAL | 5 | |
24 | -0.619956 | 4.39896 | 0.1 | 0.9 | LOCAL | 3 | |
40 | -0.621349 | 5.80321 | 0.1 | 1 | LOCAL | 3 | |
30 | -0.626953 | 5.38874 | 0.05 | 0.9 | LOCAL | 4 | |
27 | -0.62982 | 5.01653 | 0.05 | 1 | LOCAL | 4 | |
38 | -0.656732 | 5.66151 | 0.05 | 1 | LOCAL | 3 | |
26 | -0.656747 | 4.62038 | 0.05 | 0.9 | LOCAL | 3 |
The following evaluation is computed on the validation or out-of-bag dataset.
Task: CLASSIFICATION Label: income Loss (BINOMIAL_LOG_LIKELIHOOD): 0.574861 Accuracy: 0.87251 CI95[W][0 1] ErrorRate: : 0.12749 Confusion Table: truth\prediction <=50K >50K <=50K 1570 94 >50K 194 401 Total: 2259
Variable importances measure the importance of an input feature for a model.
1. "age" 0.257622 ################ 2. "capital_gain" 0.249047 ############# 3. "relationship" 0.244032 ########### 4. "occupation" 0.242881 ########### 5. "hours_per_week" 0.238530 ########## 6. "education" 0.237441 ######### 7. "marital_status" 0.234935 ######## 8. "capital_loss" 0.231145 ####### 9. "fnlwgt" 0.226059 ###### 10. "native_country" 0.225767 ###### 11. "workclass" 0.220718 #### 12. "education_num" 0.219033 #### 13. "sex" 0.211384 # 14. "race" 0.206124
1. "capital_gain" 11.000000 ################ 2. "age" 10.000000 ############## 3. "hours_per_week" 10.000000 ############## 4. "relationship" 9.000000 ############ 5. "marital_status" 7.000000 ######### 6. "education" 6.000000 ######## 7. "capital_loss" 6.000000 ######## 8. "fnlwgt" 5.000000 ###### 9. "workclass" 3.000000 ### 10. "education_num" 3.000000 ### 11. "sex" 3.000000 ### 12. "occupation" 1.000000 13. "race" 1.000000
1. "occupation" 144.000000 ################ 2. "age" 121.000000 ############# 3. "education" 113.000000 ############ 4. "capital_gain" 111.000000 ############ 5. "capital_loss" 90.000000 ######### 6. "native_country" 87.000000 ######### 7. "fnlwgt" 84.000000 ######### 8. "relationship" 73.000000 ####### 9. "marital_status" 68.000000 ####### 10. "hours_per_week" 64.000000 ###### 11. "workclass" 49.000000 ##### 12. "education_num" 28.000000 ## 13. "sex" 14.000000 # 14. "race" 5.000000
1. "relationship" 1675.422986 ################ 2. "capital_gain" 1040.150118 ######### 3. "education_num" 687.196583 ###### 4. "occupation" 526.056194 ##### 5. "marital_status" 469.469421 #### 6. "age" 289.979275 ## 7. "capital_loss" 281.277707 ## 8. "education" 259.256109 ## 9. "hours_per_week" 181.939375 # 10. "native_country" 108.750643 # 11. "workclass" 64.136268 12. "fnlwgt" 46.873309 13. "sex" 30.074515 14. "race" 2.153583
Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.
Only printing the first tree.
Tree #0: "relationship" is in [BITMAP] {<OOD>, Husband, Wife} [s:0.036623 n:20533 np:9213 miss:1] ; pred:-8.31766e-09 ├─(pos)─ "education_num">=12.5 [s:0.0343752 n:9213 np:2773 miss:0] ; pred:0.233866 | ├─(pos)─ "capital_gain">=5095.5 [s:0.0125728 n:2773 np:434 miss:0] ; pred:0.545366 | | ├─(pos)─ "occupation" is in [BITMAP] {<OOD>, Prof-specialty, Exec-managerial, Craft-repair, Adm-clerical, Sales, Other-service, Machine-op-inspct, Transport-moving, Handlers-cleaners, ...[2 left]} [s:0.000434532 n:434 np:429 miss:1] ; pred:0.832346 | | | ├─(pos)─ pred:0.834828 | | | └─(neg)─ pred:0.619473 | | └─(neg)─ "capital_loss">=1782.5 [s:0.0101181 n:2339 np:249 miss:0] ; pred:0.492116 | | ├─(pos)─ pred:0.813402 | | └─(neg)─ pred:0.453839 | └─(neg)─ "capital_gain">=5095.5 [s:0.0205419 n:6440 np:303 miss:0] ; pred:0.0997371 | ├─(pos)─ "age">=60.5 [s:0.00421502 n:303 np:43 miss:0] ; pred:0.810859 | | ├─(pos)─ pred:0.634856 | | └─(neg)─ pred:0.839967 | └─(neg)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Adm-clerical, Sales, Tech-support, Protective-serv} [s:0.0100346 n:6137 np:2334 miss:0] ; pred:0.0646271 | ├─(pos)─ pred:0.205598 | └─(neg)─ pred:-0.0218904 └─(neg)─ "capital_gain">=7073.5 [s:0.0143125 n:11320 np:199 miss:0] ; pred:-0.190336 ├─(pos)─ "age">=21.5 [s:0.00807667 n:199 np:194 miss:1] ; pred:0.795647 | ├─(pos)─ "capital_gain">=7565.5 [s:0.00761118 n:194 np:184 miss:0] ; pred:0.811553 | | ├─(pos)─ pred:0.833976 | | └─(neg)─ pred:0.398979 | └─(neg)─ pred:0.178485 └─(neg)─ "education" is in [BITMAP] {<OOD>, Bachelors, Masters, Prof-school, Doctorate} [s:0.00229611 n:11121 np:2199 miss:1] ; pred:-0.207979 ├─(pos)─ "age">=31.5 [s:0.00725859 n:2199 np:1263 miss:1] ; pred:-0.10157 | ├─(pos)─ pred:-0.0207104 | └─(neg)─ pred:-0.210678 └─(neg)─ "capital_loss">=2218.5 [s:0.000534265 n:8922 np:41 miss:0] ; pred:-0.234206 ├─(pos)─ pred:0.14084 └─(neg)─ pred:-0.235938
model.evaluate(test_ds)
Label \ Pred | <=50K | >50K |
---|---|---|
<=50K | 6974 | 438 |
>50K | 781 | 1576 |
使用自动配置超参数进行本地调优¶
如果您不想配置要优化的超参数,可以使用预配置的调谐器。
tuner = ydf.RandomSearchTuner(num_trials=50, automatic_search_space=True)
模型训练是类似的:
learner = ydf.GradientBoostedTreesLearner(
label="income",
num_trees=100,
tuner=tuner,
)
model =learner.train(train_ds)
Train model on 22792 examples Model trained in 0:00:01.745021
除了查看模型:
model.describe()
Task : CLASSIFICATION
Label : income
Features (14) : age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country
Weights : None
Trained with tuner : Yes
Model size : 1374 kB
Number of records: 22792 Number of columns: 15 Number of columns by type: CATEGORICAL: 9 (60%) NUMERICAL: 6 (40%) Columns: CATEGORICAL: 9 (60%) 0: "income" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"<=50K" 17308 (75.9389%) 2: "workclass" CATEGORICAL num-nas:1257 (5.51509%) has-dict vocab-size:8 num-oods:3 (0.0139308%) most-frequent:"Private" 15879 (73.7358%) 4: "education" CATEGORICAL has-dict vocab-size:17 zero-ood-items most-frequent:"HS-grad" 7340 (32.2043%) 6: "marital_status" CATEGORICAL has-dict vocab-size:8 zero-ood-items most-frequent:"Married-civ-spouse" 10431 (45.7661%) 7: "occupation" CATEGORICAL num-nas:1260 (5.52826%) has-dict vocab-size:14 num-oods:4 (0.018577%) most-frequent:"Prof-specialty" 2870 (13.329%) 8: "relationship" CATEGORICAL has-dict vocab-size:7 zero-ood-items most-frequent:"Husband" 9191 (40.3256%) 9: "race" CATEGORICAL has-dict vocab-size:6 zero-ood-items most-frequent:"White" 19467 (85.4115%) 10: "sex" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"Male" 15165 (66.5365%) 14: "native_country" CATEGORICAL num-nas:407 (1.78571%) has-dict vocab-size:41 num-oods:1 (0.00446728%) most-frequent:"United-States" 20436 (91.2933%) NUMERICAL: 6 (40%) 1: "age" NUMERICAL mean:38.6153 min:17 max:90 sd:13.661 3: "fnlwgt" NUMERICAL mean:189879 min:12285 max:1.4847e+06 sd:106423 5: "education_num" NUMERICAL mean:10.0927 min:1 max:16 sd:2.56427 11: "capital_gain" NUMERICAL mean:1081.9 min:0 max:99999 sd:7509.48 12: "capital_loss" NUMERICAL mean:87.2806 min:0 max:4356 sd:403.01 13: "hours_per_week" NUMERICAL mean:40.3955 min:1 max:99 sd:12.249 Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
A tuner automatically selects the hyper-parameters of a learner.
trial | score | duration |
---|---|---|
0 | -0.579637 | 1.74332 |
The following evaluation is computed on the validation or out-of-bag dataset.
Task: CLASSIFICATION Label: income Loss (BINOMIAL_LOG_LIKELIHOOD): 0.579637 Accuracy: 0.868083 CI95[W][0 1] ErrorRate: : 0.131917 Confusion Table: truth\prediction <=50K >50K <=50K 1564 100 >50K 198 397 Total: 2259
Variable importances measure the importance of an input feature for a model.
1. "capital_gain" 0.234685 ################ 2. "age" 0.231226 ############### 3. "marital_status" 0.225030 ############# 4. "occupation" 0.216504 ########### 5. "education" 0.212171 ######### 6. "relationship" 0.203987 ####### 7. "hours_per_week" 0.203680 ####### 8. "capital_loss" 0.199160 ###### 9. "fnlwgt" 0.188297 ### 10. "native_country" 0.187899 ### 11. "education_num" 0.185984 ## 12. "workclass" 0.184872 ## 13. "race" 0.177978 14. "sex" 0.176098
1. "capital_gain" 19.000000 ################ 2. "marital_status" 18.000000 ############### 3. "age" 15.000000 ############ 4. "relationship" 10.000000 ####### 5. "capital_loss" 8.000000 ##### 6. "hours_per_week" 8.000000 ##### 7. "education" 6.000000 ### 8. "education_num" 5.000000 ## 9. "race" 5.000000 ## 10. "fnlwgt" 2.000000 11. "occupation" 2.000000 12. "sex" 2.000000
1. "occupation" 437.000000 ################ 2. "age" 331.000000 ############ 3. "education" 285.000000 ########## 4. "capital_gain" 257.000000 ######### 5. "capital_loss" 230.000000 ######## 6. "native_country" 221.000000 ####### 7. "fnlwgt" 210.000000 ####### 8. "hours_per_week" 207.000000 ####### 9. "relationship" 172.000000 ###### 10. "workclass" 140.000000 #### 11. "marital_status" 139.000000 #### 12. "education_num" 63.000000 ## 13. "sex" 23.000000 14. "race" 8.000000
1. "relationship" 2993.930793 ################ 2. "capital_gain" 2048.254640 ########## 3. "marital_status" 1095.321390 ##### 4. "education" 1094.118075 ##### 5. "occupation" 1009.400363 ##### 6. "education_num" 794.643186 #### 7. "capital_loss" 571.858684 ### 8. "age" 545.766716 ## 9. "hours_per_week" 336.939387 # 10. "native_country" 241.147622 # 11. "workclass" 164.564834 12. "fnlwgt" 115.319824 13. "sex" 43.401514 14. "race" 2.559291
Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.
Only printing the first tree.
Tree #0: "relationship" is in [BITMAP] {<OOD>, Husband, Wife} [s:0.036623 n:20533 np:9213 miss:1] ; pred:-4.15883e-09 ├─(pos)─ "education_num">=12.5 [s:0.0343752 n:9213 np:2773 miss:0] ; pred:0.116933 | ├─(pos)─ "capital_gain">=5095.5 [s:0.0125728 n:2773 np:434 miss:0] ; pred:0.272683 | | ├─(pos)─ "occupation" is in [BITMAP] {<OOD>, Prof-specialty, Exec-managerial, Craft-repair, Adm-clerical, Sales, Other-service, Machine-op-inspct, Transport-moving, Handlers-cleaners, ...[2 left]} [s:0.000434532 n:434 np:429 miss:1] ; pred:0.416173 | | | ├─(pos)─ "age">=79.5 [s:0.000449964 n:429 np:5 miss:0] ; pred:0.417414 | | | | ├─(pos)─ pred:0.309737 | | | | └─(neg)─ pred:0.418684 | | | └─(neg)─ pred:0.309737 | | └─(neg)─ "capital_loss">=1782.5 [s:0.0101181 n:2339 np:249 miss:0] ; pred:0.246058 | | ├─(pos)─ "capital_loss">=1989.5 [s:0.00201289 n:249 np:39 miss:0] ; pred:0.406701 | | | ├─(pos)─ pred:0.349312 | | | └─(neg)─ pred:0.417359 | | └─(neg)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Sales, Tech-support, Protective-serv} [s:0.0097175 n:2090 np:1688 miss:0] ; pred:0.226919 | | ├─(pos)─ pred:0.253437 | | └─(neg)─ pred:0.11557 | └─(neg)─ "capital_gain">=5095.5 [s:0.0205419 n:6440 np:303 miss:0] ; pred:0.0498685 | ├─(pos)─ "age">=60.5 [s:0.00421502 n:303 np:43 miss:0] ; pred:0.40543 | | ├─(pos)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Adm-clerical, Sales, Machine-op-inspct, Transport-moving, Handlers-cleaners} [s:0.0296244 n:43 np:25 miss:0] ; pred:0.317428 | | | ├─(pos)─ pred:0.397934 | | | └─(neg)─ pred:0.205614 | | └─(neg)─ "fnlwgt">=36212.5 [s:1.36643e-16 n:260 np:250 miss:1] ; pred:0.419984 | | ├─(pos)─ pred:0.419984 | | └─(neg)─ pred:0.419984 | └─(neg)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Adm-clerical, Sales, Tech-support, Protective-serv} [s:0.0100346 n:6137 np:2334 miss:0] ; pred:0.0323136 | ├─(pos)─ "age">=33.5 [s:0.00939348 n:2334 np:1769 miss:1] ; pred:0.102799 | | ├─(pos)─ pred:0.132992 | | └─(neg)─ pred:0.00826457 | └─(neg)─ "education" is in [BITMAP] {<OOD>, HS-grad, Some-college, Bachelors, Masters, Assoc-voc, Assoc-acdm, Prof-school, Doctorate} [s:0.00478423 n:3803 np:2941 miss:1] ; pred:-0.0109452 | ├─(pos)─ pred:0.00969668 | └─(neg)─ pred:-0.0813718 └─(neg)─ "capital_gain">=7073.5 [s:0.0143125 n:11320 np:199 miss:0] ; pred:-0.0951681 ├─(pos)─ "age">=21.5 [s:0.00807667 n:199 np:194 miss:1] ; pred:0.397823 | ├─(pos)─ "capital_gain">=7565.5 [s:0.00761118 n:194 np:184 miss:0] ; pred:0.405777 | | ├─(pos)─ "capital_gain">=30961.5 [s:0.000242202 n:184 np:20 miss:0] ; pred:0.416988 | | | ├─(pos)─ pred:0.392422 | | | └─(neg)─ pred:0.419984 | | └─(neg)─ "education" is in [BITMAP] {Bachelors, Masters, Assoc-voc, Prof-school} [s:0.16 n:10 np:5 miss:0] ; pred:0.19949 | | ├─(pos)─ pred:0.419984 | | └─(neg)─ pred:-0.0210046 | └─(neg)─ pred:0.0892425 └─(neg)─ "education" is in [BITMAP] {<OOD>, Bachelors, Masters, Prof-school, Doctorate} [s:0.00229611 n:11121 np:2199 miss:1] ; pred:-0.10399 ├─(pos)─ "age">=31.5 [s:0.00725859 n:2199 np:1263 miss:1] ; pred:-0.0507848 | ├─(pos)─ "education" is in [BITMAP] {<OOD>, HS-grad, Some-college, Assoc-voc, 11th, Assoc-acdm, 10th, 7th-8th, Prof-school, 9th, ...[5 left]} [s:0.0110157 n:1263 np:125 miss:1] ; pred:-0.0103552 | | ├─(pos)─ pred:0.16421 | | └─(neg)─ pred:-0.0295298 | └─(neg)─ "capital_loss">=1977 [s:0.00164232 n:936 np:5 miss:0] ; pred:-0.105339 | ├─(pos)─ pred:0.19949 | └─(neg)─ pred:-0.106976 └─(neg)─ "capital_loss">=2218.5 [s:0.000534265 n:8922 np:41 miss:0] ; pred:-0.117103 ├─(pos)─ "fnlwgt">=125450 [s:0.0755454 n:41 np:28 miss:1] ; pred:0.0704198 | ├─(pos)─ pred:-0.0328167 | └─(neg)─ pred:0.292776 └─(neg)─ "hours_per_week">=40.5 [s:0.000447024 n:8881 np:1559 miss:0] ; pred:-0.117969 ├─(pos)─ pred:-0.0927111 └─(neg)─ pred:-0.123347
并评估模型:
model.evaluate(test_ds)
Label \ Pred | <=50K | >50K |
---|---|---|
<=50K | 6985 | 427 |
>50K | 796 | 1561 |