单调性¶
单调约束 强制特征与模型预测之间存在单调关系。例如,我们可能希望模型输出在特定特征值增加时始终增加。单调性通过 features 参数施加。
注意: 并非所有学习者都支持单调约束。
让我们在 成人人口普查 数据集上训练一个模型,针对特征 age 和 hours_per_week 强制单调增加约束,即迫使模型在年龄增加和每周工作小时数增加时预测收入增加(其他所有特征保持不变)。
In [1]:
Copied!
import ydf
import pandas as pd
# 下载一个分类数据集,并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")
# 打印前5个训练样本
train_ds.head(5)
import ydf
import pandas as pd
# 下载一个分类数据集,并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")
# 打印前5个训练样本
train_ds.head(5)
Out[1]:
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 44 | Private | 228057 | 7th-8th | 4 | Married-civ-spouse | Machine-op-inspct | Wife | White | Female | 0 | 0 | 40 | Dominican-Republic | <=50K |
| 1 | 20 | Private | 299047 | Some-college | 10 | Never-married | Other-service | Not-in-family | White | Female | 0 | 0 | 20 | United-States | <=50K |
| 2 | 40 | Private | 342164 | HS-grad | 9 | Separated | Adm-clerical | Unmarried | White | Female | 0 | 0 | 37 | United-States | <=50K |
| 3 | 30 | Private | 361742 | Some-college | 10 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
| 4 | 67 | Self-emp-inc | 171564 | HS-grad | 9 | Married-civ-spouse | Prof-specialty | Wife | White | Female | 20051 | 0 | 30 | England | >50K |
In [2]:
Copied!
model = ydf.GradientBoostedTreesLearner(label="income",
features=[
ydf.Feature("age", monotonic=+1),
ydf.Feature("hours_per_week", monotonic=+1),
],
include_all_columns=True,
use_hessian_gain=True,
).train(train_ds)
model = ydf.GradientBoostedTreesLearner(label="income",
features=[
ydf.Feature("age", monotonic=+1),
ydf.Feature("hours_per_week", monotonic=+1),
],
include_all_columns=True,
use_hessian_gain=True,
).train(train_ds)
Train model on 22792 examples Model trained in 0:00:02.326853
为了验证一个模型在某个特征上的单调性,我们可以检查局部依赖图(PDP)。在PDP标签中,age和hours_per_week的曲线是单调递增的。
In [3]:
Copied!
model.analyze(test_ds, sampling=0.1)
model.analyze(test_ds, sampling=0.1)
Out[3]:
Variable importances measure the importance of an input feature for a model.
1. "capital_gain" 0.051592 ################
2. "marital_status" 0.043403 #############
3. "education_num" 0.030505 #########
4. "age" 0.017197 #####
5. "capital_loss" 0.013512 ####
6. "occupation" 0.010953 ###
7. "hours_per_week" 0.006551 ##
8. "workclass" 0.003173 #
9. "race" 0.001024
10. "fnlwgt" 0.000921
11. "relationship" 0.000921
12. "native_country" 0.000921
13. "education" -0.001433
14. "sex" -0.001638
1. "capital_gain" 0.234571 ################
2. "marital_status" 0.126207 ########
3. "age" 0.062733 ####
4. "education_num" 0.061874 ####
5. "capital_loss" 0.045245 ###
6. "occupation" 0.025731 #
7. "hours_per_week" 0.017928 #
8. "relationship" 0.007549
9. "workclass" 0.005649
10. "sex" 0.002152
11. "fnlwgt" 0.001415
12. "race" 0.001161
13. "education" 0.000641
14. "native_country" 0.000496
1. "marital_status" 0.078686 ################
2. "capital_gain" 0.060498 ############
3. "age" 0.056163 ###########
4. "education_num" 0.030999 ######
5. "capital_loss" 0.014430 ##
6. "hours_per_week" 0.012948 ##
7. "occupation" 0.012506 ##
8. "relationship" 0.005170
9. "workclass" 0.002254
10. "sex" 0.001692
11. "race" 0.000560
12. "native_country" 0.000364
13. "education" 0.000331
14. "fnlwgt" 0.000270
1. "capital_gain" 0.234376 ################
2. "marital_status" 0.126181 ########
3. "age" 0.062723 ####
4. "education_num" 0.061858 ####
5. "capital_loss" 0.045227 ###
6. "occupation" 0.025721 #
7. "hours_per_week" 0.017925 #
8. "relationship" 0.007542
9. "workclass" 0.005649
10. "sex" 0.002149
11. "fnlwgt" 0.001409
12. "race" 0.001158
13. "education" 0.000638
14. "native_country" 0.000495
1. "capital_gain" 0.240106 ################
2. "fnlwgt" 0.228937 #############
3. "education_num" 0.221243 ###########
4. "occupation" 0.204376 #######
5. "age" 0.202566 ######
6. "marital_status" 0.201923 ######
7. "capital_loss" 0.199952 ######
8. "relationship" 0.191252 ###
9. "native_country" 0.189980 ###
10. "hours_per_week" 0.189106 ###
11. "workclass" 0.186076 ##
12. "race" 0.179428
13. "sex" 0.176670
14. "education" 0.175650
1. "capital_gain" 34.000000 ################
2. "marital_status" 27.000000 ############
3. "occupation" 20.000000 ########
4. "capital_loss" 19.000000 #######
5. "relationship" 15.000000 #####
6. "age" 14.000000 ####
7. "fnlwgt" 14.000000 ####
8. "education_num" 13.000000 ####
9. "native_country" 11.000000 ###
10. "workclass" 9.000000 ##
11. "hours_per_week" 6.000000
12. "race" 5.000000
1. "fnlwgt" 1283.000000 ################
2. "capital_gain" 705.000000 ########
3. "education_num" 618.000000 #######
4. "capital_loss" 426.000000 #####
5. "age" 384.000000 ####
6. "hours_per_week" 283.000000 ###
7. "occupation" 243.000000 ##
8. "workclass" 128.000000 #
9. "relationship" 114.000000 #
10. "marital_status" 106.000000 #
11. "native_country" 103.000000
12. "education" 61.000000
13. "sex" 37.000000
14. "race" 27.000000
1. "marital_status" 487568319.598255 ################
2. "education_num" 343622892.392935 ###########
3. "capital_gain" 320171507.632499 ##########
4. "hours_per_week" 153983141.770836 #####
5. "capital_loss" 146585694.670481 ####
6. "age" 98337013.334848 ###
7. "occupation" 52018124.395105 #
8. "fnlwgt" 14683666.270136
9. "sex" 7973337.944597
10. "workclass" 7065720.846374
11. "relationship" 6549757.592968
12. "native_country" 3384842.107984
13. "race" 1127991.045659
14. "education" 1047237.942327
为了比较,让我们在没有单调约束的情况下训练相同的模型,并比较部分依赖图。
在这里,age 和 hours_per_week 的部分依赖图并不是单调的。例如,对于大于60岁的年龄段,随着年龄的增加,模型输出反而减少。
In [4]:
Copied!
model = ydf.GradientBoostedTreesLearner(label="income",
use_hessian_gain=True,
).train(train_ds)
model.analyze(test_ds, sampling=0.1)
model = ydf.GradientBoostedTreesLearner(label="income",
use_hessian_gain=True,
).train(train_ds)
model.analyze(test_ds, sampling=0.1)
Train model on 22792 examples Model trained in 0:00:02.118608
Out[4]:
Variable importances measure the importance of an input feature for a model.
1. "capital_gain" 0.055789 ################
2. "marital_status" 0.044733 ############
3. "education_num" 0.028150 ########
4. "occupation" 0.019040 #####
5. "age" 0.015867 ####
6. "capital_loss" 0.015457 ####
7. "hours_per_week" 0.003276 #
8. "workclass" 0.003173 #
9. "fnlwgt" 0.001331
10. "education" 0.000819
11. "relationship" 0.000512
12. "native_country" 0.000102
13. "sex" -0.000102
14. "race" -0.001126
1. "capital_gain" 0.250679 ################
2. "marital_status" 0.132571 ########
3. "age" 0.062733 ####
4. "education_num" 0.059362 ###
5. "capital_loss" 0.045224 ##
6. "occupation" 0.031333 ##
7. "hours_per_week" 0.016096 #
8. "relationship" 0.004431
9. "workclass" 0.004119
10. "sex" 0.002572
11. "fnlwgt" 0.001708
12. "education" 0.000919
13. "native_country" 0.000418
14. "race" -0.000084
1. "marital_status" 0.087207 ################
2. "capital_gain" 0.064457 ###########
3. "age" 0.058184 ##########
4. "education_num" 0.027934 #####
5. "occupation" 0.015009 ##
6. "capital_loss" 0.014695 ##
7. "hours_per_week" 0.010852 #
8. "relationship" 0.003060
9. "sex" 0.002069
10. "workclass" 0.001361
11. "fnlwgt" 0.000469
12. "education" 0.000332
13. "native_country" 0.000159
14. "race" 0.000068
1. "capital_gain" 0.250493 ################
2. "marital_status" 0.132549 ########
3. "age" 0.062720 ####
4. "education_num" 0.059350 ###
5. "capital_loss" 0.045208 ##
6. "occupation" 0.031326 ##
7. "hours_per_week" 0.016094 #
8. "relationship" 0.004429
9. "workclass" 0.004118
10. "sex" 0.002570
11. "fnlwgt" 0.001708
12. "education" 0.000918
13. "native_country" 0.000417
14. "race" -0.000085
1. "age" 0.262588 ################
2. "capital_gain" 0.226122 #########
3. "education_num" 0.217170 #######
4. "fnlwgt" 0.210966 ######
5. "hours_per_week" 0.208267 ######
6. "marital_status" 0.202346 #####
7. "occupation" 0.201522 #####
8. "capital_loss" 0.198465 ####
9. "native_country" 0.181994 #
10. "relationship" 0.181866 #
11. "workclass" 0.179705 #
12. "race" 0.176878
13. "sex" 0.173925
14. "education" 0.172549
1. "age" 38.000000 ################
2. "capital_gain" 23.000000 #########
3. "marital_status" 21.000000 ########
4. "capital_loss" 18.000000 #######
5. "occupation" 17.000000 ######
6. "education_num" 16.000000 ######
7. "hours_per_week" 11.000000 ####
8. "fnlwgt" 6.000000 #
9. "native_country" 6.000000 #
10. "relationship" 5.000000 #
11. "race" 3.000000
12. "workclass" 2.000000
1. "fnlwgt" 920.000000 ################
2. "age" 850.000000 ##############
3. "hours_per_week" 539.000000 #########
4. "capital_gain" 493.000000 ########
5. "education_num" 439.000000 #######
6. "capital_loss" 374.000000 ######
7. "occupation" 186.000000 ##
8. "marital_status" 129.000000 #
9. "workclass" 110.000000 #
10. "relationship" 107.000000 #
11. "native_country" 81.000000
12. "education" 34.000000
13. "sex" 34.000000
14. "race" 28.000000
1. "marital_status" 484346054.435797 ################
2. "education_num" 342841906.363874 ###########
3. "capital_gain" 314297945.114876 ##########
4. "hours_per_week" 156279266.529111 #####
5. "capital_loss" 136981976.118004 ####
6. "age" 124340810.030382 ####
7. "occupation" 47180431.518591 #
8. "fnlwgt" 11123014.673219
9. "workclass" 6312540.338696
10. "relationship" 6128030.124364
11. "sex" 5356973.172271
12. "native_country" 3569206.429949
13. "race" 1483551.225928
14. "education" 828077.075970