单调性¶
单调约束 强制特征与模型预测之间存在单调关系。例如,我们可能希望模型输出在特定特征值增加时始终增加。单调性通过 features
参数施加。
注意: 并非所有学习者都支持单调约束。
让我们在 成人人口普查 数据集上训练一个模型,针对特征 age
和 hours_per_week
强制单调增加约束,即迫使模型在年龄增加和每周工作小时数增加时预测收入增加(其他所有特征保持不变)。
In [1]:
Copied!
import ydf
import pandas as pd
# 下载一个分类数据集,并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")
# 打印前5个训练样本
train_ds.head(5)
import ydf
import pandas as pd
# 下载一个分类数据集,并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")
# 打印前5个训练样本
train_ds.head(5)
Out[1]:
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 44 | Private | 228057 | 7th-8th | 4 | Married-civ-spouse | Machine-op-inspct | Wife | White | Female | 0 | 0 | 40 | Dominican-Republic | <=50K |
1 | 20 | Private | 299047 | Some-college | 10 | Never-married | Other-service | Not-in-family | White | Female | 0 | 0 | 20 | United-States | <=50K |
2 | 40 | Private | 342164 | HS-grad | 9 | Separated | Adm-clerical | Unmarried | White | Female | 0 | 0 | 37 | United-States | <=50K |
3 | 30 | Private | 361742 | Some-college | 10 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
4 | 67 | Self-emp-inc | 171564 | HS-grad | 9 | Married-civ-spouse | Prof-specialty | Wife | White | Female | 20051 | 0 | 30 | England | >50K |
In [2]:
Copied!
model = ydf.GradientBoostedTreesLearner(label="income",
features=[
ydf.Feature("age", monotonic=+1),
ydf.Feature("hours_per_week", monotonic=+1),
],
include_all_columns=True,
use_hessian_gain=True,
).train(train_ds)
model = ydf.GradientBoostedTreesLearner(label="income",
features=[
ydf.Feature("age", monotonic=+1),
ydf.Feature("hours_per_week", monotonic=+1),
],
include_all_columns=True,
use_hessian_gain=True,
).train(train_ds)
Train model on 22792 examples Model trained in 0:00:02.326853
为了验证一个模型在某个特征上的单调性,我们可以检查局部依赖图(PDP)。在PDP标签中,age
和hours_per_week
的曲线是单调递增的。
In [3]:
Copied!
model.analyze(test_ds, sampling=0.1)
model.analyze(test_ds, sampling=0.1)
Out[3]:
Variable importances measure the importance of an input feature for a model.
1. "capital_gain" 0.051592 ################ 2. "marital_status" 0.043403 ############# 3. "education_num" 0.030505 ######### 4. "age" 0.017197 ##### 5. "capital_loss" 0.013512 #### 6. "occupation" 0.010953 ### 7. "hours_per_week" 0.006551 ## 8. "workclass" 0.003173 # 9. "race" 0.001024 10. "fnlwgt" 0.000921 11. "relationship" 0.000921 12. "native_country" 0.000921 13. "education" -0.001433 14. "sex" -0.001638
1. "capital_gain" 0.234571 ################ 2. "marital_status" 0.126207 ######## 3. "age" 0.062733 #### 4. "education_num" 0.061874 #### 5. "capital_loss" 0.045245 ### 6. "occupation" 0.025731 # 7. "hours_per_week" 0.017928 # 8. "relationship" 0.007549 9. "workclass" 0.005649 10. "sex" 0.002152 11. "fnlwgt" 0.001415 12. "race" 0.001161 13. "education" 0.000641 14. "native_country" 0.000496
1. "marital_status" 0.078686 ################ 2. "capital_gain" 0.060498 ############ 3. "age" 0.056163 ########### 4. "education_num" 0.030999 ###### 5. "capital_loss" 0.014430 ## 6. "hours_per_week" 0.012948 ## 7. "occupation" 0.012506 ## 8. "relationship" 0.005170 9. "workclass" 0.002254 10. "sex" 0.001692 11. "race" 0.000560 12. "native_country" 0.000364 13. "education" 0.000331 14. "fnlwgt" 0.000270
1. "capital_gain" 0.234376 ################ 2. "marital_status" 0.126181 ######## 3. "age" 0.062723 #### 4. "education_num" 0.061858 #### 5. "capital_loss" 0.045227 ### 6. "occupation" 0.025721 # 7. "hours_per_week" 0.017925 # 8. "relationship" 0.007542 9. "workclass" 0.005649 10. "sex" 0.002149 11. "fnlwgt" 0.001409 12. "race" 0.001158 13. "education" 0.000638 14. "native_country" 0.000495
1. "capital_gain" 0.240106 ################ 2. "fnlwgt" 0.228937 ############# 3. "education_num" 0.221243 ########### 4. "occupation" 0.204376 ####### 5. "age" 0.202566 ###### 6. "marital_status" 0.201923 ###### 7. "capital_loss" 0.199952 ###### 8. "relationship" 0.191252 ### 9. "native_country" 0.189980 ### 10. "hours_per_week" 0.189106 ### 11. "workclass" 0.186076 ## 12. "race" 0.179428 13. "sex" 0.176670 14. "education" 0.175650
1. "capital_gain" 34.000000 ################ 2. "marital_status" 27.000000 ############ 3. "occupation" 20.000000 ######## 4. "capital_loss" 19.000000 ####### 5. "relationship" 15.000000 ##### 6. "age" 14.000000 #### 7. "fnlwgt" 14.000000 #### 8. "education_num" 13.000000 #### 9. "native_country" 11.000000 ### 10. "workclass" 9.000000 ## 11. "hours_per_week" 6.000000 12. "race" 5.000000
1. "fnlwgt" 1283.000000 ################ 2. "capital_gain" 705.000000 ######## 3. "education_num" 618.000000 ####### 4. "capital_loss" 426.000000 ##### 5. "age" 384.000000 #### 6. "hours_per_week" 283.000000 ### 7. "occupation" 243.000000 ## 8. "workclass" 128.000000 # 9. "relationship" 114.000000 # 10. "marital_status" 106.000000 # 11. "native_country" 103.000000 12. "education" 61.000000 13. "sex" 37.000000 14. "race" 27.000000
1. "marital_status" 487568319.598255 ################ 2. "education_num" 343622892.392935 ########### 3. "capital_gain" 320171507.632499 ########## 4. "hours_per_week" 153983141.770836 ##### 5. "capital_loss" 146585694.670481 #### 6. "age" 98337013.334848 ### 7. "occupation" 52018124.395105 # 8. "fnlwgt" 14683666.270136 9. "sex" 7973337.944597 10. "workclass" 7065720.846374 11. "relationship" 6549757.592968 12. "native_country" 3384842.107984 13. "race" 1127991.045659 14. "education" 1047237.942327
为了比较,让我们在没有单调约束的情况下训练相同的模型,并比较部分依赖图。
在这里,age
和 hours_per_week
的部分依赖图并不是单调的。例如,对于大于60岁的年龄段,随着年龄的增加,模型输出反而减少。
In [4]:
Copied!
model = ydf.GradientBoostedTreesLearner(label="income",
use_hessian_gain=True,
).train(train_ds)
model.analyze(test_ds, sampling=0.1)
model = ydf.GradientBoostedTreesLearner(label="income",
use_hessian_gain=True,
).train(train_ds)
model.analyze(test_ds, sampling=0.1)
Train model on 22792 examples Model trained in 0:00:02.118608
Out[4]:
Variable importances measure the importance of an input feature for a model.
1. "capital_gain" 0.055789 ################ 2. "marital_status" 0.044733 ############ 3. "education_num" 0.028150 ######## 4. "occupation" 0.019040 ##### 5. "age" 0.015867 #### 6. "capital_loss" 0.015457 #### 7. "hours_per_week" 0.003276 # 8. "workclass" 0.003173 # 9. "fnlwgt" 0.001331 10. "education" 0.000819 11. "relationship" 0.000512 12. "native_country" 0.000102 13. "sex" -0.000102 14. "race" -0.001126
1. "capital_gain" 0.250679 ################ 2. "marital_status" 0.132571 ######## 3. "age" 0.062733 #### 4. "education_num" 0.059362 ### 5. "capital_loss" 0.045224 ## 6. "occupation" 0.031333 ## 7. "hours_per_week" 0.016096 # 8. "relationship" 0.004431 9. "workclass" 0.004119 10. "sex" 0.002572 11. "fnlwgt" 0.001708 12. "education" 0.000919 13. "native_country" 0.000418 14. "race" -0.000084
1. "marital_status" 0.087207 ################ 2. "capital_gain" 0.064457 ########### 3. "age" 0.058184 ########## 4. "education_num" 0.027934 ##### 5. "occupation" 0.015009 ## 6. "capital_loss" 0.014695 ## 7. "hours_per_week" 0.010852 # 8. "relationship" 0.003060 9. "sex" 0.002069 10. "workclass" 0.001361 11. "fnlwgt" 0.000469 12. "education" 0.000332 13. "native_country" 0.000159 14. "race" 0.000068
1. "capital_gain" 0.250493 ################ 2. "marital_status" 0.132549 ######## 3. "age" 0.062720 #### 4. "education_num" 0.059350 ### 5. "capital_loss" 0.045208 ## 6. "occupation" 0.031326 ## 7. "hours_per_week" 0.016094 # 8. "relationship" 0.004429 9. "workclass" 0.004118 10. "sex" 0.002570 11. "fnlwgt" 0.001708 12. "education" 0.000918 13. "native_country" 0.000417 14. "race" -0.000085
1. "age" 0.262588 ################ 2. "capital_gain" 0.226122 ######### 3. "education_num" 0.217170 ####### 4. "fnlwgt" 0.210966 ###### 5. "hours_per_week" 0.208267 ###### 6. "marital_status" 0.202346 ##### 7. "occupation" 0.201522 ##### 8. "capital_loss" 0.198465 #### 9. "native_country" 0.181994 # 10. "relationship" 0.181866 # 11. "workclass" 0.179705 # 12. "race" 0.176878 13. "sex" 0.173925 14. "education" 0.172549
1. "age" 38.000000 ################ 2. "capital_gain" 23.000000 ######### 3. "marital_status" 21.000000 ######## 4. "capital_loss" 18.000000 ####### 5. "occupation" 17.000000 ###### 6. "education_num" 16.000000 ###### 7. "hours_per_week" 11.000000 #### 8. "fnlwgt" 6.000000 # 9. "native_country" 6.000000 # 10. "relationship" 5.000000 # 11. "race" 3.000000 12. "workclass" 2.000000
1. "fnlwgt" 920.000000 ################ 2. "age" 850.000000 ############## 3. "hours_per_week" 539.000000 ######### 4. "capital_gain" 493.000000 ######## 5. "education_num" 439.000000 ####### 6. "capital_loss" 374.000000 ###### 7. "occupation" 186.000000 ## 8. "marital_status" 129.000000 # 9. "workclass" 110.000000 # 10. "relationship" 107.000000 # 11. "native_country" 81.000000 12. "education" 34.000000 13. "sex" 34.000000 14. "race" 28.000000
1. "marital_status" 484346054.435797 ################ 2. "education_num" 342841906.363874 ########### 3. "capital_gain" 314297945.114876 ########## 4. "hours_per_week" 156279266.529111 ##### 5. "capital_loss" 136981976.118004 #### 6. "age" 124340810.030382 #### 7. "occupation" 47180431.518591 # 8. "fnlwgt" 11123014.673219 9. "workclass" 6312540.338696 10. "relationship" 6128030.124364 11. "sex" 5356973.172271 12. "native_country" 3569206.429949 13. "race" 1483551.225928 14. "education" 828077.075970