调优¶

设置¶

In [ ]:

Copied!

pip install ydf -U
pip install ydf -U

什么是模型调优？¶

模型调优，也称为自动化模型超参数优化或AutoML，涉及寻找学习者的最佳超参数，以最大化模型的性能。YDF支持开箱即用的模型调优。

YDF模型调优有两种模式。用户可以手动指定要优化的超参数及其候选值，或者使用预配置的调优。第二种选项更简单，而第一种选项则给予您更多控制。我们将在本教程中演示这两种选项。

调优可以在单台机器上进行，也可以通过分布式训练在多台机器上进行。本教程重点介绍在单台机器上的调优。本地调优设置简单，并且可以在小型数据集上产生优秀的结果。

分布式模型调优¶

分布式训练调优对于训练时间较长或超参数搜索空间较大的模型可能是有利的。分布式调优需要配置工作节点并指定学习者的workers构造参数。工作节点设置完成后，模型调优策略与在本地机器上的调优相同。有关更多信息，请参阅分布式训练教程。

下载数据集¶

我们使用成人数据集。

In [1]:

Copied!





import ydf  # Yggdrasil决策森林
import pandas as pd  # 我们使用Pandas加载小型数据集。

# 下载一个分类数据集，并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# 打印前5个训练样本
train_ds.head(5)
import ydf  # Yggdrasil决策森林
import pandas as pd  # 我们使用Pandas加载小型数据集。

# 下载一个分类数据集，并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# 打印前5个训练样本
train_ds.head(5)

Out[1]:

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
0	44	Private	228057	7th-8th	4	Married-civ-spouse	Machine-op-inspct	Wife	White	Female	0	40	Dominican-Republic	<=50K
1	20	Private	299047	Some-college	10	Never-married	Other-service	Not-in-family	White	Female	0	20	United-States	<=50K
2	40	Private	342164	HS-grad	9	Separated	Adm-clerical	Unmarried	White	Female	0	37	United-States	<=50K
3	30	Private	361742	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	50	United-States	<=50K
4	67	Self-emp-inc	171564	HS-grad	9	Married-civ-spouse	Prof-specialty	Wife	White	Female	20051	30	England	>50K

使用手动设置的超参数进行本地调优¶

学习者的超参数可以在API中访问，并在超参数页面中查看。指南如何提高模型性能也提供了一些关于影响优化的超参数的建议。在这个例子中，我们训练一个梯度提升树模型，并优化以下超参数：shrinkage、subsample和max_depth。

调优目标会自动为模型选择。例如，对于本例中使用的GradientBoostedTreesLearner，最小化损失。

让我们配置调优器：

In [2]:

Copied!





tuner = ydf.RandomSearchTuner(num_trials=50)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])
tuner.choice("max_depth", [3, 4, 5, 6])
tuner = ydf.RandomSearchTuner(num_trials=50)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])
tuner.choice("max_depth", [3, 4, 5, 6])

Out[2]:

<ydf.learner.tuner.SearchSpace at 0x7f3eb4372310>

我们使用这个调谐器创建一个学习者，并训练一个模型：

注意： 未调谐的参数可以直接在学习者上指定。

注意： 要在调谐过程中打印调谐日志，请使用 ydf.verbose(2) 来启用日志记录。

In [5]:

Copied!





learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100, # 用于所有试验。
    tuner=tuner,
)
model =learner.train(train_ds)
learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100, # 用于所有试验。
    tuner=tuner,
)
model =learner.train(train_ds)

Train model on 22792 examples
Model trained in 0:00:03.998356

模型描述包含调优日志，这是一个测试过的超参数及其得分的列表，可以在模型描述的tuning选项卡中找到。

In [6]:

Copied!

model.describe()
model.describe()

Out[6]:

Name : GRADIENT_BOOSTED_TREES
Task : CLASSIFICATION
Label : income
Features (14) : age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country
Weights : None
Trained with tuner : Yes
Model size : 543 kB

Number of records: 22792
Number of columns: 15

Number of columns by type:
	CATEGORICAL: 9 (60%)
	NUMERICAL: 6 (40%)

Columns:

CATEGORICAL: 9 (60%)
	0: "income" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"<=50K" 17308 (75.9389%)
	2: "workclass" CATEGORICAL num-nas:1257 (5.51509%) has-dict vocab-size:8 num-oods:3 (0.0139308%) most-frequent:"Private" 15879 (73.7358%)
	4: "education" CATEGORICAL has-dict vocab-size:17 zero-ood-items most-frequent:"HS-grad" 7340 (32.2043%)
	6: "marital_status" CATEGORICAL has-dict vocab-size:8 zero-ood-items most-frequent:"Married-civ-spouse" 10431 (45.7661%)
	7: "occupation" CATEGORICAL num-nas:1260 (5.52826%) has-dict vocab-size:14 num-oods:4 (0.018577%) most-frequent:"Prof-specialty" 2870 (13.329%)
	8: "relationship" CATEGORICAL has-dict vocab-size:7 zero-ood-items most-frequent:"Husband" 9191 (40.3256%)
	9: "race" CATEGORICAL has-dict vocab-size:6 zero-ood-items most-frequent:"White" 19467 (85.4115%)
	10: "sex" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"Male" 15165 (66.5365%)
	14: "native_country" CATEGORICAL num-nas:407 (1.78571%) has-dict vocab-size:41 num-oods:1 (0.00446728%) most-frequent:"United-States" 20436 (91.2933%)

NUMERICAL: 6 (40%)
	1: "age" NUMERICAL mean:38.6153 min:17 max:90 sd:13.661
	3: "fnlwgt" NUMERICAL mean:189879 min:12285 max:1.4847e+06 sd:106423
	5: "education_num" NUMERICAL mean:10.0927 min:1 max:16 sd:2.56427
	11: "capital_gain" NUMERICAL mean:1081.9 min:0 max:99999 sd:7509.48
	12: "capital_loss" NUMERICAL mean:87.2806 min:0 max:4356 sd:403.01
	13: "hours_per_week" NUMERICAL mean:40.3955 min:1 max:99 sd:12.249

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

A tuner automatically selects the hyper-parameters of a learner.

trial	score	duration	shrinkage	subsample	max_depth
16	-0.574861	2.49348	0.2	1	5
31	-0.576405	3.53616	0.2	1	6
15	-0.577211	2.4727	0.1	1	5
33	-0.578941	3.69053	0.2	0.9	5
32	-0.579071	3.54803	0.2	0.9	6
35	-0.579637	3.99118	0.1	1	6
19	-0.581703	2.68832	0.2	0.8	6
34	-0.582941	3.90171	0.1	0.8	6
14	-0.583348	2.46785	0.2	0.8	5
27	-0.583466	3.23896	0.2	0.9	4
10	-0.58463	2.14364	0.2	1	4
22	-0.584824	2.97681	0.1	0.9	6
13	-0.585809	2.46436	0.1	0.9	5
12	-0.587067	2.29765	0.1	0.8	5
8	-0.590813	1.97632	0.2	0.8	4
24	-0.593991	3.0293	0.05	1	6
9	-0.595175	2.14037	0.1	1	4
21	-0.596592	2.91333	0.05	0.8	6
28	-0.597159	3.2767	0.1	0.9	4
20	-0.597244	2.90384	0.05	0.9	6
6	-0.597766	1.96352	0.1	0.8	4
5	-0.603554	1.71404	0.2	1	3
23	-0.60517	3.01335	0.2	0.9	3
18	-0.605849	2.54463	0.05	0.9	5
0	-0.606706	1.49037	0.2	0.8	3
17	-0.607283	2.511	0.05	0.8	5
30	-0.608091	3.47695	0.05	1	5
25	-0.619956	3.17843	0.1	0.9	3
3	-0.620752	1.63833	0.1	0.8	3
4	-0.621349	1.70712	0.1	1	3
7	-0.625488	1.96705	0.05	0.8	4
29	-0.626953	3.43528	0.05	0.9	4
11	-0.62982	2.16092	0.05	1	4
1	-0.656424	1.57613	0.05	0.8	3
26	-0.656732	3.20212	0.05	1	3
2	-0.656747	1.62633	0.05	0.9	3

The following evaluation is computed on the validation or out-of-bag dataset.

Task: CLASSIFICATION
Label: income
Loss (BINOMIAL_LOG_LIKELIHOOD): 0.574861

Accuracy: 0.87251  CI95[W][0 1]
ErrorRate: : 0.12749


Confusion Table:
truth\prediction
       <=50K  >50K
<=50K   1570    94
 >50K    194   401
Total: 2259

Variable importances measure the importance of an input feature for a model.

    1.            "age"  0.257622 ################
    2.   "capital_gain"  0.249047 #############
    3.   "relationship"  0.244032 ###########
    4.     "occupation"  0.242881 ###########
    5. "hours_per_week"  0.238530 ##########
    6.      "education"  0.237441 #########
    7. "marital_status"  0.234935 ########
    8.   "capital_loss"  0.231145 #######
    9.         "fnlwgt"  0.226059 ######
   10. "native_country"  0.225767 ######
   11.      "workclass"  0.220718 ####
   12.  "education_num"  0.219033 ####
   13.            "sex"  0.211384 #
   14.           "race"  0.206124

    1.   "capital_gain" 11.000000 ################
    2.            "age" 10.000000 ##############
    3. "hours_per_week" 10.000000 ##############
    4.   "relationship"  9.000000 ############
    5. "marital_status"  7.000000 #########
    6.      "education"  6.000000 ########
    7.   "capital_loss"  6.000000 ########
    8.         "fnlwgt"  5.000000 ######
    9.      "workclass"  3.000000 ###
   10.  "education_num"  3.000000 ###
   11.            "sex"  3.000000 ###
   12.     "occupation"  1.000000 
   13.           "race"  1.000000

    1.     "occupation" 144.000000 ################
    2.            "age" 121.000000 #############
    3.      "education" 113.000000 ############
    4.   "capital_gain" 111.000000 ############
    5.   "capital_loss" 90.000000 #########
    6. "native_country" 87.000000 #########
    7.         "fnlwgt" 84.000000 #########
    8.   "relationship" 73.000000 #######
    9. "marital_status" 68.000000 #######
   10. "hours_per_week" 64.000000 ######
   11.      "workclass" 49.000000 #####
   12.  "education_num" 28.000000 ##
   13.            "sex" 14.000000 #
   14.           "race"  5.000000

    1.   "relationship" 1675.422986 ################
    2.   "capital_gain" 1040.150118 #########
    3.  "education_num" 687.196583 ######
    4.     "occupation" 526.056194 #####
    5. "marital_status" 469.469421 ####
    6.            "age" 289.979275 ##
    7.   "capital_loss" 281.277707 ##
    8.      "education" 259.256109 ##
    9. "hours_per_week" 181.939375 #
   10. "native_country" 108.750643 #
   11.      "workclass" 64.136268 
   12.         "fnlwgt" 46.873309 
   13.            "sex" 30.074515 
   14.           "race"  2.153583

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 75

Only printing the first tree.

Tree #0:
    "relationship" is in [BITMAP] {<OOD>, Husband, Wife} [s:0.036623 n:20533 np:9213 miss:1] ; pred:-8.31766e-09
        ├─(pos)─ "education_num">=12.5 [s:0.0343752 n:9213 np:2773 miss:0] ; pred:0.233866
        |        ├─(pos)─ "capital_gain">=5095.5 [s:0.0125728 n:2773 np:434 miss:0] ; pred:0.545366
        |        |        ├─(pos)─ "occupation" is in [BITMAP] {<OOD>, Prof-specialty, Exec-managerial, Craft-repair, Adm-clerical, Sales, Other-service, Machine-op-inspct, Transport-moving, Handlers-cleaners, ...[2 left]} [s:0.000434532 n:434 np:429 miss:1] ; pred:0.832346
        |        |        |        ├─(pos)─ pred:0.834828
        |        |        |        └─(neg)─ pred:0.619473
        |        |        └─(neg)─ "capital_loss">=1782.5 [s:0.0101181 n:2339 np:249 miss:0] ; pred:0.492116
        |        |                 ├─(pos)─ pred:0.813402
        |        |                 └─(neg)─ pred:0.453839
        |        └─(neg)─ "capital_gain">=5095.5 [s:0.0205419 n:6440 np:303 miss:0] ; pred:0.0997371
        |                 ├─(pos)─ "age">=60.5 [s:0.00421502 n:303 np:43 miss:0] ; pred:0.810859
        |                 |        ├─(pos)─ pred:0.634856
        |                 |        └─(neg)─ pred:0.839967
        |                 └─(neg)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Adm-clerical, Sales, Tech-support, Protective-serv} [s:0.0100346 n:6137 np:2334 miss:0] ; pred:0.0646271
        |                          ├─(pos)─ pred:0.205598
        |                          └─(neg)─ pred:-0.0218904
        └─(neg)─ "capital_gain">=7073.5 [s:0.0143125 n:11320 np:199 miss:0] ; pred:-0.190336
                 ├─(pos)─ "age">=21.5 [s:0.00807667 n:199 np:194 miss:1] ; pred:0.795647
                 |        ├─(pos)─ "capital_gain">=7565.5 [s:0.00761118 n:194 np:184 miss:0] ; pred:0.811553
                 |        |        ├─(pos)─ pred:0.833976
                 |        |        └─(neg)─ pred:0.398979
                 |        └─(neg)─ pred:0.178485
                 └─(neg)─ "education" is in [BITMAP] {<OOD>, Bachelors, Masters, Prof-school, Doctorate} [s:0.00229611 n:11121 np:2199 miss:1] ; pred:-0.207979
                          ├─(pos)─ "age">=31.5 [s:0.00725859 n:2199 np:1263 miss:1] ; pred:-0.10157
                          |        ├─(pos)─ pred:-0.0207104
                          |        └─(neg)─ pred:-0.210678
                          └─(neg)─ "capital_loss">=2218.5 [s:0.000534265 n:8922 np:41 miss:0] ; pred:-0.234206
                                   ├─(pos)─ pred:0.14084
                                   └─(neg)─ pred:-0.235938

模型可以像往常一样进行评估。

In [7]:

Copied!

model.evaluate(test_ds)
model.evaluate(test_ds)

Out[7]:

accuracy:

0.875218

AUC: '>50K' vs others:

0.929283

PR-AUC: '>50K' vs others:

0.831294

loss:

0.277689

num examples:

9769

num examples (weighted):

9769

Confusion matrix

Label \ Pred	<=50K	>50K
<=50K	6974	438
>50K	781	1576

配置条件超参数¶

有些超参数只有在其他超参数以特定方式配置时才相关。例如，当 growing_strategy=LOCAL 时，优化 max_depth 是合理的。然而，当 growing_strategy=BEST_FIRST_GLOBAL 时，优化 max_num_nodes 更为合适。我们可以配置一个调优器来考虑这些条件依赖关系。

In [13]:

Copied!





tuner = ydf.RandomSearchTuner(num_trials=50)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])

local_subspace = tuner.choice("growing_strategy", ["LOCAL"])
local_subspace.choice("max_depth", [3, 4, 5, 6])

global_subspace = tuner.choice("growing_strategy", ["BEST_FIRST_GLOBAL"], merge=True)
global_subspace.choice("max_num_nodes", [32, 64, 128, 256])
tuner = ydf.RandomSearchTuner(num_trials=50)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])

local_subspace = tuner.choice("growing_strategy", ["LOCAL"])
local_subspace.choice("max_depth", [3, 4, 5, 6])

global_subspace = tuner.choice("growing_strategy", ["BEST_FIRST_GLOBAL"], merge=True)
global_subspace.choice("max_num_nodes", [32, 64, 128, 256])

Out[13]:

<ydf.learner.tuner.SearchSpace at 0x7f3f10549e50>

让我们调整模型并展示结果。

In [14]:

Copied!





learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100,
    tuner=tuner,
)
model =learner.train(train_ds)
learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100,
    tuner=tuner,
)
model =learner.train(train_ds)

Train model on 22792 examples
Model trained in 0:00:06.789261

In [15]:

Copied!

model.describe()
model.describe()

Out[15]:

Name : GRADIENT_BOOSTED_TREES
Task : CLASSIFICATION
Label : income
Features (14) : age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country
Weights : None
Trained with tuner : Yes
Model size : 543 kB

Number of records: 22792
Number of columns: 15

Number of columns by type:
	CATEGORICAL: 9 (60%)
	NUMERICAL: 6 (40%)

Columns:

CATEGORICAL: 9 (60%)
	0: "income" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"<=50K" 17308 (75.9389%)
	2: "workclass" CATEGORICAL num-nas:1257 (5.51509%) has-dict vocab-size:8 num-oods:3 (0.0139308%) most-frequent:"Private" 15879 (73.7358%)
	4: "education" CATEGORICAL has-dict vocab-size:17 zero-ood-items most-frequent:"HS-grad" 7340 (32.2043%)
	6: "marital_status" CATEGORICAL has-dict vocab-size:8 zero-ood-items most-frequent:"Married-civ-spouse" 10431 (45.7661%)
	7: "occupation" CATEGORICAL num-nas:1260 (5.52826%) has-dict vocab-size:14 num-oods:4 (0.018577%) most-frequent:"Prof-specialty" 2870 (13.329%)
	8: "relationship" CATEGORICAL has-dict vocab-size:7 zero-ood-items most-frequent:"Husband" 9191 (40.3256%)
	9: "race" CATEGORICAL has-dict vocab-size:6 zero-ood-items most-frequent:"White" 19467 (85.4115%)
	10: "sex" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"Male" 15165 (66.5365%)
	14: "native_country" CATEGORICAL num-nas:407 (1.78571%) has-dict vocab-size:41 num-oods:1 (0.00446728%) most-frequent:"United-States" 20436 (91.2933%)

NUMERICAL: 6 (40%)
	1: "age" NUMERICAL mean:38.6153 min:17 max:90 sd:13.661
	3: "fnlwgt" NUMERICAL mean:189879 min:12285 max:1.4847e+06 sd:106423
	5: "education_num" NUMERICAL mean:10.0927 min:1 max:16 sd:2.56427
	11: "capital_gain" NUMERICAL mean:1081.9 min:0 max:99999 sd:7509.48
	12: "capital_loss" NUMERICAL mean:87.2806 min:0 max:4356 sd:403.01
	13: "hours_per_week" NUMERICAL mean:40.3955 min:1 max:99 sd:12.249

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

A tuner automatically selects the hyper-parameters of a learner.

trial	score	duration	shrinkage	subsample	growing_strategy	max_depth	max_num_nodes
31	-0.574861	5.4128	0.2	1	LOCAL	5
10	-0.576405	2.72618	0.2	1	LOCAL	6
18	-0.578031	3.67246	0.1	0.9	BEST_FIRST_GLOBAL		32
25	-0.578941	4.434	0.2	0.9	LOCAL	5
11	-0.579071	2.97415	0.2	0.9	LOCAL	6
21	-0.579482	4.04769	0.1	0.9	BEST_FIRST_GLOBAL		64
39	-0.579482	5.72021	0.1	0.9	BEST_FIRST_GLOBAL		128
44	-0.579637	6.08383	0.1	1	LOCAL	6
16	-0.580548	3.50807	0.1	0.8	BEST_FIRST_GLOBAL		32
8	-0.582698	2.65852	0.2	1	BEST_FIRST_GLOBAL		64
28	-0.582941	5.26744	0.1	0.8	LOCAL	6
6	-0.583348	2.62349	0.2	0.8	LOCAL	5
4	-0.583466	2.33348	0.2	0.9	LOCAL	4
14	-0.583824	3.30352	0.2	0.8	BEST_FIRST_GLOBAL		128
15	-0.583824	3.32547	0.2	0.8	BEST_FIRST_GLOBAL		64
3	-0.584435	2.30352	0.2	0.8	BEST_FIRST_GLOBAL		32
42	-0.584518	5.98935	0.1	0.8	BEST_FIRST_GLOBAL		256
49	-0.584518	6.78263	0.1	0.8	BEST_FIRST_GLOBAL		128
33	-0.58463	5.52032	0.2	1	LOCAL	4
12	-0.584824	3.29028	0.1	0.9	LOCAL	6
41	-0.587067	5.83872	0.1	0.8	LOCAL	5
32	-0.589069	5.51251	0.2	0.9	BEST_FIRST_GLOBAL		32
47	-0.590361	6.48934	0.05	1	BEST_FIRST_GLOBAL		64
23	-0.590361	4.2553	0.05	1	BEST_FIRST_GLOBAL		256
43	-0.590541	6.02143	0.05	0.9	BEST_FIRST_GLOBAL		32
37	-0.590813	5.66041	0.2	0.8	LOCAL	4
45	-0.592258	6.24542	0.2	0.9	BEST_FIRST_GLOBAL		64
34	-0.592258	5.6012	0.2	0.9	BEST_FIRST_GLOBAL		256
9	-0.592258	2.72076	0.2	0.9	BEST_FIRST_GLOBAL		128
22	-0.59235	4.12727	0.05	0.9	BEST_FIRST_GLOBAL		128
48	-0.59235	6.76277	0.05	0.9	BEST_FIRST_GLOBAL		256
46	-0.59389	6.38848	0.05	1	BEST_FIRST_GLOBAL		32
35	-0.593991	5.64836	0.05	1	LOCAL	6
17	-0.594588	3.5419	0.05	0.8	BEST_FIRST_GLOBAL		32
19	-0.595605	3.78829	0.05	0.8	BEST_FIRST_GLOBAL		64
20	-0.595605	3.82693	0.05	0.8	BEST_FIRST_GLOBAL		256
36	-0.597159	5.65092	0.1	0.9	LOCAL	4
13	-0.597244	3.29495	0.05	0.9	LOCAL	6
2	-0.597766	2.23067	0.1	0.8	LOCAL	4
1	-0.603554	1.84102	0.2	1	LOCAL	3
29	-0.60517	5.35164	0.2	0.9	LOCAL	3
7	-0.605849	2.63557	0.05	0.9	LOCAL	5
0	-0.606706	1.81158	0.2	0.8	LOCAL	3
5	-0.607283	2.58145	0.05	0.8	LOCAL	5
24	-0.619956	4.39896	0.1	0.9	LOCAL	3
40	-0.621349	5.80321	0.1	1	LOCAL	3
30	-0.626953	5.38874	0.05	0.9	LOCAL	4
27	-0.62982	5.01653	0.05	1	LOCAL	4
38	-0.656732	5.66151	0.05	1	LOCAL	3
26	-0.656747	4.62038	0.05	0.9	LOCAL	3

The following evaluation is computed on the validation or out-of-bag dataset.

Task: CLASSIFICATION
Label: income
Loss (BINOMIAL_LOG_LIKELIHOOD): 0.574861

Accuracy: 0.87251  CI95[W][0 1]
ErrorRate: : 0.12749


Confusion Table:
truth\prediction
       <=50K  >50K
<=50K   1570    94
 >50K    194   401
Total: 2259

Variable importances measure the importance of an input feature for a model.

    1.            "age"  0.257622 ################
    2.   "capital_gain"  0.249047 #############
    3.   "relationship"  0.244032 ###########
    4.     "occupation"  0.242881 ###########
    5. "hours_per_week"  0.238530 ##########
    6.      "education"  0.237441 #########
    7. "marital_status"  0.234935 ########
    8.   "capital_loss"  0.231145 #######
    9.         "fnlwgt"  0.226059 ######
   10. "native_country"  0.225767 ######
   11.      "workclass"  0.220718 ####
   12.  "education_num"  0.219033 ####
   13.            "sex"  0.211384 #
   14.           "race"  0.206124

    1.   "capital_gain" 11.000000 ################
    2.            "age" 10.000000 ##############
    3. "hours_per_week" 10.000000 ##############
    4.   "relationship"  9.000000 ############
    5. "marital_status"  7.000000 #########
    6.      "education"  6.000000 ########
    7.   "capital_loss"  6.000000 ########
    8.         "fnlwgt"  5.000000 ######
    9.      "workclass"  3.000000 ###
   10.  "education_num"  3.000000 ###
   11.            "sex"  3.000000 ###
   12.     "occupation"  1.000000 
   13.           "race"  1.000000

    1.     "occupation" 144.000000 ################
    2.            "age" 121.000000 #############
    3.      "education" 113.000000 ############
    4.   "capital_gain" 111.000000 ############
    5.   "capital_loss" 90.000000 #########
    6. "native_country" 87.000000 #########
    7.         "fnlwgt" 84.000000 #########
    8.   "relationship" 73.000000 #######
    9. "marital_status" 68.000000 #######
   10. "hours_per_week" 64.000000 ######
   11.      "workclass" 49.000000 #####
   12.  "education_num" 28.000000 ##
   13.            "sex" 14.000000 #
   14.           "race"  5.000000

    1.   "relationship" 1675.422986 ################
    2.   "capital_gain" 1040.150118 #########
    3.  "education_num" 687.196583 ######
    4.     "occupation" 526.056194 #####
    5. "marital_status" 469.469421 ####
    6.            "age" 289.979275 ##
    7.   "capital_loss" 281.277707 ##
    8.      "education" 259.256109 ##
    9. "hours_per_week" 181.939375 #
   10. "native_country" 108.750643 #
   11.      "workclass" 64.136268 
   12.         "fnlwgt" 46.873309 
   13.            "sex" 30.074515 
   14.           "race"  2.153583

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 75

Only printing the first tree.

Tree #0:
    "relationship" is in [BITMAP] {<OOD>, Husband, Wife} [s:0.036623 n:20533 np:9213 miss:1] ; pred:-8.31766e-09
        ├─(pos)─ "education_num">=12.5 [s:0.0343752 n:9213 np:2773 miss:0] ; pred:0.233866
        |        ├─(pos)─ "capital_gain">=5095.5 [s:0.0125728 n:2773 np:434 miss:0] ; pred:0.545366
        |        |        ├─(pos)─ "occupation" is in [BITMAP] {<OOD>, Prof-specialty, Exec-managerial, Craft-repair, Adm-clerical, Sales, Other-service, Machine-op-inspct, Transport-moving, Handlers-cleaners, ...[2 left]} [s:0.000434532 n:434 np:429 miss:1] ; pred:0.832346
        |        |        |        ├─(pos)─ pred:0.834828
        |        |        |        └─(neg)─ pred:0.619473
        |        |        └─(neg)─ "capital_loss">=1782.5 [s:0.0101181 n:2339 np:249 miss:0] ; pred:0.492116
        |        |                 ├─(pos)─ pred:0.813402
        |        |                 └─(neg)─ pred:0.453839
        |        └─(neg)─ "capital_gain">=5095.5 [s:0.0205419 n:6440 np:303 miss:0] ; pred:0.0997371
        |                 ├─(pos)─ "age">=60.5 [s:0.00421502 n:303 np:43 miss:0] ; pred:0.810859
        |                 |        ├─(pos)─ pred:0.634856
        |                 |        └─(neg)─ pred:0.839967
        |                 └─(neg)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Adm-clerical, Sales, Tech-support, Protective-serv} [s:0.0100346 n:6137 np:2334 miss:0] ; pred:0.0646271
        |                          ├─(pos)─ pred:0.205598
        |                          └─(neg)─ pred:-0.0218904
        └─(neg)─ "capital_gain">=7073.5 [s:0.0143125 n:11320 np:199 miss:0] ; pred:-0.190336
                 ├─(pos)─ "age">=21.5 [s:0.00807667 n:199 np:194 miss:1] ; pred:0.795647
                 |        ├─(pos)─ "capital_gain">=7565.5 [s:0.00761118 n:194 np:184 miss:0] ; pred:0.811553
                 |        |        ├─(pos)─ pred:0.833976
                 |        |        └─(neg)─ pred:0.398979
                 |        └─(neg)─ pred:0.178485
                 └─(neg)─ "education" is in [BITMAP] {<OOD>, Bachelors, Masters, Prof-school, Doctorate} [s:0.00229611 n:11121 np:2199 miss:1] ; pred:-0.207979
                          ├─(pos)─ "age">=31.5 [s:0.00725859 n:2199 np:1263 miss:1] ; pred:-0.10157
                          |        ├─(pos)─ pred:-0.0207104
                          |        └─(neg)─ pred:-0.210678
                          └─(neg)─ "capital_loss">=2218.5 [s:0.000534265 n:8922 np:41 miss:0] ; pred:-0.234206
                                   ├─(pos)─ pred:0.14084
                                   └─(neg)─ pred:-0.235938

In [16]:

Copied!

model.evaluate(test_ds)
model.evaluate(test_ds)

Out[16]:

accuracy:

0.875218

AUC: '>50K' vs others:

0.929283

PR-AUC: '>50K' vs others:

0.831294

loss:

0.277689

num examples:

9769

num examples (weighted):

9769

Confusion matrix

Label \ Pred	<=50K	>50K
<=50K	6974	438
>50K	781	1576

使用自动配置超参数进行本地调优¶

如果您不想配置要优化的超参数，可以使用预配置的调谐器。

In [8]:

Copied!

tuner = ydf.RandomSearchTuner(num_trials=50, automatic_search_space=True)
tuner = ydf.RandomSearchTuner(num_trials=50, automatic_search_space=True)

模型训练是类似的：

In [10]:

Copied!





learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100,
    tuner=tuner,
)
model =learner.train(train_ds)
learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100,
    tuner=tuner,
)
model =learner.train(train_ds)

Train model on 22792 examples
Model trained in 0:00:01.745021

除了查看模型：

In [11]:

Copied!

model.describe()
model.describe()

Out[11]:

Name : GRADIENT_BOOSTED_TREES
Task : CLASSIFICATION
Label : income
Features (14) : age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country
Weights : None
Trained with tuner : Yes
Model size : 1374 kB

Number of records: 22792
Number of columns: 15

Number of columns by type:
	CATEGORICAL: 9 (60%)
	NUMERICAL: 6 (40%)

Columns:

CATEGORICAL: 9 (60%)
	0: "income" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"<=50K" 17308 (75.9389%)
	2: "workclass" CATEGORICAL num-nas:1257 (5.51509%) has-dict vocab-size:8 num-oods:3 (0.0139308%) most-frequent:"Private" 15879 (73.7358%)
	4: "education" CATEGORICAL has-dict vocab-size:17 zero-ood-items most-frequent:"HS-grad" 7340 (32.2043%)
	6: "marital_status" CATEGORICAL has-dict vocab-size:8 zero-ood-items most-frequent:"Married-civ-spouse" 10431 (45.7661%)
	7: "occupation" CATEGORICAL num-nas:1260 (5.52826%) has-dict vocab-size:14 num-oods:4 (0.018577%) most-frequent:"Prof-specialty" 2870 (13.329%)
	8: "relationship" CATEGORICAL has-dict vocab-size:7 zero-ood-items most-frequent:"Husband" 9191 (40.3256%)
	9: "race" CATEGORICAL has-dict vocab-size:6 zero-ood-items most-frequent:"White" 19467 (85.4115%)
	10: "sex" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"Male" 15165 (66.5365%)
	14: "native_country" CATEGORICAL num-nas:407 (1.78571%) has-dict vocab-size:41 num-oods:1 (0.00446728%) most-frequent:"United-States" 20436 (91.2933%)

NUMERICAL: 6 (40%)
	1: "age" NUMERICAL mean:38.6153 min:17 max:90 sd:13.661
	3: "fnlwgt" NUMERICAL mean:189879 min:12285 max:1.4847e+06 sd:106423
	5: "education_num" NUMERICAL mean:10.0927 min:1 max:16 sd:2.56427
	11: "capital_gain" NUMERICAL mean:1081.9 min:0 max:99999 sd:7509.48
	12: "capital_loss" NUMERICAL mean:87.2806 min:0 max:4356 sd:403.01
	13: "hours_per_week" NUMERICAL mean:40.3955 min:1 max:99 sd:12.249

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

A tuner automatically selects the hyper-parameters of a learner.

trial	score	duration
0	-0.579637	1.74332

The following evaluation is computed on the validation or out-of-bag dataset.

Task: CLASSIFICATION
Label: income
Loss (BINOMIAL_LOG_LIKELIHOOD): 0.579637

Accuracy: 0.868083  CI95[W][0 1]
ErrorRate: : 0.131917


Confusion Table:
truth\prediction
       <=50K  >50K
<=50K   1564   100
 >50K    198   397
Total: 2259

Variable importances measure the importance of an input feature for a model.

    1.   "capital_gain"  0.234685 ################
    2.            "age"  0.231226 ###############
    3. "marital_status"  0.225030 #############
    4.     "occupation"  0.216504 ###########
    5.      "education"  0.212171 #########
    6.   "relationship"  0.203987 #######
    7. "hours_per_week"  0.203680 #######
    8.   "capital_loss"  0.199160 ######
    9.         "fnlwgt"  0.188297 ###
   10. "native_country"  0.187899 ###
   11.  "education_num"  0.185984 ##
   12.      "workclass"  0.184872 ##
   13.           "race"  0.177978 
   14.            "sex"  0.176098

    1.   "capital_gain" 19.000000 ################
    2. "marital_status" 18.000000 ###############
    3.            "age" 15.000000 ############
    4.   "relationship" 10.000000 #######
    5.   "capital_loss"  8.000000 #####
    6. "hours_per_week"  8.000000 #####
    7.      "education"  6.000000 ###
    8.  "education_num"  5.000000 ##
    9.           "race"  5.000000 ##
   10.         "fnlwgt"  2.000000 
   11.     "occupation"  2.000000 
   12.            "sex"  2.000000

    1.     "occupation" 437.000000 ################
    2.            "age" 331.000000 ############
    3.      "education" 285.000000 ##########
    4.   "capital_gain" 257.000000 #########
    5.   "capital_loss" 230.000000 ########
    6. "native_country" 221.000000 #######
    7.         "fnlwgt" 210.000000 #######
    8. "hours_per_week" 207.000000 #######
    9.   "relationship" 172.000000 ######
   10.      "workclass" 140.000000 ####
   11. "marital_status" 139.000000 ####
   12.  "education_num" 63.000000 ##
   13.            "sex" 23.000000 
   14.           "race"  8.000000

    1.   "relationship" 2993.930793 ################
    2.   "capital_gain" 2048.254640 ##########
    3. "marital_status" 1095.321390 #####
    4.      "education" 1094.118075 #####
    5.     "occupation" 1009.400363 #####
    6.  "education_num" 794.643186 ####
    7.   "capital_loss" 571.858684 ###
    8.            "age" 545.766716 ##
    9. "hours_per_week" 336.939387 #
   10. "native_country" 241.147622 #
   11.      "workclass" 164.564834 
   12.         "fnlwgt" 115.319824 
   13.            "sex" 43.401514 
   14.           "race"  2.559291

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 100

Only printing the first tree.

Tree #0:
    "relationship" is in [BITMAP] {<OOD>, Husband, Wife} [s:0.036623 n:20533 np:9213 miss:1] ; pred:-4.15883e-09
        ├─(pos)─ "education_num">=12.5 [s:0.0343752 n:9213 np:2773 miss:0] ; pred:0.116933
        |        ├─(pos)─ "capital_gain">=5095.5 [s:0.0125728 n:2773 np:434 miss:0] ; pred:0.272683
        |        |        ├─(pos)─ "occupation" is in [BITMAP] {<OOD>, Prof-specialty, Exec-managerial, Craft-repair, Adm-clerical, Sales, Other-service, Machine-op-inspct, Transport-moving, Handlers-cleaners, ...[2 left]} [s:0.000434532 n:434 np:429 miss:1] ; pred:0.416173
        |        |        |        ├─(pos)─ "age">=79.5 [s:0.000449964 n:429 np:5 miss:0] ; pred:0.417414
        |        |        |        |        ├─(pos)─ pred:0.309737
        |        |        |        |        └─(neg)─ pred:0.418684
        |        |        |        └─(neg)─ pred:0.309737
        |        |        └─(neg)─ "capital_loss">=1782.5 [s:0.0101181 n:2339 np:249 miss:0] ; pred:0.246058
        |        |                 ├─(pos)─ "capital_loss">=1989.5 [s:0.00201289 n:249 np:39 miss:0] ; pred:0.406701
        |        |                 |        ├─(pos)─ pred:0.349312
        |        |                 |        └─(neg)─ pred:0.417359
        |        |                 └─(neg)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Sales, Tech-support, Protective-serv} [s:0.0097175 n:2090 np:1688 miss:0] ; pred:0.226919
        |        |                          ├─(pos)─ pred:0.253437
        |        |                          └─(neg)─ pred:0.11557
        |        └─(neg)─ "capital_gain">=5095.5 [s:0.0205419 n:6440 np:303 miss:0] ; pred:0.0498685
        |                 ├─(pos)─ "age">=60.5 [s:0.00421502 n:303 np:43 miss:0] ; pred:0.40543
        |                 |        ├─(pos)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Adm-clerical, Sales, Machine-op-inspct, Transport-moving, Handlers-cleaners} [s:0.0296244 n:43 np:25 miss:0] ; pred:0.317428
        |                 |        |        ├─(pos)─ pred:0.397934
        |                 |        |        └─(neg)─ pred:0.205614
        |                 |        └─(neg)─ "fnlwgt">=36212.5 [s:1.36643e-16 n:260 np:250 miss:1] ; pred:0.419984
        |                 |                 ├─(pos)─ pred:0.419984
        |                 |                 └─(neg)─ pred:0.419984
        |                 └─(neg)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Adm-clerical, Sales, Tech-support, Protective-serv} [s:0.0100346 n:6137 np:2334 miss:0] ; pred:0.0323136
        |                          ├─(pos)─ "age">=33.5 [s:0.00939348 n:2334 np:1769 miss:1] ; pred:0.102799
        |                          |        ├─(pos)─ pred:0.132992
        |                          |        └─(neg)─ pred:0.00826457
        |                          └─(neg)─ "education" is in [BITMAP] {<OOD>, HS-grad, Some-college, Bachelors, Masters, Assoc-voc, Assoc-acdm, Prof-school, Doctorate} [s:0.00478423 n:3803 np:2941 miss:1] ; pred:-0.0109452
        |                                   ├─(pos)─ pred:0.00969668
        |                                   └─(neg)─ pred:-0.0813718
        └─(neg)─ "capital_gain">=7073.5 [s:0.0143125 n:11320 np:199 miss:0] ; pred:-0.0951681
                 ├─(pos)─ "age">=21.5 [s:0.00807667 n:199 np:194 miss:1] ; pred:0.397823
                 |        ├─(pos)─ "capital_gain">=7565.5 [s:0.00761118 n:194 np:184 miss:0] ; pred:0.405777
                 |        |        ├─(pos)─ "capital_gain">=30961.5 [s:0.000242202 n:184 np:20 miss:0] ; pred:0.416988
                 |        |        |        ├─(pos)─ pred:0.392422
                 |        |        |        └─(neg)─ pred:0.419984
                 |        |        └─(neg)─ "education" is in [BITMAP] {Bachelors, Masters, Assoc-voc, Prof-school} [s:0.16 n:10 np:5 miss:0] ; pred:0.19949
                 |        |                 ├─(pos)─ pred:0.419984
                 |        |                 └─(neg)─ pred:-0.0210046
                 |        └─(neg)─ pred:0.0892425
                 └─(neg)─ "education" is in [BITMAP] {<OOD>, Bachelors, Masters, Prof-school, Doctorate} [s:0.00229611 n:11121 np:2199 miss:1] ; pred:-0.10399
                          ├─(pos)─ "age">=31.5 [s:0.00725859 n:2199 np:1263 miss:1] ; pred:-0.0507848
                          |        ├─(pos)─ "education" is in [BITMAP] {<OOD>, HS-grad, Some-college, Assoc-voc, 11th, Assoc-acdm, 10th, 7th-8th, Prof-school, 9th, ...[5 left]} [s:0.0110157 n:1263 np:125 miss:1] ; pred:-0.0103552
                          |        |        ├─(pos)─ pred:0.16421
                          |        |        └─(neg)─ pred:-0.0295298
                          |        └─(neg)─ "capital_loss">=1977 [s:0.00164232 n:936 np:5 miss:0] ; pred:-0.105339
                          |                 ├─(pos)─ pred:0.19949
                          |                 └─(neg)─ pred:-0.106976
                          └─(neg)─ "capital_loss">=2218.5 [s:0.000534265 n:8922 np:41 miss:0] ; pred:-0.117103
                                   ├─(pos)─ "fnlwgt">=125450 [s:0.0755454 n:41 np:28 miss:1] ; pred:0.0704198
                                   |        ├─(pos)─ pred:-0.0328167
                                   |        └─(neg)─ pred:0.292776
                                   └─(neg)─ "hours_per_week">=40.5 [s:0.000447024 n:8881 np:1559 miss:0] ; pred:-0.117969
                                            ├─(pos)─ pred:-0.0927111
                                            └─(neg)─ pred:-0.123347

并评估模型：

In [12]:

Copied!

model.evaluate(test_ds)
model.evaluate(test_ds)

Out[12]:

accuracy:

0.874808

AUC: '>50K' vs others:

0.929433

PR-AUC: '>50K' vs others:

0.830569

loss:

0.277839

num examples:

9769

num examples (weighted):

9769

Confusion matrix

Label \ Pred	<=50K	>50K
<=50K	6985	427
>50K	796	1561