In [ ]:
Copied!
pip install ydf -U
pip install ydf -U
In [ ]:
Copied!
import ydf
import numpy as np
import ydf
import numpy as np
什么是多维特征?¶
多维特征是具有多个维度的模型输入。例如,馈送时间序列的多个时间戳或图像中不同像素的值就是多维特征。它们与只有一个维度的单维特征不同。多维特征的每个维度都被视为一个单独的单维特征。
多维特征以多维数组的形式输入,例如 Numpy 数组或 TensorFlow 向量。下一个示例展示了将多维特征馈送到模型中的一个简单示例。
创建多维数据集¶
创建多维数据集的最简单方法是使用多维NumPy数组的字典。
In [ ]:
Copied!
def create_dataset(num_examples):
# 生成随机特征值。
dataset = {
# f1是一个4维特征。
"f1": np.random.uniform(size=(num_examples, 4)),
# f2 是一个单维特征。
"f2": np.random.uniform(size=(num_examples)),
}
# 添加一个合成标签
noise = np.random.uniform(size=num_examples)
dataset["label"] = (
np.sum(dataset["f1"], axis=1) + dataset["f2"] * 0.2 + noise
) >= 2.0
return dataset
print("A dataset with 5 examples:")
create_dataset(num_examples=5)
def create_dataset(num_examples):
# 生成随机特征值。
dataset = {
# f1是一个4维特征。
"f1": np.random.uniform(size=(num_examples, 4)),
# f2 是一个单维特征。
"f2": np.random.uniform(size=(num_examples)),
}
# 添加一个合成标签
noise = np.random.uniform(size=num_examples)
dataset["label"] = (
np.sum(dataset["f1"], axis=1) + dataset["f2"] * 0.2 + noise
) >= 2.0
return dataset
print("A dataset with 5 examples:")
create_dataset(num_examples=5)
A dataset with 5 examples:
Out[ ]:
{'f1': array([[0.5373759 , 0.18098291, 0.74489824, 0.27706572],
[0.4517745 , 0.37578001, 0.45156836, 0.05413219],
[0.77036813, 0.1640734 , 0.47994649, 0.06315383],
[0.44115416, 0.95749836, 0.80662146, 0.78114808],
[0.40393628, 0.22786682, 0.32477702, 0.18309577]]),
'f2': array([0.02058218, 0.94332705, 0.25678716, 0.02122367, 0.04498769]),
'label': array([False, True, False, True, False])}
训练模型¶
在多维特征上训练模型与在单维特征上训练模型类似。
In [ ]:
Copied!
train_ds = create_dataset(num_examples=10000)
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)
train_ds = create_dataset(num_examples=10000)
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)
Train model on 10000 examples Model trained in 0:00:02.789326
模型理解¶
在解释模型时,多维特征的每个维度被单独处理。例如,描述模型时会单独展示每个维度。
In [ ]:
Copied!
model.describe()
model.describe()
Out[ ]:
Name : GRADIENT_BOOSTED_TREES
Task : CLASSIFICATION
Label : label
Features (5) : f1.0_of_4 f1.1_of_4 f1.2_of_4 f1.3_of_4 f2
Weights : None
Trained with tuner : No
Model size : 767 kB
Task : CLASSIFICATION
Label : label
Features (5) : f1.0_of_4 f1.1_of_4 f1.2_of_4 f1.3_of_4 f2
Weights : None
Trained with tuner : No
Model size : 767 kB
Number of records: 10000 Number of columns: 6 Number of columns by type: NUMERICAL: 5 (83.3333%) CATEGORICAL: 1 (16.6667%) Columns: NUMERICAL: 5 (83.3333%) 1: "f1.0_of_4" NUMERICAL mean:0.49459 min:4.63251e-05 max:0.999917 sd:0.289597 2: "f1.1_of_4" NUMERICAL mean:0.498703 min:5.8423e-06 max:0.999997 sd:0.289197 3: "f1.2_of_4" NUMERICAL mean:0.498227 min:7.85791e-05 max:0.999943 sd:0.288629 4: "f1.3_of_4" NUMERICAL mean:0.496773 min:9.6696e-05 max:0.99987 sd:0.28987 5: "f2" NUMERICAL mean:0.504066 min:3.89178e-05 max:0.999976 sd:0.289052 CATEGORICAL: 1 (16.6667%) 0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"true" 8140 (81.4%) Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
The following evaluation is computed on the validation or out-of-bag dataset.
Task: CLASSIFICATION
Label: label
Loss (BINOMIAL_LOG_LIKELIHOOD): 0.48424
Accuracy: 0.88592 CI95[W][0 1]
ErrorRate: : 0.11408
Confusion Table:
truth\prediction
false true
false 114 65
true 46 748
Total: 973
Variable importances measure the importance of an input feature for a model.
1. "f1.3_of_4" 0.381709 ################
2. "f1.0_of_4" 0.364288 ##############
3. "f1.1_of_4" 0.347260 #############
4. "f1.2_of_4" 0.310757 ##########
5. "f2" 0.187976
1. "f1.3_of_4" 20.000000 ################
2. "f1.0_of_4" 16.000000 #########
3. "f1.1_of_4" 10.000000
4. "f1.2_of_4" 10.000000
1. "f1.2_of_4" 368.000000 ################
2. "f1.3_of_4" 367.000000 ###############
3. "f1.0_of_4" 340.000000 #############
4. "f1.1_of_4" 318.000000 ############
5. "f2" 164.000000
1. "f1.2_of_4" 1184.639552 ################
2. "f1.1_of_4" 1180.490537 ###############
3. "f1.0_of_4" 1087.300719 ##############
4. "f1.3_of_4" 1061.770106 ##############
5. "f2" 124.009355
Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.
Num trees : 56
Only printing the first tree.
Tree #0:
"f1.3_of_4">=0.340523 [s:0.0125718 n:9027 np:5931 miss:1] ; pred:-1.59908e-08
├─(pos)─ "f1.0_of_4">=0.355412 [s:0.00663164 n:5931 np:3784 miss:1] ; pred:0.0534567
| ├─(pos)─ "f1.1_of_4">=0.285883 [s:0.00223373 n:3784 np:2645 miss:1] ; pred:0.0939348
| | ├─(pos)─ "f1.2_of_4">=0.211332 [s:0.000410131 n:2645 np:2088 miss:1] ; pred:0.114401
| | | ├─(pos)─ "f1.3_of_4">=0.343045 [s:7.77674e-05 n:2088 np:2082 miss:1] ; pred:0.121303
| | | | ├─(pos)─ pred:0.121615
| | | | └─(neg)─ pred:0.0129024
| | | └─(neg)─ "f1.1_of_4">=0.404073 [s:0.00320274 n:557 np:479 miss:1] ; pred:0.0885265
| | | ├─(pos)─ pred:0.103596
| | | └─(neg)─ pred:-0.00401777
| | └─(neg)─ "f1.2_of_4">=0.186632 [s:0.0112601 n:1139 np:915 miss:1] ; pred:0.0464084
| | ├─(pos)─ "f1.2_of_4">=0.528798 [s:0.00319255 n:915 np:556 miss:0] ; pred:0.0810544
| | | ├─(pos)─ pred:0.111015
| | | └─(neg)─ pred:0.0346534
| | └─(neg)─ "f1.0_of_4">=0.702914 [s:0.0383716 n:224 np:96 miss:0] ; pred:-0.0951145
| | ├─(pos)─ pred:0.0541452
| | └─(neg)─ pred:-0.207059
| └─(neg)─ "f1.1_of_4">=0.412053 [s:0.0223679 n:2147 np:1253 miss:1] ; pred:-0.0178841
| ├─(pos)─ "f1.2_of_4">=0.204555 [s:0.0101297 n:1253 np:1010 miss:1] ; pred:0.065479
| | ├─(pos)─ "f1.2_of_4">=0.404707 [s:0.00241061 n:1010 np:772 miss:1] ; pred:0.0980558
| | | ├─(pos)─ pred:0.116045
| | | └─(neg)─ pred:0.0397044
| | └─(neg)─ "f1.3_of_4">=0.667282 [s:0.0338417 n:243 np:114 miss:0] ; pred:-0.0699227
| | ├─(pos)─ pred:0.0592101
| | └─(neg)─ pred:-0.18404
| └─(neg)─ "f1.2_of_4">=0.494196 [s:0.0422598 n:894 np:448 miss:1] ; pred:-0.134723
| ├─(pos)─ "f1.3_of_4">=0.561545 [s:0.0132409 n:448 np:285 miss:0] ; pred:0.000627708
| | ├─(pos)─ pred:0.0580524
| | └─(neg)─ pred:-0.0997774
| └─(neg)─ "f1.3_of_4">=0.702899 [s:0.0247338 n:446 np:213 miss:0] ; pred:-0.270681
| ├─(pos)─ pred:-0.162138
| └─(neg)─ pred:-0.369906
└─(neg)─ "f1.1_of_4">=0.456287 [s:0.0326619 n:3096 np:1725 miss:1] ; pred:-0.102407
├─(pos)─ "f1.0_of_4">=0.465293 [s:0.0172008 n:1725 np:920 miss:1] ; pred:0.00391262
| ├─(pos)─ "f1.2_of_4">=0.150376 [s:0.00671146 n:920 np:781 miss:1] ; pred:0.0848681
| | ├─(pos)─ "f1.1_of_4">=0.675697 [s:0.000847067 n:781 np:480 miss:0] ; pred:0.107675
| | | ├─(pos)─ pred:0.122883
| | | └─(neg)─ pred:0.0834216
| | └─(neg)─ "f1.3_of_4">=0.0952032 [s:0.0266206 n:139 np:103 miss:1] ; pred:-0.0432749
| | ├─(pos)─ pred:0.0203768
| | └─(neg)─ pred:-0.225389
| └─(neg)─ "f1.2_of_4">=0.588246 [s:0.0368965 n:805 np:331 miss:0] ; pred:-0.0886079
| ├─(pos)─ "f1.2_of_4">=0.705104 [s:0.00581962 n:331 np:244 miss:0] ; pred:0.0630749
| | ├─(pos)─ pred:0.0931343
| | └─(neg)─ pred:-0.0212296
| └─(neg)─ "f1.1_of_4">=0.640417 [s:0.0161828 n:474 np:313 miss:0] ; pred:-0.19453
| ├─(pos)─ pred:-0.134324
| └─(neg)─ pred:-0.311575
└─(neg)─ "f1.2_of_4">=0.519391 [s:0.0405007 n:1371 np:637 miss:0] ; pred:-0.236179
├─(pos)─ "f1.0_of_4">=0.316183 [s:0.0395178 n:637 np:418 miss:1] ; pred:-0.0936254
| ├─(pos)─ "f1.0_of_4">=0.686852 [s:0.0172247 n:418 np:186 miss:0] ; pred:0.00132543
| | ├─(pos)─ pred:0.0980488
| | └─(neg)─ pred:-0.0762201
| └─(neg)─ "f1.2_of_4">=0.893097 [s:0.03348 n:219 np:43 miss:0] ; pred:-0.274856
| ├─(pos)─ pred:-0.0305784
| └─(neg)─ pred:-0.334537
└─(neg)─ "f1.0_of_4">=0.667598 [s:0.0436245 n:734 np:222 miss:0] ; pred:-0.359894
├─(pos)─ "f1.1_of_4">=0.19785 [s:0.0336715 n:222 np:119 miss:1] ; pred:-0.150583
| ├─(pos)─ pred:-0.0379291
| └─(neg)─ pred:-0.280736
└─(neg)─ "f1.0_of_4">=0.402493 [s:0.00914017 n:512 np:213 miss:1] ; pred:-0.45065
├─(pos)─ pred:-0.375903
└─(neg)─ pred:-0.503897
分析模型和预测也显示每个维度的情况。
In [ ]:
Copied!
test_ds = create_dataset(num_examples=10000)
model.analyze(test_ds)
test_ds = create_dataset(num_examples=10000)
model.analyze(test_ds)
Out[ ]:
Variable importances measure the importance of an input feature for a model.
1. "f1.1_of_4" 0.064800 ################
2. "f1.3_of_4" 0.064300 ###############
3. "f1.2_of_4" 0.062700 ###############
4. "f1.0_of_4" 0.058700 ##############
5. "f2" 0.004000
1. "f1.0_of_4" 0.032397 ################
2. "f1.3_of_4" 0.032241 ###############
3. "f1.1_of_4" 0.031047 ###############
4. "f1.2_of_4" 0.030587 ###############
5. "f2" 0.001307
1. "f1.3_of_4" 0.113808 ################
2. "f1.0_of_4" 0.113546 ###############
3. "f1.1_of_4" 0.112715 ###############
4. "f1.2_of_4" 0.110428 ###############
5. "f2" 0.005334
1. "f1.0_of_4" 0.032394 ################
2. "f1.3_of_4" 0.032237 ###############
3. "f1.1_of_4" 0.031045 ###############
4. "f1.2_of_4" 0.030584 ###############
5. "f2" 0.001307
1. "f1.3_of_4" 0.381709 ################
2. "f1.0_of_4" 0.364288 ##############
3. "f1.1_of_4" 0.347260 #############
4. "f1.2_of_4" 0.310757 ##########
5. "f2" 0.187976
1. "f1.3_of_4" 20.000000 ################
2. "f1.0_of_4" 16.000000 #########
3. "f1.1_of_4" 10.000000
4. "f1.2_of_4" 10.000000
1. "f1.2_of_4" 368.000000 ################
2. "f1.3_of_4" 367.000000 ###############
3. "f1.0_of_4" 340.000000 #############
4. "f1.1_of_4" 318.000000 ############
5. "f2" 164.000000
1. "f1.2_of_4" 1184.639552 ################
2. "f1.1_of_4" 1180.490537 ###############
3. "f1.0_of_4" 1087.300719 ##############
4. "f1.3_of_4" 1061.770106 ##############
5. "f2" 124.009355