Search
Usage

我们将从波士顿住房数据集上的概率回归示例开始:

from ngboost import NGBRegressor

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, Y = load_boston(True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ngb = NGBRegressor().fit(X_train, Y_train)
Y_preds = ngb.predict(X_test)
Y_dists = ngb.pred_dist(X_test)

# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)

# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)
[iter 0] loss=3.6486 val_loss=0.0000 scale=0.5000 norm=3.4791
[iter 100] loss=3.1043 val_loss=0.0000 scale=1.0000 norm=3.9358
[iter 200] loss=2.4762 val_loss=0.0000 scale=2.0000 norm=4.1521
[iter 300] loss=2.0484 val_loss=0.0000 scale=1.0000 norm=1.6249
[iter 400] loss=1.8610 val_loss=0.0000 scale=1.0000 norm=1.4547
Test MSE 7.719871354323341
Test NLL 2.8867507325340243

获取一组点的估计分布参数很容易。这将返回测试集中前五个观测值的预测均值和标准差:

Y_dists[0:5].params
{'loc': array([15.71909047, 19.51384116, 19.24509285, 17.8645122 , 24.31325397]),
 'scale': array([1.48748154, 1.37673424, 1.67090687, 1.63854999, 1.52513887])}

分布

NGBoost 可以与多种分布一起使用,这些分布可以分为用于回归的分布(支持无限集)和用于分类的分布(支持有限集)。

回归分布

分布 参数 实现的分数 参考
Normal loc, scale LogScore, CRPScore scipy.stats normal
LogNormal s, scale LogScore, CRPScore scipy.stats lognormal
Exponential scale LogScore, CRPScore scipy.stats exponential

回归分布可以通过将适当的类作为Dist参数传递给NGBRegressor()构造函数来使用。Normal是默认值。

from ngboost.distns import Exponential, Normal

X, Y = load_boston(True)
X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = train_test_split(X, Y, test_size=0.2)

ngb_norm = NGBRegressor(Dist=Normal, verbose=False).fit(X_reg_train, Y_reg_train)
ngb_exp = NGBRegressor(Dist=Exponential, verbose=False).fit(X_reg_train, Y_reg_train)

NGBRegressor 对象有两种预测方法:predict(),它返回点预测,就像标准回归器所期望的那样,以及 pred_dist(),它返回一个分布对象,表示测试集中点 $x_i$ 处的条件分布 $Y|X=x_i$。

ngb_norm.predict(X_reg_test)[0:5]
array([21.25837828,  9.88964092, 23.01338315, 10.89576892, 16.12806237])
ngb_exp.predict(X_reg_test)[0:5]
array([20.94799589,  9.38317525, 22.88445968, 10.33327537, 14.83048942])
ngb_exp.pred_dist(X_reg_test)[0:5].params
{'scale': array([20.94799589,  9.38317525, 22.88445968, 10.33327537, 14.83048942])}

生存回归

NGBoost 支持对右删失数据的分析。理论上,任何可以用于 NGBoost 回归的分布也可以用于生存分析,但这需要实现相应得分的右删失版本。目前,LogNormalExponential 已经实现了这些得分。要进行生存分析,请使用 NGBSurvival 并将事件时间(或删失时间)和事件指示向量传递给 fit()

import numpy as np
from ngboost import NGBSurvival
from ngboost.distns import LogNormal

X, Y = load_boston(True)
X_surv_train, X_surv_test, Y_surv_train, Y_surv_test = train_test_split(X, Y, test_size=0.2)

# introduce administrative censoring to simulate survival data
T_surv_train = np.minimum(Y_surv_train, 30) # time of an event or censoring
E_surv_train = Y_surv_train > 30 # 1 if T[i] is the time of an event, 0 if it's a time of censoring

ngb = NGBSurvival(Dist=LogNormal).fit(X_surv_train, T_surv_train, E_surv_train)
[iter 0] loss=1.2960 val_loss=0.0000 scale=8.0000 norm=4.8495
[iter 100] loss=0.6339 val_loss=0.0000 scale=2.0000 norm=0.7358
[iter 200] loss=0.3803 val_loss=0.0000 scale=4.0000 norm=0.9619
[iter 300] loss=0.2276 val_loss=0.0000 scale=8.0000 norm=0.9190
[iter 400] loss=0.1178 val_loss=0.0000 scale=4.0000 norm=0.3496

当前实施的评分假设在观察到的预测变量条件下,审查与生存是独立的。

分类分布

分布 参数 实现的分数 参考
k_categorical(K) p0, p1... p{K-1} LogScore Categorical distribution on Wikipedia
Bernoulli p LogScore Bernoulli distribution on Wikipedia

分类分布可以通过NGBClassifier()构造函数使用,通过将适当的类作为Dist参数传递。Bernoulli是默认值,等同于k_categorical(2)

from ngboost import NGBClassifier
from ngboost.distns import k_categorical, Bernoulli
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(True)
y[0:15] = 2 # artificially make this a 3-class problem instead of a 2-class problem
X_cls_train, X_cls_test, Y_cls_train, Y_cls_test  = train_test_split(X, y, test_size=0.2)

ngb_cat = NGBClassifier(Dist=k_categorical(3), verbose=False) # tell ngboost that there are 3 possible outcomes
_ = ngb_cat.fit(X_cls_train, Y_cls_train) # Y should have only 3 values: {0,1,2}

当使用NGBoost进行分类时,结果向量Y必须仅包含从0到K-1的整数,其中K是类别的总数。这与sklearn中的分类标准一致。

NGBClassifier 对象有三种预测方法:predict() 返回最可能的类别,predict_proba() 返回类别概率,pred_dist() 返回分布对象。

ngb_cat.predict(X_cls_test)[0:5]
array([1, 1, 1, 0, 1])
ngb_cat.predict_proba(X_cls_test)[0:5]
array([[3.53080012e-03, 9.96242905e-01, 2.26294536e-04],
       [6.59565268e-03, 9.93168490e-01, 2.35857004e-04],
       [3.53080012e-03, 9.96242905e-01, 2.26294536e-04],
       [9.92981053e-01, 6.07012737e-03, 9.48819937e-04],
       [3.53080012e-03, 9.96242905e-01, 2.26294536e-04]])
ngb_cat.pred_dist(X_cls_test)[0:5].params
{'p0': array([0.0035308 , 0.00659565, 0.0035308 , 0.99298105, 0.0035308 ]),
 'p1': array([0.99624291, 0.99316849, 0.99624291, 0.00607013, 0.99624291]),
 'p2': array([0.00022629, 0.00023586, 0.00022629, 0.00094882, 0.00022629])}

分数

NGBoost 支持对数评分(LogScore,也称为负对数似然)和 CRPS(CRPScore),尽管每个评分可能并未为每个分布实现。评分由构造函数中的 Score 参数指定。

from ngboost.scores import LogScore, CRPScore

NGBRegressor(Dist=Exponential, Score=CRPScore, verbose=False).fit(X_reg_train, Y_reg_train)
NGBClassifier(Dist=k_categorical(3), Score=LogScore, verbose=False).fit(X_cls_train, Y_cls_train)
NGBClassifier(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                                         max_features=None, max_leaf_nodes=None,
                                         min_impurity_decrease=0.0,
                                         min_impurity_split=None,
                                         min_samples_leaf=1,
                                         min_samples_split=2,
                                         min_weight_fraction_leaf=0.0,
                                         presort=False, random_state=None,
                                         splitter='best'),
              Dist=<class 'ngboost.distns.categorical.k_categorical.<locals>.Categorical'>,
              Score=<class 'ngboost.scores.LogScore'>, col_sample=1.0,
              learning_rate=0.01, minibatch_frac=1.0, n_estimators=500,
              natural_gradient=True,
              random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
              verbose=False, verbose_eval=100)

基础学习器

NGBoost 可以与任何 sklearn 回归器一起使用作为基础学习器,通过 Base 参数指定。默认情况下是一个深度为3的回归树。

from sklearn.tree import DecisionTreeRegressor

learner = DecisionTreeRegressor(criterion='friedman_mse', max_depth=5)

NGBSurvival(Dist=Exponential, Score=CRPScore, Base=learner, verbose=False).fit(X_surv_train, T_surv_train, E_surv_train)
NGBSurvival(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=5,
                                       max_features=None, max_leaf_nodes=None,
                                       min_impurity_decrease=0.0,
                                       min_impurity_split=None,
                                       min_samples_leaf=1, min_samples_split=2,
                                       min_weight_fraction_leaf=0.0,
                                       presort=False, random_state=None,
                                       splitter='best'),
            Dist=<class 'ngboost.api.NGBSurvival.__init__.<locals>.SurvivalDistn'>,
            Score=<class 'ngboost.scores.CRPScore'>, col_sample=1.0,
            learning_rate=0.01, minibatch_frac=1.0, n_estimators=500,
            natural_gradient=True,
            random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
            verbose=False, verbose_eval=100)

其他参数

学习率、估计器数量、小批量比例和列子采样也可以轻松调整:

ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
             minibatch_frac=0.5, col_sample=0.5)
ngb.fit(X_reg_train, Y_reg_train)
[iter 0] loss=3.6328 val_loss=0.0000 scale=0.5000 norm=3.3554
NGBRegressor(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                                        max_features=None, max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        presort=False, random_state=None,
                                        splitter='best'),
             Dist=<class 'ngboost.distns.normal.Normal'>,
             Score=<class 'ngboost.scores.LogScore'>, col_sample=0.5,
             learning_rate=0.01, minibatch_frac=0.5, n_estimators=100,
             natural_gradient=True,
             random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
             verbose=True, verbose_eval=100)

样本权重(用于训练)通过sample_weight参数设置到fit中。

ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
             minibatch_frac=0.5, col_sample=0.5)
weights = np.random.random(Y_reg_train.shape)
ngb.fit(X_reg_train, Y_reg_train, sample_weight=weights)
[iter 0] loss=3.6257 val_loss=0.0000 scale=1.0000 norm=6.6551
NGBRegressor(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                                        max_features=None, max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        presort=False, random_state=None,
                                        splitter='best'),
             Dist=<class 'ngboost.distns.normal.Normal'>,
             Score=<class 'ngboost.scores.LogScore'>, col_sample=0.5,
             learning_rate=0.01, minibatch_frac=0.5, n_estimators=100,
             natural_gradient=True,
             random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
             verbose=True, verbose_eval=100)