Usage

我们将从波士顿住房数据集上的概率回归示例开始：

from ngboost import NGBRegressor

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, Y = load_boston(True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ngb = NGBRegressor().fit(X_train, Y_train)
Y_preds = ngb.predict(X_test)
Y_dists = ngb.pred_dist(X_test)

# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)

# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)

[iter 0] loss=3.6486 val_loss=0.0000 scale=0.5000 norm=3.4791
[iter 100] loss=3.1043 val_loss=0.0000 scale=1.0000 norm=3.9358
[iter 200] loss=2.4762 val_loss=0.0000 scale=2.0000 norm=4.1521
[iter 300] loss=2.0484 val_loss=0.0000 scale=1.0000 norm=1.6249
[iter 400] loss=1.8610 val_loss=0.0000 scale=1.0000 norm=1.4547
Test MSE 7.719871354323341
Test NLL 2.8867507325340243

获取一组点的估计分布参数很容易。这将返回测试集中前五个观测值的预测均值和标准差：

Y_dists[0:5].params

{'loc': array([15.71909047, 19.51384116, 19.24509285, 17.8645122 , 24.31325397]),
 'scale': array([1.48748154, 1.37673424, 1.67090687, 1.63854999, 1.52513887])}

分布

NGBoost 可以与多种分布一起使用，这些分布可以分为用于回归的分布（支持无限集）和用于分类的分布（支持有限集）。

回归分布

分布	参数	实现的分数	参考
`Normal`	`loc`, `scale`	`LogScore`, `CRPScore`	`scipy.stats` normal
`LogNormal`	`s`, `scale`	`LogScore`, `CRPScore`	`scipy.stats` lognormal
`Exponential`	`scale`	`LogScore`, `CRPScore`	`scipy.stats` exponential

回归分布可以通过将适当的类作为Dist参数传递给NGBRegressor()构造函数来使用。Normal是默认值。

from ngboost.distns import Exponential, Normal

X, Y = load_boston(True)
X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = train_test_split(X, Y, test_size=0.2)

ngb_norm = NGBRegressor(Dist=Normal, verbose=False).fit(X_reg_train, Y_reg_train)
ngb_exp = NGBRegressor(Dist=Exponential, verbose=False).fit(X_reg_train, Y_reg_train)

NGBRegressor 对象有两种预测方法：predict()，它返回点预测，就像标准回归器所期望的那样，以及 pred_dist()，它返回一个分布对象，表示测试集中点 $x_i$ 处的条件分布 $Y|X=x_i$。

ngb_norm.predict(X_reg_test)[0:5]

array([21.25837828,  9.88964092, 23.01338315, 10.89576892, 16.12806237])

ngb_exp.predict(X_reg_test)[0:5]

array([20.94799589,  9.38317525, 22.88445968, 10.33327537, 14.83048942])

ngb_exp.pred_dist(X_reg_test)[0:5].params

{'scale': array([20.94799589,  9.38317525, 22.88445968, 10.33327537, 14.83048942])}

生存回归

NGBoost 支持对右删失数据的分析。理论上，任何可以用于 NGBoost 回归的分布也可以用于生存分析，但这需要实现相应得分的右删失版本。目前，LogNormal 和 Exponential 已经实现了这些得分。要进行生存分析，请使用 NGBSurvival 并将事件时间（或删失时间）和事件指示向量传递给 fit()：

import numpy as np
from ngboost import NGBSurvival
from ngboost.distns import LogNormal

X, Y = load_boston(True)
X_surv_train, X_surv_test, Y_surv_train, Y_surv_test = train_test_split(X, Y, test_size=0.2)

# introduce administrative censoring to simulate survival data
T_surv_train = np.minimum(Y_surv_train, 30) # time of an event or censoring
E_surv_train = Y_surv_train > 30 # 1 if T[i] is the time of an event, 0 if it's a time of censoring

ngb = NGBSurvival(Dist=LogNormal).fit(X_surv_train, T_surv_train, E_surv_train)

[iter 0] loss=1.2960 val_loss=0.0000 scale=8.0000 norm=4.8495
[iter 100] loss=0.6339 val_loss=0.0000 scale=2.0000 norm=0.7358
[iter 200] loss=0.3803 val_loss=0.0000 scale=4.0000 norm=0.9619
[iter 300] loss=0.2276 val_loss=0.0000 scale=8.0000 norm=0.9190
[iter 400] loss=0.1178 val_loss=0.0000 scale=4.0000 norm=0.3496

当前实施的评分假设在观察到的预测变量条件下，审查与生存是独立的。

分类分布

分布	参数	实现的分数	参考
`k_categorical(K)`	`p0`, `p1`... `p{K-1}`	`LogScore`	Categorical distribution on Wikipedia
`Bernoulli`	`p`	`LogScore`	Bernoulli distribution on Wikipedia

分类分布可以通过NGBClassifier()构造函数使用，通过将适当的类作为Dist参数传递。Bernoulli是默认值，等同于k_categorical(2)。

from ngboost import NGBClassifier
from ngboost.distns import k_categorical, Bernoulli
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(True)
y[0:15] = 2 # artificially make this a 3-class problem instead of a 2-class problem
X_cls_train, X_cls_test, Y_cls_train, Y_cls_test  = train_test_split(X, y, test_size=0.2)

ngb_cat = NGBClassifier(Dist=k_categorical(3), verbose=False) # tell ngboost that there are 3 possible outcomes
_ = ngb_cat.fit(X_cls_train, Y_cls_train) # Y should have only 3 values: {0,1,2}

当使用NGBoost进行分类时，结果向量Y必须仅包含从0到K-1的整数，其中K是类别的总数。这与sklearn中的分类标准一致。

NGBClassifier 对象有三种预测方法：predict() 返回最可能的类别，predict_proba() 返回类别概率，pred_dist() 返回分布对象。

ngb_cat.predict(X_cls_test)[0:5]

array([1, 1, 1, 0, 1])

ngb_cat.predict_proba(X_cls_test)[0:5]

array([[3.53080012e-03, 9.96242905e-01, 2.26294536e-04],
       [6.59565268e-03, 9.93168490e-01, 2.35857004e-04],
       [3.53080012e-03, 9.96242905e-01, 2.26294536e-04],
       [9.92981053e-01, 6.07012737e-03, 9.48819937e-04],
       [3.53080012e-03, 9.96242905e-01, 2.26294536e-04]])

ngb_cat.pred_dist(X_cls_test)[0:5].params

{'p0': array([0.0035308 , 0.00659565, 0.0035308 , 0.99298105, 0.0035308 ]),
 'p1': array([0.99624291, 0.99316849, 0.99624291, 0.00607013, 0.99624291]),
 'p2': array([0.00022629, 0.00023586, 0.00022629, 0.00094882, 0.00022629])}

分数

NGBoost 支持对数评分（LogScore，也称为负对数似然）和 CRPS（CRPScore），尽管每个评分可能并未为每个分布实现。评分由构造函数中的 Score 参数指定。

from ngboost.scores import LogScore, CRPScore

NGBRegressor(Dist=Exponential, Score=CRPScore, verbose=False).fit(X_reg_train, Y_reg_train)
NGBClassifier(Dist=k_categorical(3), Score=LogScore, verbose=False).fit(X_cls_train, Y_cls_train)

NGBClassifier(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                                         max_features=None, max_leaf_nodes=None,
                                         min_impurity_decrease=0.0,
                                         min_impurity_split=None,
                                         min_samples_leaf=1,
                                         min_samples_split=2,
                                         min_weight_fraction_leaf=0.0,
                                         presort=False, random_state=None,
                                         splitter='best'),
              Dist=<class 'ngboost.distns.categorical.k_categorical.<locals>.Categorical'>,
              Score=<class 'ngboost.scores.LogScore'>, col_sample=1.0,
              learning_rate=0.01, minibatch_frac=1.0, n_estimators=500,
              natural_gradient=True,
              random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
              verbose=False, verbose_eval=100)

基础学习器

NGBoost 可以与任何 sklearn 回归器一起使用作为基础学习器，通过 Base 参数指定。默认情况下是一个深度为3的回归树。

from sklearn.tree import DecisionTreeRegressor

learner = DecisionTreeRegressor(criterion='friedman_mse', max_depth=5)

NGBSurvival(Dist=Exponential, Score=CRPScore, Base=learner, verbose=False).fit(X_surv_train, T_surv_train, E_surv_train)

NGBSurvival(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=5,
                                       max_features=None, max_leaf_nodes=None,
                                       min_impurity_decrease=0.0,
                                       min_impurity_split=None,
                                       min_samples_leaf=1, min_samples_split=2,
                                       min_weight_fraction_leaf=0.0,
                                       presort=False, random_state=None,
                                       splitter='best'),
            Dist=<class 'ngboost.api.NGBSurvival.__init__.<locals>.SurvivalDistn'>,
            Score=<class 'ngboost.scores.CRPScore'>, col_sample=1.0,
            learning_rate=0.01, minibatch_frac=1.0, n_estimators=500,
            natural_gradient=True,
            random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
            verbose=False, verbose_eval=100)

其他参数

学习率、估计器数量、小批量比例和列子采样也可以轻松调整：

ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
             minibatch_frac=0.5, col_sample=0.5)
ngb.fit(X_reg_train, Y_reg_train)

[iter 0] loss=3.6328 val_loss=0.0000 scale=0.5000 norm=3.3554

NGBRegressor(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                                        max_features=None, max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        presort=False, random_state=None,
                                        splitter='best'),
             Dist=<class 'ngboost.distns.normal.Normal'>,
             Score=<class 'ngboost.scores.LogScore'>, col_sample=0.5,
             learning_rate=0.01, minibatch_frac=0.5, n_estimators=100,
             natural_gradient=True,
             random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
             verbose=True, verbose_eval=100)

样本权重（用于训练）通过sample_weight参数设置到fit中。

ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
             minibatch_frac=0.5, col_sample=0.5)
weights = np.random.random(Y_reg_train.shape)
ngb.fit(X_reg_train, Y_reg_train, sample_weight=weights)

[iter 0] loss=3.6257 val_loss=0.0000 scale=1.0000 norm=6.6551

NGBRegressor(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                                        max_features=None, max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        presort=False, random_state=None,
                                        splitter='best'),
             Dist=<class 'ngboost.distns.normal.Normal'>,
             Score=<class 'ngboost.scores.LogScore'>, col_sample=0.5,
             learning_rate=0.01, minibatch_frac=0.5, n_estimators=100,
             natural_gradient=True,
             random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
             verbose=True, verbose_eval=100)