StackingCVRegressor:用于回归的堆叠交叉验证

一个用于堆叠回归的集成学习元回归器

> 从 mlxtend.regressor 导入 StackingCVRegressor

概述

堆叠是一种集成学习技术,通过元回归器将多个回归模型结合起来。StackingCVRegressor 扩展了标准堆叠算法(被实现为 StackingRegressor),使用超出折叠预测来准备第二级回归器的输入数据。

在标准堆叠过程中,第一层回归器被拟合到与准备第二级回归器输入相同的训练集,这可能导致过拟合。然而,StackingCVRegressor 使用了超出折叠预测的概念:数据集被分成 k 个折叠,在 k 个连续的轮次中,使用 k-1 个折叠来拟合第一层回归器。在每个轮次中,第一层回归器应用于在每次迭代中未用于模型拟合的剩余 1 个子集。生成的预测结果随后被堆叠并提供作为第二级回归器的输入数据。在 StackingCVRegressor 训练完成后,第一层回归器被拟合到整个数据集,以获得最佳预测。

参考文献

示例 1:波士顿住房数据预测

在这个示例中,我们评估了一些基本预测模型在波士顿房价数据集上的表现,并查看$R^2$和均方误差(MSE)分数如何受到与StackingCVRegressor结合的模型的影响。下面的代码输出展示了堆叠模型在该数据集上的表现最佳 - 略优于最佳的单一回归模型。

from mlxtend.regressor import StackingCVRegressor
from sklearn.datasets import load_boston
from sklearn.svm import SVR
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
import numpy as np

RANDOM_SEED = 42

X, y = load_boston(return_X_y=True)

svr = SVR(kernel='linear')
lasso = Lasso()
rf = RandomForestRegressor(n_estimators=5, 
                           random_state=RANDOM_SEED)

# 从v0.16.0版本开始,StackingCVRegressor支持
# `random_state` 以获得确定性结果。
stack = StackingCVRegressor(regressors=(svr, lasso, rf),
                            meta_regressor=lasso,
                            random_state=RANDOM_SEED)

print('5-fold cross validation scores:\n')

for clf, label in zip([svr, lasso, rf, stack], ['SVM', 'Lasso', 
                                                'Random Forest', 
                                                'StackingCVRegressor']):
    scores = cross_val_score(clf, X, y, cv=5)
    print("R^2 Score: %0.2f (+/- %0.2f) [%s]" % (
        scores.mean(), scores.std(), label))

5-fold cross validation scores:

R^2 Score: 0.46 (+/- 0.29) [SVM]
R^2 Score: 0.43 (+/- 0.14) [Lasso]
R^2 Score: 0.53 (+/- 0.28) [Random Forest]
R^2 Score: 0.57 (+/- 0.24) [StackingCVRegressor]
stack = StackingCVRegressor(regressors=(svr, lasso, rf),
                            meta_regressor=lasso)

print('5-fold cross validation scores:\n')

for clf, label in zip([svr, lasso, rf, stack], ['SVM', 'Lasso', 
                                                'Random Forest', 
                                                'StackingCVRegressor']):
    scores = cross_val_score(clf, X, y, cv=5, scoring='neg_mean_squared_error')
    print("Neg. MSE Score: %0.2f (+/- %0.2f) [%s]" % (
        scores.mean(), scores.std(), label))

5-fold cross validation scores:

Neg. MSE Score: -33.33 (+/- 22.36) [SVM]
Neg. MSE Score: -35.53 (+/- 16.99) [Lasso]
Neg. MSE Score: -27.08 (+/- 15.67) [Random Forest]
Neg. MSE Score: -25.85 (+/- 17.22) [StackingCVRegressor]

示例 2:使用 Stacking 的 GridSearchCV

在这个第二个例子中,我们演示了 StackingCVRegressor 如何与 GridSearchCV 结合使用。堆叠仍然允许调整基础模型和元模型的超参数!

例如,我们可以使用 estimator.get_params().keys() 获取可调参数的完整列表。

from mlxtend.regressor import StackingCVRegressor
from sklearn.datasets import load_boston
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

X, y = load_boston(return_X_y=True)

ridge = Ridge(random_state=RANDOM_SEED)
lasso = Lasso(random_state=RANDOM_SEED)
rf = RandomForestRegressor(random_state=RANDOM_SEED)

stack = StackingCVRegressor(regressors=(lasso, ridge),
                            meta_regressor=rf, 
                            random_state=RANDOM_SEED,
                            use_features_in_secondary=True)

params = {'lasso__alpha': [0.1, 1.0, 10.0],
          'ridge__alpha': [0.1, 1.0, 10.0]}

grid = GridSearchCV(
    estimator=stack, 
    param_grid={
        'lasso__alpha': [x/5.0 for x in range(1, 10)],
        'ridge__alpha': [x/20.0 for x in range(1, 10)],
        'meta_regressor__n_estimators': [10, 100]
    }, 
    cv=5,
    refit=True
)

grid.fit(X, y)

print("Best: %f using %s" % (grid.best_score_, grid.best_params_))

Best: 0.679151 using {'lasso__alpha': 1.6, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.4}
cv_keys = ('mean_test_score', 'std_test_score', 'params')

for r, _ in enumerate(grid.cv_results_['mean_test_score']):
    print("%0.3f +/- %0.2f %r"
          % (grid.cv_results_[cv_keys[0]][r],
             grid.cv_results_[cv_keys[1]][r] / 2.0,
             grid.cv_results_[cv_keys[2]][r]))
    if r > 10:
        break
print('...')

print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)

0.632 +/- 0.09 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.05}
0.645 +/- 0.08 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.1}
0.641 +/- 0.08 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.15}
0.653 +/- 0.08 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.2}
0.622 +/- 0.10 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.25}
0.630 +/- 0.09 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.3}
0.630 +/- 0.09 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.35}
0.642 +/- 0.09 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.4}
0.654 +/- 0.08 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.45}
0.642 +/- 0.09 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 100, 'ridge__alpha': 0.05}
0.645 +/- 0.09 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 100, 'ridge__alpha': 0.1}
0.648 +/- 0.09 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 100, 'ridge__alpha': 0.15}
...
Best parameters: {'lasso__alpha': 1.6, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.4}
Accuracy: 0.68

注意

StackingCVRegressor 还支持对 regressors 以及单个基础回归器进行网格搜索。当存在层级混合的超参数时,GridSearchCV 将尝试以自上而下的顺序替换超参数,即 regressors -> 单个基础回归器 -> 回归器超参数。例如,给定一个超参数网格如下:

params = {'randomforestregressor__n_estimators': [1, 100],
'regressors': [(regr1, regr1, regr1), (regr2, regr3)]}

它将首先使用 (regr1, regr2, regr3)(regr2, regr3) 的实例设置。然后,它将根据 'randomforestregressor__n_estimators': [1, 100] 替换匹配回归器的 'n_estimators' 设置。

API

StackingCVRegressor(regressors, meta_regressor, cv=5, shuffle=True, random_state=None, verbose=0, refit=True, use_features_in_secondary=False, store_train_meta_features=False, n_jobs=None, pre_dispatch='2n_jobs', multi_output=False)*

A 'Stacking Cross-Validation' regressor for scikit-learn estimators.

Parameters

Attributes

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/regressor/StackingCVRegressor/

Methods


fit(X, y, groups=None, sample_weight=None)

Fit ensemble regressors and the meta-regressor.

Parameters

Returns


fit_transform(X, y=None, fit_params)

Fit to data, then transform it.

Fits transformer to `X` and `y` with optional parameters `fit_params`
and returns a transformed version of `X`.

Parameters

Returns


get_params(deep=True)

Get parameters for this estimator.

Parameters

Returns


predict(X)

Predict target values for X.

Parameters

Returns


predict_meta_features(X)

Get meta-features of test-data.

Parameters

Returns


score(X, y, sample_weight=None)

Return the coefficient of determination :math:R^2 of the prediction.

The coefficient :math:`R^2` is defined as :math:`(1 - \frac{u}{v})`,
where :math:`u` is the residual sum of squares ``((y_true - y_pred)

2).sum()and :math:`v` is the total sum of squares((y_true - y_true.mean()) 2).sum()``. The best possible score is 1.0 and it

can be negative (because the model can be arbitrarily worse). A

constant model that always predicts the expected value of y, disregarding the input features, would get a :math:R^2 score of 0.0.

Parameters

Returns

Notes

The :math:R^2 score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of :func:~sklearn.metrics.r2_score. This influences the score method of all the multioutput regressors (except for :class:~sklearn.multioutput.MultiOutputRegressor).


set_params(params)

Set the parameters of this estimator.

Valid parameter keys can be listed with ``get_params()``.

Returns

self

Properties


named_regressors

Returns

List of named estimator tuples, like [('svc', SVC(...))]