StackingCVClassifier：使用交叉验证的堆叠方法

一种集成学习的元分类器，采用交叉验证为第二层分类器准备输入，以防止过拟合。

# 从mlxtend.classifier导入StackingCVClassifier

概述

堆叠是一个集成学习技术，通过元分类器组合多个分类模型。StackingCVClassifier扩展了标准的堆叠算法（实现为StackingClassifier），使用交叉验证来准备输入数据，以便用于二层分类器。

在标准的堆叠过程中，第一层分类器拟合使用相同的训练集来准备输入数据以供第二层分类器使用，这可能导致过拟合。然而，StackingCVClassifier使用了交叉验证的概念：数据集被分成k个折叠，在k个连续的轮次中，使用k-1个折叠来拟合第一层分类器；在每一轮中，第一层分类器会应用于每个迭代中未用于模型拟合的剩余1个子集。然后将生成的预测结果堆叠并作为输入数据提供给第二层分类器。在训练完StackingCVClassifier后，第一层分类器会拟合整个数据集，如下图所示。

更正式地说，堆叠交叉验证算法可以总结如下（来源：[1]）：

参考文献

[1] Tang, J., S. Alelyani 和 H. Liu. "数据分类：算法与应用。" 数据挖掘与知识发现系列，CRC出版社 (2015): 第498-500页。
[2] Wolpert, David H. "叠加泛化。" 神经网络 5.2 (1992): 241-259。
[3] Marios Michailidis (2017), StackNet, StackNet元建模框架, https://github.com/kaz-Anova/StackNet

示例 1 - 简单堆叠交叉验证分类

from sklearn import datasets

iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingCVClassifier
import numpy as np
import warnings

warnings.simplefilter('ignore')

RANDOM_SEED = 42

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()

# 从v0.16.0版本开始，StackingCVRegressor支持
# `random_state` 以获得确定性结果。
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
                            meta_classifier=lr,
                            random_state=RANDOM_SEED)

print('3-fold cross validation:\n')

for clf, label in zip([clf1, clf2, clf3, sclf], 
                      ['KNN', 
                       'Random Forest', 
                       'Naive Bayes',
                       'StackingClassifier']):

    scores = model_selection.cross_val_score(clf, X, y, 
                                              cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))

3-fold cross validation:

Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
Accuracy: 0.93 (+/- 0.02) [StackingClassifier]

import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
import matplotlib.gridspec as gridspec
import itertools

gs = gridspec.GridSpec(2, 2)

fig = plt.figure(figsize=(10,8))

for clf, lab, grd in zip([clf1, clf2, clf3, sclf], 
                         ['KNN', 
                          'Random Forest', 
                          'Naive Bayes',
                          'StackingCVClassifier'],
                          itertools.product([0, 1], repeat=2)):

    clf.fit(X, y)
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X, y=y, clf=clf)
    plt.title(lab)
plt.show()

png

示例 2 - 使用概率作为元特征

或者，可以通过将use_probas=True来利用第一层分类器的类别概率来训练元分类器（第二层分类器）。例如，在一个有3个类别和2个第一层分类器的设置中，这些分类器对1个训练样本可能做出以下“概率”预测：

分类器1: [0.2, 0.5, 0.3]
分类器2: [0.3, 0.4, 0.4]

这将生成k特征，其中k = [n_classes * n_classifiers]，通过堆叠这些第一层概率得到：

[0.2, 0.5, 0.3, 0.3, 0.4, 0.4]

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()

sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
                            use_probas=True,
                            meta_classifier=lr,
                            random_state=42)

print('3-fold cross validation:\n')

for clf, label in zip([clf1, clf2, clf3, sclf], 
                      ['KNN', 
                       'Random Forest', 
                       'Naive Bayes',
                       'StackingClassifier']):

    scores = model_selection.cross_val_score(clf, X, y, 
                                              cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))

3-fold cross validation:

Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
Accuracy: 0.95 (+/- 0.02) [StackingClassifier]

示例 3 - 堆叠交叉验证分类与网格搜索

堆栈允许调整基模型和元模型的超参数！可以通过 estimator.get_params().keys() 获取可调参数的完整列表。

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from mlxtend.classifier import StackingCVClassifier

# 正在初始化模型

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()

sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3], 
                            meta_classifier=lr,
                            random_state=42)

params = {'kneighborsclassifier__n_neighbors': [1, 5],
          'randomforestclassifier__n_estimators': [10, 50],
          'meta_classifier__C': [0.1, 10.0]}

grid = GridSearchCV(estimator=sclf, 
                    param_grid=params, 
                    cv=5,
                    refit=True)
grid.fit(X, y)

cv_keys = ('mean_test_score', 'std_test_score', 'params')

for r, _ in enumerate(grid.cv_results_['mean_test_score']):
    print("%0.3f +/- %0.2f %r"
          % (grid.cv_results_[cv_keys[0]][r],
             grid.cv_results_[cv_keys[1]][r] / 2.0,
             grid.cv_results_[cv_keys[2]][r]))

print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)

0.947 +/- 0.03 {'kneighborsclassifier__n_neighbors': 1, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 10}
0.933 +/- 0.02 {'kneighborsclassifier__n_neighbors': 1, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 50}
0.940 +/- 0.02 {'kneighborsclassifier__n_neighbors': 1, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 10}
0.940 +/- 0.02 {'kneighborsclassifier__n_neighbors': 1, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 50}
0.953 +/- 0.02 {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 10}
0.953 +/- 0.02 {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 50}
0.953 +/- 0.02 {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 10}
0.953 +/- 0.02 {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 50}
Best parameters: {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 10}
Accuracy: 0.95

如果我们计划多次使用回归算法，所需做的就是在参数网格中添加一个额外的数字后缀，如下所示：

from sklearn.model_selection import GridSearchCV

# 初始化模型

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()

sclf = StackingCVClassifier(classifiers=[clf1, clf1, clf2, clf3], 
                            meta_classifier=lr,
                            random_state=RANDOM_SEED)

params = {'kneighborsclassifier-1__n_neighbors': [1, 5],
          'kneighborsclassifier-2__n_neighbors': [1, 5],
          'randomforestclassifier__n_estimators': [10, 50],
          'meta_classifier__C': [0.1, 10.0]}

grid = GridSearchCV(estimator=sclf, 
                    param_grid=params, 
                    cv=5,
                    refit=True)
grid.fit(X, y)

cv_keys = ('mean_test_score', 'std_test_score', 'params')

for r, _ in enumerate(grid.cv_results_['mean_test_score']):
    print("%0.3f +/- %0.2f %r"
          % (grid.cv_results_[cv_keys[0]][r],
             grid.cv_results_[cv_keys[1]][r] / 2.0,
             grid.cv_results_[cv_keys[2]][r]))

print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)

0.940 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 1, 'kneighborsclassifier-2__n_neighbors': 1, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 10}
0.940 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 1, 'kneighborsclassifier-2__n_neighbors': 1, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 50}
0.940 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 1, 'kneighborsclassifier-2__n_neighbors': 1, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 10}
0.940 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 1, 'kneighborsclassifier-2__n_neighbors': 1, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 50}
0.960 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 1, 'kneighborsclassifier-2__n_neighbors': 5, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 10}
0.953 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 1, 'kneighborsclassifier-2__n_neighbors': 5, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 50}
0.953 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 1, 'kneighborsclassifier-2__n_neighbors': 5, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 10}
0.953 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 1, 'kneighborsclassifier-2__n_neighbors': 5, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 50}
0.960 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 5, 'kneighborsclassifier-2__n_neighbors': 1, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 10}
0.953 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 5, 'kneighborsclassifier-2__n_neighbors': 1, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 50}
0.953 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 5, 'kneighborsclassifier-2__n_neighbors': 1, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 10}
0.953 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 5, 'kneighborsclassifier-2__n_neighbors': 1, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 50}
0.953 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 5, 'kneighborsclassifier-2__n_neighbors': 5, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 10}
0.953 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 5, 'kneighborsclassifier-2__n_neighbors': 5, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 50}
0.953 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 5, 'kneighborsclassifier-2__n_neighbors': 5, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 10}
0.953 +/- 0.02 {'kneighborsclassifier-1__n_neighbors': 5, 'kneighborsclassifier-2__n_neighbors': 5, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 50}
Best parameters: {'kneighborsclassifier-1__n_neighbors': 1, 'kneighborsclassifier-2__n_neighbors': 5, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 10}
Accuracy: 0.96

注意

StackingClassifier 还支持对 classifiers 参数进行网格搜索。当存在混合层的超参数时，GridSearchCV 将尝试按自上而下的顺序替换超参数，即 classifiers -> 单个基分类器 -> 分类器超参数。例如，给定如下超参数网格：

params = {'randomforestclassifier__n_estimators': [1, 100],
'classifiers': [(clf1, clf1, clf1), (clf2, clf3)]}

它将首先使用 (clf1, clf1, clf1) 或 (clf2, clf3) 的实例设置。然后，它将基于 'randomforestclassifier__n_estimators': [1, 100] 替换匹配分类器的 'n_estimators' 设置。

示例 4 - 在不同特征子集上操作的分类器堆叠

不同的一级分类器可以拟合训练数据集中不同特征的子集。以下示例说明了如何在技术层面上使用scikit-learn管道和ColumnSelector来实现这一点：

from sklearn.datasets import load_iris
from mlxtend.classifier import StackingCVClassifier
from mlxtend.feature_selection import ColumnSelector
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

iris = load_iris()
X = iris.data
y = iris.target

pipe1 = make_pipeline(ColumnSelector(cols=(0, 2)),
                      LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),
                      LogisticRegression())

sclf = StackingCVClassifier(classifiers=[pipe1, pipe2], 
                            meta_classifier=LogisticRegression(),
                            random_state=42)

sclf.fit(X, y)

StackingCVClassifier(classifiers=[Pipeline(memory=None,
                                           steps=[('columnselector',
                                                   ColumnSelector(cols=(0, 2),
                                                                  drop_axis=False)),
                                                  ('logisticregression',
                                                   LogisticRegression(C=1.0,
                                                                      class_weight=None,
                                                                      dual=False,
                                                                      fit_intercept=True,
                                                                      intercept_scaling=1,
                                                                      l1_ratio=None,
                                                                      max_iter=100,
                                                                      multi_class='auto',
                                                                      n_jobs=None,
                                                                      penalty='l2',
                                                                      random_state=None,
                                                                      solver='lbfgs',
                                                                      tol=0.0...
                                                        fit_intercept=True,
                                                        intercept_scaling=1,
                                                        l1_ratio=None,
                                                        max_iter=100,
                                                        multi_class='auto',
                                                        n_jobs=None,
                                                        penalty='l2',
                                                        random_state=None,
                                                        solver='lbfgs',
                                                        tol=0.0001, verbose=0,
                                                        warm_start=False),
                     n_jobs=None, pre_dispatch='2*n_jobs', random_state=42,
                     shuffle=True, store_train_meta_features=False,
                     stratify=True, use_clones=True,
                     use_features_in_secondary=False, use_probas=False,
                     verbose=0)

示例 5 -- 使用 `decision_function` 的 ROC 曲线

与其他scikit-learn分类器一样，StackingCVClassifier具有一个可以用于绘制ROC曲线的decision_function方法。请注意，decision_function期望并要求元分类器实现decision_function。

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingCVClassifier
from sklearn.metrics import roc_curve, auc
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier


iris = datasets.load_iris()
X, y = iris.data[:, [0, 1]], iris.target


# 将输出二值化
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]



RANDOM_SEED = 42


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=RANDOM_SEED)

clf1 =  LogisticRegression()
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = SVC(random_state=RANDOM_SEED)
lr = LogisticRegression()


sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
                            meta_classifier=lr)


# 学习预测每个类别与其他类别的对比
classifier = OneVsRestClassifier(sclf)

使用 predict_proba()

y_score = classifier.fit(X_train, y_train).predict_proba(X_test)

# 计算每个类别的ROC曲线和ROC面积
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# 计算微平均ROC曲线和ROC面积
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

png

使用 decision_function()

y_score = classifier.fit(X_train, y_train).decision_function(X_test)

# 计算每个类别的ROC曲线和ROC面积
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# 计算微平均ROC曲线和ROC面积
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

png

API

StackingCVClassifier(classifiers, meta_classifier, use_probas=False, drop_proba_col=None, cv=2, shuffle=True, random_state=None, stratify=True, verbose=0, use_features_in_secondary=False, store_train_meta_features=False, use_clones=True, n_jobs=None, pre_dispatch='2n_jobs')*

A 'Stacking Cross-Validation' classifier for scikit-learn estimators.

New in mlxtend v0.4.3

Parameters

classifiers : array-like, shape = [n_classifiers]

A list of classifiers. Invoking the fit method on the StackingCVClassifer will fit clones of these original classifiers that will be stored in the class attribute self.clfs_.
meta_classifier : object

The meta-classifier to be fitted on the ensemble of classifiers
use_probas : bool (default: False)

If True, trains meta-classifier based on predicted probabilities instead of class labels.
drop_proba_col : string (default: None)

Drops extra "probability" column in the feature set, because it is redundant: p(y_c) = 1 - p(y_1) + p(y_2) + ... + p(y_{c-1}). This can be useful for meta-classifiers that are sensitive to perfectly collinear features. If last, drops last probability column. If first, drops first probability column. Only relevant if use_probas=True.
cv : int, cross-validation generator or an iterable, optional (default: 2)

Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 2-fold cross validation, - integer, to specify the number of folds in a (Stratified)KFold, - An object to be used as a cross-validation generator. - An iterable yielding train, test splits. For integer/None inputs, it will use either a KFold or StratifiedKFold cross validation depending the value of stratify argument.
shuffle : bool (default: True)

If True, and the cv argument is integer, the training data will be shuffled at fitting stage prior to cross-validation. If the cv argument is a specific cross validation technique, this argument is omitted.
random_state : int, RandomState instance or None, optional (default: None)

Constrols the randomness of the cv splitter. Used when cv is integer and shuffle=True. New in v0.16.0.
stratify : bool (default: True)

If True, and the cv argument is integer it will follow a stratified K-Fold cross validation technique. If the cv argument is a specific cross validation technique, this argument is omitted.
verbose : int, optional (default=0)

Controls the verbosity of the building process. - verbose=0 (default): Prints nothing - verbose=1: Prints the number & name of the regressor being fitted and which fold is currently being used for fitting - verbose=2: Prints info about the parameters of the regressor being fitted - verbose>2: Changes verbose param of the underlying regressor to self.verbose - 2
use_features_in_secondary : bool (default: False)

If True, the meta-classifier will be trained both on the predictions of the original classifiers and the original dataset. If False, the meta-classifier will be trained only on the predictions of the original classifiers.
store_train_meta_features : bool (default: False)

If True, the meta-features computed from the training data used for fitting the meta-classifier stored in the self.train_meta_features_ array, which can be accessed after calling fit.
use_clones : bool (default: True)

Clones the classifiers for stacking classification if True (default) or else uses the original ones, which will be refitted on the dataset upon calling the fit method. Hence, if use_clones=True, the original input classifiers will remain unmodified upon using the StackingCVClassifier's fit method. Setting use_clones=False is recommended if you are working with estimators that are supporting the scikit-learn fit/predict API interface but are not compatible to scikit-learn's clone function.
n_jobs : int or None, optional (default=None)

The number of CPUs to use to do the computation. None means 1 unless in a :obj:joblib.parallel_backend context. -1 means using all processors. See :term:Glossary <n_jobs> for more details. New in v0.16.0.
pre_dispatch : int, or string, optional

Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A string, giving an expression as a function of n_jobs, as in '2*n_jobs' New in v0.16.0.

Attributes

clfs_ : list, shape=[n_classifiers]

Fitted classifiers (clones of the original classifiers)
meta_clf_ : estimator

Fitted meta-classifier (clone of the original meta-estimator)
train_meta_features : numpy array, shape = [n_samples, n_classifiers]

meta-features for training data, where n_samples is the number of samples in training data and n_classifiers is the number of classfiers.

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/classifier/StackingCVClassifier/

Methods

decision_function(X)

Predict class confidence scores for X.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.

Returns

scores : shape=(n_samples,) if n_classes == 2 else (n_samples, n_classes).

Confidence scores per (sample, class) combination. In the binary case, confidence score for self.classes_[1] where >0 means this class would be predicted.

fit(X, y, groups=None, sample_weight=None)

Fit ensemble classifers and the meta-classifier.

Parameters

X : numpy array, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : numpy array, shape = [n_samples]

Target values.
groups : numpy array/None, shape = [n_samples]

The group that each sample belongs to. This is used by specific folding strategies such as GroupKFold()
sample_weight : array-like, shape = [n_samples], optional

Sample weights passed as sample_weights to each regressor in the regressors list as well as the meta_regressor. Raises error if some regressor does not support sample_weight in the fit() method.

Returns

self : object

fit_transform(X, y=None, fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X : numpy array of shape [n_samples, n_features]

Training set.
y : numpy array of shape [n_samples]

Target values.
**fit_params : dict

Additional fit parameters.

Returns

X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(deep=True)

Return estimator parameter names for GridSearch support.

predict(X)

Predict target values for X.

Parameters

X : numpy array, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.

Returns

labels : array-like, shape = [n_samples]

Predicted class labels.

predict_meta_features(X)

Get meta-features of test-data.

Parameters

X : numpy array, shape = [n_samples, n_features]

Test vectors, where n_samples is the number of samples and n_features is the number of features.

Returns

meta-features : numpy array, shape = [n_samples, n_classifiers]

Returns the meta-features for test data.

predict_proba(X)

Predict class probabilities for X.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.

Returns

proba : array-like, shape = [n_samples, n_classes] or a list of n_outputs of such arrays if n_outputs > 1.

Probability for each class per sample.

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters

X : array-like of shape (n_samples, n_features)

Test samples.
y : array-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.
sample_weight : array-like of shape (n_samples,), default=None

Sample weights.

Returns

score : float

Mean accuracy of self.predict(X) wrt. y.

set_params(params)

Set the parameters of this estimator.

Valid parameter keys can be listed with get_params().

Returns

self

Properties

named_classifiers

None

StackingCVClassifier：使用交叉验证的堆叠方法

概述

参考文献

示例 1 - 简单堆叠交叉验证分类

示例 2 - 使用概率作为元特征

示例 3 - 堆叠交叉验证分类与网格搜索

示例 4 - 在不同特征子集上操作的分类器堆叠

示例 5 -- 使用 decision_function 的 ROC 曲线

API

Methods

Properties

示例 5 -- 使用 `decision_function` 的 ROC 曲线