EnsembleVoteClassifier: 多数投票分类器
实现一个用于分类的多数投票 EnsembleVoteClassifier
。
# 集成投票分类器
来自 mlxtend.classifier 的 EnsembleVoteClassifier
概述
EnsembleVoteClassifier
是一种元分类器,用于通过多数投票或多数票投票组合相似或概念上不同的机器学习分类器进行分类。(为了简化起见,我们将多数投票和多数票投票统称为多数投票。)
EnsembleVoteClassifier
实现了“硬”投票和“软”投票。在硬投票中,我们将最终的类标签预测为分类模型中被预测最频繁的类标签。在软投票中,我们通过平均类概率来预测类标签(仅在分类器经过良好校准时推荐使用)。
注意
如果您有兴趣使用 EnsembleVoteClassifier
,请注意它现在也可以通过 scikit-learn (>0.17) 作为 VotingClassifier
使用。
多数投票 / 硬投票
硬投票是多数投票的最简单情况。在这里,我们通过每个分类器 $C_j$ 的多数(众数)投票预测类别标签 $\hat{y}$:
$$\hat{y}=mode{C_1(\mathbf{x}), C_2(\mathbf{x}), ..., C_m(\mathbf{x})}$$
假设我们结合三个分类器对一个训练样本进行分类,如下所示:
- 分类器 1 -> 类别 0
- 分类器 2 -> 类别 0
- 分类器 3 -> 类别 1
$$\hat{y}=mode{0, 0, 1} = 0$$
通过多数投票,我们将样本分类为“类别 0”。
加权多数投票
除了前一节描述的简单多数投票(硬投票),我们还可以通过与分类器 $C_j$ 关联一个权重 $w_j$ 来计算加权多数投票:
$$\hat{y} = \arg \max_i \sum^{m}_{j=1} w_j \chi_A \big(C_j(\mathbf{x})=i\big),$$
其中 $\chi_A$ 是特征函数 $[C_j(\mathbf{x}) = i \; \in A]$,而 $A$ 是唯一类别标签的集合。
继续前一节的例子
- 分类器 1 -> 类别 0
- 分类器 2 -> 类别 0
- 分类器 3 -> 类别 1
赋予权重 {0.2, 0.2, 0.6} 将得出预测 $\hat{y} = 1$:
$$\arg \max_i [0.2 \times i_0 + 0.2 \times i_0 + 0.6 \times i_1] = 1$$
软投票
在软投票中,我们根据分类器预测的概率 $p$ 来预测类标签——这种方法仅在分类器经过良好校准时推荐使用。
$$\hat{y} = \arg \max_i \sum^{m}{j=1} w_j p{ij},$$
其中 $w_j$ 是可以分配给第 $j$ 个分类器的权重。
假设上一节中的示例是一个二分类任务,类标签为 $i \in {0, 1}$,我们的集成可以做出以下预测:
- $C_1(\mathbf{x}) \rightarrow [0.9, 0.1]$
- $C_2(\mathbf{x}) \rightarrow [0.8, 0.2]$
- $C_3(\mathbf{x}) \rightarrow [0.4, 0.6]$
使用均匀权重,我们计算平均概率:
$$p(i_0 \mid \mathbf{x}) = \frac{0.9 + 0.8 + 0.4}{3} = 0.7 \\ p(i_1 \mid \mathbf{x}) = \frac{0.1 + 0.2 + 0.6}{3} = 0.3$$
$$\hat{y} = \arg \max_i \big[p(i_0 \mid \mathbf{x}), p(i_1 \mid \mathbf{x}) \big] = 0$$
然而,赋予权重{0.1, 0.1, 0.8}将产生预测$\hat{y} = 1$:
$$p(i_0 \mid \mathbf{x}) = {0.1 \times 0.9 + 0.1 \times 0.8 + 0.8 \times 0.4} = 0.49 \\ p(i_1 \mid \mathbf{x}) = {0.1 \times 0.1 + 0.2 \times 0.1 + 0.8 \times 0.6} = 0.51$$
$$\hat{y} = \arg \max_i \big[p(i_0 \mid \mathbf{x}), p(i_1 \mid \mathbf{x}) \big] = 1$$
参考文献
- [1] S. Raschka. Python 机器学习. Packt Publishing Ltd., 2015.
示例 1 - 使用不同的分类模型对鸢尾花进行分类
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
import numpy as np
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
print('5-fold cross validation:\n')
labels = ['Logistic Regression', 'Random Forest', 'Naive Bayes']
for clf, label in zip([clf1, clf2, clf3], labels):
scores = model_selection.cross_val_score(clf, X, y,
cv=5,
scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]"
% (scores.mean(), scores.std(), label))
5-fold cross validation:
Accuracy: 0.95 (+/- 0.04) [Logistic Regression]
Accuracy: 0.94 (+/- 0.04) [Random Forest]
Accuracy: 0.91 (+/- 0.04) [Naive Bayes]
from mlxtend.classifier import EnsembleVoteClassifier
eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3], weights=[1,1,1])
labels = ['Logistic Regression', 'Random Forest', 'Naive Bayes', 'Ensemble']
for clf, label in zip([clf1, clf2, clf3, eclf], labels):
scores = model_selection.cross_val_score(clf, X, y,
cv=5,
scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]"
% (scores.mean(), scores.std(), label))
Accuracy: 0.95 (+/- 0.04) [Logistic Regression]
Accuracy: 0.94 (+/- 0.04) [Random Forest]
Accuracy: 0.91 (+/- 0.04) [Naive Bayes]
Accuracy: 0.95 (+/- 0.04) [Ensemble]
绘制决策区域
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
import matplotlib.gridspec as gridspec
import itertools
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10,8))
labels = ['Logistic Regression', 'Random Forest', 'Naive Bayes', 'Ensemble']
for clf, lab, grd in zip([clf1, clf2, clf3, eclf],
labels,
itertools.product([0, 1], repeat=2)):
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y, clf=clf)
plt.title(lab)
示例 2 - 网格搜索
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
%%capture --no-display
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import EnsembleVoteClassifier
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3], voting='soft')
params = {'logisticregression__C': [1.0, 100.0],
'randomforestclassifier__n_estimators': [20, 200],}
grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
grid.fit(iris.data, iris.target)
cv_keys = ('mean_test_score', 'std_test_score', 'params')
for r, _ in enumerate(grid.cv_results_['mean_test_score']):
print("%0.3f +/- %0.2f %r"
% (grid.cv_results_[cv_keys[0]][r],
grid.cv_results_[cv_keys[1]][r] / 2.0,
grid.cv_results_[cv_keys[2]][r]))
注意:如果 EnsembleClassifier
是用多个相似的估计器对象初始化的,则估计器名称会用连续的整数索引进行修改,例如:
%%capture --no-display
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
eclf = EnsembleVoteClassifier(clfs=[clf1, clf1, clf2],
voting='soft')
params = {'logisticregression-1__C': [1.0, 100.0],
'logisticregression-2__C': [1.0, 100.0],
'randomforestclassifier__n_estimators': [20, 200],}
grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
grid = grid.fit(iris.data, iris.target)
注意
EnsembleVoteClass
还支持对 clfs
参数进行网格搜索。然而,由于 scikit-learn 中 GridSearchCV
的当前实现,无法同时搜索不同的分类器和分类器参数。例如,以下参数字典可以正常工作:
params = {'randomforestclassifier__n_estimators': [1, 100],
'clfs': [(clf1, clf1, clf1), (clf2, clf3)]}
但它将使用 clf1
、clf2
和 clf3
的实例设置,而不会用 'randomforestclassifier__n_estimators': [1, 100]
的设置进行覆盖。
示例 3 - 基于不同特征子集训练的分类器进行多数投票
在scikit-learn中实现的特征选择算法以及SequentialFeatureSelector
实现了一个transform
方法,该方法将减少的特征子集传递给Pipeline
中的下一个项。
例如,该方法
def transform(self, X):
return X[:, self.k_feature_idx_]
返回给定数据集X的最佳特征列k_feature_idx_
。
因此,我们只需构建一个包含特征选择器和分类器的Pipeline
,以便为不同的算法选择不同的特征子集。在fitting
期间,最佳特征子集通过GridSearchCV
对象自动确定,通过调用predict
,管道中的拟合特征选择器仅传递这些列,从而为相应的分类器带来了最佳性能。
%%capture --no-display
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:, :], iris.target
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import EnsembleVoteClassifier
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
# 创建一个特征选择分类器管道
sfs1 = SequentialFeatureSelector(clf1,
k_features=4,
forward=True,
floating=False,
scoring='accuracy',
verbose=0,
cv=0)
clf1_pipe = Pipeline([('sfs', sfs1),
('logreg', clf1)])
eclf = EnsembleVoteClassifier(clfs=[clf1_pipe, clf2, clf3],
voting='soft')
params = {'pipeline__sfs__k_features': [1, 2, 3],
'pipeline__logreg__C': [1.0, 100.0],
'randomforestclassifier__n_estimators': [20, 200]}
grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
grid.fit(iris.data, iris.target)
cv_keys = ('mean_test_score', 'std_test_score', 'params')
for r, _ in enumerate(grid.cv_results_['mean_test_score']):
print("%0.3f +/- %0.2f %r"
% (grid.cv_results_[cv_keys[0]][r],
grid.cv_results_[cv_keys[1]][r] / 2.0,
grid.cv_results_[cv_keys[2]][r]))
通过网格搜索确定的最佳参数是:
grid.best_params_
{'pipeline__logreg__C': 1.0,
'pipeline__sfs__k_features': 2,
'randomforestclassifier__n_estimators': 200}
现在,我们将这些参数分配给集成投票分类器,在完整的训练集上拟合模型,并对来自鸢尾花数据集的3个样本进行预测。
eclf = eclf.set_params(**grid.best_params_)
eclf.fit(X, y).predict(X[[1, 51, 149]])
array([0, 1, 2])
手动方法
另外,我们可以使用ColumnSelector
对象“手动”选择不同的列。在这个例子中,我们仅选择第一列(萼片长度)和第三列(花瓣长度)作为逻辑回归分类器(clf1
)的输入。
from mlxtend.feature_selection import ColumnSelector
col_sel = ColumnSelector(cols=[0, 2])
clf1_pipe = Pipeline([('sel', col_sel),
('logreg', clf1)])
eclf = EnsembleVoteClassifier(clfs=[clf1_pipe, clf2, clf3],
voting='soft')
eclf.fit(X, y).predict(X[[1, 51, 149]])
array([0, 1, 2])
此外,我们可以在网格搜索超参数优化流程之外单独拟合 SequentialFeatureSelector
。在这里,我们首先确定最佳特征,然后使用这些“固定的”最佳特征构建一个管道,以作为 ColumnSelector
的种子:
sfs1 = SequentialFeatureSelector(clf1,
k_features=2,
forward=True,
floating=False,
scoring='accuracy',
verbose=1,
cv=0)
sfs1.fit(X, y)
print('Best features', sfs1.k_feature_idx_)
col_sel = ColumnSelector(cols=sfs1.k_feature_idx_)
clf1_pipe = Pipeline([('sel', col_sel),
('logreg', clf1)])
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished
Features: 1/2[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished
Features: 2/2
Best features (2, 3)
eclf = EnsembleVoteClassifier(clfs=[clf1_pipe, clf2, clf3],
voting='soft')
eclf.fit(X, y).predict(X[[1, 51, 149]])
array([0, 1, 2])
示例 5 - 使用预先拟合的分类器
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
假设我们之前已经训练了我们的分类器:
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
import numpy as np
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
for clf in (clf1, clf2, clf3):
clf.fit(X, y)
通过设置 fit_base_estimators=False
,将强制 use_clones
为 False,EnsembleVoteClassifier
将不会重新拟合这些分类器,以节省计算时间。
from mlxtend.classifier import EnsembleVoteClassifier
import copy
eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3], weights=[1,1,1], fit_base_estimators=False)
labels = ['Logistic Regression', 'Random Forest', 'Naive Bayes', 'Ensemble']
eclf.fit(X, y)
print('accuracy:', np.mean(y == eclf.predict(X)))
accuracy: 0.96
/Users/sebastian/miniforge3/lib/python3.9/site-packages/mlxtend/classifier/ensemble_vote.py:166: UserWarning: fit_base_estimators=False enforces use_clones to be `False`
warnings.warn("fit_base_estimators=False "
然而,请注意,fit_base_estimators=False
与在例如 model_selection.cross_val_score
或 model_selection.GridSearchCV
中进行的任何形式的交叉验证不兼容,因为这会要求分类器重新拟合到训练折叠中。因此,仅在您希望直接进行预测而不进行交叉验证时使用 fit_base_estimators=False
。
示例 6 - 在不同特征子集上操作的分类器集成
如果需要,可以将不同的分类器拟合到训练数据集中不同特征的子集上。以下示例说明了如何使用scikit-learn管道和ColumnSelector
在技术层面上实现这一点:
from sklearn.datasets import load_iris
from mlxtend.classifier import EnsembleVoteClassifier
from mlxtend.feature_selection import ColumnSelector
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
iris = load_iris()
X = iris.data
y = iris.target
pipe1 = make_pipeline(ColumnSelector(cols=(0, 2)),
LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),
LogisticRegression())
eclf = EnsembleVoteClassifier(clfs=[pipe1, pipe2])
eclf.fit(X, y)
EnsembleVoteClassifier(clfs=[Pipeline(steps=[('columnselector',
ColumnSelector(cols=(0, 2))),
('logisticregression',
LogisticRegression())]),
Pipeline(steps=[('columnselector',
ColumnSelector(cols=(1, 2, 3))),
('logisticregression',
LogisticRegression())])])
示例 7 - 关于 Scikit-Learn SVM 和软投票的说明
本节提供了一些额外的技术见解,解释在voting='soft'
时如何使用概率。
请注意,scikit-learn 对 SVM 估计概率的方式(更多信息请参见此处:https://scikit-learn.org/stable/modules/svm.html#scores-probabilities)可能与 SVM 预测的类标签不一致。这是一个极端示例,但假设我们有一个包含 3 个类标签(0、1 和 2)的数据集。对于给定的训练示例,SVM 分类器可能预测为类 2。然而,类成员概率可能如下所示:
- 类 0: 99%
- 类 1: 0.5%
- 类 2: 0.5%
下面展示了此场景的一个实际例子:
import numpy as np
from mlxtend.classifier import EnsembleVoteClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
clf2 = SVC(probability=True, random_state=4)
clf2.fit(X, y)
eclf = EnsembleVoteClassifier(clfs=[clf2], voting='soft', fit_base_estimators=False)
eclf.fit(X, y)
for svm_class, e_class, svm_prob, e_prob, in zip(clf2.predict(X),
eclf.predict(X),
clf2.predict_proba(X),
eclf.predict_proba(X)):
if svm_class != e_class:
print('============')
print('Probas from SVM :', svm_prob)
print('Class from SVM :', svm_class)
print('Probas from SVM in Ensemble:', e_prob)
print('Class from SVM in Ensemble :', e_class)
print('============')
============
Probas from SVM : [0.00921708 0.49415165 0.49663127]
Class from SVM : 1
Probas from SVM in Ensemble: [0.00921708 0.49415165 0.49663127]
Class from SVM in Ensemble : 2
============
/Users/sebastian/miniforge3/lib/python3.9/site-packages/mlxtend/classifier/ensemble_vote.py:166: UserWarning: fit_base_estimators=False enforces use_clones to be `False`
warnings.warn("fit_base_estimators=False "
示例 7 - 关于 Scikit-Learn SVM 和软投票的说明
根据概率,我们可以预测 SVM 将会预测类别 2,因为它的概率最高。由于 EnsembleVoteClassifier
在内部使用 argmax
函数(当 voting='soft'
时),在这种情况下即使集成仅包含一个 SVM 模型,它也确实会预测类别 2。
请注意,在实际操作中,这个小的技术细节并不需要您担心,但如果您在考虑一个模型的 SVM 集成与那个 SVM 单独结果的区别时,保持这个在心中是有用的——这并不是一个错误。
示例 8 - 使用Nelder-Mead优化集成权重
在本节中,我们将看到如何使用像Nelder-Mead这样的启发式搜索方法来优化集成权重。
假设我们有以下示例场景,其中我们在训练数据集的不同子集上拟合了3个独立的分类器:
from mlxtend.classifier import EnsembleVoteClassifier
from mlxtend.data import mnist_data
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
X, y = mnist_data()
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.5, shuffle=True, random_state=1
)
clf1 = GaussianNB()
clf2 = LogisticRegression(random_state=123, solver='newton-cg')
clf3 = DecisionTreeClassifier(random_state=123, max_depth=2)
clf1.fit(X_train[500:1000], y_train[500:1000])
clf2.fit(X_train[750:1250], y_train[750:1250])
clf3.fit(X_train[1250:2000], y_train[1250:2000]);
然后,我们从这3个分类器中构建一个集成分类器,其中每个分类器的贡献相等,权重为1:
eclf = EnsembleVoteClassifier(
clfs=(clf1, clf2, clf3),
voting="soft", # the same would also work with "hard" voting
weights=(1, 1, 1),
use_clones=False,
fit_base_estimators=False,
)
eclf.fit(X_train, y_train)
eclf.score(X_val, y_val)
/Users/sebastian/miniforge3/lib/python3.9/site-packages/mlxtend/classifier/ensemble_vote.py:166: UserWarning: fit_base_estimators=False enforces use_clones to be `False`
warnings.warn("fit_base_estimators=False "
0.8012
我们看到在验证集上达到了80%的准确率。我们能做得更好吗?也许这些独立分类器不应该平等贡献。或许,我们可以使用来自scipy.optimize
的优化算法来找到这些独立分类器更好的相对权重。
让我们设置一个目标函数,我们希望通过SciPy的minimize
来最小化:
from scipy.optimize import minimize
def function_to_minimize(weights, fitted_clfs):
w1, w2 = weights # 这些是新的砝码!
newclf = EnsembleVoteClassifier(
voting="soft",
use_clones=False,
fit_base_estimators=False,
clfs=fitted_clfs,
weights=(w1, w2, 1.), # 使用新的权重
)
newclf.fit(X_train, y_train)
score = newclf.score(X_val, y_val)
# 将准确性改为误差,以便数值越小越好。
score_to_minimize = 1 - score
return score_to_minimize
注意几点:
-
我们仅优化 3 个分类器权重中的 2 个。原因在于权重是相对的,如果也优化第 3 个权重将会显得过于复杂(并且自由度过多)。
-
我们将
use_clones=False & fit_base_estimators=False
设置为与之前相同,这样可以确保我们在集成分类器中使用预先拟合的分类器。 -
我们优化的不是准确率,而是分类错误率,即
score_to_minimize = 1 - score
。这是因为我们使用minimize
函数,较低的值意味着更好。
接下来,让我们选择一些初始权重值并运行优化。通过bounds
我们指定每个权重的范围(下限和上限),以确保搜索不会失控:
%%capture --no-display
init_weights = [1., 1.]
results = minimize(
function_to_minimize,
init_weights,
args=((clf1, clf2, clf3),),
bounds=((0, 5), (0, 5)),
method="nelder-mead",
)
让我们来看看结果!
print(results)
final_simplex: (array([[0.575 , 1.40625 ],
[0.57500153, 1.40622215],
[0.57508965, 1.40617647]]), array([0.1324, 0.1324, 0.1324]))
fun: 0.13239999999999996
message: 'Optimization terminated successfully.'
nfev: 60
nit: 21
status: 0
success: True
x: array([0.575 , 1.40625])
看起来搜索成功并返回了以下权重:
solution = results["x"]
print(solution)
[0.575 1.40625]
让我们在我们的集成分类器中使用这些新权重:
eclf = EnsembleVoteClassifier(
clfs=(clf1, clf2, clf3),
voting="soft",
weights=(solution[0], solution[1], 1),
use_clones=False,
fit_base_estimators=False,
)
eclf.fit(X_train, y_train)
eclf.score(X_val, y_val)
/Users/sebastian/miniforge3/lib/python3.9/site-packages/mlxtend/classifier/ensemble_vote.py:166: UserWarning: fit_base_estimators=False enforces use_clones to be `False`
warnings.warn("fit_base_estimators=False "
0.8676
正如我们所看到的,验证集上的结果(0.8676)相比于原始结果(0.8012)有所改进。太棒了!
API
EnsembleVoteClassifier(clfs, voting='hard', weights=None, verbose=0, use_clones=True, fit_base_estimators=True)
Soft Voting/Majority Rule classifier for scikit-learn estimators.
Parameters
-
clfs
: array-like, shape = [n_classifiers]A list of classifiers. Invoking the
fit
method on theVotingClassifier
will fit clones of those original classifiers be stored in the class attribute ifuse_clones=True
(default) andfit_base_estimators=True
(default). -
voting
: str, {'hard', 'soft'} (default='hard')If 'hard', uses predicted class labels for majority rule voting. Else if 'soft', predicts the class label based on the argmax of the sums of the predicted probalities, which is recommended for an ensemble of well-calibrated classifiers.
-
weights
: array-like, shape = [n_classifiers], optional (default=None
)Sequence of weights (
float
orint
) to weight the occurances of predicted class labels (hard
voting) or class probabilities before averaging (soft
voting). Uses uniform weights ifNone
. -
verbose
: int, optional (default=0)Controls the verbosity of the building process. -
verbose=0
(default): Prints nothing -verbose=1
: Prints the number & name of the clf being fitted -verbose=2
: Prints info about the parameters of the clf being fitted -verbose>2
: Changesverbose
param of the underlying clf to self.verbose - 2 -
use_clones
: bool (default: True)Clones the classifiers for stacking classification if True (default) or else uses the original ones, which will be refitted on the dataset upon calling the
fit
method. Hence, if use_clones=True, the original input classifiers will remain unmodified upon using the StackingClassifier'sfit
method. Settinguse_clones=False
is recommended if you are working with estimators that are supporting the scikit-learn fit/predict API interface but are not compatible to scikit-learn'sclone
function. -
fit_base_estimators
: bool (default: True)Refits classifiers in
clfs
if True; uses references to theclfs
, otherwise (assumes that the classifiers were already fit). Note: fit_base_estimators=False will enforce use_clones to be False, and is incompatible to most scikit-learn wrappers! For instance, if any form of cross-validation is performed this would require the re-fitting classifiers to training folds, which would raise a NotFitterError if fit_base_estimators=False. (New in mlxtend v0.6.)
Attributes
-
classes_
: array-like, shape = [n_predictions] -
clf
: array-like, shape = [n_predictions]The input classifiers; may be overwritten if
use_clones=False
-
clf_
: array-like, shape = [n_predictions]Fitted input classifiers; clones if
use_clones=True
Examples
```
>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.ensemble import RandomForestClassifier
>>> from mlxtend.sklearn import EnsembleVoteClassifier
>>> clf1 = LogisticRegression(random_seed=1)
>>> clf2 = RandomForestClassifier(random_seed=1)
>>> clf3 = GaussianNB()
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> eclf1 = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3],
... voting='hard', verbose=1)
>>> eclf1 = eclf1.fit(X, y)
>>> print(eclf1.predict(X))
[1 1 1 2 2 2]
>>> eclf2 = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3], voting='soft')
>>> eclf2 = eclf2.fit(X, y)
>>> print(eclf2.predict(X))
[1 1 1 2 2 2]
>>> eclf3 = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3],
... voting='soft', weights=[2,1,1])
>>> eclf3 = eclf3.fit(X, y)
>>> print(eclf3.predict(X))
[1 1 1 2 2 2]
>>>
For more usage examples, please see
https://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/
```
Methods
fit(X, y, sample_weight=None)
Learn weight coefficients from training data for each classifier.
Parameters
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features.
-
y
: array-like, shape = [n_samples]Target values.
-
sample_weight
: array-like, shape = [n_samples], optionalSample weights passed as sample_weights to each regressor in the regressors list as well as the meta_regressor. Raises error if some regressor does not support sample_weight in the fit() method.
Returns
self
: object
fit_transform(X, y=None, fit_params)
Fit to data, then transform it.
Fits transformer to `X` and `y` with optional parameters `fit_params`
and returns a transformed version of `X`.
Parameters
-
X
: array-like of shape (n_samples, n_features)Input samples.
-
y
: array-like of shape (n_samples,) or (n_samples, n_outputs), default=NoneTarget values (None for unsupervised transformations).
-
**fit_params
: dictAdditional fit parameters.
Returns
-
X_new
: ndarray array of shape (n_samples, n_features_new)Transformed array.
get_params(deep=True)
Return estimator parameter names for GridSearch support.
predict(X)
Predict class labels for X.
Parameters
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features.
Returns
-
maj
: array-like, shape = [n_samples]Predicted class labels.
predict_proba(X)
Predict class probabilities for X.
Parameters
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features.
Returns
-
avg
: array-like, shape = [n_samples, n_classes]Weighted average probability for each class per sample.
score(X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.
Parameters
-
X
: array-like of shape (n_samples, n_features)Test samples.
-
y
: array-like of shape (n_samples,) or (n_samples, n_outputs)True labels for
X
. -
sample_weight
: array-like of shape (n_samples,), default=NoneSample weights.
Returns
-
score
: floatMean accuracy of
self.predict(X)
wrt.y
.
set_params(params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as :class:`~sklearn.pipeline.Pipeline`). The latter have
parameters of the form ``<component>__<parameter>`` so that it's
possible to update each component of a nested object.
Parameters
-
**params
: dictEstimator parameters.
Returns
-
self
: estimator instanceEstimator instance.
transform(X)
Return class labels or probabilities for X for each estimator.
Parameters
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features.
Returns
-
If
voting='soft'`` : array-like = [n_classifiers, n_samples, n_classes]Class probabilties calculated by each classifier.
-
If
voting='hard'`` : array-like = [n_classifiers, n_samples]Class labels predicted by each classifier.
ython