combined_ftest_5x2cv: 5x2cv 组合 F 检验用于分类器比较

5x2cv 组合 F 测试程序用于比较两个模型的性能。

> `from mlxtend.evaluate import combined_ftest_5x2cv`

概述

5x2cv组合F检验是一种比较两个模型（分类器或回归器）性能的程序，由Alpaydin提出[1]，作为Dietterich的5x2cv配对t检验程序[2]的更稳健替代方案。paired_ttest_5x2cv.md。Dietterich的5x2cv方法旨在解决其他方法的缺点，例如重采样配对t检验（见paired_ttest_resampled）和k折交叉验证配对t检验（见paired_ttest_kfold_cv）。

为了解释此方法的工作原理，我们考虑估计器（例如分类器）A和B。此外，我们有一个标记数据集D。在常见的留出法中，我们通常将数据集分为两部分：训练集和测试集。在5x2cv配对t检验中，我们将数据分割（50%训练和50%测试数据）重复5次。

在这5次迭代中，我们将A和B拟合到训练集，并在测试集上评估它们的性能（$p_A$和$p_B$）。然后，我们轮换训练集和测试集（训练集变为测试集，反之亦然），再次计算性能，这会产生两个性能差异度量：

$$p^{(1)} = p^{(1)}_A - p^{(1)}_B$$

和

$$p^{(2)} = p^{(2)}_A - p^{(2)}_B.$$

然后，我们估计差异的均值和方差：

$\overline{p} = \frac{p^{(1)} + p^{(2)}}{2}$

和

$s^2 = (p^{(1)} - \overline{p})^2 + (p^{(2)} - \overline{p})^2.$

Alpaydin提出的F统计量（见论文以获取理由）然后计算为

$$ \mathcal{f} = \frac{\sum_{i=1}^{5} \sum_{j=1}^2 (p_i^{j})^2}{2 \sum_{i=1}^5 s_i^2}, $$

该统计量大致服从自由度为10和5的F分布。

使用f统计量，可以计算p值并与先前选择的显著性水平进行比较，例如$\alpha=0.05$。如果p值小于$\alpha$，我们拒绝原假设并接受两个模型之间存在显著差异。

参考文献

[1] Alpaydin, E. (1999). 用于比较监督分类学习算法的结合5×2交叉验证F检验。神经计算, 11(8), 1885-1892.
[2] Dietterich TG (1998) 比较监督分类学习算法的近似统计检验。神经计算 10:1895–1923.

示例 1 - 5x2cv 组合 F 检验

假设我们要比较两种分类算法：逻辑回归和决策树算法：

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import train_test_split


X, y = iris_data()
clf1 = LogisticRegression(random_state=1, solver='liblinear', multi_class='ovr')
clf2 = DecisionTreeClassifier(random_state=1)

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.25,
                     random_state=123)

score1 = clf1.fit(X_train, y_train).score(X_test, y_test)
score2 = clf2.fit(X_train, y_train).score(X_test, y_test)

print('Logistic regression accuracy: %.2f%%' % (score1*100))
print('Decision tree accuracy: %.2f%%' % (score2*100))

Logistic regression accuracy: 97.37%
Decision tree accuracy: 94.74%

请注意，这些准确率值在配对 f 检验过程中并未使用，因为在重采样过程中会生成新的测试/训练划分，以上值仅用于提供直观理解。

现在，假设显著性阈值为 $\alpha=0.05$，以拒绝两个算法在数据集上表现相等的原假设，并进行5x2cv f 检验：

from mlxtend.evaluate import combined_ftest_5x2cv


f, p = combined_ftest_5x2cv(estimator1=clf1,
                            estimator2=clf2,
                            X=X, y=y,
                            random_seed=1)

print('F statistic: %.3f' % f)
print('p value: %.3f' % p)

F statistic: 1.053
p value: 0.509

由于$p > \alpha$，我们无法拒绝原假设，并可以得出结论：这两种算法的性能没有显著差异。

虽然通常不建议在没有进行多重假设检验修正的情况下多次应用统计检验，但让我们看一个例子，其中决策树算法仅能生成一个非常简单的决策边界，这将导致相对较差的性能：

clf2 = DecisionTreeClassifier(random_state=1, max_depth=1)

score2 = clf2.fit(X_train, y_train).score(X_test, y_test)
print('Decision tree accuracy: %.2f%%' % (score2*100))


f, p = combined_ftest_5x2cv(estimator1=clf1,
                            estimator2=clf2,
                            X=X, y=y,
                            random_seed=1)

print('F statistic: %.3f' % f)
print('p value: %.3f' % p)

Decision tree accuracy: 63.16%
F statistic: 34.934
p value: 0.001

假设我们以显著性水平 $\alpha=0.05$ 进行此测试，我们可以拒绝原假设，即这两个模型在该数据集上的表现相同，因为 p 值 ($p < 0.001$) 小于 $\alpha$。

API

combined_ftest_5x2cv(estimator1, estimator2, X, y, scoring=None, random_seed=None)

Implements the 5x2cv combined F test proposed by Alpaydin 1999, to compare the performance of two models.

Parameters

estimator1 : scikit-learn classifier or regressor
estimator2 : scikit-learn classifier or regressor
X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples]

Target values.
scoring : str, callable, or None (default: None)

If None (default), uses 'accuracy' for sklearn classifiers and 'r2' for sklearn regressors. If str, uses a sklearn scoring metric string identifier, for example {accuracy, f1, precision, recall, roc_auc} for classifiers, {'mean_absolute_error', 'mean_squared_error'/'neg_mean_squared_error', 'median_absolute_error', 'r2'} for regressors. If a callable object or function is provided, it has to be conform with sklearn's signature scorer(estimator, X, y); see https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html for more information.
random_seed : int or None (default: None)

Random seed for creating the test/train splits.

Returns

f : float

The F-statistic
pvalue : float

Two-tailed p-value. If the chosen significance level is larger than the p-value, we reject the null hypothesis and accept that there are significant differences in the two compared models.

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/evaluate/combined_ftest_5x2cv/