paired_ttest_resample: 重采样配对 t 检验

重采样配对 t 检验程序用于比较两个模型的性能

from mlxtend.evaluate import paired_ttest_resample

概述

重采样配对 t 检验程序（也称为 k-留出配对 t 检验）是一种流行的方法，用于比较两个模型（分类器或回归器）的性能；然而，这种方法存在许多缺点，不推荐在实践中使用[1]，应使用类似于 paired_ttest_5x2cv 的技术。

为了说明这种方法是如何工作的，假设我们有两个估计器（例如，分类器）A 和 B。此外，我们有一个标记数据集 D。在常见的留出方法中，我们通常将数据集分为两个部分：训练集和测试集。在重采样配对 t 检验程序中，我们重复这种拆分过程（通常是 2/3 的训练数据和 1/3 的测试数据）k 次（通常是 30 次）。在每次迭代中，我们在训练集上训练 A 和 B，并在测试集上评估它。然后，我们计算每次迭代中 A 和 B 性能之间的差异，从而获得 k 个差异度量。现在，假设这 k 个差异是独立抽取的，并遵循近似正态分布，我们可以根据学生 t 检验计算以下 t 统计量，具有 k-1 自由度，在原假设下，即模型 A 和 B 的性能相等：

$$t = \frac{\overline{p} \sqrt{k}}{\sqrt{\sum_{i=1}^{k}(p^{(i)} - \overline{p})^2 / (k-1)}}.$$

这里，$p^{(i)}$ 计算第 $i$ 次迭代中模型性能之间的差异，$p^{(i)} = p^{(i)}A - p^{(i)}_B$，而 $\overline{p}$ 表示分类器性能之间的平均差异，$\overline{p} = \frac{1}{k} \sum^k{i=1} p^{(i)}$。

一旦我们计算了 t 统计量，就可以计算 p 值，并将其与我们选择的显著性水平进行比较，例如，$\alpha=0.05$。如果 p 值小于 $\alpha$，我们拒绝原假设，并接受两个模型之间存在显著差异。

总结程序如下：

i := 0
当 i < k 时：
将数据集拆分为训练子集和测试子集
将模型 A 和 B 拟合到训练集
计算 A 和 B 在测试集上的性能
记录 A 和 B 之间的性能差异
i := i + 1
计算 t-统计量
根据 k-1 自由度从 t-统计量计算 p 值
将 p 值与选择的显著性阈值进行比较

这种方法的问题，以及不推荐在实践中使用的原因，是它违背了学生 t 检验的假设[1]：

模型性能之间的差异（$p^{(i)} = p^{(i)}_A - p^{(i)}_B$）不是正态分布，因为 $p^{(i)}_A$ 和 $p^{(i)}_B$ 不是独立的
$p^{(i)}$ 本身由于测试集的重叠而不是独立的；另外，测试集和训练集也有重叠

参考文献

[1] Dietterich TG (1998) 用于比较监督分类学习算法的近似统计检验. 神经计算 10:1895–1923.

示例 1 - 重抽样配对 t 检验

假设我们想比较两种分类算法：逻辑回归和决策树算法：

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import train_test_split


X, y = iris_data()
clf1 = LogisticRegression(random_state=1)
clf2 = DecisionTreeClassifier(random_state=1)

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.25,
                     random_state=123)

score1 = clf1.fit(X_train, y_train).score(X_test, y_test)
score2 = clf2.fit(X_train, y_train).score(X_test, y_test)

print('Logistic regression accuracy: %.2f%%' % (score1*100))
print('Decision tree accuracy: %.2f%%' % (score2*100))

Logistic regression accuracy: 97.37%
Decision tree accuracy: 94.74%

请注意，这些准确性值在配对 t 检验程序中并未使用，因为在重采样过程中会生成新的测试/训练划分，上述值仅用于提供直观感受。

现在，假设拒绝零假设（即两种算法在数据集上表现相等）的显著性阈值为 $\alpha=0.05$，并进行配对样本 t 检验：

from mlxtend.evaluate import paired_ttest_resampled


t, p = paired_ttest_resampled(estimator1=clf1,
                              estimator2=clf2,
                              X=X, y=y,
                              random_seed=1)

print('t statistic: %.3f' % t)
print('p value: %.3f' % p)

t statistic: -1.809
p value: 0.081

由于 $p > t$，我们无法拒绝原假设，因此可以得出结论，两种算法的性能没有显著差异。

虽然通常不建议在没有对多重假设检验进行修正的情况下多次应用统计检验，但让我们来看一个例子，其中决策树算法仅限于产生一个非常简单的决策边界，这将导致相对较差的性能：

clf2 = DecisionTreeClassifier(random_state=1, max_depth=1)

score2 = clf2.fit(X_train, y_train).score(X_test, y_test)
print('Decision tree accuracy: %.2f%%' % (score2*100))


t, p = paired_ttest_resampled(estimator1=clf1,
                              estimator2=clf2,
                              X=X, y=y,
                              random_seed=1)

print('t statistic: %.3f' % t)
print('p value: %.3f' % p)

Decision tree accuracy: 63.16%
t statistic: 39.214
p value: 0.000

假设我们在显著性水平 $\alpha=0.05$ 下进行了这个测试，我们可以拒绝零假设，即这两个模型在这个数据集上的表现相同，因为 p 值 ($p < 0.001$) 小于 $\alpha$。

API

paired_ttest_resampled(estimator1, estimator2, X, y, num_rounds=30, test_size=0.3, scoring=None, random_seed=None)

Implements the resampled paired t test procedure to compare the performance of two models (also called k-hold-out paired t test).

Parameters

estimator1 : scikit-learn classifier or regressor
estimator2 : scikit-learn classifier or regressor
X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples]

Target values.
num_rounds : int (default: 30)

Number of resampling iterations (i.e., train/test splits)
test_size : float or int (default: 0.3)

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to use as a test set. If int, represents the absolute number of test exsamples.
scoring : str, callable, or None (default: None)

If None (default), uses 'accuracy' for sklearn classifiers and 'r2' for sklearn regressors. If str, uses a sklearn scoring metric string identifier, for example {accuracy, f1, precision, recall, roc_auc} for classifiers, {'mean_absolute_error', 'mean_squared_error'/'neg_mean_squared_error', 'median_absolute_error', 'r2'} for regressors. If a callable object or function is provided, it has to be conform with sklearn's signature scorer(estimator, X, y); see https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html for more information.
random_seed : int or None (default: None)

Random seed for creating the test/train splits.

Returns

t : float

The t-statistic
pvalue : float

Two-tailed p-value. If the chosen significance level is larger than the p-value, we reject the null hypothesis and accept that there are significant differences in the two compared models.

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/evaluate/paired_ttest_resampled/