paired_ttest_kfold_cv: K折交叉验证配对 t 检验

K折配对 t 检验程序用于比较两个模型的性能

> `from mlxtend.evaluate import paired_ttest_kfold_cv`

概述

K-fold cross-validated paired t-test procedure is a common method for comparing the performance of two models (classifiers or regressors) and addresses some of the drawbacks of the resampled t-test procedure; however, this method has still the problem that the training sets overlap and is not recommended to be used in practice [1], and techniques such as the paired_ttest_5x2cv should be used instead.

To explain how this method works, let's consider to estimator (e.g., classifiers) A and B. Further, we have a labeled dataset D. In the common hold-out method, we typically split the dataset into 2 parts: a training and a test set. In the k-fold cross-validated paired t-test procedure, we split the test set into k parts of equal size, and each of these parts is then used for testing while the remaining k-1 parts (joined together) are used for training a classifier or regressor (i.e., the standard k-fold cross-validation procedure).

In each k-fold cross-validation iteration, we then compute the difference in performance between A and B in each so that we obtain k difference measures. Now, by making the assumption that these k differences were independently drawn and follow an approximately normal distribution, we can compute the following t statistic with k-1 degrees of freedom according to Student's t test, under the null hypothesis that the models A and B have equal performance:

$$t = \frac{\overline{p} \sqrt{k}}{\sqrt{\sum_{i=1}^{k}(p^{(i) - \overline{p}})^2 / (k-1)}}.$$

Here, $p^{(i)}$ computes the difference between the model performances in the $i$th iteration, $p^{(i)} = p^{(i)}A - p^{(i)}_B$, and $\overline{p}$ represents the average difference between the classifier performances, $\overline{p} = \frac{1}{k} \sum^k{i=1} p^{(i)}$.

Once we computed the t statistic we can compute the p value and compare it to our chosen significance level, e.g., $\alpha=0.05$. If the p value is smaller than $\alpha$, we reject the null hypothesis and accept that there is a significant difference in the two models.

The problem with this method, and the reason why it is not recommended to be used in practice, is that it violates an assumption of Student's t test [1]:

参考文献

示例 1 - K折交叉验证配对 t 检验

假设我们想比较两种分类算法:逻辑回归和决策树算法:

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import train_test_split


X, y = iris_data()
clf1 = LogisticRegression(random_state=1)
clf2 = DecisionTreeClassifier(random_state=1)

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.25,
                     random_state=123)

score1 = clf1.fit(X_train, y_train).score(X_test, y_test)
score2 = clf2.fit(X_train, y_train).score(X_test, y_test)

print('Logistic regression accuracy: %.2f%%' % (score1*100))
print('Decision tree accuracy: %.2f%%' % (score2*100))

Logistic regression accuracy: 97.37%
Decision tree accuracy: 94.74%

请注意,这些准确性值并未在配对t检验程序中使用,因为在重采样过程中生成了新的测试/训练分割,这些值仅用于直观理解。

现在,我们假设拒绝零假设的显著性阈值为 $\alpha=0.05$,即假设两种算法在数据集上的表现相同,并进行k折交叉验证的t检验:

from mlxtend.evaluate import paired_ttest_kfold_cv


t, p = paired_ttest_kfold_cv(estimator1=clf1,
                              estimator2=clf2,
                              X=X, y=y,
                              random_seed=1)

print('t statistic: %.3f' % t)
print('p value: %.3f' % p)

t statistic: -1.861
p value: 0.096

由于$p > \alpha$,我们无法拒绝原假设,可以得出结论:这两种算法的性能没有显著差异。

虽然一般不建议在没有进行多重假设检验修正的情况下多次应用统计测试,但我们来看看一个示例,其中决策树算法仅限于生成一个非常简单的决策边界,这将导致相对较差的性能:

clf2 = DecisionTreeClassifier(random_state=1, max_depth=1)

score2 = clf2.fit(X_train, y_train).score(X_test, y_test)
print('Decision tree accuracy: %.2f%%' % (score2*100))


t, p = paired_ttest_kfold_cv(estimator1=clf1,
                             estimator2=clf2,
                             X=X, y=y,
                             random_seed=1)

print('t statistic: %.3f' % t)
print('p value: %.3f' % p)

Decision tree accuracy: 63.16%
t statistic: 13.491
p value: 0.000

假设我们在显著性水平 $\alpha=0.05$ 下进行了这个测试,我们可以拒绝原假设,即这两个模型在这个数据集上的表现相同,因为 p 值 ($p < 0.001$) 小于 $\alpha$。

API

paired_ttest_kfold_cv(estimator1, estimator2, X, y, cv=10, scoring=None, shuffle=False, random_seed=None)

Implements the k-fold paired t test procedure to compare the performance of two models.

Parameters

Returns

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/evaluate/paired_ttest_kfold_cv/