mcnemar: McNemar检验用于分类器比较

McNemar检验用于配对名义数据

> `from mlxtend.evaluate import mcnemar`

概述

麦克尼马尔检验 [1]（有时也称为“组内卡方检验”）是一种用于配对名义数据的统计检验。在机器学习（或统计）模型的背景下，我们可以使用麦克尼马尔检验来比较两个模型的预测准确性。麦克尼马尔检验基于两个模型预测的 2×2 列联表。

McNemar的检验统计量

在McNemar检验中，我们提出零假设，即概率$p(b)$和$p(c)$是相同的，简而言之：两个模型的表现没有一个比另一个更好。因此，备择假设是两个模型的表现不相等。

McNemar检验统计量（"卡方"）可以按如下方式计算：

$$\chi^2 = \frac{(b - c)^2}{(b + c)},$$

如果单元格 c 和 b 的和足够大，则 $\chi^2$ 值遵循自由度为 1 的卡方分布。在设定显著性阈值，例如 $\alpha=0.05$ 后，我们可以计算 p 值——假设原假设为真，p 值是观察到这个经验（或更大）卡方值的概率。如果 p 值低于我们选择的显著性水平，我们可以拒绝原假设，即两个模型的性能相等。

连续性校正

在奎因·迈克尼马尔发表迈克尼马尔检验约1年后，爱德华兹提出了一种连续性校正版本，这也是今天更常用的变体：

$$\chi^2 = \frac{( \mid b - c \mid - 1)^2}{(b + c)}.$$

精确p值

如前所述，对于小样本量（$b + c < 25$），建议使用精确的二项检验 [3]，因为卡方值可能无法很好地被卡方分布所近似。精确的p值可以按如下方式计算：

$$p = 2 \sum^{n}_{i=b} \binom{n}{i} 0.5^i (1 - 0.5)^{n-i},$$

其中$n = b + c$，因子$2$用于计算双侧p值。

示例

例如，给定两种模型的准确率分别为99.7%和99.6%，一个2x2列联表可以提供更多关于模型选择的见解。

在子图A和B中，两个模型的预测准确率如下：

模型1的准确率：9,960 / 10,000 = 99.6%
模型2的准确率：9,970 / 10,000 = 99.7%

现在，在子图A中，我们可以看到模型2正确预测了11个模型1预测错误的情况。反之，模型2有1个预测正确，而模型1预测错误。因此，基于这个11:1的比率，我们可以得出结论，模型2的表现明显优于模型1。然而，在子图B中，比例为25:15，这对哪种模型是更好的选择不那么明确。

在接下来的代码示例中，我们将使用这两种情况A和B来说明McNemar检验。

参考文献

[1] McNemar, Quinn, 1947. "关于相关比例或百分比之间差异的抽样误差的说明". Psychometrika. 12 (2): 153–157.
[2] Edwards AL: 关于在检验相关比例差异显著性时的“连续性校正”。Psychometrika. 1948, 13 (3): 185-187. 10.1007/BF02289261.
[3] https://en.wikipedia.org/wiki/McNemar%27s_test

示例 1 - 创建 2x2 列联表

mcnemar 函数期望一个 2x2 的列联表，格式为 NumPy 数组，如下所示：

这样的列联矩阵可以通过使用 mlxtend.evaluate 中的 mcnemar_table 函数来创建。例如：

import numpy as np
from mlxtend.evaluate import mcnemar_table

# 正确的目标（类别）标签
y_target = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

# 模型1预测的类别标签
y_model1 = np.array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0])

# 模型2预测的类别标签
y_model2 = np.array([0, 0, 1, 1, 0, 1, 1, 0, 0, 0])

tb = mcnemar_table(y_target=y_target, 
                   y_model1=y_model1, 
                   y_model2=y_model2)

print(tb)

[[4 1]
 [2 3]]

示例 2 - McNemar 检验用于情景 B

不，让我们继续概述部分提到的示例，并假设我们已经计算出了2x2列联表：

import numpy as np

tb_b = np.array([[9945, 25],
                 [15, 15]])

为了检验两个模型的预测性能是否相等（使用显著性水平 $\alpha=0.05$），我们可以进行校正的McNemar检验，以计算卡方和p值，如下所示：

from mlxtend.evaluate import mcnemar

chi2, p = mcnemar(ary=tb_b, corrected=True)
print('chi-squared:', chi2)
print('p-value:', p)

chi-squared: 2.025
p-value: 0.154728923485

由于p值大于我们假设的显著性阈值（$\alpha=0.05$），我们无法拒绝我们的零假设，假设这两个预测模型之间没有显著差异。

示例 3 - McNemar检验用于场景A

与场景B（示例2）相比，场景A中的样本量相对较小（b + c = 11 + 1 = 12），并且小于推荐的25 [3]，这使得计算的卡方值与卡方分布的近似效果不佳。

在这种情况下，我们需要计算来自二项分布的确切p值：

from mlxtend.evaluate import mcnemar
import numpy as np

tb_a = np.array([[9959, 11],
                 [1, 29]])

chi2, p = mcnemar(ary=tb_a, exact=True)

print('chi-squared:', chi2)
print('p-value:', p)

chi-squared: None
p-value: 0.005859375

假设我们以显著性水平 $\alpha=0.05$ 进行此测试，我们可以拒绝原假设，即这两个模型在该数据集上的表现相同，因为 p 值 ($p \approx 0.006$) 小于 $\alpha$。

API

mcnemar(ary, corrected=True, exact=False)

McNemar test for paired nominal data

Parameters

ary : array-like, shape=[2, 2]

2 x 2 contingency table (as returned by evaluate.mcnemar_table), where a: ary[0, 0]: # of samples that both models predicted correctly b: ary[0, 1]: # of samples that model 1 got right and model 2 got wrong c: ary[1, 0]: # of samples that model 2 got right and model 1 got wrong d: aryCell [1, 1]: # of samples that both models predicted incorrectly
corrected : array-like, shape=[n_samples] (default: True)

Uses Edward's continuity correction for chi-squared if True
exact : bool, (default: False)

If True, uses an exact binomial test comparing b to a binomial distribution with n = b + c and p = 0.5. It is highly recommended to use exact=True for sample sizes < 25 since chi-squared is not well-approximated by the chi-squared distribution!

Returns

chi2, p : float or None, float

Returns the chi-squared value and the p-value; if exact=True (default: False), chi2 is None

Examples

For usage examples, please see
[https://rasbt.github.io/mlxtend/user_guide/evaluate/mcnemar/](https://rasbt.github.io/mlxtend/user_guide/evaluate/mcnemar/)

ython