Skip to content

Glossary

Bootstrapping

A way to estimate confidence intervals and statistical significance using randomization. The confidence intervals and statistical significance of metrics computed using Bootstrapping are tagged by [B]. Unless specified otherwise, bootstrapping is non-parametric and runs at the "example/prediction" level.

Default metrics

A default metric (e.g., default accuracy) is the maximum possible value of a metric for a model always outputting the same value. For example, in a balanced binary classification dataset, the default accuracy is 0.5.

Classification

ACC (Accuracy)

The accuracy (Acc) is the ratio of correct predictions over the total number of predictions:

\[ Accuracy = \frac{\textrm{Number of correct predictions}}{\textrm{Total number of predictions}} \]

If not specified, the accuracy is reported for the threshold value(s) that maximize it.

Confidence intervals of the accuracy are computed using the Wilson Score Interval (Acc CI [W]) and non-parametric percentile bootstrapping (Acc CI [B]).

Confusion Matrix

The confusion matrix shows the relation between predictions and ground truth. The columns of the matrix represent the predictions and the rows represent the ground truth: \(M_{i,j}\) is the number of predictions of class \(j\) which are in reality of class \(i\).

In the case of weighted evaluation, the confusion matrix is a weighted confusion matrix.

LogLoss

The logloss is defined as:

\[ logloss = \frac{\sum_{i=1}^{n} - \log{ p_{i,y_i} } }{n} \]

with \(\{y_i\}*{i \in [1,n]}\) the labels, and \(p*{i,j}\) the predicted probability for the class \(j\) in the observation \(i\). Note: \(\forall i, \sum_{j=1}^{c} p_{i,j} = 1\).

Not all machine learning algorithms are calibrated, therefore not all machine learning algorithms are minimizing the logloss. The default predictor minimizes the logloss. The default logloss is equal to the Shannon entropy of the labels.

ROC (Receiver Operating Characteristic)

The ROC curve shows the relation between Recall (also known as True Positive Rate) and the False Positive Rate.

The ROC is computed without the convex hull rule (see "Technical Note: PAV and the ROC Convex Hull").

AUC (Area Under the Curve of the ROC)

The AUC is the integral of the ROC curve.

The AUC is computed using the trapezoidal rule and without the convex hull rule.

The confidence intervals of the ROC Curve are computed using the method proposed by Hanley et al (AUC CI [H]) and the non-parametric percentile bootstrapping method (AUC CI [B]).

PR (Precision-Recall Curve)

The PR curve shows the relation between Precision and Recall.

The PR curve is computed without the convex hull rule.

PR-AUC (Area Under the Precision-Recall Curve)

The PR-AUC is the integral of the PR curve.

The PR-AUC is computed using the lower trapezoid rule (PR-AUC). A presentation and comparison of various approaches for computing the PR-AUC was done by Boyd et al. This work indicates that the estimation of PR-AUC using the lower trapezoid rule has a lower bias than the commonly used Average Precision rule (AP) (the rule used by scikit learn).

The confidence intervals of the PR Curve are computed using the logistic interval (PR-AUC CI [L]) and the non-parametric percentile bootstrapping method (PR-AUC CI [B]). Boyd et al shows these two methods to have better coverage than the cross-validation method.

X@Y Metrics

X@Y metrics (e.g. Precision at given Recall) are computed conservatively and without interpolation. Depending on the metric pairs, being conservative can be lower bounds or upper bounds:

  • Precision @ Recall: Precision with the highest threshold such the recall is greater or equal to the limit. Note: Precision is not monotonic with the threshold value.
  • Precision @ Volume: Precision with the highest threshold such as the volume being greater or equal to the limit.
  • Recall @ Precision: Highest recall with precision greater or equal to the limit. Note: Recall is monotonic with the threshold value.
  • Recall @ False Positive Rate: Highest recall with a false positive rate less or equal to the limit. Note: Recall and FPR are monotonic positive to each other.
  • False positive rate @ Recall: Smallest (best) false positive rate with recall greater or equal to the limit.

X@Y指标的置信区间是通过非参数百分位自举法计算的。

单侧麦克尼马尔检验

麦克尼马尔检验用于返回零假设的p值,即在阈值“threshold_1”下,“model_1”的准确率不高于在阈值“threshold_2”下“model_2”的准确率。

Mathworks链接 提供了计算麦克尼马尔检验的方法。

有几种资源介绍了如何计算麦克尼马尔检验的p值(使用二项分布/高斯CDF/卡方CDF)。经过离线模拟,我们认为二项分布最适合我们的目的。

回归

我们推荐阅读维基百科页面 关于回归模型评估的内容。

默认预测器输出在测试数据集上估计的标签的平均值(默认预测器总是输出相同的值)。

均方根误差(RMSE)

RMSE定义如下:

\[ RMSE = \sqrt{ \frac{\sum_{i=1}^{n} (\hat{y_i} - y_i)^2 }{n}} \]

其中,\(\{y_i\}_{i \in [1,n]}\) 是标签,\(\{\hat{y}_i\}_{i \in [1,n]}\) 是预测值。

较小的RMSE表明模型预测准确,而较大的RMSE则表明模型较差。RMSE以标签单位表示(例如,如果你在预测篮子里的苹果数量,RMSE将以苹果数量表示)。

假设残差(即 \(y_i - \hat{y}_i\))从中心正态分布中采样,计算RMSE的闭式置信区间,记为RMSE CI[X2]。这一假设应通过Html评估报告中提供的标准化正态分位数-分位数图进行检查,并在下面定义。

RMSE CI[X2]置信区间计算如下:

\[ \left[ \sqrt{\frac{n}{ \chi^2_{1 - (1 - \beta) / 2,n}}} RMSE , \sqrt{\frac{n}{\chi^2_{(1 - \beta) / 2,n}}} RMSE \right] \]

其中,RMSE是估计的RMSE,\(\beta\)是置信水平(例如95%),\(n\)是样本数量,\(\chi^2\)卡方分布的分位数函数。

更多详情请参阅工程统计手册的"Chi-Square Test for the Variance"章节。注意:RMSE是残差的标准差。

RMSE的置信区间也使用自举法计算(RMSE CI[B])。

残差正态概率图

残差正态概率图 是残差(在方差上标准化)与单位正态分布之间的分位数-分位数图

直线对角线的正态概率图表明残差呈正态分布。如果不是对角线,图形的形状可以(与残差直方图一起)用于定性残差分布的性质。

以下是残差正态概率图的一个示例。模型2的残差或多或少是正态的,而模型1的残差则不是。

条件{真实值, 预测值, 召回率}图

条件图展示了真实值、预测值和召回率之间的关系。这些图有助于理解模型在哪些地方表现最好,哪些地方表现最差。

以下是三个条件图的示例。模型1在低真实值时表现最佳,而模型2看起来是随机的(它是一个随机预测器)。

这些图应与真实值的直方图一起阅读。

排序

我们推荐阅读维基百科页面 关于学习排序任务的内容。

归一化折损累积增益(NDCG)

NDCG定义如下:

\[ NDCG@T = \frac{DCG@T}{maxDCG@T} \]

其中:

\[ DCG@T = \sum_{i=1}^{T} \frac{G(r_i)}{log(1+i)} \]
\[ maxDCG@T = \sum_{i=1}^{T} \frac{G(\hat{r}_i)}{log(1+i)} \]

\(T\)是截断值(例如5),\(r_i\)是预测值最大的第\(i\)个样本的相关性\(\hat{r}_i\)是相关性最大的第\(i\)个样本的相关性(即\(\hat{r}_1 \geq \hat{r}_2 \geq \cdots\))。

一个流行的惯例是相关性取值在0到4之间,增益函数为\(G(r) = 2^{r_i} - 1\)

NDCG值介于0(最差)和1(完美)之间。

在预测值相等的情况下(即模型对两个样本预测相同的值),增益在相等的元素之间平均分配(参见Computing Information Retrieval Performance Measures Efficiently in the Presence of Tied Scores)。 默认的NDCG是通过对所有示例的增益进行平均计算得出的。

更多详情请参见 从RankNet到LambdaRank再到LambdaMART:概述 的第3节。