lift_score: 分类和关联规则挖掘的提升得分
评分函数用于计算LIFT指标,即在测试数据集中,正确预测的正例与实际正例之间的比率。
> `from mlxtend.evaluate import lift_score`
概览
在分类的背景下,提升 [1] 比较模型预测与随机生成的预测。提升通常与增益与提升图表结合使用,作为一种视觉辅助工具 [2]。例如,假设客户响应的基准为10%,则当使用预测模型时,提升值为3将对应于30%的客户响应。请注意,提升 的范围为 $\lbrack 0, \infty \rbrack$。
计算提升有多种策略,下面我们将通过经典的混淆矩阵演示提升分数的计算。例如,假设以下预测和目标标签,其中“1”是正类:
- $\text{真实标签}: [0, 0, 1, 0, 0, 1, 1, 1, 1, 1]$
- $\text{预测}: [1, 0, 1, 0, 0, 0, 0, 1, 0, 0]$
然后,我们的混淆矩阵将如下所示:
Based on the confusion matrix above, with "1" as positive label, we compute lift as follows:
$$ \text{lift} = \frac{(TP/(TP+FP)}{(TP+FN)/(TP+TN+FP+FN)} $$
Plugging in the actual values from the example above, we arrive at the following lift value:
$$ \frac{2/(2+1)}{(2+4)/(2+3+1+4)} = 1.1111111111111112 $$
An alternative way to computing lift is by using the support metric [3]:
$$ \text{lift} = \frac{\text{support}(\text{true labels} \cap \text{prediction})}{\text{support}(\text{true labels}) \times \text{support}(\text{prediction})}, $$
Support is $x / N$, where $x$ is the number of incidences of an observation and $N$ is the total number of samples in the datset. $\text{true labels} \cap \text{prediction}$ are the true positives, $true labels$ are true positives plus false negatives, and $prediction$ are true positives plus false positives. Plugging the values from our example into the equation above, we arrive at:
$$ \frac{2/10}{(6/10 \times 3/10)} = 1.1111111111111112 $$
参考文献
- [1] S. Brin, R. Motwani, J. D. Ullman, 和 S. Tsur. 市场篮子数据的动态项集计数和蕴含规则. 在 ACM SIGMOD 国际数据管理会议(ACM SIGMOD '97)中,页面265-276, 1997年.
- [2] https://www3.nd.edu/~busiforc/Lift_chart.html
- [3] https://zh.wikipedia.org/wiki/%E5%85%B1%E5%90%88%E8%A1%8C%E4%B8%8E%E5%9B%9E%E5%BA%94%E5%8A%9F%E8%83%BD#%E6%94%AF%E6%8C%81
示例 1 - 计算提升度
这个示例展示了如何使用“概述”部分中的示例,基本使用lift_score
函数。
import numpy as np
from mlxtend.evaluate import lift_score
y_target = np.array([0, 0, 1, 0, 0, 1, 1, 1, 1, 1])
y_predicted = np.array([1, 0, 1, 0, 0, 0, 0, 1, 0, 0])
lift_score(y_target, y_predicted)
1.1111111111111112
示例 2 - 在 GridSearch
中使用 lift_score
lift_score
函数还可以与 scikit-learn 对象一起使用,例如 GridSearch
:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import make_scorer
# 创建自定义评分器
lift_scorer = make_scorer(lift_score)
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=123)
hyperparameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
clf = GridSearchCV(SVC(), hyperparameters, cv=10,
scoring=lift_scorer)
clf.fit(X_train, y_train)
print(clf.best_score_)
print(clf.best_params_)
3.0
{'gamma': 0.001, 'kernel': 'rbf', 'C': 1000}
API
lift_score(y_target, y_predicted, binary=True, positive_label=1)
Lift measures the degree to which the predictions of a classification model are better than randomly-generated predictions.
The in terms of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN), the lift score is computed as: [ TP/(TP+FN) ] / [ (TP+FP) / (TP+TN+FP+FN) ]
Parameters
-
y_target
: array-like, shape=[n_samples]True class labels.
-
y_predicted
: array-like, shape=[n_samples]Predicted class labels.
-
binary
: bool (default: True)Maps a multi-class problem onto a binary, where the positive class is 1 and all other classes are 0.
-
positive_label
: int (default: 0)Class label of the positive class.
Returns
-
score
: floatLift score in the range [0, $\infty$]
Examples
For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/evaluate/lift_score/