关联规则:从频繁项集生成关联规则
生成频繁项集的关联规则的函数
> 从 mlxtend.frequent_patterns 导入关联规则
概述
规则生成是在频繁模式挖掘中的一项常见任务。关联规则是形式为 $X \rightarrow Y$ 的蕴含表达式,其中 $X$ 和 $Y$ 是不相交的项集[1]。基于消费者行为的一个更具体的例子是 ${尿布} \rightarrow {啤酒}$,这表明购买尿布的人也很可能购买啤酒。为了评估这种关联规则的“兴趣”,已经开发了不同的度量标准。当前的实现使用了 confidence
和 lift
这两个度量。
度量指标
当前用于评估关联规则和设置选择阈值的度量标准如下所示。给定规则 "A -> C",A 代表前提,C 代表结果。
'支持度':
$$\text{support}(A\rightarrow C) = \text{support}(A \cup C), \;\;\; \text{范围: } [0, 1]$$
- 在 [3] 中引入
支持度度量是为项集定义的,而不是为关联规则定义的。关联规则挖掘算法生成的表格包含三种不同的支持度度量:“前提支持度”、“结果支持度”和“支持度”。这里,“前提支持度”计算包含前提 A 的交易比例,而“结果支持度”计算结果 C 的项集的支持度。“支持度”度量则计算组合项集 A $\cup$ C 的支持度。
通常,支持度用于衡量数据库中项集的丰度或频率(通常被解释为显著性或重要性)。如果项集的支持度大于指定的最小支持度阈值,我们将其称为“频繁项集”。请注意,通常由于向下封闭特性,频繁项集的所有子集也是频繁的。
'置信度':
$$\text{confidence}(A\rightarrow C) = \frac{\text{support}(A\rightarrow C)}{\text{support}(A)}, \;\;\; \text{范围: } [0, 1]$$
- 在 [3] 中引入
规则 A->C 的置信度是给定交易中看到结果的前提的概率。请注意,该度量不是对称或有向的;例如,A->C 的置信度与 C->A 的置信度是不同的。当结果和前提总是一起出现时,规则 A->C 的置信度为 1(最大值)。
'提升度':
$$\text{lift}(A\rightarrow C) = \frac{\text{confidence}(A\rightarrow C)}{\text{support}(C)}, \;\;\; \text{范围: } [0, \infty]$$
- 在 [4] 中引入
提升度度量通常用于衡量规则 A->C 的前提和结果一起出现的频率,超过了我们认为它们在统计上独立时的出现频率。如果 A 和 C 是独立的,那么提升度分数将正好为 1。
'杠杆值':
$$\text{leverage}(A\rightarrow C) = \text{support}(A\rightarrow C) - \text{support}(A) \times \text{support}(C), \;\;\; \text{范围: } [-1, 1]$$
- 在 [5] 中引入
杠杆值计算 A 和 C 同时出现的观察频率与如果 A 和 C 独立时预计的频率之间的差异。杠杆值为 0 表示独立。
'信念值':
$$\text{conviction}(A\rightarrow C) = \frac{1 - \text{support}(C)}{1 - \text{confidence}(A\rightarrow C)}, \;\;\; \text{范围: } [0, \infty]$$
- 在 [6] 中引入
高信念值意味着结果高度依赖于前提。例如,在完美置信度评分的情况下,分母变为 0(由于 1 - 1),因此信念评分定义为'inf'。与提升度类似,如果项是独立的,信念值为 1。
'张量度量':
$$\text{zhangs metric}(A\rightarrow C) = \frac{\text{confidence}(A\rightarrow C) - \text{confidence}(A'\rightarrow C)}{Max[ \text{confidence}(A\rightarrow C) , \text{confidence}(A'\rightarrow C)]}, \;\;\; \text{范围: } [-1, 1]$$
- 在 [7] 中引入
同时衡量关联和非关联。值的范围在 -1 和 1 之间。正值(>0)表示关联,负值表示非关联。
参考文献
[1] Tan, Steinbach, Kumar. 数据挖掘导论。培生新国际版。哈罗:培生教育有限公司,2014年。(第327-414页)
[2] Michael Hahsler, https://michael.hahsler.net/research/association_rules/measures.html
[3] R. Agrawal, T. Imielinski, 和 A. Swami. 在大型数据库中挖掘项集之间的关联。在ACM SIGMOD国际数据管理会议的会议记录,第207-216页,华盛顿D.C.,1993年5月。
[4] S. Brin, R. Motwani, J. D. Ullman, 和 S. Tsur. 用于市场购物篮数据的动态项集计数和蕴含规则。
[5] Piatetsky-Shapiro, G.,强规则的发现、分析和展示。《数据库中的知识发现》,1991年:第229-248页。
[6] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, 和 Shalom Turk. 用于市场购物篮数据的动态项集计数和蕴含规则。在SIGMOD 1997,ACM SIGMOD国际数据管理会议的会议记录,第255-264页,美国亚利桑那州图森市,1997年5月。
[7] Xiaowei Yan, Chengqi Zhang & Shichao Zhang (2009) 关联规则挖掘的置信度度量,《应用人工智能》,23:8, 713-737 https://www.tandfonline.com/doi/pdf/10.1080/08839510903208062.
示例 1 -- 从频繁项集生成关联规则
generate_rules
函数接受由 mlxtend.association 中的 apriori
、fpgrowth
或 fpmax
函数生成的频繁项集数据框。为了演示 generate_rules
方法的用法,我们首先创建一个由 fpgrowth
函数生成的频繁项集的 pandas DataFrame
:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
# ##或者:
#频繁项集 = apriori(df, min_support=0.6, use_colnames=True)
#频繁项集 = fpmax(df, min_support=0.6, use_colnames=True)
frequent_itemsets
support | itemsets | |
---|---|---|
0 | 1.0 | (Kidney Beans) |
1 | 0.8 | (Eggs) |
2 | 0.6 | (Yogurt) |
3 | 0.6 | (Onion) |
4 | 0.6 | (Milk) |
5 | 0.8 | (Kidney Beans, Eggs) |
6 | 0.6 | (Kidney Beans, Yogurt) |
7 | 0.6 | (Eggs, Onion) |
8 | 0.6 | (Kidney Beans, Onion) |
9 | 0.6 | (Eggs, Kidney Beans, Onion) |
10 | 0.6 | (Kidney Beans, Milk) |
generate_rules()
函数允许您 (1) 指定您感兴趣的指标和 (2) 相关的阈值。目前实现的度量有 置信度 和 提升度。假设您只对从频繁项集派生的规则感兴趣,前提是置信度水平高于 70% 的阈值 (min_threshold=0.7
):
from mlxtend.frequent_patterns import association_rules
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (Kidney Beans) | (Eggs) | 1.0 | 0.8 | 0.8 | 0.80 | 1.00 | 0.00 | 1.0 | 0.0 |
1 | (Eggs) | (Kidney Beans) | 0.8 | 1.0 | 0.8 | 1.00 | 1.00 | 0.00 | inf | 0.0 |
2 | (Yogurt) | (Kidney Beans) | 0.6 | 1.0 | 0.6 | 1.00 | 1.00 | 0.00 | inf | 0.0 |
3 | (Eggs) | (Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 |
4 | (Onion) | (Eggs) | 0.6 | 0.8 | 0.6 | 1.00 | 1.25 | 0.12 | inf | 0.5 |
5 | (Onion) | (Kidney Beans) | 0.6 | 1.0 | 0.6 | 1.00 | 1.00 | 0.00 | inf | 0.0 |
6 | (Kidney Beans, Eggs) | (Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 |
7 | (Onion, Eggs) | (Kidney Beans) | 0.6 | 1.0 | 0.6 | 1.00 | 1.00 | 0.00 | inf | 0.0 |
8 | (Kidney Beans, Onion) | (Eggs) | 0.6 | 0.8 | 0.6 | 1.00 | 1.25 | 0.12 | inf | 0.5 |
9 | (Eggs) | (Kidney Beans, Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 |
10 | (Onion) | (Kidney Beans, Eggs) | 0.6 | 0.8 | 0.6 | 1.00 | 1.25 | 0.12 | inf | 0.5 |
11 | (Milk) | (Kidney Beans) | 0.6 | 1.0 | 0.6 | 1.00 | 1.00 | 0.00 | inf | 0.0 |
示例 2 -- 规则生成和选择标准
如果您对根据不同兴趣度量的规则感兴趣,可以简单地调整metric
和min_threshold
参数。例如,如果您只对提升值得分 >= 1.2 的规则感兴趣,可以执行以下操作:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (Eggs) | (Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 |
1 | (Onion) | (Eggs) | 0.6 | 0.8 | 0.6 | 1.00 | 1.25 | 0.12 | inf | 0.5 |
2 | (Kidney Beans, Eggs) | (Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 |
3 | (Kidney Beans, Onion) | (Eggs) | 0.6 | 0.8 | 0.6 | 1.00 | 1.25 | 0.12 | inf | 0.5 |
4 | (Eggs) | (Kidney Beans, Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 |
5 | (Onion) | (Kidney Beans, Eggs) | 0.6 | 0.8 | 0.6 | 1.00 | 1.25 | 0.12 | inf | 0.5 |
Pandas DataFrames
使得进一步过滤结果变得容易。假设我们只对满足以下条件的规则感兴趣:
- 至少有 2 个前件
- 置信度 > 0.75
- 提升值 > 1.2
我们可以按如下方式计算前件长度:
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | antecedent_len | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | (Eggs) | (Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 | 1 |
1 | (Onion) | (Eggs) | 0.6 | 0.8 | 0.6 | 1.00 | 1.25 | 0.12 | inf | 0.5 | 1 |
2 | (Kidney Beans, Eggs) | (Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 | 2 |
3 | (Kidney Beans, Onion) | (Eggs) | 0.6 | 0.8 | 0.6 | 1.00 | 1.25 | 0.12 | inf | 0.5 | 2 |
4 | (Eggs) | (Kidney Beans, Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 | 1 |
5 | (Onion) | (Kidney Beans, Eggs) | 0.6 | 0.8 | 0.6 | 1.00 | 1.25 | 0.12 | inf | 0.5 | 1 |
然后,我们可以使用pandas的选择语法,如下所示:
rules[ (rules['antecedent_len'] >= 2) &
(rules['confidence'] > 0.75) &
(rules['lift'] > 1.2) ]
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | antecedent_len | |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | (Kidney Beans, Onion) | (Eggs) | 0.6 | 0.8 | 0.6 | 1.0 | 1.25 | 0.12 | inf | 0.5 | 2 |
同样,使用Pandas API,我们可以根据“前件”或“后件”列选择条目:
rules[rules['antecedents'] == {'Eggs', 'Kidney Beans'}]
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | antecedent_len | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | (Kidney Beans, Eggs) | (Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 | 2 |
冷冻集合
请注意,“项集”列中的条目是类型为 frozenset
,这是Python的内置类型,类似于Python的 set
,但不可变,这使得它在某些查询或比较操作中更高效(https://docs.python.org/3.6/library/stdtypes.html#frozenset)。由于 frozenset
是集合,因此项的顺序无关紧要。也就是说,查询
rules[rules['antecedents'] == {'Eggs', 'Kidney Beans'}]
等同于以下任意三种
rules[rules['antecedents'] == {'Kidney Beans', 'Eggs'}]
rules[rules['antecedents'] == frozenset(('Eggs', 'Kidney Beans'))]
rules[rules['antecedents'] == frozenset(('Kidney Beans', 'Eggs'))]
示例 3 -- 具有不完整前件和后件信息的频繁项集
association_rules
计算的大多数指标取决于给定规则的后件和前件支持分数,这些分数由频繁项集输入DataFrame提供。考虑以下示例:
import pandas as pd
dict = {'itemsets': [['177', '176'], ['177', '179'],
['176', '178'], ['176', '179'],
['93', '100'], ['177', '178'],
['177', '176', '178']],
'support':[0.253623, 0.253623, 0.217391,
0.217391, 0.181159, 0.108696, 0.108696]}
freq_itemsets = pd.DataFrame(dict)
freq_itemsets
itemsets | support | |
---|---|---|
0 | [177, 176] | 0.253623 |
1 | [177, 179] | 0.253623 |
2 | [176, 178] | 0.217391 |
3 | [176, 179] | 0.217391 |
4 | [93, 100] | 0.181159 |
5 | [177, 178] | 0.108696 |
6 | [177, 176, 178] | 0.108696 |
注意,这是一 个“裁剪”后的数据框,不包含项目子集的支持值。如果我们想计算关联规则指标,例如 176 => 177
,这可能会造成问题。
例如,置信度的计算公式为
$$\text{confidence}(A\rightarrow C) = \frac{\text{support}(A\rightarrow C)}{\text{support}(A)}, \;\;\; \text{范围: } [0, 1]$$
但我们没有 $\text{support}(A)$。我们所知道的“A”的支持值是它至少为 0.253623。
在这些场景中,由于输入数据框不完整,无法计算所有指标,可以使用 support_only=True
选项,这将仅计算给定规则的支持列,所需的信息较少:
$$\text{support}(A\rightarrow C) = \text{support}(A \cup C), \;\;\; \text{范围: } [0, 1]$$
“NaN”将分配给所有其他指标列:
from mlxtend.frequent_patterns import association_rules
res = association_rules(freq_itemsets, support_only=True, min_threshold=0.1)
res
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (176) | (177) | NaN | NaN | 0.253623 | NaN | NaN | NaN | NaN | NaN |
1 | (177) | (176) | NaN | NaN | 0.253623 | NaN | NaN | NaN | NaN | NaN |
2 | (179) | (177) | NaN | NaN | 0.253623 | NaN | NaN | NaN | NaN | NaN |
3 | (177) | (179) | NaN | NaN | 0.253623 | NaN | NaN | NaN | NaN | NaN |
4 | (178) | (176) | NaN | NaN | 0.217391 | NaN | NaN | NaN | NaN | NaN |
5 | (176) | (178) | NaN | NaN | 0.217391 | NaN | NaN | NaN | NaN | NaN |
6 | (179) | (176) | NaN | NaN | 0.217391 | NaN | NaN | NaN | NaN | NaN |
7 | (176) | (179) | NaN | NaN | 0.217391 | NaN | NaN | NaN | NaN | NaN |
8 | (100) | (93) | NaN | NaN | 0.181159 | NaN | NaN | NaN | NaN | NaN |
9 | (93) | (100) | NaN | NaN | 0.181159 | NaN | NaN | NaN | NaN | NaN |
10 | (178) | (177) | NaN | NaN | 0.108696 | NaN | NaN | NaN | NaN | NaN |
11 | (177) | (178) | NaN | NaN | 0.108696 | NaN | NaN | NaN | NaN | NaN |
12 | (178, 176) | (177) | NaN | NaN | 0.108696 | NaN | NaN | NaN | NaN | NaN |
13 | (178, 177) | (176) | NaN | NaN | 0.108696 | NaN | NaN | NaN | NaN | NaN |
14 | (177, 176) | (178) | NaN | NaN | 0.108696 | NaN | NaN | NaN | NaN | NaN |
15 | (178) | (177, 176) | NaN | NaN | 0.108696 | NaN | NaN | NaN | NaN | NaN |
16 | (176) | (178, 177) | NaN | NaN | 0.108696 | NaN | NaN | NaN | NaN | NaN |
17 | (177) | (178, 176) | NaN | NaN | 0.108696 | NaN | NaN | NaN | NaN | NaN |
为了清理表示,您可能需要执行以下操作:
res = res[['antecedents', 'consequents', 'support']]
res
antecedents | consequents | support | |
---|---|---|---|
0 | (176) | (177) | 0.253623 |
1 | (177) | (176) | 0.253623 |
2 | (179) | (177) | 0.253623 |
3 | (177) | (179) | 0.253623 |
4 | (178) | (176) | 0.217391 |
5 | (176) | (178) | 0.217391 |
6 | (179) | (176) | 0.217391 |
7 | (176) | (179) | 0.217391 |
8 | (100) | (93) | 0.181159 |
9 | (93) | (100) | 0.181159 |
10 | (178) | (177) | 0.108696 |
11 | (177) | (178) | 0.108696 |
12 | (178, 176) | (177) | 0.108696 |
13 | (178, 177) | (176) | 0.108696 |
14 | (177, 176) | (178) | 0.108696 |
15 | (178) | (177, 176) | 0.108696 |
16 | (176) | (178, 177) | 0.108696 |
17 | (177) | (178, 176) | 0.108696 |
示例 4 -- 剪枝关联规则
没有特定的 API 用于剪枝。相反,可以在结果数据框上使用 pandas API 来删除单独的行。例如,假设我们有以下规则:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
from mlxtend.frequent_patterns import association_rules
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (Eggs) | (Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 |
1 | (Onion) | (Eggs) | 0.6 | 0.8 | 0.6 | 1.00 | 1.25 | 0.12 | inf | 0.5 |
2 | (Kidney Beans, Eggs) | (Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 |
3 | (Kidney Beans, Onion) | (Eggs) | 0.6 | 0.8 | 0.6 | 1.00 | 1.25 | 0.12 | inf | 0.5 |
4 | (Eggs) | (Kidney Beans, Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 |
5 | (Onion) | (Kidney Beans, Eggs) | 0.6 | 0.8 | 0.6 | 1.00 | 1.25 | 0.12 | inf | 0.5 |
我们想要移除规则“(洋葱, 红豆) -> (鸡蛋)”。为了做到这一点,我们可以定义选择掩码并如下移除这一行:
antecedent_sele = rules['antecedents'] == frozenset({'Onion', 'Kidney Beans'}) # or frozenset({'Kidney Beans', 'Onion'})
consequent_sele = rules['consequents'] == frozenset({'Eggs'})
final_sele = (antecedent_sele & consequent_sele)
rules.loc[ ~final_sele ]
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (Eggs) | (Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 |
1 | (Onion) | (Eggs) | 0.6 | 0.8 | 0.6 | 1.00 | 1.25 | 0.12 | inf | 0.5 |
2 | (Kidney Beans, Eggs) | (Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 |
4 | (Eggs) | (Kidney Beans, Onion) | 0.8 | 0.6 | 0.6 | 0.75 | 1.25 | 0.12 | 1.6 | 1.0 |
5 | (Onion) | (Kidney Beans, Eggs) | 0.6 | 0.8 | 0.6 | 1.00 | 1.25 | 0.12 | inf | 0.5 |