关联规则:从频繁项集生成关联规则

生成频繁项集的关联规则的函数

> 从 mlxtend.frequent_patterns 导入关联规则

概述

规则生成是在频繁模式挖掘中的一项常见任务。关联规则是形式为 $X \rightarrow Y$ 的蕴含表达式,其中 $X$ 和 $Y$ 是不相交的项集[1]。基于消费者行为的一个更具体的例子是 ${尿布} \rightarrow {啤酒}$,这表明购买尿布的人也很可能购买啤酒。为了评估这种关联规则的“兴趣”,已经开发了不同的度量标准。当前的实现使用了 confidencelift 这两个度量。

度量指标

当前用于评估关联规则和设置选择阈值的度量标准如下所示。给定规则 "A -> C",A 代表前提,C 代表结果。

'支持度':

$$\text{support}(A\rightarrow C) = \text{support}(A \cup C), \;\;\; \text{范围: } [0, 1]$$

支持度度量是为项集定义的,而不是为关联规则定义的。关联规则挖掘算法生成的表格包含三种不同的支持度度量:“前提支持度”、“结果支持度”和“支持度”。这里,“前提支持度”计算包含前提 A 的交易比例,而“结果支持度”计算结果 C 的项集的支持度。“支持度”度量则计算组合项集 A $\cup$ C 的支持度。

通常,支持度用于衡量数据库中项集的丰度或频率(通常被解释为显著性或重要性)。如果项集的支持度大于指定的最小支持度阈值,我们将其称为“频繁项集”。请注意,通常由于向下封闭特性,频繁项集的所有子集也是频繁的。

'置信度':

$$\text{confidence}(A\rightarrow C) = \frac{\text{support}(A\rightarrow C)}{\text{support}(A)}, \;\;\; \text{范围: } [0, 1]$$

规则 A->C 的置信度是给定交易中看到结果的前提的概率。请注意,该度量不是对称或有向的;例如,A->C 的置信度与 C->A 的置信度是不同的。当结果和前提总是一起出现时,规则 A->C 的置信度为 1(最大值)。

'提升度':

$$\text{lift}(A\rightarrow C) = \frac{\text{confidence}(A\rightarrow C)}{\text{support}(C)}, \;\;\; \text{范围: } [0, \infty]$$

提升度度量通常用于衡量规则 A->C 的前提和结果一起出现的频率,超过了我们认为它们在统计上独立时的出现频率。如果 A 和 C 是独立的,那么提升度分数将正好为 1。

'杠杆值':

$$\text{leverage}(A\rightarrow C) = \text{support}(A\rightarrow C) - \text{support}(A) \times \text{support}(C), \;\;\; \text{范围: } [-1, 1]$$

杠杆值计算 A 和 C 同时出现的观察频率与如果 A 和 C 独立时预计的频率之间的差异。杠杆值为 0 表示独立。

'信念值':

$$\text{conviction}(A\rightarrow C) = \frac{1 - \text{support}(C)}{1 - \text{confidence}(A\rightarrow C)}, \;\;\; \text{范围: } [0, \infty]$$

高信念值意味着结果高度依赖于前提。例如,在完美置信度评分的情况下,分母变为 0(由于 1 - 1),因此信念评分定义为'inf'。与提升度类似,如果项是独立的,信念值为 1。

'张量度量':

$$\text{zhangs metric}(A\rightarrow C) = \frac{\text{confidence}(A\rightarrow C) - \text{confidence}(A'\rightarrow C)}{Max[ \text{confidence}(A\rightarrow C) , \text{confidence}(A'\rightarrow C)]}, \;\;\; \text{范围: } [-1, 1]$$

同时衡量关联和非关联。值的范围在 -1 和 1 之间。正值(>0)表示关联,负值表示非关联。

参考文献

[1] Tan, Steinbach, Kumar. 数据挖掘导论。培生新国际版。哈罗:培生教育有限公司,2014年。(第327-414页)

[2] Michael Hahsler, https://michael.hahsler.net/research/association_rules/measures.html

[3] R. Agrawal, T. Imielinski, 和 A. Swami. 在大型数据库中挖掘项集之间的关联。在ACM SIGMOD国际数据管理会议的会议记录,第207-216页,华盛顿D.C.,1993年5月。

[4] S. Brin, R. Motwani, J. D. Ullman, 和 S. Tsur. 用于市场购物篮数据的动态项集计数和蕴含规则。

[5] Piatetsky-Shapiro, G.,强规则的发现、分析和展示。《数据库中的知识发现》,1991年:第229-248页。

[6] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, 和 Shalom Turk. 用于市场购物篮数据的动态项集计数和蕴含规则。在SIGMOD 1997,ACM SIGMOD国际数据管理会议的会议记录,第255-264页,美国亚利桑那州图森市,1997年5月。

[7] Xiaowei Yan, Chengqi Zhang & Shichao Zhang (2009) 关联规则挖掘的置信度度量,《应用人工智能》,23:8, 713-737 https://www.tandfonline.com/doi/pdf/10.1080/08839510903208062.

示例 1 -- 从频繁项集生成关联规则

generate_rules 函数接受由 mlxtend.association 中的 apriorifpgrowthfpmax 函数生成的频繁项集数据框。为了演示 generate_rules 方法的用法,我们首先创建一个由 fpgrowth 函数生成的频繁项集的 pandas DataFrame

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth


dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
# ##或者:
#频繁项集 = apriori(df, min_support=0.6, use_colnames=True)
#频繁项集 = fpmax(df, min_support=0.6, use_colnames=True)

frequent_itemsets

support itemsets
0 1.0 (Kidney Beans)
1 0.8 (Eggs)
2 0.6 (Yogurt)
3 0.6 (Onion)
4 0.6 (Milk)
5 0.8 (Kidney Beans, Eggs)
6 0.6 (Kidney Beans, Yogurt)
7 0.6 (Eggs, Onion)
8 0.6 (Kidney Beans, Onion)
9 0.6 (Eggs, Kidney Beans, Onion)
10 0.6 (Kidney Beans, Milk)

generate_rules() 函数允许您 (1) 指定您感兴趣的指标和 (2) 相关的阈值。目前实现的度量有 置信度提升度。假设您只对从频繁项集派生的规则感兴趣,前提是置信度水平高于 70% 的阈值 (min_threshold=0.7):

from mlxtend.frequent_patterns import association_rules

association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric
0 (Kidney Beans) (Eggs) 1.0 0.8 0.8 0.80 1.00 0.00 1.0 0.0
1 (Eggs) (Kidney Beans) 0.8 1.0 0.8 1.00 1.00 0.00 inf 0.0
2 (Yogurt) (Kidney Beans) 0.6 1.0 0.6 1.00 1.00 0.00 inf 0.0
3 (Eggs) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0
4 (Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 0.5
5 (Onion) (Kidney Beans) 0.6 1.0 0.6 1.00 1.00 0.00 inf 0.0
6 (Kidney Beans, Eggs) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0
7 (Onion, Eggs) (Kidney Beans) 0.6 1.0 0.6 1.00 1.00 0.00 inf 0.0
8 (Kidney Beans, Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 0.5
9 (Eggs) (Kidney Beans, Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0
10 (Onion) (Kidney Beans, Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 0.5
11 (Milk) (Kidney Beans) 0.6 1.0 0.6 1.00 1.00 0.00 inf 0.0

示例 2 -- 规则生成和选择标准

如果您对根据不同兴趣度量的规则感兴趣,可以简单地调整metricmin_threshold参数。例如,如果您只对提升值得分 >= 1.2 的规则感兴趣,可以执行以下操作:

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules

antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric
0 (Eggs) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0
1 (Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 0.5
2 (Kidney Beans, Eggs) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0
3 (Kidney Beans, Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 0.5
4 (Eggs) (Kidney Beans, Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0
5 (Onion) (Kidney Beans, Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 0.5

Pandas DataFrames 使得进一步过滤结果变得容易。假设我们只对满足以下条件的规则感兴趣:

  1. 至少有 2 个前件
  2. 置信度 > 0.75
  3. 提升值 > 1.2

我们可以按如下方式计算前件长度:

rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules

antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric antecedent_len
0 (Eggs) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0 1
1 (Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 0.5 1
2 (Kidney Beans, Eggs) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0 2
3 (Kidney Beans, Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 0.5 2
4 (Eggs) (Kidney Beans, Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0 1
5 (Onion) (Kidney Beans, Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 0.5 1

然后,我们可以使用pandas的选择语法,如下所示:

rules[ (rules['antecedent_len'] >= 2) &
       (rules['confidence'] > 0.75) &
       (rules['lift'] > 1.2) ]

antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric antecedent_len
3 (Kidney Beans, Onion) (Eggs) 0.6 0.8 0.6 1.0 1.25 0.12 inf 0.5 2

同样,使用Pandas API,我们可以根据“前件”或“后件”列选择条目:

rules[rules['antecedents'] == {'Eggs', 'Kidney Beans'}]

antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric antecedent_len
2 (Kidney Beans, Eggs) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0 2

冷冻集合

请注意,“项集”列中的条目是类型为 frozenset,这是Python的内置类型,类似于Python的 set,但不可变,这使得它在某些查询或比较操作中更高效(https://docs.python.org/3.6/library/stdtypes.html#frozenset)。由于 frozenset 是集合,因此项的顺序无关紧要。也就是说,查询

rules[rules['antecedents'] == {'Eggs', 'Kidney Beans'}]

等同于以下任意三种

示例 3 -- 具有不完整前件和后件信息的频繁项集

association_rules计算的大多数指标取决于给定规则的后件和前件支持分数,这些分数由频繁项集输入DataFrame提供。考虑以下示例:

import pandas as pd

dict = {'itemsets': [['177', '176'], ['177', '179'],
                     ['176', '178'], ['176', '179'],
                     ['93', '100'], ['177', '178'],
                     ['177', '176', '178']],
        'support':[0.253623, 0.253623, 0.217391,
                   0.217391, 0.181159, 0.108696, 0.108696]}

freq_itemsets = pd.DataFrame(dict)
freq_itemsets

itemsets support
0 [177, 176] 0.253623
1 [177, 179] 0.253623
2 [176, 178] 0.217391
3 [176, 179] 0.217391
4 [93, 100] 0.181159
5 [177, 178] 0.108696
6 [177, 176, 178] 0.108696

注意,这是一 个“裁剪”后的数据框,不包含项目子集的支持值。如果我们想计算关联规则指标,例如 176 => 177,这可能会造成问题。

例如,置信度的计算公式为

$$\text{confidence}(A\rightarrow C) = \frac{\text{support}(A\rightarrow C)}{\text{support}(A)}, \;\;\; \text{范围: } [0, 1]$$

但我们没有 $\text{support}(A)$。我们所知道的“A”的支持值是它至少为 0.253623。

在这些场景中,由于输入数据框不完整,无法计算所有指标,可以使用 support_only=True 选项,这将仅计算给定规则的支持列,所需的信息较少:

$$\text{support}(A\rightarrow C) = \text{support}(A \cup C), \;\;\; \text{范围: } [0, 1]$$

“NaN”将分配给所有其他指标列:

from mlxtend.frequent_patterns import association_rules

res = association_rules(freq_itemsets, support_only=True, min_threshold=0.1)
res

antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric
0 (176) (177) NaN NaN 0.253623 NaN NaN NaN NaN NaN
1 (177) (176) NaN NaN 0.253623 NaN NaN NaN NaN NaN
2 (179) (177) NaN NaN 0.253623 NaN NaN NaN NaN NaN
3 (177) (179) NaN NaN 0.253623 NaN NaN NaN NaN NaN
4 (178) (176) NaN NaN 0.217391 NaN NaN NaN NaN NaN
5 (176) (178) NaN NaN 0.217391 NaN NaN NaN NaN NaN
6 (179) (176) NaN NaN 0.217391 NaN NaN NaN NaN NaN
7 (176) (179) NaN NaN 0.217391 NaN NaN NaN NaN NaN
8 (100) (93) NaN NaN 0.181159 NaN NaN NaN NaN NaN
9 (93) (100) NaN NaN 0.181159 NaN NaN NaN NaN NaN
10 (178) (177) NaN NaN 0.108696 NaN NaN NaN NaN NaN
11 (177) (178) NaN NaN 0.108696 NaN NaN NaN NaN NaN
12 (178, 176) (177) NaN NaN 0.108696 NaN NaN NaN NaN NaN
13 (178, 177) (176) NaN NaN 0.108696 NaN NaN NaN NaN NaN
14 (177, 176) (178) NaN NaN 0.108696 NaN NaN NaN NaN NaN
15 (178) (177, 176) NaN NaN 0.108696 NaN NaN NaN NaN NaN
16 (176) (178, 177) NaN NaN 0.108696 NaN NaN NaN NaN NaN
17 (177) (178, 176) NaN NaN 0.108696 NaN NaN NaN NaN NaN

为了清理表示,您可能需要执行以下操作:

res = res[['antecedents', 'consequents', 'support']]
res

antecedents consequents support
0 (176) (177) 0.253623
1 (177) (176) 0.253623
2 (179) (177) 0.253623
3 (177) (179) 0.253623
4 (178) (176) 0.217391
5 (176) (178) 0.217391
6 (179) (176) 0.217391
7 (176) (179) 0.217391
8 (100) (93) 0.181159
9 (93) (100) 0.181159
10 (178) (177) 0.108696
11 (177) (178) 0.108696
12 (178, 176) (177) 0.108696
13 (178, 177) (176) 0.108696
14 (177, 176) (178) 0.108696
15 (178) (177, 176) 0.108696
16 (176) (178, 177) 0.108696
17 (177) (178, 176) 0.108696

示例 4 -- 剪枝关联规则

没有特定的 API 用于剪枝。相反,可以在结果数据框上使用 pandas API 来删除单独的行。例如,假设我们有以下规则:

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
from mlxtend.frequent_patterns import association_rules


dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules

antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric
0 (Eggs) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0
1 (Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 0.5
2 (Kidney Beans, Eggs) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0
3 (Kidney Beans, Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 0.5
4 (Eggs) (Kidney Beans, Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0
5 (Onion) (Kidney Beans, Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 0.5

我们想要移除规则“(洋葱, 红豆) -> (鸡蛋)”。为了做到这一点,我们可以定义选择掩码并如下移除这一行:

antecedent_sele = rules['antecedents'] == frozenset({'Onion', 'Kidney Beans'}) # or  frozenset({'Kidney Beans', 'Onion'})
consequent_sele = rules['consequents'] == frozenset({'Eggs'})
final_sele = (antecedent_sele & consequent_sele)

rules.loc[ ~final_sele ]

antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric
0 (Eggs) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0
1 (Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 0.5
2 (Kidney Beans, Eggs) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0
4 (Eggs) (Kidney Beans, Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.6 1.0
5 (Onion) (Kidney Beans, Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 0.5

API