apriori: 通过Apriori算法发现频繁项集


> 从 mlxtend.frequent_patterns 导入 apriori




[1] Agrawal, Rakesh, 和 Ramakrishnan Srikant. "快速挖掘关联规则的算法." 第20届国际大会 大型数据库,VLDB。第1215卷。1994年。


示例 1 -- 生成频繁项集

apriori 函数期望数据以一个独热编码的 pandas DataFrame 格式提供。假设我们有以下交易数据:

dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]


import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

Apple Corn Dill Eggs Ice cream Kidney Beans Milk Nutmeg Onion Unicorn Yogurt
0 False False False True False True True True True False True
1 False False True True False True False True True False True
2 True False False True False True True False False False False
3 False True False False False True True False False True True
4 False True False True True True False False True False False

现在,让我们返回支持度至少为 60% 的项和项集:

from mlxtend.frequent_patterns import apriori

apriori(df, min_support=0.6)

support itemsets
0 0.8 (3)
1 1.0 (5)
2 0.6 (6)
3 0.6 (8)
4 0.6 (10)
5 0.8 (3, 5)
6 0.6 (8, 3)
7 0.6 (5, 6)
8 0.6 (8, 5)
9 0.6 (10, 5)
10 0.6 (8, 3, 5)

默认情况下,apriori 返回的是项目的列索引,这在后续操作如关联规则挖掘中可能会有所帮助。为了提高可读性,我们可以设置 use_colnames=True 将这些整数值转换为相应的项目名称:

apriori(df, min_support=0.6, use_colnames=True)

support itemsets
0 0.8 (Eggs)
1 1.0 (Kidney Beans)
2 0.6 (Milk)
3 0.6 (Onion)
4 0.6 (Yogurt)
5 0.8 (Eggs, Kidney Beans)
6 0.6 (Eggs, Onion)
7 0.6 (Kidney Beans, Milk)
8 0.6 (Kidney Beans, Onion)
9 0.6 (Yogurt, Kidney Beans)
10 0.6 (Kidney Beans, Eggs, Onion)

示例 2 -- 选择和过滤结果

使用pandas DataFrames 的一个优势是我们可以利用其便捷的功能来过滤结果。比如,假设我们只对支持度至少为80%的长度为2的项集感兴趣。首先,我们通过 apriori 创建频繁项集,并添加一个新列来存储每个项集的长度:

frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

support itemsets length
0 0.8 (Eggs) 1
1 1.0 (Kidney Beans) 1
2 0.6 (Milk) 1
3 0.6 (Onion) 1
4 0.6 (Yogurt) 1
5 0.8 (Eggs, Kidney Beans) 2
6 0.6 (Eggs, Onion) 2
7 0.6 (Kidney Beans, Milk) 2
8 0.6 (Kidney Beans, Onion) 2
9 0.6 (Yogurt, Kidney Beans) 2
10 0.6 (Kidney Beans, Eggs, Onion) 3


frequent_itemsets[ (frequent_itemsets['length'] == 2) &
                   (frequent_itemsets['support'] >= 0.8) ]

support itemsets length
5 0.8 (Eggs, Kidney Beans) 2

同样地,使用Pandas API,我们可以根据“itemsets”列选择条目:

frequent_itemsets[ frequent_itemsets['itemsets'] == {'Onion', 'Eggs'} ]

support itemsets length
6 0.6 (Eggs, Onion) 2



frequent_itemsets[ frequent_itemsets['itemsets'] == {'Onion', 'Eggs'} ]


示例 3 -- 使用稀疏表示法


oht_ary = te.fit(dataset).transform(dataset, sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary, columns=te.columns_)

Apple Corn Dill Eggs Ice cream Kidney Beans Milk Nutmeg Onion Unicorn Yogurt
0 False False False True False True True True True False True
1 False False True True False True False True True False True
2 True False False True False True True False False False False
3 False True False False False True True False False True True
4 False True False True True True False False True False False
apriori(sparse_df, min_support=0.6, use_colnames=True, verbose=1)

Processing 21 combinations | Sampling itemset size 3
support itemsets
0 0.8 (Eggs)
1 1.0 (Kidney Beans)
2 0.6 (Milk)
3 0.6 (Onion)
4 0.6 (Yogurt)
5 0.8 (Eggs, Kidney Beans)
6 0.6 (Eggs, Onion)
7 0.6 (Kidney Beans, Milk)
8 0.6 (Kidney Beans, Onion)
9 0.6 (Yogurt, Kidney Beans)
10 0.6 (Kidney Beans, Eggs, Onion)


apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0, low_memory=False)

Get frequent itemsets from a one-hot DataFrame


    Apple  Bananas   Beer  Chicken   Milk   Rice
    0     True    False   True     True  False   True
    1     True    False   True    False  False   True
    2     True    False   True    False  False  False
    3     True     True  False    False  False  False
    4    False    False   True     True   True   True
    5    False    False   True    False   True   True
    6    False    False   True    False   True  False
    7     True     True  False    False  False  False


pandas DataFrame with columns ['support', 'itemsets'] of all itemsets that are >= min_support and < than max_len (if max_len is not None). Each itemset in the 'itemsets' column is of type frozenset, which is a Python built-in type that behaves similarly to sets except that it is immutable (For more info, see https://docs.python.org/3.6/library/stdtypes.html#frozenset).


For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/