apriori: 通过Apriori算法发现频繁项集
Apriori函数用于提取关联规则挖掘中的频繁项集
> 从 mlxtend.frequent_patterns 导入 apriori
概述
Apriori是一种流行的算法[1],用于提取频繁项集,并应用于关联规则学习。Apriori算法旨在处理包含交易的数据库,例如商店客户的购买记录。如果一个项集满足用户指定的支持度阈值,则被视为“频繁”。例如,如果支持度阈值设置为0.5(50%),则频繁项集被定义为在数据库中至少50%的所有交易中共同出现的一组物品。
参考文献
[1] Agrawal, Rakesh, 和 Ramakrishnan Srikant. "快速挖掘关联规则的算法." 第20届国际大会 大型数据库,VLDB。第1215卷。1994年。
相关
示例 1 -- 生成频繁项集
apriori
函数期望数据以一个独热编码的 pandas DataFrame 格式提供。假设我们有以下交易数据:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
我们可以通过TransactionEncoder
将其转换为正确的格式,如下所示:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df
Apple | Corn | Dill | Eggs | Ice cream | Kidney Beans | Milk | Nutmeg | Onion | Unicorn | Yogurt | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | True | False | True | True | True | True | False | True |
1 | False | False | True | True | False | True | False | True | True | False | True |
2 | True | False | False | True | False | True | True | False | False | False | False |
3 | False | True | False | False | False | True | True | False | False | True | True |
4 | False | True | False | True | True | True | False | False | True | False | False |
现在,让我们返回支持度至少为 60% 的项和项集:
from mlxtend.frequent_patterns import apriori
apriori(df, min_support=0.6)
support | itemsets | |
---|---|---|
0 | 0.8 | (3) |
1 | 1.0 | (5) |
2 | 0.6 | (6) |
3 | 0.6 | (8) |
4 | 0.6 | (10) |
5 | 0.8 | (3, 5) |
6 | 0.6 | (8, 3) |
7 | 0.6 | (5, 6) |
8 | 0.6 | (8, 5) |
9 | 0.6 | (10, 5) |
10 | 0.6 | (8, 3, 5) |
默认情况下,apriori
返回的是项目的列索引,这在后续操作如关联规则挖掘中可能会有所帮助。为了提高可读性,我们可以设置 use_colnames=True
将这些整数值转换为相应的项目名称:
apriori(df, min_support=0.6, use_colnames=True)
support | itemsets | |
---|---|---|
0 | 0.8 | (Eggs) |
1 | 1.0 | (Kidney Beans) |
2 | 0.6 | (Milk) |
3 | 0.6 | (Onion) |
4 | 0.6 | (Yogurt) |
5 | 0.8 | (Eggs, Kidney Beans) |
6 | 0.6 | (Eggs, Onion) |
7 | 0.6 | (Kidney Beans, Milk) |
8 | 0.6 | (Kidney Beans, Onion) |
9 | 0.6 | (Yogurt, Kidney Beans) |
10 | 0.6 | (Kidney Beans, Eggs, Onion) |
示例 2 -- 选择和过滤结果
使用pandas DataFrames
的一个优势是我们可以利用其便捷的功能来过滤结果。比如,假设我们只对支持度至少为80%的长度为2的项集感兴趣。首先,我们通过 apriori
创建频繁项集,并添加一个新列来存储每个项集的长度:
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets
support | itemsets | length | |
---|---|---|---|
0 | 0.8 | (Eggs) | 1 |
1 | 1.0 | (Kidney Beans) | 1 |
2 | 0.6 | (Milk) | 1 |
3 | 0.6 | (Onion) | 1 |
4 | 0.6 | (Yogurt) | 1 |
5 | 0.8 | (Eggs, Kidney Beans) | 2 |
6 | 0.6 | (Eggs, Onion) | 2 |
7 | 0.6 | (Kidney Beans, Milk) | 2 |
8 | 0.6 | (Kidney Beans, Onion) | 2 |
9 | 0.6 | (Yogurt, Kidney Beans) | 2 |
10 | 0.6 | (Kidney Beans, Eggs, Onion) | 3 |
然后,我们可以选择满足我们期望标准的结果,如下所示:
frequent_itemsets[ (frequent_itemsets['length'] == 2) &
(frequent_itemsets['support'] >= 0.8) ]
support | itemsets | length | |
---|---|---|---|
5 | 0.8 | (Eggs, Kidney Beans) | 2 |
同样地,使用Pandas API,我们可以根据“itemsets”列选择条目:
frequent_itemsets[ frequent_itemsets['itemsets'] == {'Onion', 'Eggs'} ]
support | itemsets | length | |
---|---|---|---|
6 | 0.6 | (Eggs, Onion) | 2 |
不可变集合
请注意,“itemsets”列中的条目类型为frozenset
,这是Python内置的类型,类似于Python的set
但不可变,这使得它在某些查询或比较操作中更有效率(https://docs.python.org/3.6/library/stdtypes.html#frozenset)。由于frozenset
是集合,因此项的顺序无关紧要。即,查询
frequent_itemsets[ frequent_itemsets['itemsets'] == {'Onion', 'Eggs'} ]
等价于以下三种中的任何一种
frequent_itemsets[ frequent_itemsets['itemsets'] == {'Eggs', 'Onion'} ]
frequent_itemsets[ frequent_itemsets['itemsets'] == frozenset(('Eggs', 'Onion')) ]
frequent_itemsets[ frequent_itemsets['itemsets'] == frozenset(('Onion', 'Eggs')) ]
示例 3 -- 使用稀疏表示法
为了节省内存,您可能希望以稀疏格式表示您的交易数据。这在您有很多产品和少量交易时尤其有用。
oht_ary = te.fit(dataset).transform(dataset, sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary, columns=te.columns_)
sparse_df
Apple | Corn | Dill | Eggs | Ice cream | Kidney Beans | Milk | Nutmeg | Onion | Unicorn | Yogurt | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | True | False | True | True | True | True | False | True |
1 | False | False | True | True | False | True | False | True | True | False | True |
2 | True | False | False | True | False | True | True | False | False | False | False |
3 | False | True | False | False | False | True | True | False | False | True | True |
4 | False | True | False | True | True | True | False | False | True | False | False |
apriori(sparse_df, min_support=0.6, use_colnames=True, verbose=1)
Processing 21 combinations | Sampling itemset size 3
support | itemsets | |
---|---|---|
0 | 0.8 | (Eggs) |
1 | 1.0 | (Kidney Beans) |
2 | 0.6 | (Milk) |
3 | 0.6 | (Onion) |
4 | 0.6 | (Yogurt) |
5 | 0.8 | (Eggs, Kidney Beans) |
6 | 0.6 | (Eggs, Onion) |
7 | 0.6 | (Kidney Beans, Milk) |
8 | 0.6 | (Kidney Beans, Onion) |
9 | 0.6 | (Yogurt, Kidney Beans) |
10 | 0.6 | (Kidney Beans, Eggs, Onion) |
API
apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0, low_memory=False)
Get frequent itemsets from a one-hot DataFrame
Parameters
-
df
: pandas DataFramepandas DataFrame the encoded format. Also supports DataFrames with sparse data; for more info, please see (https://pandas.pydata.org/pandas-docs/stable/ user_guide/sparse.html#sparse-data-structures)
Please note that the old pandas SparseDataFrame format is no longer supported in mlxtend >= 0.17.2.
The allowed values are either 0/1 or True/False. For example,
Apple Bananas Beer Chicken Milk Rice
0 True False True True False True
1 True False True False False True
2 True False True False False False
3 True True False False False False
4 False False True True True True
5 False False True False True True
6 False False True False True False
7 True True False False False False
-
min_support
: float (default: 0.5)A float between 0 and 1 for minumum support of the itemsets returned. The support is computed as the fraction
transactions_where_item(s)_occur / total_transactions
. -
use_colnames
: bool (default: False)If
True
, uses the DataFrames' column names in the returned DataFrame instead of column indices. -
max_len
: int (default: None)Maximum length of the itemsets generated. If
None
(default) all possible itemsets lengths (under the apriori condition) are evaluated. -
verbose
: int (default: 0)Shows the number of iterations if >= 1 and
low_memory
isTrue
. If=1 and
low_memory
isFalse
, shows the number of combinations. -
low_memory
: bool (default: False)If
True
, uses an iterator to search for combinations abovemin_support
. Note that whilelow_memory=True
should only be used for large dataset if memory resources are limited, because this implementation is approx. 3-6x slower than the default.
Returns
pandas DataFrame with columns ['support', 'itemsets'] of all itemsets
that are >= min_support
and < than max_len
(if max_len
is not None).
Each itemset in the 'itemsets' column is of type frozenset
,
which is a Python built-in type that behaves similarly to
sets except that it is immutable
(For more info, see
https://docs.python.org/3.6/library/stdtypes.html#frozenset).
Examples
For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/