fpgrowth:通过FP-growth算法获取频繁项集
实现FP-Growth的函数,用于提取关联规则挖掘中的频繁项集
> from mlxtend.frequent_patterns import fpgrowth
概述
FP-Growth [1] 是一种用于提取频繁项集的算法,在关联规则学习中应用广泛,作为传统的 Apriori 算法 [2] 的一种流行替代方案。
一般来说,该算法旨在处理包含事务的数据库,例如商店客户的购买记录。如果一个项集满足用户指定的支持阈值,则该项集被视为“频繁”。例如,如果支持阈值设置为 0.5(50%),则频繁项集被定义为在数据库中至少出现 50% 的所有事务中共同出现的项集。
特别地,FP-Growth 与 Apriori 频繁模式挖掘算法的不同之处在于,FP-Growth 是一种不需要候选项生成的频繁模式挖掘算法。它内部使用一种称为 FP-tree(频繁模式树)的数据结构,而不显式生成候选集,这使得它对大型数据集特别有吸引力。
参考文献
[1] Han, Jiawei, Jian Pei, Yiwen Yin, 和 Runying Mao. "无候选生成的频繁模式挖掘。" 频繁模式树方法。数据挖掘与知识发现 8, no. 1 (2004): 53-87。
[2] Agrawal, Rakesh, 和 Ramakrishnan Srikant. "快速挖掘关联规则的算法。" 第20届国际会议,大型数据库会议,VLDB。第1215卷。1994。
相关
示例 1 -- 生成频繁项集
fpgrowth
函数期望接收一个经过独热编码的 pandas DataFrame 数据。
假设我们有以下交易数据:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
我们可以通过 TransactionEncoder
将其转换为正确的格式,具体如下:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df
Apple | Corn | Dill | Eggs | Ice cream | Kidney Beans | Milk | Nutmeg | Onion | Unicorn | Yogurt | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | True | False | True | True | True | True | False | True |
1 | False | False | True | True | False | True | False | True | True | False | True |
2 | True | False | False | True | False | True | True | False | False | False | False |
3 | False | True | False | False | False | True | True | False | False | True | True |
4 | False | True | False | True | True | True | False | False | True | False | False |
现在,让我们返回支持度至少为60%的项和项集:
from mlxtend.frequent_patterns import fpgrowth
fpgrowth(df, min_support=0.6)
support | itemsets | |
---|---|---|
0 | 1.0 | (5) |
1 | 0.8 | (3) |
2 | 0.6 | (10) |
3 | 0.6 | (8) |
4 | 0.6 | (6) |
5 | 0.8 | (3, 5) |
6 | 0.6 | (10, 5) |
7 | 0.6 | (8, 3) |
8 | 0.6 | (8, 5) |
9 | 0.6 | (8, 3, 5) |
10 | 0.6 | (5, 6) |
默认情况下,fpgrowth
返回项目的列索引,这在后续操作如关联规则挖掘中可能会很有用。为了更好地阅读,我们可以将 use_colnames=True
设置为将这些整数值转换为相应的项目名称:
fpgrowth(df, min_support=0.6, use_colnames=True)
support | itemsets | |
---|---|---|
0 | 1.0 | (Kidney Beans) |
1 | 0.8 | (Eggs) |
2 | 0.6 | (Yogurt) |
3 | 0.6 | (Onion) |
4 | 0.6 | (Milk) |
5 | 0.8 | (Eggs, Kidney Beans) |
6 | 0.6 | (Yogurt, Kidney Beans) |
7 | 0.6 | (Eggs, Onion) |
8 | 0.6 | (Onion, Kidney Beans) |
9 | 0.6 | (Eggs, Onion, Kidney Beans) |
10 | 0.6 | (Milk, Kidney Beans) |
示例 2 -- Apriori 与 FPGrowth 对比
由于FP-Growth不需要显式地创建候选集,因此它的速度可以比替代的Apriori算法快几个数量级。例如,以下单元格比较了Apriori算法和FP-Growth的性能——即使在这个非常简单的玩具数据集场景中,FP-Growth的速度也快了大约5倍。
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
from mlxtend.frequent_patterns import apriori
%timeit -n 100 -r 10 apriori(df, min_support=0.6)
850 µs ± 39.3 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit -n 100 -r 10 apriori(df, min_support=0.6, low_memory=True)
941 µs ± 30.6 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
from mlxtend.frequent_patterns import fpgrowth
%timeit -n 100 -r 10 fpgrowth(df, min_support=0.6)
320 µs ± 9.21 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
更多示例
请注意,由于 fpgrowth
函数是 apriori
的直接替代品,因此它具有相同的一组函数参数和返回参数。因此,如需更多示例,请参见 apriori
文档。
API
fpgrowth(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0)
Get frequent itemsets from a one-hot DataFrame
Parameters
-
df
: pandas DataFramepandas DataFrame the encoded format. Also supports DataFrames with sparse data; for more info, please see https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#sparse-data-structures.
Please note that the old pandas SparseDataFrame format is no longer supported in mlxtend >= 0.17.2.
The allowed values are either 0/1 or True/False. For example,
Apple Bananas Beer Chicken Milk Rice 0 True False True True False True 1 True False True False False True 2 True False True False False False 3 True True False False False False 4 False False True True True True 5 False False True False True True 6 False False True False True False 7 True True False False False False
-
min_support
: float (default: 0.5)A float between 0 and 1 for minimum support of the itemsets returned. The support is computed as the fraction transactions_where_item(s)_occur / total_transactions.
-
use_colnames
: bool (default: False)If true, uses the DataFrames' column names in the returned DataFrame instead of column indices.
-
max_len
: int (default: None)Maximum length of the itemsets generated. If
None
(default) all possible itemsets lengths are evaluated. -
verbose
: int (default: 0)Shows the stages of conditional tree generation.
Returns
pandas DataFrame with columns ['support', 'itemsets'] of all itemsets
that are >= min_support
and < than max_len
(if max_len
is not None).
Each itemset in the 'itemsets' column is of type frozenset
,
which is a Python built-in type that behaves similarly to
sets except that it is immutable
(For more info, see
https://docs.python.org/3.6/library/stdtypes.html#frozenset).
Examples
For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpgrowth/