fpmax: 通过FP-Max算法获得最大项集
实现FP-Max的函数,用于提取关联规则挖掘的最大项集
# fpmax 的使用
在这个示例中,我们将使用 `mlxtend` 库中的 `fpmax` 函数。
概述
Apriori算法是最早也是最受欢迎的频繁项集生成算法之一(频繁项集随后用于关联规则挖掘)。然而,在具有大量唯一项的数据集上,Apriori的运行时间可能相当大,因为其运行时间随着唯一项的数量呈指数级增长。
与Apriori相比,FP-Growth是一种频繁模式生成算法,它将项插入模式搜索树中,从而使其在运行时间上相对于唯一项或条目数量呈线性增加。
FP-Max是FP-Growth的一个变种,专注于获取最大项集。 如果项集X是频繁的,并且不存在包含X的频繁超模式,则称项集X为最大项集。 换句话说,频繁模式X不能是更大频繁模式的子模式,以符合最大项集的定义。
参考文献
- [1] Grahne, G., & Zhu, J. (2003年11月). 在挖掘频繁项集时有效使用前缀树. 在 FIMI (第90卷).
相关内容
示例 1 -- 最大项集
fpmax
函数期望输入数据为一维独热编码的 pandas DataFrame。假设我们有以下交易数据:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
我们可以通过 TransactionEncoder
将其转换为正确的格式,如下所示:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df
Apple | Corn | Dill | Eggs | Ice cream | Kidney Beans | Milk | Nutmeg | Onion | Unicorn | Yogurt | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | True | False | True | True | True | True | False | True |
1 | False | False | True | True | False | True | False | True | True | False | True |
2 | True | False | False | True | False | True | True | False | False | False | False |
3 | False | True | False | False | False | True | True | False | False | True | True |
4 | False | True | False | True | True | True | False | False | True | False | False |
现在,让我们返回支持度至少为60%的项和项集:
from mlxtend.frequent_patterns import fpmax
fpmax(df, min_support=0.6)
support | itemsets | |
---|---|---|
0 | 0.6 | (5, 6) |
1 | 0.6 | (8, 3, 5) |
2 | 0.6 | (10, 5) |
默认情况下,fpmax
返回项的列索引,这在后续操作中可能会有用,例如关联规则挖掘。为了更好地可读,我们可以设置 use_colnames=True
将这些整数值转换为相应的项名称:
fpmax(df, min_support=0.6, use_colnames=True)
support | itemsets | |
---|---|---|
0 | 0.6 | (Kidney Beans, Milk) |
1 | 0.6 | (Onion, Eggs, Kidney Beans) |
2 | 0.6 | (Kidney Beans, Yogurt) |
更多示例
Please note that since the fpmax
function is a drop-in replacement for fpgrowth
and apriori
, it comes with the same set of function arguments and return arguments. Thus, for more examples, please see the apriori
documentation.
API
fpmax(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0)
Get maximal frequent itemsets from a one-hot DataFrame
Parameters
-
df
: pandas DataFramepandas DataFrame the encoded format. Also supports DataFrames with sparse data; for more info, please see (https://pandas.pydata.org/pandas-docs/stable/ user_guide/sparse.html#sparse-data-structures)
Please note that the old pandas SparseDataFrame format is no longer supported in mlxtend >= 0.17.2.
The allowed values are either 0/1 or True/False. For example,
Apple Bananas Beer Chicken Milk Rice
0 True False True True False True
1 True False True False False True
2 True False True False False False
3 True True False False False False
4 False False True True True True
5 False False True False True True
6 False False True False True False
7 True True False False False False
-
min_support
: float (default: 0.5)A float between 0 and 1 for minimum support of the itemsets returned. The support is computed as the fraction transactions_where_item(s)_occur / total_transactions.
-
use_colnames
: bool (default: False)If true, uses the DataFrames' column names in the returned DataFrame instead of column indices.
-
max_len
: int (default: None)Given the set of all maximal itemsets, return those that are less than
max_len
. IfNone
(default) all possible itemsets lengths are evaluated. -
verbose
: int (default: 0)Shows the stages of conditional tree generation.
Returns
pandas DataFrame with columns ['support', 'itemsets'] of all maximal
itemsets that are >= min_support
and < than max_len
(if max_len
is not None).
Each itemset in the 'itemsets' column is of type frozenset
,
which is a Python built-in type that behaves similarly to
sets except that it is immutable
(For more info, see
https://docs.python.org/3.6/library/stdtypes.html#frozenset).
Examples
For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpmax/