Fpgrowth - mlxtend

fpgrowth：通过FP-growth算法获取频繁项集

实现FP-Growth的函数，用于提取关联规则挖掘中的频繁项集

> from mlxtend.frequent_patterns import fpgrowth

概述

FP-Growth [1] 是一种用于提取频繁项集的算法，在关联规则学习中应用广泛，作为传统的 Apriori 算法 [2] 的一种流行替代方案。

一般来说，该算法旨在处理包含事务的数据库，例如商店客户的购买记录。如果一个项集满足用户指定的支持阈值，则该项集被视为“频繁”。例如，如果支持阈值设置为 0.5（50%），则频繁项集被定义为在数据库中至少出现 50% 的所有事务中共同出现的项集。

特别地，FP-Growth 与 Apriori 频繁模式挖掘算法的不同之处在于，FP-Growth 是一种不需要候选项生成的频繁模式挖掘算法。它内部使用一种称为 FP-tree（频繁模式树）的数据结构，而不显式生成候选集，这使得它对大型数据集特别有吸引力。

参考文献

[1] Han, Jiawei, Jian Pei, Yiwen Yin, 和 Runying Mao. "无候选生成的频繁模式挖掘。" 频繁模式树方法。数据挖掘与知识发现 8, no. 1 (2004): 53-87。

[2] Agrawal, Rakesh, 和 Ramakrishnan Srikant. "快速挖掘关联规则的算法。" 第20届国际会议，大型数据库会议，VLDB。第1215卷。1994。

示例 1 -- 生成频繁项集

fpgrowth 函数期望接收一个经过独热编码的 pandas DataFrame 数据。假设我们有以下交易数据：

dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

我们可以通过 TransactionEncoder 将其转换为正确的格式，具体如下：

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

	Apple	Corn	Dill	Eggs	Ice cream	Kidney Beans	Milk	Nutmeg	Onion	Unicorn	Yogurt
0	False	False	False	True	False	True	True	True	True	False	True
1	False	False	True	True	False	True	False	True	True	False	True
2	True	False	False	True	False	True	True	False	False	False	False
3	False	True	False	False	False	True	True	False	False	True	True
4	False	True	False	True	True	True	False	False	True	False	False

现在，让我们返回支持度至少为60%的项和项集：

from mlxtend.frequent_patterns import fpgrowth

fpgrowth(df, min_support=0.6)

	support	itemsets
0	1.0	(5)
1	0.8	(3)
2	0.6	(10)
3	0.6	(8)
4	0.6	(6)
5	0.8	(3, 5)
6	0.6	(10, 5)
7	0.6	(8, 3)
8	0.6	(8, 5)
9	0.6	(8, 3, 5)
10	0.6	(5, 6)

默认情况下，fpgrowth 返回项目的列索引，这在后续操作如关联规则挖掘中可能会很有用。为了更好地阅读，我们可以将 use_colnames=True 设置为将这些整数值转换为相应的项目名称：

fpgrowth(df, min_support=0.6, use_colnames=True)

	support	itemsets
0	1.0	(Kidney Beans)
1	0.8	(Eggs)
2	0.6	(Yogurt)
3	0.6	(Onion)
4	0.6	(Milk)
5	0.8	(Eggs, Kidney Beans)
6	0.6	(Yogurt, Kidney Beans)
7	0.6	(Eggs, Onion)
8	0.6	(Onion, Kidney Beans)
9	0.6	(Eggs, Onion, Kidney Beans)
10	0.6	(Milk, Kidney Beans)

示例 2 -- Apriori 与 FPGrowth 对比

由于FP-Growth不需要显式地创建候选集，因此它的速度可以比替代的Apriori算法快几个数量级。例如，以下单元格比较了Apriori算法和FP-Growth的性能——即使在这个非常简单的玩具数据集场景中，FP-Growth的速度也快了大约5倍。

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

from mlxtend.frequent_patterns import apriori

%timeit -n 100 -r 10 apriori(df, min_support=0.6)

850 µs ± 39.3 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit -n 100 -r 10 apriori(df, min_support=0.6, low_memory=True)

941 µs ± 30.6 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

from mlxtend.frequent_patterns import fpgrowth

%timeit -n 100 -r 10 fpgrowth(df, min_support=0.6)

320 µs ± 9.21 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

API

fpgrowth(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0)

Get frequent itemsets from a one-hot DataFrame

Parameters

df : pandas DataFrame

pandas DataFrame the encoded format. Also supports DataFrames with sparse data; for more info, please see https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#sparse-data-structures.

Please note that the old pandas SparseDataFrame format is no longer supported in mlxtend >= 0.17.2.

The allowed values are either 0/1 or True/False. For example,

Apple Bananas Beer Chicken Milk Rice 0 True False True True False True 1 True False True False False True 2 True False True False False False 3 True True False False False False 4 False False True True True True 5 False False True False True True 6 False False True False True False 7 True True False False False False
min_support : float (default: 0.5)

A float between 0 and 1 for minimum support of the itemsets returned. The support is computed as the fraction transactions_where_item(s)_occur / total_transactions.
use_colnames : bool (default: False)

If true, uses the DataFrames' column names in the returned DataFrame instead of column indices.
max_len : int (default: None)

Maximum length of the itemsets generated. If None (default) all possible itemsets lengths are evaluated.
verbose : int (default: 0)

Shows the stages of conditional tree generation.

Returns

pandas DataFrame with columns ['support', 'itemsets'] of all itemsets that are >= min_support and < than max_len (if max_len is not None). Each itemset in the 'itemsets' column is of type frozenset, which is a Python built-in type that behaves similarly to sets except that it is immutable (For more info, see https://docs.python.org/3.6/library/stdtypes.html#frozenset).

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpgrowth/

fpgrowth：通过FP-growth算法获取频繁项集

概述

参考文献

相关

示例 1 -- 生成频繁项集

示例 2 -- Apriori 与 FPGrowth 对比

更多示例

API