事务编码器:将商品列表转换为频繁商品集挖掘的事务数据

在Python列表中用于交易数据的编码器类

> from mlxtend.preprocessing import TransactionEncoder

概述

将数据库事务数据以Python列表的形式编码为NumPy数组。

示例 1

假设我们有以下交易数据:

from mlxtend.preprocessing import TransactionEncoder

dataset = [['Apple', 'Beer', 'Rice', 'Chicken'],
           ['Apple', 'Beer', 'Rice'],
           ['Apple', 'Beer'],
           ['Apple', 'Bananas'],
           ['Milk', 'Beer', 'Rice', 'Chicken'],
           ['Milk', 'Beer', 'Rice'],
           ['Milk', 'Beer'],
           ['Apple', 'Bananas']]

使用 TransactionEncoder 对象,我们可以将该数据集转换为适合典型机器学习 API 的数组格式。通过 fit 方法,TransactionEncoder 学习数据集中的唯一标签,通过 transform 方法,它将输入数据集(一个 Python 列表的列表)转换为一个独热编码的 NumPy 布尔数组:

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
te_ary

array([[ True, False,  True,  True, False,  True],
       [ True, False,  True, False, False,  True],
       [ True, False,  True, False, False, False],
       [ True,  True, False, False, False, False],
       [False, False,  True,  True,  True,  True],
       [False, False,  True, False,  True,  True],
       [False, False,  True, False,  True, False],
       [ True,  True, False, False, False, False]], dtype=bool)

NumPy数组是布尔值类型,以便在处理大型数据集时提高内存效率。如果想要经典的整数表示,可以将数组转换为适当的类型:

te_ary.astype("int")

array([[1, 0, 1, 1, 0, 1],
       [1, 0, 1, 0, 0, 1],
       [1, 0, 1, 0, 0, 0],
       [1, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 1],
       [0, 0, 1, 0, 1, 1],
       [0, 0, 1, 0, 1, 0],
       [1, 1, 0, 0, 0, 0]])

拟合后,可以通过 columns_ 属性访问与上面显示的数据数组对应的独特列名:

te.columns_

['Apple', 'Bananas', 'Beer', 'Chicken', 'Milk', 'Rice']

为了方便我们,可以将编码后的数组转换为一个 pandas DataFrame

import pandas as pd

pd.DataFrame(te_ary, columns=te.columns_)

Apple Bananas Beer Chicken Milk Rice
0 True False True True False True
1 True False True False False True
2 True False True False False False
3 True True False False False False
4 False False True True True True
5 False False True False True True
6 False False True False True False
7 True True False False False False

如果我们愿意,可以通过inverse_transform函数将独热编码数组转换回交易列表的列表:

first4 = te_ary[:4]
te.inverse_transform(first4)

[['Apple', 'Beer', 'Chicken', 'Rice'],
 ['Apple', 'Beer', 'Rice'],
 ['Apple', 'Beer'],
 ['Apple', 'Bananas']]

API

TransactionEncoder()

Encoder class for transaction data in Python lists

Parameters

None

Attributes

columns_: list List of unique names in the X input list of lists

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/

Methods


fit(X)

Learn unique column names from transaction DataFrame

Parameters


fit_transform(X, sparse=False)

Fit a TransactionEncoder encoder and transform a dataset.


get_params(deep=True)

Get parameters for this estimator.

Parameters

Returns


inverse_transform(array)

Transforms an encoded NumPy array back into transactions.

Parameters

    array([[True , False, True , True , False, True ],
    [True , False, True , False, False, True ],
    [True , False, True , False, False, False],
    [True , True , False, False, False, False],
    [False, False, True , True , True , True ],
    [False, False, True , False, True , True ],
    [False, False, True , False, True , False],
    [True , True , False, False, False, False]])
The corresponding column labels are available as self.columns_,
e.g., ['Apple', 'Bananas', 'Beer', 'Chicken', 'Milk', 'Rice']

Returns

    [['Apple', 'Beer', 'Rice', 'Chicken'],
    ['Apple', 'Beer', 'Rice'],
    ['Apple', 'Beer'],
    ['Apple', 'Bananas'],
    ['Milk', 'Beer', 'Rice', 'Chicken'],
    ['Milk', 'Beer', 'Rice'],
    ['Milk', 'Beer'],
    ['Apple', 'Bananas']]

set_params(params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Returns

self


transform(X, sparse=False)

Transform transactions into a one-hot encoded NumPy array.

Parameters

Returns

ython