TransactionEncoder

事务编码器：将商品列表转换为频繁商品集挖掘的事务数据

在Python列表中用于交易数据的编码器类

> from mlxtend.preprocessing import TransactionEncoder

概述

将数据库事务数据以Python列表的形式编码为NumPy数组。

示例 1

假设我们有以下交易数据：

from mlxtend.preprocessing import TransactionEncoder

dataset = [['Apple', 'Beer', 'Rice', 'Chicken'],
           ['Apple', 'Beer', 'Rice'],
           ['Apple', 'Beer'],
           ['Apple', 'Bananas'],
           ['Milk', 'Beer', 'Rice', 'Chicken'],
           ['Milk', 'Beer', 'Rice'],
           ['Milk', 'Beer'],
           ['Apple', 'Bananas']]

使用 TransactionEncoder 对象，我们可以将该数据集转换为适合典型机器学习 API 的数组格式。通过 fit 方法，TransactionEncoder 学习数据集中的唯一标签，通过 transform 方法，它将输入数据集（一个 Python 列表的列表）转换为一个独热编码的 NumPy 布尔数组：

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
te_ary

array([[ True, False,  True,  True, False,  True],
       [ True, False,  True, False, False,  True],
       [ True, False,  True, False, False, False],
       [ True,  True, False, False, False, False],
       [False, False,  True,  True,  True,  True],
       [False, False,  True, False,  True,  True],
       [False, False,  True, False,  True, False],
       [ True,  True, False, False, False, False]], dtype=bool)

NumPy数组是布尔值类型，以便在处理大型数据集时提高内存效率。如果想要经典的整数表示，可以将数组转换为适当的类型：

te_ary.astype("int")

array([[1, 0, 1, 1, 0, 1],
       [1, 0, 1, 0, 0, 1],
       [1, 0, 1, 0, 0, 0],
       [1, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 1],
       [0, 0, 1, 0, 1, 1],
       [0, 0, 1, 0, 1, 0],
       [1, 1, 0, 0, 0, 0]])

拟合后，可以通过 columns_ 属性访问与上面显示的数据数组对应的独特列名：

te.columns_

['Apple', 'Bananas', 'Beer', 'Chicken', 'Milk', 'Rice']

为了方便我们，可以将编码后的数组转换为一个 pandas DataFrame：

import pandas as pd

pd.DataFrame(te_ary, columns=te.columns_)

	Apple	Bananas	Beer	Chicken	Milk	Rice
0	True	False	True	True	False	True
1	True	False	True	False	False	True
2	True	False	True	False	False	False
3	True	True	False	False	False	False
4	False	False	True	True	True	True
5	False	False	True	False	True	True
6	False	False	True	False	True	False
7	True	True	False	False	False	False

如果我们愿意，可以通过inverse_transform函数将独热编码数组转换回交易列表的列表：

first4 = te_ary[:4]
te.inverse_transform(first4)

[['Apple', 'Beer', 'Chicken', 'Rice'],
 ['Apple', 'Beer', 'Rice'],
 ['Apple', 'Beer'],
 ['Apple', 'Bananas']]

API

TransactionEncoder()

Encoder class for transaction data in Python lists

Parameters

None

Attributes

columns_: list List of unique names in the X input list of lists

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/

Methods

fit(X)

Learn unique column names from transaction DataFrame

Parameters

X : list of lists

A python list of lists, where the outer list stores the n transactions and the inner list stores the items in each transaction.

For example, [['Apple', 'Beer', 'Rice', 'Chicken'], ['Apple', 'Beer', 'Rice'], ['Apple', 'Beer'], ['Apple', 'Bananas'], ['Milk', 'Beer', 'Rice', 'Chicken'], ['Milk', 'Beer', 'Rice'], ['Milk', 'Beer'], ['Apple', 'Bananas']]

fit_transform(X, sparse=False)

Fit a TransactionEncoder encoder and transform a dataset.

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params : mapping of string to any

Parameter names mapped to their values.

inverse_transform(array)

Transforms an encoded NumPy array back into transactions.

Parameters

array : NumPy array [n_transactions, n_unique_items]

The NumPy one-hot encoded boolean array of the input transactions, where the columns represent the unique items found in the input array in alphabetic order

For example,

    array([[True , False, True , True , False, True ],
    [True , False, True , False, False, True ],
    [True , False, True , False, False, False],
    [True , True , False, False, False, False],
    [False, False, True , True , True , True ],
    [False, False, True , False, True , True ],
    [False, False, True , False, True , False],
    [True , True , False, False, False, False]])

The corresponding column labels are available as self.columns_,
e.g., ['Apple', 'Bananas', 'Beer', 'Chicken', 'Milk', 'Rice']

Returns

X : list of lists

A python list of lists, where the outer list stores the n transactions and the inner list stores the items in each transaction.

For example,

    [['Apple', 'Beer', 'Rice', 'Chicken'],
    ['Apple', 'Beer', 'Rice'],
    ['Apple', 'Beer'],
    ['Apple', 'Bananas'],
    ['Milk', 'Beer', 'Rice', 'Chicken'],
    ['Milk', 'Beer', 'Rice'],
    ['Milk', 'Beer'],
    ['Apple', 'Bananas']]

set_params(params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Returns

self

transform(X, sparse=False)

Transform transactions into a one-hot encoded NumPy array.

Parameters

X : list of lists

A python list of lists, where the outer list stores the n transactions and the inner list stores the items in each transaction.

For example, [['Apple', 'Beer', 'Rice', 'Chicken'], ['Apple', 'Beer', 'Rice'], ['Apple', 'Beer'], ['Apple', 'Bananas'], ['Milk', 'Beer', 'Rice', 'Chicken'], ['Milk', 'Beer', 'Rice'], ['Milk', 'Beer'], ['Apple', 'Bananas']]

sparse: bool (default=False) If True, transform will return Compressed Sparse Row matrix instead of the regular one.

Returns

array : NumPy array [n_transactions, n_unique_items]

if sparse=False (default). Compressed Sparse Row matrix otherwise The one-hot encoded boolean array of the input transactions, where the columns represent the unique items found in the input array in alphabetic order. Exact representation depends on the sparse argument

For example, array([[True , False, True , True , False, True ], [True , False, True , False, False, True ], [True , False, True , False, False, False], [True , True , False, False, False, False], [False, False, True , True , True , True ], [False, False, True , False, True , True ], [False, False, True , False, True , False], [True , True , False, False, False, False]]) The corresponding column labels are available as self.columns_, e.g., ['Apple', 'Bananas', 'Beer', 'Chicken', 'Milk', 'Rice']

ython