用EntitySets表示数据#

一个EntitySet是数据帧及其之间关系的集合。它们对于为特征工程准备原始的结构化数据集非常有用。虽然Featuretools中的许多函数将dataframesrelationships作为单独的参数,但建议创建一个EntitySet,这样您可以更轻松地根据需要操作数据。

原始数据#

下面我们有两个数据表(表示为Pandas DataFrames),涉及客户交易。第一个是交易、会话和客户的合并,使结果看起来像您可能在日志文件中看到的内容:

[1]:
import featuretools as ft


data = ft.demo.load_mock_customer()

transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])


transactions_df.sample(10)

2024-10-11 14:48:59,541 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:48:59,541 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:48:59,542 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:48:59,542 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:48:59,542 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:48:59,542 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:48:59,542 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:48:59,560 featuretools - WARNING    Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
[1]:
transaction_id session_id transaction_time product_id amount customer_id device session_start zip_code join_date birthday
264 40 20 2014-01-01 04:46:00 5 53.22 5 desktop 2014-01-01 04:46:00 60091 2010-07-17 05:27:50 1984-07-28
19 370 2 2014-01-01 00:20:35 1 106.99 5 mobile 2014-01-01 00:17:20 60091 2010-07-17 05:27:50 1984-07-28
314 186 23 2014-01-01 05:40:10 5 128.26 3 desktop 2014-01-01 05:32:35 13244 2011-08-13 15:42:34 2003-11-21
290 380 21 2014-01-01 05:14:10 5 57.09 4 desktop 2014-01-01 05:02:15 60091 2011-04-08 20:08:14 2006-08-15
379 261 28 2014-01-01 06:50:35 1 133.71 5 mobile 2014-01-01 06:50:35 60091 2010-07-17 05:27:50 1984-07-28
335 68 25 2014-01-01 06:02:55 1 26.30 3 desktop 2014-01-01 05:59:40 13244 2011-08-13 15:42:34 2003-11-21
293 236 21 2014-01-01 05:17:25 5 69.62 4 desktop 2014-01-01 05:02:15 60091 2011-04-08 20:08:14 2006-08-15
271 303 20 2014-01-01 04:53:35 3 78.87 5 desktop 2014-01-01 04:46:00 60091 2010-07-17 05:27:50 1984-07-28
404 147 29 2014-01-01 07:17:40 4 11.62 1 mobile 2014-01-01 07:10:05 60091 2011-04-17 10:48:33 1994-07-18
179 176 12 2014-01-01 03:13:55 2 143.96 4 desktop 2014-01-01 03:04:10 60091 2011-04-08 20:08:14 2006-08-15

第二个数据框是涉及这些交易的产品列表。

[2]:
products_df = data["products"]
products_df

[2]:
product_id brand
0 1 B
1 2 B
2 3 B
3 4 B
4 5 A

创建一个实体集#

首先,我们初始化一个EntitySet。如果您想为其命名,可以选择性地在构造函数中提供一个id

[3]:
es = ft.EntitySet(id="customer_data")

添加数据框#

为了开始,我们将transactions数据框添加到EntitySet中。在调用add_dataframe时,我们指定了三个重要参数: * index参数指定了在数据框中唯一标识行的列。 * time_index参数告诉Featuretools数据的创建时间。 * logical_types参数指示”product_id”应该被解释为一个分类列,即使在底层数据中它只是一个整数。

[4]:
from woodwork.logical_types import Categorical, PostalCode

es = es.add_dataframe(
    dataframe_name="transactions",
    dataframe=transactions_df,
    index="transaction_id",
    time_index="transaction_time",
    logical_types={
        "product_id": Categorical,
        "zip_code": PostalCode,
    },
)

es

/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[4]:
Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 11]
  Relationships:
    No relationships

您还可以在EntitySet对象上使用setter来添加数据帧。

Note

You can also use a setter on the EntitySet object to add dataframes

es["transactions"] = transactions_df

that this will use the default implementation of add_dataframe, notably the following:

  • if the DataFrame does not have Woodwork initialized, the first column will be the index column

  • if the DataFrame does not have Woodwork initialized, all columns will be inferred by Woodwork.

  • if control over the time index column and logical types is needed, Woodwork should be initialized before adding the dataframe.

Note

You can also display your EntitySet structure graphically by calling EntitySet.plot().

这个方法将数据框中的每一列与Woodwork的逻辑类型关联起来。每种逻辑类型都可以有一个关联的标准语义标签,有助于定义列的数据类型。如果不为列指定逻辑类型,它将根据底层数据进行推断。逻辑类型和语义标签列在数据框的模式中列出。有关使用逻辑类型和语义标签的更多信息,请查看Woodwork文档

[5]:
es["transactions"].ww.schema

[5]:
Logical Type Semantic Tag(s)
Column
transaction_id Integer ['index']
session_id Integer ['numeric']
transaction_time Datetime ['time_index']
product_id Categorical ['category']
amount Double ['numeric']
customer_id Integer ['numeric']
device Categorical ['category']
session_start Datetime []
zip_code PostalCode ['category']
join_date Datetime []
birthday Datetime []

现在,我们可以对我们的产品数据框执行相同的操作。

[6]:
es = es.add_dataframe(
    dataframe_name="products", dataframe=products_df, index="product_id"
)

es

/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[6]:
Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 11]
    products [Rows: 5, Columns: 2]
  Relationships:
    No relationships

在我们的EntitySet中有两个数据框,我们可以在它们之间添加关系。

添加关系#

我们希望通过每个数据框中名为“product_id”的列将这两个数据框关联起来。每个产品都有与之关联的多个交易,因此被称为父数据框,而交易数据框则被称为子数据框。在指定关系时,我们需要四个参数:父数据框名称、父列名称、子数据框名称和子列名称。请注意,每个关系必须表示一对多的关系,而不是一对一或多对多的关系。

[7]:
es = es.add_relationship("products", "product_id", "transactions", "product_id")
es

[7]:
Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 11]
    products [Rows: 5, Columns: 2]
  Relationships:
    transactions.product_id -> products.product_id
现在,我们看到关系已经添加到我们的EntitySet中。
## 从现有表创建数据框
在处理原始数据时,通常会有足够的信息来证明需要创建新的数据框。为了为sessions创建一个新的数据框和关系,我们需要对交易数据框进行“规范化”。
[8]:
es = es.normalize_dataframe(
    base_dataframe_name="transactions",
    new_dataframe_name="sessions",
    index="session_id",
    make_time_index="session_start",
    additional_columns=[
        "device",
        "customer_id",
        "zip_code",
        "session_start",
        "join_date",
    ],
)
es

[8]:
Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 6]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id

从上面的输出中,我们可以看到这个方法执行了两个操作:1. 根据”transactions”中的”session_id”和”session_start”列创建了一个名为”sessions”的新数据框;2. 添加了一个连接”transactions”和”sessions”的关系。如果我们查看一下”transactions”数据框和新的”sessions”数据框的模式,我们会看到另外两个自动执行的操作:

[9]:
es["transactions"].ww.schema

[9]:
Logical Type Semantic Tag(s)
Column
transaction_id Integer ['index']
session_id Integer ['foreign_key', 'numeric']
transaction_time Datetime ['time_index']
product_id Categorical ['category', 'foreign_key']
amount Double ['numeric']
birthday Datetime []
[10]:
es["sessions"].ww.schema

[10]:
Logical Type Semantic Tag(s)
Column
session_id Integer ['index']
device Categorical ['category']
customer_id Integer ['numeric']
zip_code PostalCode ['category']
session_start Datetime ['time_index']
join_date Datetime []
  1. 从“transactions”中删除了“device”、“customer_id”、“zip_code”和“join_date”,并在sessions数据框中创建了新的列。这样做可以减少冗余信息,因为会话的这些属性在交易之间不会改变。

  2. 将“session_start”复制并标记为新sessions数据框中的时间索引列,以表示会话的开始。如果基础数据框具有时间索引且未设置make_time_indexnormalize_dataframe将为新数据框创建一个时间索引。在这种情况下,它将使用每个会话的第一笔交易的时间创建一个名为“first_transactions_time”的新时间索引。如果不希望创建这个时间索引,可以设置make_time_index=False。如果我们查看数据框,就可以看到normalize_dataframe对实际数据所做的操作。

[11]:
es["sessions"].head(5)

[11]:
session_id device customer_id zip_code session_start join_date
1 1 desktop 2 13244 2014-01-01 00:00:00 2012-04-15 23:31:04
2 2 mobile 5 60091 2014-01-01 00:17:20 2010-07-17 05:27:50
3 3 mobile 4 60091 2014-01-01 00:28:10 2011-04-08 20:08:14
4 4 mobile 1 60091 2014-01-01 00:44:25 2011-04-17 10:48:33
5 5 mobile 4 60091 2014-01-01 01:11:30 2011-04-08 20:08:14
[12]:
es["transactions"].head(5)

[12]:
transaction_id session_id transaction_time product_id amount birthday
298 298 1 2014-01-01 00:00:00 5 127.64 1986-08-18
2 2 1 2014-01-01 00:01:05 2 109.48 1986-08-18
308 308 1 2014-01-01 00:02:10 3 95.06 1986-08-18
116 116 1 2014-01-01 00:03:15 4 78.92 1986-08-18
371 371 1 2014-01-01 00:04:20 3 31.54 1986-08-18
完成准备数据集的工作,使用相同的方法调用创建一个名为"customers"的数据框。
[13]:
es = es.normalize_dataframe(
    base_dataframe_name="sessions",
    new_dataframe_name="customers",
    index="customer_id",
    make_time_index="join_date",
    additional_columns=["zip_code", "join_date"],
)

es

[13]:
Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 4]
    customers [Rows: 5, Columns: 3]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

使用EntitySet#

最后,我们准备好在Featuretools中使用这个EntitySet的任何功能。例如,让我们为数据集中的每个产品构建一个特征矩阵。

[14]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="products")

feature_matrix

/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1052e0040> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1052bf920> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1052e0b80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1052e0180> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1052e0a40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  ).agg(to_agg)
[14]:
COUNT(transactions) MAX(transactions.amount) MEAN(transactions.amount) MIN(transactions.amount) SKEW(transactions.amount) STD(transactions.amount) SUM(transactions.amount) MODE(transactions.DAY(birthday)) MODE(transactions.DAY(transaction_time)) MODE(transactions.MONTH(birthday)) ... MODE(transactions.sessions.device) NUM_UNIQUE(transactions.DAY(birthday)) NUM_UNIQUE(transactions.DAY(transaction_time)) NUM_UNIQUE(transactions.MONTH(birthday)) NUM_UNIQUE(transactions.MONTH(transaction_time)) NUM_UNIQUE(transactions.WEEKDAY(birthday)) NUM_UNIQUE(transactions.WEEKDAY(transaction_time)) NUM_UNIQUE(transactions.YEAR(birthday)) NUM_UNIQUE(transactions.YEAR(transaction_time)) NUM_UNIQUE(transactions.sessions.device)
product_id
1 102 149.56 73.429314 6.84 0.125525 42.479989 7489.79 18 1 7 ... desktop 4 1 3 1 4 1 5 1 3
2 92 149.95 76.319891 5.73 0.151934 46.336308 7021.43 18 1 8 ... desktop 4 1 3 1 4 1 5 1 3
3 96 148.31 73.001250 5.89 0.223938 38.871405 7008.12 18 1 8 ... desktop 4 1 3 1 4 1 5 1 3
4 106 146.46 76.311038 5.81 -0.132077 42.492501 8088.97 18 1 7 ... desktop 4 1 3 1 4 1 5 1 3
5 104 149.02 76.264904 5.91 0.098248 42.131902 7931.55 18 1 7 ... mobile 4 1 3 1 4 1 5 1 3

5 rows × 25 columns

As we can see, the features from DFS use the relational structure of our EntitySet. Therefore it is important to think carefully about the dataframes that we create.