用EntitySets表示数据#

一个EntitySet是数据帧及其之间关系的集合。它们对于为特征工程准备原始的结构化数据集非常有用。虽然Featuretools中的许多函数将dataframes和relationships作为单独的参数，但建议创建一个EntitySet，这样您可以更轻松地根据需要操作数据。

原始数据#

下面我们有两个数据表（表示为Pandas DataFrames），涉及客户交易。第一个是交易、会话和客户的合并，使结果看起来像您可能在日志文件中看到的内容：

[1]:

import featuretools as ft

data = ft.demo.load_mock_customer()

transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])

transactions_df.sample(10)

2024-10-11 14:48:59,541 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:48:59,541 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:48:59,542 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:48:59,542 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:48:59,542 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:48:59,542 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:48:59,542 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:48:59,560 featuretools - WARNING    Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.

[1]:

	transaction_id	session_id	transaction_time	product_id	amount	customer_id	device	session_start	zip_code	join_date	birthday
264	40	20	2014-01-01 04:46:00	5	53.22	5	desktop	2014-01-01 04:46:00	60091	2010-07-17 05:27:50	1984-07-28
19	370	2	2014-01-01 00:20:35	1	106.99	5	mobile	2014-01-01 00:17:20	60091	2010-07-17 05:27:50	1984-07-28
314	186	23	2014-01-01 05:40:10	5	128.26	3	desktop	2014-01-01 05:32:35	13244	2011-08-13 15:42:34	2003-11-21
290	380	21	2014-01-01 05:14:10	5	57.09	4	desktop	2014-01-01 05:02:15	60091	2011-04-08 20:08:14	2006-08-15
379	261	28	2014-01-01 06:50:35	1	133.71	5	mobile	2014-01-01 06:50:35	60091	2010-07-17 05:27:50	1984-07-28
335	68	25	2014-01-01 06:02:55	1	26.30	3	desktop	2014-01-01 05:59:40	13244	2011-08-13 15:42:34	2003-11-21
293	236	21	2014-01-01 05:17:25	5	69.62	4	desktop	2014-01-01 05:02:15	60091	2011-04-08 20:08:14	2006-08-15
271	303	20	2014-01-01 04:53:35	3	78.87	5	desktop	2014-01-01 04:46:00	60091	2010-07-17 05:27:50	1984-07-28
404	147	29	2014-01-01 07:17:40	4	11.62	1	mobile	2014-01-01 07:10:05	60091	2011-04-17 10:48:33	1994-07-18
179	176	12	2014-01-01 03:13:55	2	143.96	4	desktop	2014-01-01 03:04:10	60091	2011-04-08 20:08:14	2006-08-15

第二个数据框是涉及这些交易的产品列表。

[2]:

products_df = data["products"]
products_df

[2]:

	product_id	brand
0	1	B
1	2	B
2	3	B
3	4	B
4	5	A

创建一个实体集#

首先，我们初始化一个EntitySet。如果您想为其命名，可以选择性地在构造函数中提供一个id。

[3]:

es = ft.EntitySet(id="customer_data")

添加数据框#

为了开始，我们将transactions数据框添加到EntitySet中。在调用add_dataframe时，我们指定了三个重要参数： * index参数指定了在数据框中唯一标识行的列。 * time_index参数告诉Featuretools数据的创建时间。 * logical_types参数指示”product_id”应该被解释为一个分类列，即使在底层数据中它只是一个整数。

[4]:

from woodwork.logical_types import Categorical, PostalCode

es = es.add_dataframe(
    dataframe_name="transactions",
    dataframe=transactions_df,
    index="transaction_id",
    time_index="transaction_time",
    logical_types={
        "product_id": Categorical,
        "zip_code": PostalCode,
    },
)

es

/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(

[4]:

Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 11]
  Relationships:
    No relationships

您还可以在EntitySet对象上使用setter来添加数据帧。

Note

You can also use a setter on the EntitySet object to add dataframes

es["transactions"] = transactions_df

that this will use the default implementation of add_dataframe, notably the following:

if the DataFrame does not have Woodwork initialized, the first column will be the index column
if the DataFrame does not have Woodwork initialized, all columns will be inferred by Woodwork.
if control over the time index column and logical types is needed, Woodwork should be initialized before adding the dataframe.

Note

You can also display your EntitySet structure graphically by calling EntitySet.plot().

这个方法将数据框中的每一列与Woodwork的逻辑类型关联起来。每种逻辑类型都可以有一个关联的标准语义标签，有助于定义列的数据类型。如果不为列指定逻辑类型，它将根据底层数据进行推断。逻辑类型和语义标签列在数据框的模式中列出。有关使用逻辑类型和语义标签的更多信息，请查看Woodwork文档。

[5]:

es["transactions"].ww.schema

[5]:

	Logical Type	Semantic Tag(s)
Column
transaction_id	Integer	['index']
session_id	Integer	['numeric']
transaction_time	Datetime	['time_index']
product_id	Categorical	['category']
amount	Double	['numeric']
customer_id	Integer	['numeric']
device	Categorical	['category']
session_start	Datetime	[]
zip_code	PostalCode	['category']
join_date	Datetime	[]
birthday	Datetime	[]

现在，我们可以对我们的产品数据框执行相同的操作。

[6]:

es = es.add_dataframe(
    dataframe_name="products", dataframe=products_df, index="product_id"
)

es

/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(

[6]:

Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 11]
    products [Rows: 5, Columns: 2]
  Relationships:
    No relationships

在我们的EntitySet中有两个数据框，我们可以在它们之间添加关系。

添加关系#

我们希望通过每个数据框中名为“product_id”的列将这两个数据框关联起来。每个产品都有与之关联的多个交易，因此被称为父数据框，而交易数据框则被称为子数据框。在指定关系时，我们需要四个参数：父数据框名称、父列名称、子数据框名称和子列名称。请注意，每个关系必须表示一对多的关系，而不是一对一或多对多的关系。

[7]:

es = es.add_relationship("products", "product_id", "transactions", "product_id")
es

[7]:

Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 11]
    products [Rows: 5, Columns: 2]
  Relationships:
    transactions.product_id -> products.product_id

现在，我们看到关系已经添加到我们的EntitySet中。
## 从现有表创建数据框
在处理原始数据时，通常会有足够的信息来证明需要创建新的数据框。为了为sessions创建一个新的数据框和关系，我们需要对交易数据框进行“规范化”。

[8]:

es = es.normalize_dataframe(
    base_dataframe_name="transactions",
    new_dataframe_name="sessions",
    index="session_id",
    make_time_index="session_start",
    additional_columns=[
        "device",
        "customer_id",
        "zip_code",
        "session_start",
        "join_date",
    ],
)
es

[8]:

Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 6]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id

从上面的输出中，我们可以看到这个方法执行了两个操作：1. 根据”transactions”中的”session_id”和”session_start”列创建了一个名为”sessions”的新数据框；2. 添加了一个连接”transactions”和”sessions”的关系。如果我们查看一下”transactions”数据框和新的”sessions”数据框的模式，我们会看到另外两个自动执行的操作：

[9]:

es["transactions"].ww.schema

[9]:

	Logical Type	Semantic Tag(s)
Column
transaction_id	Integer	['index']
session_id	Integer	['foreign_key', 'numeric']
transaction_time	Datetime	['time_index']
product_id	Categorical	['category', 'foreign_key']
amount	Double	['numeric']
birthday	Datetime	[]

[10]:

es["sessions"].ww.schema

[10]:

	Logical Type	Semantic Tag(s)
Column
session_id	Integer	['index']
device	Categorical	['category']
customer_id	Integer	['numeric']
zip_code	PostalCode	['category']
session_start	Datetime	['time_index']
join_date	Datetime	[]

从“transactions”中删除了“device”、“customer_id”、“zip_code”和“join_date”，并在sessions数据框中创建了新的列。这样做可以减少冗余信息，因为会话的这些属性在交易之间不会改变。
将“session_start”复制并标记为新sessions数据框中的时间索引列，以表示会话的开始。如果基础数据框具有时间索引且未设置make_time_index，normalize_dataframe将为新数据框创建一个时间索引。在这种情况下，它将使用每个会话的第一笔交易的时间创建一个名为“first_transactions_time”的新时间索引。如果不希望创建这个时间索引，可以设置make_time_index=False。如果我们查看数据框，就可以看到normalize_dataframe对实际数据所做的操作。

[11]:

es["sessions"].head(5)

[11]:

	session_id	device	customer_id	zip_code	session_start	join_date
1	1	desktop	2	13244	2014-01-01 00:00:00	2012-04-15 23:31:04
2	2	mobile	5	60091	2014-01-01 00:17:20	2010-07-17 05:27:50
3	3	mobile	4	60091	2014-01-01 00:28:10	2011-04-08 20:08:14
4	4	mobile	1	60091	2014-01-01 00:44:25	2011-04-17 10:48:33
5	5	mobile	4	60091	2014-01-01 01:11:30	2011-04-08 20:08:14

[12]:

es["transactions"].head(5)

[12]:

	transaction_id	session_id	transaction_time	product_id	amount	birthday
298	298	1	2014-01-01 00:00:00	5	127.64	1986-08-18
2	2	1	2014-01-01 00:01:05	2	109.48	1986-08-18
308	308	1	2014-01-01 00:02:10	3	95.06	1986-08-18
116	116	1	2014-01-01 00:03:15	4	78.92	1986-08-18
371	371	1	2014-01-01 00:04:20	3	31.54	1986-08-18

完成准备数据集的工作，使用相同的方法调用创建一个名为"customers"的数据框。

[13]:

es = es.normalize_dataframe(
    base_dataframe_name="sessions",
    new_dataframe_name="customers",
    index="customer_id",
    make_time_index="join_date",
    additional_columns=["zip_code", "join_date"],
)

es

[13]:

Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 4]
    customers [Rows: 5, Columns: 3]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

使用EntitySet#

最后，我们准备好在Featuretools中使用这个EntitySet的任何功能。例如，让我们为数据集中的每个产品构建一个特征矩阵。

[14]:

feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="products")

feature_matrix

/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1052e0040> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1052bf920> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1052e0b80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1052e0180> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1052e0a40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  ).agg(to_agg)

[14]:

	COUNT(transactions)	MAX(transactions.amount)	MEAN(transactions.amount)	MIN(transactions.amount)	SKEW(transactions.amount)	STD(transactions.amount)	SUM(transactions.amount)	MODE(transactions.DAY(birthday))	MODE(transactions.DAY(transaction_time))	MODE(transactions.MONTH(birthday))	...	MODE(transactions.sessions.device)	NUM_UNIQUE(transactions.DAY(birthday))	NUM_UNIQUE(transactions.DAY(transaction_time))	NUM_UNIQUE(transactions.MONTH(birthday))	NUM_UNIQUE(transactions.MONTH(transaction_time))	NUM_UNIQUE(transactions.WEEKDAY(birthday))	NUM_UNIQUE(transactions.WEEKDAY(transaction_time))	NUM_UNIQUE(transactions.YEAR(birthday))	NUM_UNIQUE(transactions.YEAR(transaction_time))	NUM_UNIQUE(transactions.sessions.device)
product_id
1	102	149.56	73.429314	6.84	0.125525	42.479989	7489.79	18	1	7	...	desktop	4	1	3	1	4	1	5	1	3
2	92	149.95	76.319891	5.73	0.151934	46.336308	7021.43	18	1	8	...	desktop	4	1	3	1	4	1	5	1	3
3	96	148.31	73.001250	5.89	0.223938	38.871405	7008.12	18	1	8	...	desktop	4	1	3	1	4	1	5	1	3
4	106	146.46	76.311038	5.81	-0.132077	42.492501	8088.97	18	1	7	...	desktop	4	1	3	1	4	1	5	1	3
5	104	149.02	76.264904	5.91	0.098248	42.131902	7931.55	18	1	7	...	mobile	4	1	3	1	4	1	5	1	3

5 rows × 25 columns

As we can see, the features from DFS use the relational structure of our EntitySet. Therefore it is important to think carefully about the dataframes that we create.

Table of Contents

Previous topic

Next topic

This Page

用EntitySets表示数据#

原始数据#

创建一个实体集#

添加数据框#

添加关系#

使用EntitySet#

Table of Contents

Previous topic

Next topic

This Page

Quick search

用EntitySets表示数据#

原始数据#

创建一个实体集#

添加数据框#

添加关系#

使用EntitySet#