用EntitySets表示数据#
一个EntitySet
是数据帧及其之间关系的集合。它们对于为特征工程准备原始的结构化数据集非常有用。虽然Featuretools中的许多函数将dataframes
和relationships
作为单独的参数,但建议创建一个EntitySet
,这样您可以更轻松地根据需要操作数据。
原始数据#
下面我们有两个数据表(表示为Pandas DataFrames),涉及客户交易。第一个是交易、会话和客户的合并,使结果看起来像您可能在日志文件中看到的内容:
[1]:
import featuretools as ft
data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])
transactions_df.sample(10)
2024-10-11 14:48:59,541 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:48:59,541 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:48:59,542 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:48:59,542 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:48:59,542 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:48:59,542 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:48:59,542 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:48:59,560 featuretools - WARNING Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
[1]:
transaction_id | session_id | transaction_time | product_id | amount | customer_id | device | session_start | zip_code | join_date | birthday | |
---|---|---|---|---|---|---|---|---|---|---|---|
264 | 40 | 20 | 2014-01-01 04:46:00 | 5 | 53.22 | 5 | desktop | 2014-01-01 04:46:00 | 60091 | 2010-07-17 05:27:50 | 1984-07-28 |
19 | 370 | 2 | 2014-01-01 00:20:35 | 1 | 106.99 | 5 | mobile | 2014-01-01 00:17:20 | 60091 | 2010-07-17 05:27:50 | 1984-07-28 |
314 | 186 | 23 | 2014-01-01 05:40:10 | 5 | 128.26 | 3 | desktop | 2014-01-01 05:32:35 | 13244 | 2011-08-13 15:42:34 | 2003-11-21 |
290 | 380 | 21 | 2014-01-01 05:14:10 | 5 | 57.09 | 4 | desktop | 2014-01-01 05:02:15 | 60091 | 2011-04-08 20:08:14 | 2006-08-15 |
379 | 261 | 28 | 2014-01-01 06:50:35 | 1 | 133.71 | 5 | mobile | 2014-01-01 06:50:35 | 60091 | 2010-07-17 05:27:50 | 1984-07-28 |
335 | 68 | 25 | 2014-01-01 06:02:55 | 1 | 26.30 | 3 | desktop | 2014-01-01 05:59:40 | 13244 | 2011-08-13 15:42:34 | 2003-11-21 |
293 | 236 | 21 | 2014-01-01 05:17:25 | 5 | 69.62 | 4 | desktop | 2014-01-01 05:02:15 | 60091 | 2011-04-08 20:08:14 | 2006-08-15 |
271 | 303 | 20 | 2014-01-01 04:53:35 | 3 | 78.87 | 5 | desktop | 2014-01-01 04:46:00 | 60091 | 2010-07-17 05:27:50 | 1984-07-28 |
404 | 147 | 29 | 2014-01-01 07:17:40 | 4 | 11.62 | 1 | mobile | 2014-01-01 07:10:05 | 60091 | 2011-04-17 10:48:33 | 1994-07-18 |
179 | 176 | 12 | 2014-01-01 03:13:55 | 2 | 143.96 | 4 | desktop | 2014-01-01 03:04:10 | 60091 | 2011-04-08 20:08:14 | 2006-08-15 |
第二个数据框是涉及这些交易的产品列表。
[2]:
products_df = data["products"]
products_df
[2]:
product_id | brand | |
---|---|---|
0 | 1 | B |
1 | 2 | B |
2 | 3 | B |
3 | 4 | B |
4 | 5 | A |
创建一个实体集#
首先,我们初始化一个EntitySet
。如果您想为其命名,可以选择性地在构造函数中提供一个id
。
[3]:
es = ft.EntitySet(id="customer_data")
添加数据框#
为了开始,我们将transactions数据框添加到EntitySet
中。在调用add_dataframe
时,我们指定了三个重要参数: * index
参数指定了在数据框中唯一标识行的列。 * time_index
参数告诉Featuretools数据的创建时间。 * logical_types
参数指示”product_id”应该被解释为一个分类列,即使在底层数据中它只是一个整数。
[4]:
from woodwork.logical_types import Categorical, PostalCode
es = es.add_dataframe(
dataframe_name="transactions",
dataframe=transactions_df,
index="transaction_id",
time_index="transaction_time",
logical_types={
"product_id": Categorical,
"zip_code": PostalCode,
},
)
es
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[4]:
Entityset: customer_data
DataFrames:
transactions [Rows: 500, Columns: 11]
Relationships:
No relationships
您还可以在EntitySet
对象上使用setter来添加数据帧。
Note
You can also use a setter on the EntitySet
object to add dataframes
es["transactions"] = transactions_df
that this will use the default implementation of add_dataframe, notably the following:
if the DataFrame does not have Woodwork initialized, the first column will be the index column
if the DataFrame does not have Woodwork initialized, all columns will be inferred by Woodwork.
if control over the time index column and logical types is needed, Woodwork should be initialized before adding the dataframe.
Note
You can also display your EntitySet structure graphically by calling EntitySet.plot()
.
这个方法将数据框中的每一列与Woodwork的逻辑类型关联起来。每种逻辑类型都可以有一个关联的标准语义标签,有助于定义列的数据类型。如果不为列指定逻辑类型,它将根据底层数据进行推断。逻辑类型和语义标签列在数据框的模式中列出。有关使用逻辑类型和语义标签的更多信息,请查看Woodwork文档。
[5]:
es["transactions"].ww.schema
[5]:
Logical Type | Semantic Tag(s) | |
---|---|---|
Column | ||
transaction_id | Integer | ['index'] |
session_id | Integer | ['numeric'] |
transaction_time | Datetime | ['time_index'] |
product_id | Categorical | ['category'] |
amount | Double | ['numeric'] |
customer_id | Integer | ['numeric'] |
device | Categorical | ['category'] |
session_start | Datetime | [] |
zip_code | PostalCode | ['category'] |
join_date | Datetime | [] |
birthday | Datetime | [] |
现在,我们可以对我们的产品数据框执行相同的操作。
[6]:
es = es.add_dataframe(
dataframe_name="products", dataframe=products_df, index="product_id"
)
es
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[6]:
Entityset: customer_data
DataFrames:
transactions [Rows: 500, Columns: 11]
products [Rows: 5, Columns: 2]
Relationships:
No relationships
在我们的EntitySet
中有两个数据框,我们可以在它们之间添加关系。
添加关系#
我们希望通过每个数据框中名为“product_id”的列将这两个数据框关联起来。每个产品都有与之关联的多个交易,因此被称为父数据框,而交易数据框则被称为子数据框。在指定关系时,我们需要四个参数:父数据框名称、父列名称、子数据框名称和子列名称。请注意,每个关系必须表示一对多的关系,而不是一对一或多对多的关系。
[7]:
es = es.add_relationship("products", "product_id", "transactions", "product_id")
es
[7]:
Entityset: customer_data
DataFrames:
transactions [Rows: 500, Columns: 11]
products [Rows: 5, Columns: 2]
Relationships:
transactions.product_id -> products.product_id
EntitySet
中。[8]:
es = es.normalize_dataframe(
base_dataframe_name="transactions",
new_dataframe_name="sessions",
index="session_id",
make_time_index="session_start",
additional_columns=[
"device",
"customer_id",
"zip_code",
"session_start",
"join_date",
],
)
es
[8]:
Entityset: customer_data
DataFrames:
transactions [Rows: 500, Columns: 6]
products [Rows: 5, Columns: 2]
sessions [Rows: 35, Columns: 6]
Relationships:
transactions.product_id -> products.product_id
transactions.session_id -> sessions.session_id
从上面的输出中,我们可以看到这个方法执行了两个操作:1. 根据”transactions”中的”session_id”和”session_start”列创建了一个名为”sessions”的新数据框;2. 添加了一个连接”transactions”和”sessions”的关系。如果我们查看一下”transactions”数据框和新的”sessions”数据框的模式,我们会看到另外两个自动执行的操作:
[9]:
es["transactions"].ww.schema
[9]:
Logical Type | Semantic Tag(s) | |
---|---|---|
Column | ||
transaction_id | Integer | ['index'] |
session_id | Integer | ['foreign_key', 'numeric'] |
transaction_time | Datetime | ['time_index'] |
product_id | Categorical | ['category', 'foreign_key'] |
amount | Double | ['numeric'] |
birthday | Datetime | [] |
[10]:
es["sessions"].ww.schema
[10]:
Logical Type | Semantic Tag(s) | |
---|---|---|
Column | ||
session_id | Integer | ['index'] |
device | Categorical | ['category'] |
customer_id | Integer | ['numeric'] |
zip_code | PostalCode | ['category'] |
session_start | Datetime | ['time_index'] |
join_date | Datetime | [] |
从“transactions”中删除了“device”、“customer_id”、“zip_code”和“join_date”,并在sessions数据框中创建了新的列。这样做可以减少冗余信息,因为会话的这些属性在交易之间不会改变。
将“session_start”复制并标记为新sessions数据框中的时间索引列,以表示会话的开始。如果基础数据框具有时间索引且未设置
make_time_index
,normalize_dataframe
将为新数据框创建一个时间索引。在这种情况下,它将使用每个会话的第一笔交易的时间创建一个名为“first_transactions_time”的新时间索引。如果不希望创建这个时间索引,可以设置make_time_index=False
。如果我们查看数据框,就可以看到normalize_dataframe
对实际数据所做的操作。
[11]:
es["sessions"].head(5)
[11]:
session_id | device | customer_id | zip_code | session_start | join_date | |
---|---|---|---|---|---|---|
1 | 1 | desktop | 2 | 13244 | 2014-01-01 00:00:00 | 2012-04-15 23:31:04 |
2 | 2 | mobile | 5 | 60091 | 2014-01-01 00:17:20 | 2010-07-17 05:27:50 |
3 | 3 | mobile | 4 | 60091 | 2014-01-01 00:28:10 | 2011-04-08 20:08:14 |
4 | 4 | mobile | 1 | 60091 | 2014-01-01 00:44:25 | 2011-04-17 10:48:33 |
5 | 5 | mobile | 4 | 60091 | 2014-01-01 01:11:30 | 2011-04-08 20:08:14 |
[12]:
es["transactions"].head(5)
[12]:
transaction_id | session_id | transaction_time | product_id | amount | birthday | |
---|---|---|---|---|---|---|
298 | 298 | 1 | 2014-01-01 00:00:00 | 5 | 127.64 | 1986-08-18 |
2 | 2 | 1 | 2014-01-01 00:01:05 | 2 | 109.48 | 1986-08-18 |
308 | 308 | 1 | 2014-01-01 00:02:10 | 3 | 95.06 | 1986-08-18 |
116 | 116 | 1 | 2014-01-01 00:03:15 | 4 | 78.92 | 1986-08-18 |
371 | 371 | 1 | 2014-01-01 00:04:20 | 3 | 31.54 | 1986-08-18 |
完成准备数据集的工作,使用相同的方法调用创建一个名为"customers"的数据框。
[13]:
es = es.normalize_dataframe(
base_dataframe_name="sessions",
new_dataframe_name="customers",
index="customer_id",
make_time_index="join_date",
additional_columns=["zip_code", "join_date"],
)
es
[13]:
Entityset: customer_data
DataFrames:
transactions [Rows: 500, Columns: 6]
products [Rows: 5, Columns: 2]
sessions [Rows: 35, Columns: 4]
customers [Rows: 5, Columns: 3]
Relationships:
transactions.product_id -> products.product_id
transactions.session_id -> sessions.session_id
sessions.customer_id -> customers.customer_id
使用EntitySet#
最后,我们准备好在Featuretools中使用这个EntitySet的任何功能。例如,让我们为数据集中的每个产品构建一个特征矩阵。
[14]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="products")
feature_matrix
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1052e0040> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1052bf920> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1052e0b80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1052e0180> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1052e0a40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
[14]:
COUNT(transactions) | MAX(transactions.amount) | MEAN(transactions.amount) | MIN(transactions.amount) | SKEW(transactions.amount) | STD(transactions.amount) | SUM(transactions.amount) | MODE(transactions.DAY(birthday)) | MODE(transactions.DAY(transaction_time)) | MODE(transactions.MONTH(birthday)) | ... | MODE(transactions.sessions.device) | NUM_UNIQUE(transactions.DAY(birthday)) | NUM_UNIQUE(transactions.DAY(transaction_time)) | NUM_UNIQUE(transactions.MONTH(birthday)) | NUM_UNIQUE(transactions.MONTH(transaction_time)) | NUM_UNIQUE(transactions.WEEKDAY(birthday)) | NUM_UNIQUE(transactions.WEEKDAY(transaction_time)) | NUM_UNIQUE(transactions.YEAR(birthday)) | NUM_UNIQUE(transactions.YEAR(transaction_time)) | NUM_UNIQUE(transactions.sessions.device) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
product_id | |||||||||||||||||||||
1 | 102 | 149.56 | 73.429314 | 6.84 | 0.125525 | 42.479989 | 7489.79 | 18 | 1 | 7 | ... | desktop | 4 | 1 | 3 | 1 | 4 | 1 | 5 | 1 | 3 |
2 | 92 | 149.95 | 76.319891 | 5.73 | 0.151934 | 46.336308 | 7021.43 | 18 | 1 | 8 | ... | desktop | 4 | 1 | 3 | 1 | 4 | 1 | 5 | 1 | 3 |
3 | 96 | 148.31 | 73.001250 | 5.89 | 0.223938 | 38.871405 | 7008.12 | 18 | 1 | 8 | ... | desktop | 4 | 1 | 3 | 1 | 4 | 1 | 5 | 1 | 3 |
4 | 106 | 146.46 | 76.311038 | 5.81 | -0.132077 | 42.492501 | 8088.97 | 18 | 1 | 7 | ... | desktop | 4 | 1 | 3 | 1 | 4 | 1 | 5 | 1 | 3 |
5 | 104 | 149.02 | 76.264904 | 5.91 | 0.098248 | 42.131902 | 7931.55 | 18 | 1 | 7 | ... | mobile | 4 | 1 | 3 | 1 | 4 | 1 | 5 | 1 | 3 |
5 rows × 25 columns
As we can see, the features from DFS use the relational structure of our EntitySet. Therefore it is important to think carefully about the dataframes that we create.