深度特征合成#
深度特征合成(DFS)是一种自动化方法,用于在关系型和时间序列数据上进行特征工程。
输入数据#
深度特征合成需要结构化数据集才能进行特征工程。为了展示DFS的能力,我们将使用一个模拟客户交易数据集。
Note
Before using DFS, it is recommended that you prepare your data as an EntitySet
. See 用EntitySets表示数据 to learn how.
[1]:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
es
2024-10-11 14:48:33,609 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:48:33,609 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:48:33,610 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:48:33,610 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:48:33,610 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:48:33,610 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:48:33,610 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:48:33,628 featuretools - WARNING Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[1]:
Entityset: transactions
DataFrames:
transactions [Rows: 500, Columns: 6]
products [Rows: 5, Columns: 3]
sessions [Rows: 35, Columns: 5]
customers [Rows: 5, Columns: 5]
Relationships:
transactions.product_id -> products.product_id
transactions.session_id -> sessions.session_id
sessions.customer_id -> customers.customer_id
一旦数据准备好作为.EntitySet
,我们就可以准备好为目标数据框(例如customers
)自动生成特征。
运行DFS#
通常,在没有自动特征工程的情况下,数据科学家会编写代码来聚合客户数据,并应用不同的统计函数,从而生成量化客户行为的特征。在这个例子中,专家可能对诸如会话总数或客户注册的月份等特征感兴趣。当我们将目标数据框指定为customers
,并将"count"
和"month"
指定为原语时,DFS可以生成这些特征。
[2]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["count"],
trans_primitives=["month"],
max_depth=1,
)
feature_matrix
[2]:
zip_code | COUNT(sessions) | MONTH(birthday) | MONTH(join_date) | |
---|---|---|---|---|
customer_id | ||||
5 | 60091 | 6 | 7 | 7 |
4 | 60091 | 8 | 8 | 4 |
1 | 60091 | 8 | 7 | 4 |
3 | 13244 | 6 | 11 | 8 |
2 | 13244 | 7 | 8 | 4 |
在上面的示例中,"count"
是一个聚合原语,因为它基于与一个客户相关的许多会话计算出一个单个值。"month"
被称为转换原语,因为它接受一个客户的一个值并将其转换为另一个值。
Note
Feature primitives are a fundamental component to Featuretools. To learn more read 特征原语.
创建“深度特征”#
深度特征合成的名称来源于该算法堆叠原语以生成更复杂特征的能力。每次堆叠原语时,我们都会增加一个特征的“深度”。max_depth
参数控制DFS返回的特征的最大深度。让我们尝试以max_depth=2
运行DFS。
[3]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mean", "sum", "mode"],
trans_primitives=["month", "hour"],
max_depth=2,
)
feature_matrix
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x108b2cae0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x108b0b9c0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x108b2cae0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x108b0b9c0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x108b0b9c0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x108b2cae0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
[3]:
zip_code | MODE(sessions.device) | MEAN(transactions.amount) | MODE(transactions.product_id) | SUM(transactions.amount) | HOUR(birthday) | HOUR(join_date) | MONTH(birthday) | MONTH(join_date) | MEAN(sessions.MEAN(transactions.amount)) | MEAN(sessions.SUM(transactions.amount)) | MODE(sessions.HOUR(session_start)) | MODE(sessions.MODE(transactions.product_id)) | MODE(sessions.MONTH(session_start)) | SUM(sessions.MEAN(transactions.amount)) | MODE(transactions.sessions.device) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | ||||||||||||||||
5 | 60091 | mobile | 80.375443 | 5 | 6349.66 | 0 | 5 | 7 | 7 | 78.705187 | 1058.276667 | 0 | 3 | 1 | 472.231119 | mobile |
4 | 60091 | mobile | 80.070459 | 2 | 8727.68 | 0 | 20 | 8 | 4 | 81.207189 | 1090.960000 | 1 | 1 | 1 | 649.657515 | mobile |
1 | 60091 | mobile | 71.631905 | 4 | 9025.62 | 0 | 10 | 7 | 4 | 72.774140 | 1128.202500 | 6 | 4 | 1 | 582.193117 | mobile |
3 | 13244 | desktop | 67.060430 | 1 | 6236.62 | 0 | 15 | 11 | 8 | 67.539577 | 1039.436667 | 5 | 1 | 1 | 405.237462 | desktop |
2 | 13244 | desktop | 77.422366 | 4 | 7200.28 | 0 | 23 | 8 | 4 | 78.415122 | 1028.611429 | 3 | 3 | 1 | 548.905851 | desktop |
在深度为2的情况下,使用提供的原语生成了一些特征。合成这些定义的算法在这篇论文中有描述。在返回的特征矩阵中,让我们了解一个深度为2的特征。
[4]:
feature_matrix[["MEAN(sessions.SUM(transactions.amount))"]]
[4]:
MEAN(sessions.SUM(transactions.amount)) | |
---|---|
customer_id | |
5 | 1058.276667 |
4 | 1090.960000 |
1 | 1128.202500 |
3 | 1039.436667 |
2 | 1028.611429 |
对于每个客户,此功能1. 计算每个会话的所有交易金额的总和,以获取每个会话的总金额,2. 然后对跨多个会话的总金额应用平均值,以确定每个会话的平均花费金额我们将这个特征称为深度为2的“深度特征”。让我们再看一个深度为2的特征,它为每个客户计算他们开始会话的一天中最常见的小时。
[5]:
feature_matrix[["MODE(sessions.HOUR(session_start))"]]
[5]:
MODE(sessions.HOUR(session_start)) | |
---|---|
customer_id | |
5 | 0 |
4 | 1 |
1 | 6 |
3 | 5 |
2 | 3 |
对于每个客户,此功能计算:1. 他或她每个会话开始的小时
,然后2. 使用统计函数mode
来识别他或她最常见的会话开始小时。堆叠结果会产生比单个基元本身更具表现力的特征。这使得能够为机器学习自动创建复杂模式。
Note
You can graphically visualize the lineage of a feature by calling featuretools.graph_feature()
on it. You can also generate an English description of the feature with featuretools.describe_feature()
. See 生成特征描述 for more details.
更改目标DataFrame#
DFS非常强大,因为我们可以为数据集中的任何DataFrame创建特征矩阵。如果我们将目标DataFrame切换为“sessions”,我们可以为每个会话而不是每个客户合成特征。现在,我们可以使用这些特征来预测会话的结果。
[6]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="sessions",
agg_primitives=["mean", "sum", "mode"],
trans_primitives=["month", "hour"],
max_depth=2,
)
feature_matrix.head(5)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x108b0b9c0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x108b2cae0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x108b2cae0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x108b0b9c0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
[6]:
customer_id | device | MEAN(transactions.amount) | MODE(transactions.product_id) | SUM(transactions.amount) | HOUR(session_start) | MONTH(session_start) | customers.zip_code | MODE(transactions.HOUR(transaction_time)) | MODE(transactions.MONTH(transaction_time)) | customers.MODE(sessions.device) | customers.MEAN(transactions.amount) | customers.MODE(transactions.product_id) | customers.SUM(transactions.amount) | customers.HOUR(birthday) | customers.HOUR(join_date) | customers.MONTH(birthday) | customers.MONTH(join_date) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
session_id | ||||||||||||||||||
1 | 2 | desktop | 76.813125 | 3 | 1229.01 | 0 | 1 | 13244 | 0 | 1 | desktop | 77.422366 | 4 | 7200.28 | 0 | 23 | 8 | 4 |
2 | 5 | mobile | 74.696000 | 5 | 746.96 | 0 | 1 | 60091 | 0 | 1 | mobile | 80.375443 | 5 | 6349.66 | 0 | 5 | 7 | 7 |
3 | 4 | mobile | 88.600000 | 1 | 1329.00 | 0 | 1 | 60091 | 0 | 1 | mobile | 80.070459 | 2 | 8727.68 | 0 | 20 | 8 | 4 |
4 | 1 | mobile | 64.557200 | 5 | 1613.93 | 0 | 1 | 60091 | 0 | 1 | mobile | 71.631905 | 4 | 9025.62 | 0 | 10 | 7 | 4 |
5 | 4 | mobile | 70.638182 | 5 | 777.02 | 1 | 1 | 60091 | 1 | 1 | mobile | 80.070459 | 2 | 8727.68 | 0 | 20 | 8 | 4 |
正如我们所看到的,DFS 也会基于父数据框构建深度特征,这里是特定会话的客户。例如,下面的特征计算会话客户的交易金额均值。
[7]:
feature_matrix[["customers.MEAN(transactions.amount)"]].head(5)
[7]:
customers.MEAN(transactions.amount) | |
---|---|
session_id | |
1 | 77.422366 |
2 | 80.375443 |
3 | 80.070459 |
4 | 71.631905 |
5 | 80.070459 |
Improve feature output#
To learn about the parameters to change in DFS read 调整深度特征合成.