调整深度特征合成#
有几个参数可以调整以改变DFS的输出。我们将使用以下transactions
实体集来探索这些参数。
[1]:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
es
2024-10-11 14:50:08,145 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:50:08,145 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:50:08,146 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:50:08,146 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:50:08,146 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:50:08,147 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:50:08,147 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:50:08,167 featuretools - WARNING Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[1]:
Entityset: transactions
DataFrames:
transactions [Rows: 500, Columns: 6]
products [Rows: 5, Columns: 3]
sessions [Rows: 35, Columns: 5]
customers [Rows: 5, Columns: 5]
Relationships:
transactions.product_id -> products.product_id
transactions.session_id -> sessions.session_id
sessions.customer_id -> customers.customer_id
使用“种子特征”#
种子特征是用户提供给DFS的手动定义且特定于问题的特征。当可能时,Deep Feature Synthesis将自动在这些特征之上堆叠新特征。通过使用种子特征,我们可以在特征工程自动化中包含领域特定知识。对于下面的种子特征,领域知识可能是,对于特定零售商,超过125美元的交易将被视为昂贵的购买。
[2]:
expensive_purchase = ft.Feature(es["transactions"].ww["amount"]) > 125
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["percent_true"],
seed_features=[expensive_purchase],
)
feature_matrix[["PERCENT_TRUE(transactions.amount > 125)"]]
[2]:
PERCENT_TRUE(transactions.amount > 125) | |
---|---|
customer_id | |
5 | 0.227848 |
4 | 0.220183 |
1 | 0.119048 |
3 | 0.182796 |
2 | 0.129032 |
现在我们可以看到,“PERCENT_TRUE”原语已自动应用于来自“transactions”表的布尔值expensive_purchase
特征。由此产生的特征可以理解为客户购买的被认为昂贵的交易所占的百分比。
为列添加“有趣”的值#
有时我们希望在执行计算之前基于第二个值创建特征。我们将这种额外的过滤条件称为“where子句”。在Deep Feature Synthesis中,通过在DFS中包含where_primitives
参数来使用where子句。默认情况下,where子句是使用列的“interesting_values”构建的。可以通过调用es.add_interesting_values()
为pandas EntitySet中的每个DataFrame自动确定和添加有趣的值。
[3]:
values_dict = {"device": ["desktop", "mobile", "tablet"]}
es.add_interesting_values(dataframe_name="sessions", values=values_dict)
数据框的Woodwork类型信息中存储着有趣的值。
[4]:
es["sessions"].ww.columns["device"].metadata
[4]:
{'dataframe_name': 'sessions',
'entityset_id': 'transactions',
'interesting_values': ['desktop', 'mobile', 'tablet']}
现在在sessions
表中为device
列设置了有趣的值,我们可以使用where_primitives
参数来指定我们想要的聚合原语的where子句。
[5]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["count", "avg_time_between"],
where_primitives=["count", "avg_time_between"],
trans_primitives=[],
)
feature_matrix
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
/Users/code/fin_tool/github/featuretools/featuretools/primitives/standard/aggregation/avg_time_between.py:59: FutureWarning: Series.view is deprecated and will be removed in a future version. Use ``astype`` as an alternative to change the dtype.
x = x.view("int64")
[5]:
zip_code | AVG_TIME_BETWEEN(sessions.session_start) | COUNT(sessions) | AVG_TIME_BETWEEN(transactions.transaction_time) | COUNT(transactions) | AVG_TIME_BETWEEN(sessions.session_start WHERE device = mobile) | AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet) | AVG_TIME_BETWEEN(sessions.session_start WHERE device = desktop) | COUNT(sessions WHERE device = mobile) | COUNT(sessions WHERE device = tablet) | ... | AVG_TIME_BETWEEN(transactions.sessions.session_start) | AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = mobile) | AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = desktop) | AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = tablet) | AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = mobile) | AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = desktop) | AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = tablet) | COUNT(transactions WHERE sessions.device = mobile) | COUNT(transactions WHERE sessions.device = desktop) | COUNT(transactions WHERE sessions.device = tablet) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | |||||||||||||||||||||
5 | 60091 | 5577.000000 | 6 | 363.333333 | 79 | 13942.500000 | NaN | 9685.0 | 3 | 1 | ... | 357.500000 | 796.714286 | 345.892857 | 0.000000 | 809.714286 | 376.071429 | 65.000000 | 36 | 29 | 14 |
4 | 60091 | 2516.428571 | 8 | 168.518519 | 109 | 3336.666667 | NaN | 4127.5 | 4 | 1 | ... | 163.101852 | 192.500000 | 223.108108 | 0.000000 | 206.250000 | 238.918919 | 65.000000 | 53 | 38 | 18 |
1 | 60091 | 3305.714286 | 8 | 192.920000 | 126 | 11570.000000 | 8807.5 | 7150.0 | 3 | 3 | ... | 185.120000 | 420.727273 | 275.000000 | 419.404762 | 438.454545 | 302.500000 | 442.619048 | 56 | 27 | 43 |
3 | 13244 | 5096.000000 | 6 | 287.554348 | 93 | NaN | NaN | 4745.0 | 1 | 1 | ... | 276.956522 | 0.000000 | 233.360656 | 0.000000 | 65.000000 | 251.475410 | 65.000000 | 16 | 62 | 15 |
2 | 13244 | 4907.500000 | 7 | 328.532609 | 93 | 1690.000000 | 5330.0 | 6890.0 | 2 | 2 | ... | 320.054348 | 56.333333 | 417.575758 | 197.407407 | 82.333333 | 435.303030 | 226.296296 | 31 | 34 | 28 |
5 rows × 21 columns
现在,我们有几个可能有用的新功能。以下是其中两个功能,它们是基于“设备使用为平板电脑”的where子句构建的:
[6]:
feature_matrix[
[
"COUNT(sessions WHERE device = tablet)",
"AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)",
]
]
[6]:
COUNT(sessions WHERE device = tablet) | AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet) | |
---|---|---|
customer_id | ||
5 | 1 | NaN |
4 | 1 | NaN |
1 | 3 | 8807.5 |
3 | 1 | NaN |
2 | 2 | 5330.0 |
第一个特征 COUNT(sessions WHERE device = tablet)
可以理解为指示客户在平板上完成了多少个会话。第二个特征 AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)
计算这些会话之间的时间。我们可以看到,只在平板上完成了0或1个会话的客户在这些会话之间的平均时间上有 NaN
值。
编码分类特征#
机器学习算法通常期望所有的数据都是数字数据,或者具有定义明确的数字表示,比如对应于 0
和 1
的布尔值。当Deep Feature Synthesis生成分类特征时,我们可以使用Featuretools对其进行编码。
[7]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mode"],
trans_primitives=["time_since"],
max_depth=1,
)
feature_matrix
[7]:
zip_code | MODE(sessions.device) | TIME_SINCE(birthday) | TIME_SINCE(join_date) | |
---|---|---|---|---|
customer_id | ||||
5 | 60091 | mobile | 1.268837e+09 | 4.493137e+08 |
4 | 60091 | mobile | 5.730582e+08 | 4.263649e+08 |
1 | 60091 | mobile | 9.541686e+08 | 4.256209e+08 |
3 | 13244 | desktop | 6.592854e+08 | 4.154081e+08 |
2 | 13244 | desktop | 1.203951e+09 | 3.941255e+08 |
这个特征矩阵包含两列分类,分别是zip_code
和MODE(sessions.device)
。我们可以使用特征矩阵和特征定义将这些分类值编码为布尔值。Featuretools提供了将DFS输出应用独热编码的功能。
[8]:
feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)
feature_matrix_enc
[8]:
TIME_SINCE(birthday) | TIME_SINCE(join_date) | zip_code = 60091 | zip_code = 13244 | zip_code is unknown | MODE(sessions.device) = mobile | MODE(sessions.device) = desktop | MODE(sessions.device) is unknown | |
---|---|---|---|---|---|---|---|---|
customer_id | ||||||||
5 | 1.268837e+09 | 4.493137e+08 | True | False | False | True | False | False |
4 | 5.730582e+08 | 4.263649e+08 | True | False | False | True | False | False |
1 | 9.541686e+08 | 4.256209e+08 | True | False | False | True | False | False |
3 | 6.592854e+08 | 4.154081e+08 | False | True | False | False | True | False |
2 | 1.203951e+09 | 3.941255e+08 | False | True | False | False | True | False |
现在返回的特征矩阵已经以一种机器学习算法可以解释的方式进行了编码。请注意,那些不需要编码的列仍然被包含在内。此外,我们还得到了一个包含编码值的新特征定义集。
[9]:
features_enc
[9]:
[<Feature: zip_code = 60091>,
<Feature: zip_code = 13244>,
<Feature: zip_code is unknown>,
<Feature: MODE(sessions.device) = mobile>,
<Feature: MODE(sessions.device) = desktop>,
<Feature: MODE(sessions.device) is unknown>,
<Feature: TIME_SINCE(birthday)>,
<Feature: TIME_SINCE(join_date)>]
这些特征可以用来在新数据上计算相同的编码数值。有关在生产中进行特征工程的更多信息,请阅读部署指南。