常见问题解答#
在这里,我们试图回答一些经常出现在Github和Stack Overflow上的常见问题。
[1]:
import pandas as pd
import woodwork as ww
import featuretools as ft
2024-10-11 14:50:20,901 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:50:20,902 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:50:20,902 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:50:20,903 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:50:20,903 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:50:20,903 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:50:20,903 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:50:20,917 featuretools - WARNING Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
实体集#
EntitySet是Featuretools中的一个核心概念,它代表了数据集中的多个表格。在EntitySet中,每个表格被称为一个实体,而实体之间的关系被称为关系。EntitySet提供了一个方便的方式来组织和管理多个表格之间的关系,以便进行自动化特征工程。
如何获取EntitySet
中列名和类型的列表?#
在创建EntitySet
之后,您可能希望查看列名。EntitySet
包含多个DataFrame,每个DataFrame对应EntitySet
中的一个表。
[2]:
es = ft.demo.load_mock_customer(return_entityset=True)
es
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[2]:
Entityset: transactions
DataFrames:
transactions [Rows: 500, Columns: 6]
products [Rows: 5, Columns: 3]
sessions [Rows: 35, Columns: 5]
customers [Rows: 5, Columns: 5]
Relationships:
transactions.product_id -> products.product_id
transactions.session_id -> sessions.session_id
sessions.customer_id -> customers.customer_id
如果您想查看底层的数据框(Dataframe),可以执行以下操作:
[3]:
es["transactions"].head()
[3]:
transaction_id | session_id | transaction_time | product_id | amount | _ft_last_time | |
---|---|---|---|---|---|---|
298 | 298 | 1 | 2014-01-01 00:00:00 | 5 | 127.64 | 2014-01-01 00:00:00 |
2 | 2 | 1 | 2014-01-01 00:01:05 | 2 | 109.48 | 2014-01-01 00:01:05 |
308 | 308 | 1 | 2014-01-01 00:02:10 | 3 | 95.06 | 2014-01-01 00:02:10 |
116 | 116 | 1 | 2014-01-01 00:03:15 | 4 | 78.92 | 2014-01-01 00:03:15 |
371 | 371 | 1 | 2014-01-01 00:04:20 | 3 | 31.54 | 2014-01-01 00:04:20 |
如果您想查看“transactions” DataFrame 的列和类型,可以执行以下操作:
[4]:
es["transactions"].ww
[4]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
transaction_id | int64 | Integer | ['index'] |
session_id | int64 | Integer | ['foreign_key', 'numeric'] |
transaction_time | datetime64[ns] | Datetime | ['time_index'] |
product_id | category | Categorical | ['category', 'foreign_key'] |
amount | float64 | Double | ['numeric'] |
_ft_last_time | datetime64[ns] | Datetime | ['last_time_index'] |
copy_columns
和 additional_columns
之间有什么区别?#
函数 normalize_dataframe
创建一个新的DataFrame和一个与现有DataFrame的唯一值相关联的关系。它接受两个类似的参数:
additional_columns
从基础DataFrame中移除列并将它们移动到新的DataFrame中。copy_columns
保留基础DataFrame中给定的列,同时也将它们复制到新的DataFrame中。
[5]:
data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])
products_df = data["products"]
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(
dataframe_name="transactions",
dataframe=transactions_df,
index="transaction_id",
time_index="transaction_time",
)
es = es.add_dataframe(
dataframe_name="products", dataframe=products_df, index="product_id"
)
es = es.add_relationship("products", "product_id", "transactions", "product_id")
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
在我们进行规范化创建新的DataFrame之前,让我们先看看基础DataFrame。
[6]:
es["transactions"].head()
[6]:
transaction_id | session_id | transaction_time | product_id | amount | customer_id | device | session_start | zip_code | join_date | birthday | |
---|---|---|---|---|---|---|---|---|---|---|---|
298 | 298 | 1 | 2014-01-01 00:00:00 | 5 | 127.64 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 |
2 | 2 | 1 | 2014-01-01 00:01:05 | 2 | 109.48 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 |
308 | 308 | 1 | 2014-01-01 00:02:10 | 3 | 95.06 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 |
116 | 116 | 1 | 2014-01-01 00:03:15 | 4 | 78.92 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 |
371 | 371 | 1 | 2014-01-01 00:04:20 | 3 | 31.54 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 |
请注意列 session_id
, session_start
, join_date
, device
, customer_id
, 和 zip_code
。
[7]:
es = es.normalize_dataframe(
base_dataframe_name="transactions",
new_dataframe_name="sessions",
index="session_id",
make_time_index="session_start",
additional_columns=["join_date"],
copy_columns=["device", "customer_id", "zip_code", "session_start"],
)
在上面,我们对列进行了规范化,创建了一个新的DataFrame。
对于
additional_columns
,transactions
DataFrame中的列['join_date']
将被移除,并移到新的sessions
DataFrame中。对于
copy_columns
,transactions
DataFrame中的列['device', 'customer_id', 'zip_code', 'session_start']
将被复制到新的sessions
DataFrame中。
让我们在实际的EntitySet
中看看这个过程。
[8]:
es["transactions"].head()
[8]:
transaction_id | session_id | transaction_time | product_id | amount | customer_id | device | session_start | zip_code | birthday | |
---|---|---|---|---|---|---|---|---|---|---|
298 | 298 | 1 | 2014-01-01 00:00:00 | 5 | 127.64 | 2 | desktop | 2014-01-01 | 13244 | 1986-08-18 |
2 | 2 | 1 | 2014-01-01 00:01:05 | 2 | 109.48 | 2 | desktop | 2014-01-01 | 13244 | 1986-08-18 |
308 | 308 | 1 | 2014-01-01 00:02:10 | 3 | 95.06 | 2 | desktop | 2014-01-01 | 13244 | 1986-08-18 |
116 | 116 | 1 | 2014-01-01 00:03:15 | 4 | 78.92 | 2 | desktop | 2014-01-01 | 13244 | 1986-08-18 |
371 | 371 | 1 | 2014-01-01 00:04:20 | 3 | 31.54 | 2 | desktop | 2014-01-01 | 13244 | 1986-08-18 |
请注意,['device', 'customer_id', 'zip_code', 'session_start']
仍然存在于 transactions
数据框中,而 ['join_date']
不在其中。但是,它们都已经被移动到 sessions
数据框中,如下所示。
[9]:
es["sessions"].head()
[9]:
session_id | join_date | device | customer_id | zip_code | session_start | |
---|---|---|---|---|---|---|
1 | 1 | 2012-04-15 23:31:04 | desktop | 2 | 13244 | 2014-01-01 00:00:00 |
2 | 2 | 2010-07-17 05:27:50 | mobile | 5 | 60091 | 2014-01-01 00:17:20 |
3 | 3 | 2011-04-08 20:08:14 | mobile | 4 | 60091 | 2014-01-01 00:28:10 |
4 | 4 | 2011-04-17 10:48:33 | mobile | 1 | 60091 | 2014-01-01 00:44:25 |
5 | 5 | 2011-04-08 20:08:14 | mobile | 4 | 60091 | 2014-01-01 01:11:30 |
为什么我的列会得到新的语义标签?#
在创建EntitySet
的过程中,您可能会想知道为什么您的列的语义标签会发生变化。
[10]:
data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])
products_df = data["products"]
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(
dataframe_name="transactions",
dataframe=transactions_df,
index="transaction_id",
time_index="transaction_time",
)
es.plot()
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[10]:
如果一列包含语义标签,它们将出现在上图中分号的右侧。请注意,session_id
和session_start
目前没有任何与它们关联的语义标签。
现在,让我们对交易数据框进行规范化,以创建一个新的数据框。
[11]:
es = es.normalize_dataframe(
base_dataframe_name="transactions",
new_dataframe_name="sessions",
index="session_id",
make_time_index="session_start",
additional_columns=["session_start"],
)
es.plot()
[11]:
session_id
现在在 transactions
DataFrame 中具有语义标签 foreign_key
,在新的 DataFrame sessions
中具有 index
。这是因为当我们对 DataFrame 进行规范化时,我们在 transactions
和 sessions
之间创建了新的关系。父 DataFrame sessions
和子 DataFrame transactions
之间存在一对多的关系。
因此,在 transactions
中,session_id
具有语义标签 foreign_key
,因为它代表另一个 DataFrame 中的 index
。如果我们使用 add_dataframe
和 add_relationship
添加另一个 DataFrame,也会产生类似的效果。
此外,当我们创建新的 DataFrame 时,我们将 session_start
设置为 time_index
。这将在新的 sessions
DataFrame 中的 session_start
列上添加语义标签 time_index
,因为它现在代表一个 time_index
。
如何更新列的描述或元数据?#
您可以直接更新列模式的描述或元数据属性。但是,您必须明确使用由DataFrame.ww.columns['col_name']
返回的列模式,而不是 DataFrame.ww['col_name'].ww.schema
。来自DataFrame.ww.columns['col_name']
的列模式仍与EntitySet相关联,并传播任何属性更新,而另一个则不会。例如,这是如何更新列的描述或元数据的方法:
column_schema = df.ww.columns['col_name']
column_schema.description = '我的描述'
column_schema.metadata.update(key='value')
如何组合两个或更多有趣的值?#
在计算之前,您可能希望创建受多个值条件约束的特征。这将需要使用interesting_values
。然而,由于我们试图创建具有多个条件的特征,我们需要在创建EntitySet
之前修改数据框。
让我们看看您可能如何实现这一点。
首先,让我们创建我们的数据框。
[12]:
data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])
products_df = data["products"]
[13]:
transactions_df.head()
[13]:
transaction_id | session_id | transaction_time | product_id | amount | customer_id | device | session_start | zip_code | join_date | birthday | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 298 | 1 | 2014-01-01 00:00:00 | 5 | 127.64 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 |
1 | 2 | 1 | 2014-01-01 00:01:05 | 2 | 109.48 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 |
2 | 308 | 1 | 2014-01-01 00:02:10 | 3 | 95.06 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 |
3 | 116 | 1 | 2014-01-01 00:03:15 | 4 | 78.92 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 |
4 | 371 | 1 | 2014-01-01 00:04:20 | 3 | 31.54 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 |
[14]:
products_df.head()
[14]:
product_id | brand | |
---|---|---|
0 | 1 | B |
1 | 2 | B |
2 | 3 | B |
3 | 4 | B |
4 | 5 | A |
现在,让我们修改我们的transactions
数据框,创建一个表示多个条件的特征的额外列。
[15]:
transactions_df["product_id_device"] = (
transactions_df["product_id"].astype(str) + " and " + transactions_df["device"]
)
在这里,我们创建了一个名为product_id_device
的新列,它只是将product_id
列和device
列合并在一起。
现在让我们创建我们的EntitySet
。
[16]:
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(
dataframe_name="transactions",
dataframe=transactions_df,
index="transaction_id",
time_index="transaction_time",
logical_types={
"product_id": ww.logical_types.Categorical,
"product_id_device": ww.logical_types.Categorical,
"zip_code": ww.logical_types.PostalCode,
},
)
es = es.add_dataframe(
dataframe_name="products", dataframe=products_df, index="product_id"
)
es = es.normalize_dataframe(
base_dataframe_name="transactions",
new_dataframe_name="sessions",
index="session_id",
additional_columns=["device", "product_id_device", "customer_id"],
)
es = es.normalize_dataframe(
base_dataframe_name="sessions", new_dataframe_name="customers", index="customer_id"
)
es
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[16]:
Entityset: customer_data
DataFrames:
transactions [Rows: 500, Columns: 9]
products [Rows: 5, Columns: 2]
sessions [Rows: 35, Columns: 5]
customers [Rows: 5, Columns: 2]
Relationships:
transactions.session_id -> sessions.session_id
sessions.customer_id -> customers.customer_id
现在,我们准备添加我们感兴趣的值。
首先,让我们查看一下有哪些有趣的值可供选择。
[17]:
interesting_values = transactions_df["product_id_device"].unique().tolist()
interesting_values
[17]:
['5 and desktop',
'2 and desktop',
'3 and desktop',
'4 and desktop',
'1 and desktop',
'4 and mobile',
'5 and mobile',
'1 and mobile',
'3 and mobile',
'2 and mobile',
'4 and tablet',
'3 and tablet',
'2 and tablet',
'1 and tablet',
'5 and tablet']
如果你愿意的话,你可以选择这些值的一个子集,而创建的where
特征将只使用这些条件。在我们的示例中,我们将使用所有可能的有趣值。
在这里,我们将所有这些值设置为这个特定DataFrame和列的有趣值。如果我们愿意,我们可以以同样的方式为多个列创建有趣值,但在这个示例中我们将只使用这一个。
[18]:
values = {"product_id_device": interesting_values}
es.add_interesting_values(dataframe_name="sessions", values=values)
现在我们可以运行深度优先搜索算法。
[19]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["count"],
where_primitives=["count"],
trans_primitives=[],
)
feature_matrix.head()
[19]:
COUNT(sessions) | COUNT(transactions) | COUNT(sessions WHERE product_id_device = 4 and desktop) | COUNT(sessions WHERE product_id_device = 1 and tablet) | COUNT(sessions WHERE product_id_device = 3 and desktop) | COUNT(sessions WHERE product_id_device = 4 and mobile) | COUNT(sessions WHERE product_id_device = 2 and mobile) | COUNT(sessions WHERE product_id_device = 5 and tablet) | COUNT(sessions WHERE product_id_device = 5 and mobile) | COUNT(sessions WHERE product_id_device = 3 and mobile) | ... | COUNT(transactions WHERE sessions.product_id_device = 4 and mobile) | COUNT(transactions WHERE sessions.product_id_device = 2 and mobile) | COUNT(transactions WHERE sessions.product_id_device = 5 and mobile) | COUNT(transactions WHERE sessions.product_id_device = 4 and desktop) | COUNT(transactions WHERE sessions.product_id_device = 3 and mobile) | COUNT(transactions WHERE sessions.product_id_device = 4 and tablet) | COUNT(transactions WHERE sessions.product_id_device = 5 and tablet) | COUNT(transactions WHERE sessions.product_id_device = 1 and tablet) | COUNT(transactions WHERE sessions.product_id_device = 1 and desktop) | COUNT(transactions WHERE sessions.product_id_device = 2 and desktop) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | |||||||||||||||||||||
2 | 7 | 93 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | ... | 18 | 13 | 0 | 10 | 0 | 0 | 13 | 15 | 8 | 0 |
5 | 6 | 79 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | ... | 10 | 0 | 0 | 14 | 8 | 14 | 0 | 0 | 0 | 0 |
4 | 8 | 109 | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | ... | 0 | 23 | 0 | 18 | 15 | 0 | 18 | 0 | 0 | 10 |
1 | 8 | 126 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | ... | 56 | 0 | 0 | 0 | 0 | 27 | 0 | 0 | 0 | 15 |
3 | 6 | 93 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 16 | 0 | 0 | 0 | 33 | 0 |
5 rows × 32 columns
为了更好地理解where
子句的特性,让我们来看其中的一个特性。
特性COUNT(sessions WHERE product_id_device = 5 and tablet)
,告诉我们客户在平板电脑上购买product_id
为5的产品的会话数量。请注意,该特性依赖于多个条件(product_id = 5 & device = tablet)。
[20]:
feature_matrix[["COUNT(sessions WHERE product_id_device = 5 and tablet)"]]
[20]:
COUNT(sessions WHERE product_id_device = 5 and tablet) | |
---|---|
customer_id | |
2 | 1 |
5 | 0 |
4 | 1 |
1 | 0 |
3 | 0 |
深度优先搜索 (DFS)#
为什么DFS没有创建聚合特征?#
您可能已经创建了您的EntitySet
,然后应用DFS来创建特征。然而,您可能会感到困惑,为什么没有创建任何聚合特征。
这很可能是因为您的EntitySet中只有一个DataFrame,并且DFS无法使用少于2个DataFrame创建聚合特征。Featuretools会查找关系,并根据该关系进行聚合。
让我们看一个简单的例子。
[21]:
data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(
dataframe_name="transactions", dataframe=transactions_df, index="transaction_id"
)
es
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[21]:
Entityset: customer_data
DataFrames:
transactions [Rows: 500, Columns: 11]
Relationships:
No relationships
请注意,我们的EntitySet
中只有一个DataFrame。如果我们尝试在这个EntitySet
上创建聚合特征,那是不可能的,因为DFS需要2个DataFrame来生成聚合特征。
[22]:
feature_matrix, feature_defs = ft.dfs(
entityset=es, target_dataframe_name="transactions"
)
feature_defs
/Users/code/fin_tool/github/featuretools/featuretools/synthesis/deep_feature_synthesis.py:154: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created
warnings.warn(
[22]:
[<Feature: session_id>,
<Feature: product_id>,
<Feature: amount>,
<Feature: customer_id>,
<Feature: device>,
<Feature: zip_code>,
<Feature: DAY(birthday)>,
<Feature: DAY(join_date)>,
<Feature: DAY(session_start)>,
<Feature: DAY(transaction_time)>,
<Feature: MONTH(birthday)>,
<Feature: MONTH(join_date)>,
<Feature: MONTH(session_start)>,
<Feature: MONTH(transaction_time)>,
<Feature: WEEKDAY(birthday)>,
<Feature: WEEKDAY(join_date)>,
<Feature: WEEKDAY(session_start)>,
<Feature: WEEKDAY(transaction_time)>,
<Feature: YEAR(birthday)>,
<Feature: YEAR(join_date)>,
<Feature: YEAR(session_start)>,
<Feature: YEAR(transaction_time)>]
以上特征均不是聚合特征。要解决这个问题,您可以向您的EntitySet
中添加另一个DataFrame。
解决方案#1 - 如果您有额外的数据,可以添加新的DataFrame。
[23]:
products_df = data["products"]
es = es.add_dataframe(
dataframe_name="products", dataframe=products_df, index="product_id"
)
es
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[23]:
Entityset: customer_data
DataFrames:
transactions [Rows: 500, Columns: 11]
products [Rows: 5, Columns: 2]
Relationships:
No relationships
注意我们现在在EntitySet
中有一个额外的DataFrame,名为products
。
解决方案#2 - 您可以对现有的DataFrame进行规范化。
[24]:
es = es.normalize_dataframe(
base_dataframe_name="transactions",
new_dataframe_name="sessions",
index="session_id",
make_time_index="session_start",
additional_columns=["device", "customer_id", "zip_code", "join_date"],
copy_columns=["session_start"],
)
es
[24]:
Entityset: customer_data
DataFrames:
transactions [Rows: 500, Columns: 7]
products [Rows: 5, Columns: 2]
sessions [Rows: 35, Columns: 6]
Relationships:
transactions.session_id -> sessions.session_id
注意我们现在在EntitySet
中有一个额外的DataFrame,名为sessions
。在这里,规范化创建了transactions
和sessions
之间的关系。然而,如果我们只使用了解决方案#1,我们也可以指定transactions
和products
之间的关系。
现在,我们可以生成聚合特征。
[25]:
feature_matrix, feature_defs = ft.dfs(
entityset=es, target_dataframe_name="transactions"
)
feature_defs[:-10]
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
[25]:
[<Feature: session_id>,
<Feature: product_id>,
<Feature: amount>,
<Feature: DAY(birthday)>,
<Feature: DAY(session_start)>,
<Feature: DAY(transaction_time)>,
<Feature: MONTH(birthday)>,
<Feature: MONTH(session_start)>,
<Feature: MONTH(transaction_time)>,
<Feature: WEEKDAY(birthday)>,
<Feature: WEEKDAY(session_start)>,
<Feature: WEEKDAY(transaction_time)>,
<Feature: YEAR(birthday)>,
<Feature: YEAR(session_start)>,
<Feature: YEAR(transaction_time)>,
<Feature: sessions.device>,
<Feature: sessions.customer_id>,
<Feature: sessions.zip_code>,
<Feature: sessions.COUNT(transactions)>,
<Feature: sessions.MAX(transactions.amount)>,
<Feature: sessions.MEAN(transactions.amount)>,
<Feature: sessions.MIN(transactions.amount)>,
<Feature: sessions.MODE(transactions.product_id)>,
<Feature: sessions.NUM_UNIQUE(transactions.product_id)>,
<Feature: sessions.SKEW(transactions.amount)>]
一些聚合特征包括:
<特征: sessions.MAX(transactions.amount)>
<特征: sessions.SKEW(transactions.amount)>
<特征: sessions.MIN(transactions.amount)>
<特征: sessions.MEAN(transactions.amount)>
<特征: sessions.COUNT(transactions)>
如何加快DFS的运行时间?#
在运行ft.dfs
时可能会遇到的一个问题是性能较慢。虽然Featuretools在计算特征时通常具有最佳的默认设置,但在计算大量特征时,您可能希望提高性能。
加快性能的一种快速方法是调整ft.dfs
或ft.calculate_feature_matrix
的n_jobs
设置。
# 将n_jobs设置为-1将使用所有核心
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_dataframe_name="customers",
n_jobs=-1)
feature_matrix, feature_defs = ft.calculate_feature_matrix(entityset=es,
features=feature_defs,
n_jobs=-1)
要了解更多提高性能的方法,请访问:
提高计算性能
在运行DFS时如何只包含特定的特征?#
在使用DFS生成特征时,您可能希望只包含特定的特征。有多种方法可以实现这一点:
使用
ignore_columns
来指定DataFrame中不应用于创建特征的列。它是一个将DataFrame名称映射到要忽略的列名列表的字典。使用
drop_contains
来删除包含在此参数中列出的任何字符串的特征。使用
drop_exact
来删除与此参数中列出的任何字符串完全匹配的特征。
以下是使用所有三个参数的示例:
[26]:
es = ft.demo.load_mock_customer(return_entityset=True)
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
ignore_columns={
"transactions": ["amount"],
"customers": ["age", "gender", "birthday"],
}, # 忽略这些列
drop_contains=["customers.SUM("], # 删除包含这些字符串的特性
drop_exact=["STD(transactions.quanity)"],
) # 删除完全匹配的功能
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
如何在每列或每个DataFrame基础上指定原语?#
在使用DFS生成特征时,您可能希望仅针对特定原语使用特定的特征或DataFrame。这可以通过primitive_options
参数来实现。primitive_options
参数是一个字典,将一个原语或原语元组映射到包含原语选项的字典中。如果原语需要多个输入,则原语或原语元组也可以映射到选项字典列表。原语键可以是原语的字符串名称、原语类或原语的特定实例。每个字典为其各自的输入列提供选项。通过这些选项,有多种控制原语应用方式的方法:
使用
ignore_dataframes
来指定不应用于为该原语创建特征的DataFrame。这是一个要忽略的DataFrame名称列表。使用
include_dataframes
来指定仅包含用于为该原语创建特征的DataFrame。这是要包含的DataFrame名称列表。使用
ignore_columns
来指定不应用于为该原语创建特征的DataFrame中的列。这是将DataFrame名称映射到要忽略的列名列表的字典。使用
include_columns
来指定仅应用于为该原语创建特征的DataFrame中的列。这是将DataFrame名称映射到要包含的列名列表的字典。
您还可以使用primitive_options
来指定希望用作groupby转换原语的groupby的DataFrame或列:
使用
ignore_groupby_dataframes
来指定不应用于获取该原语的groupbys的DataFrame。这是要忽略的DataFrame名称列表。使用
include_groupby_dataframes
来指定应用于获取该原语的groupbys的唯一DataFrame。这是要包含的DataFrame名称列表。使用
ignore_groupby_columns
来指定不应用作为该原语的groupbys的DataFrame中的列。这是将DataFrame名称映射到要忽略的列名列表的字典。使用
include_groupby_columns
来指定仅应用作为该原语的groupbys的DataFrame中的列。这是将DataFrame名称映射到要包含的列名列表的字典。
以下是使用其中一些选项的示例:
[27]:
es = ft.demo.load_mock_customer(return_entityset=True)
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
primitive_options={
"mode": {
"ignore_dataframes": ["sessions"],
"ignore_columns": {"products": ["brand"], "transactions": ["product_id"]},
},
# For mode, ignore the "sessions" DataFrame and only include "brands" in the
# "products" dataframe and "product_id" in the "transactions" DataFrame
("count", "mean"): {"include_dataframes": ["sessions", "transactions"]},
# For count and mean, only include the dataframes "sessions" and "transactions"
},
)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
请注意,如果为特定实例的原语和一般原语分别提供选项(通过字符串名称或类),那么具有自己选项的实例将不使用通用选项。例如,在这种情况下:
special_mean = Mean()
options = {
special_mean: {'include_dataframes': ['customers']},
'mean': {'include_dataframes': ['sessions']}
原语special_mean
将不使用DataFrame sessions
,因为它的选项只包括customers
。Mean
原语的每个其他实例将使用'mean'
选项。
有关为DFS指定选项的更多示例,请访问:
如果我没有指定cutoff_time,特征计算会使用哪个日期?#
特征计算将使用当前时间作为截止时间,即cutoff_time = datetime.now()
。
如何在计算特征时选择特定数量的历史数据?#
在计算特征时,您可能会遇到只希望使用特定数量的历史数据进行预测的情况。您可以使用ft.dfs
中的training_window
参数来实现这一目的。当您使用training_window
时,Featuretools将使用在cutoff_time
和cutoff_time - training_window
之间的历史数据。
为了进行计算,Featuretools将检查target_dataframe
中time_index
列中的时间。
[28]:
es = ft.demo.load_mock_customer(return_entityset=True)
es["customers"].ww.time_index
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[28]:
'join_date'
我们的target_dataframe
中有一个time_index
,这对于training_window
的计算是必需的。在这里,我们正在创建一个截止时间的DataFrame,以便为每个客户端设置一个唯一的训练窗口。
[29]:
cutoff_times = pd.DataFrame()
cutoff_times["customer_id"] = [1, 2, 3, 1]
cutoff_times["time"] = pd.to_datetime(
["2014-1-1 04:00", "2014-1-1 05:00", "2014-1-1 06:00", "2014-1-1 08:00"]
)
cutoff_times["label"] = [True, True, False, True]
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
training_window="1 hour",
)
feature_matrix.head()
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
[29]:
zip_code | COUNT(sessions) | MODE(sessions.device) | NUM_UNIQUE(sessions.device) | COUNT(transactions) | MAX(transactions.amount) | MEAN(transactions.amount) | MIN(transactions.amount) | MODE(transactions.product_id) | NUM_UNIQUE(transactions.product_id) | ... | STD(sessions.SUM(transactions.amount)) | SUM(sessions.MAX(transactions.amount)) | SUM(sessions.MEAN(transactions.amount)) | SUM(sessions.MIN(transactions.amount)) | SUM(sessions.NUM_UNIQUE(transactions.product_id)) | SUM(sessions.SKEW(transactions.amount)) | SUM(sessions.STD(transactions.amount)) | MODE(transactions.sessions.device) | NUM_UNIQUE(transactions.sessions.device) | label | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | time | |||||||||||||||||||||
1 | 2014-01-01 04:00:00 | 60091 | 1 | tablet | 1 | 12 | 139.09 | 85.469167 | 6.78 | 4 | 5 | ... | NaN | 139.09 | 85.469167 | 6.78 | 5.0 | -0.830975 | 39.825249 | tablet | 1 | True |
2 | 2014-01-01 05:00:00 | 13244 | 1 | tablet | 1 | 13 | 118.85 | 77.304615 | 21.82 | 1 | 5 | ... | NaN | 118.85 | 77.304615 | 21.82 | 5.0 | -0.314918 | 33.725036 | tablet | 1 | True |
3 | 2014-01-01 06:00:00 | 13244 | 2 | desktop | 1 | 12 | 128.26 | 81.747500 | 20.06 | 3 | 5 | ... | 563.882303 | 220.02 | 172.597273 | 111.82 | 6.0 | -0.289466 | 35.704680 | desktop | 1 | False |
1 | 2014-01-01 08:00:00 | 60091 | 1 | mobile | 1 | 16 | 126.11 | 88.755625 | 11.62 | 4 | 5 | ... | NaN | 126.11 | 88.755625 | 11.62 | 5.0 | -1.038434 | 32.324534 | mobile | 1 | True |
4 rows × 76 columns
在上面的代码中,我们使用了training_window
参数为1小时
来运行DFS,以创建仅使用在我们提供的截止时间之前最后一个小时内收集的客户数据的特征。
我可以在单个表上运行DFS吗?#
虽然可能,但在单个表上运行DFS并没有充分利用DFS的能力。首先,DFS将无法使用任何聚合原语,因为这至少需要两个表。您只能使用转换原语。这限制了DFS通过特征堆叠生成特征的复杂性。此外,在某些情况下,在具有时间列的数据上运行单表DFS可能会导致标签泄漏。将数据拆分为多个表后,Featuretools可以根据截止时间过滤数据,而不是假设数据已经适当地展平,但在只有一个表的情况下无法做到这一点。
如果您只有一个数据表,DFS当然仍然可以派上用场。有两种主要方法可以将单个表传递给DFS。
第一种方法是简单地创建一个只有一个表的EntitySet。
例如:
[30]:
transactions_df = ft.demo.load_mock_customer(return_single_table=True)
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(
dataframe_name="transactions",
dataframe=transactions_df,
index="transaction_id",
time_index="transaction_time",
)
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="transactions",
trans_primitives=[
"time_since",
"day",
"is_weekend",
"cum_min",
"minute",
"weekday",
"percentile",
"year",
"week",
"cum_mean",
],
)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/featuretools/synthesis/deep_feature_synthesis.py:154: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created
warnings.warn(
第二种方法是将数据框插入到一个字典中,将其名称映射到包含特定数据框信息的元组。然后我们将该字典传递给DFS中的dataframes
参数。
在这种情况下,对于字典中的值,我们传入一个包含数据框、其索引列和时间索引的元组。有关可能参数的更多信息可以在DFS文档中找到。
例如:
[31]:
transactions_df = ft.demo.load_mock_customer(return_single_table=True)
dataframes = {"transactions": (transactions_df, "transaction_id", "transaction_time")}
feature_matrix, feature_defs = ft.dfs(
dataframes=dataframes,
target_dataframe_name="transactions",
trans_primitives=[
"time_since",
"day",
"is_weekend",
"cum_min",
"minute",
"weekday",
"percentile",
"year",
"week",
"cum_mean",
],
)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/featuretools/synthesis/deep_feature_synthesis.py:154: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created
warnings.warn(
在我们检查输出之前,让我们先看一下我们的原始单表格。
[32]:
transactions_df.head()
[32]:
transaction_id | session_id | transaction_time | product_id | amount | customer_id | device | session_start | zip_code | join_date | birthday | brand | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
298 | 298 | 1 | 2014-01-01 00:00:00 | 5 | 127.64 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 | A |
2 | 2 | 1 | 2014-01-01 00:01:05 | 2 | 109.48 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 | B |
308 | 308 | 1 | 2014-01-01 00:02:10 | 3 | 95.06 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 | B |
116 | 116 | 1 | 2014-01-01 00:03:15 | 4 | 78.92 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 | B |
371 | 371 | 1 | 2014-01-01 00:04:20 | 3 | 31.54 | 2 | desktop | 2014-01-01 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 | B |
现在我们可以看一下Featuretools能够应用于这个单个DataFrame以创建特征矩阵的转换。
[33]:
feature_matrix.head()
[33]:
session_id | product_id | amount | customer_id | device | zip_code | brand | CUM_MEAN(amount) | CUM_MEAN(customer_id) | CUM_MEAN(session_id) | ... | WEEK(session_start) | WEEK(transaction_time) | WEEKDAY(birthday) | WEEKDAY(join_date) | WEEKDAY(session_start) | WEEKDAY(transaction_time) | YEAR(birthday) | YEAR(join_date) | YEAR(session_start) | YEAR(transaction_time) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
transaction_id | |||||||||||||||||||||
298 | 1 | 5 | 127.64 | 2 | desktop | 13244 | A | 127.640000 | 2.0 | 1.0 | ... | 1 | 1 | 0 | 6 | 2 | 2 | 1986 | 2012 | 2014 | 2014 |
2 | 1 | 2 | 109.48 | 2 | desktop | 13244 | B | 118.560000 | 2.0 | 1.0 | ... | 1 | 1 | 0 | 6 | 2 | 2 | 1986 | 2012 | 2014 | 2014 |
308 | 1 | 3 | 95.06 | 2 | desktop | 13244 | B | 110.726667 | 2.0 | 1.0 | ... | 1 | 1 | 0 | 6 | 2 | 2 | 1986 | 2012 | 2014 | 2014 |
116 | 1 | 4 | 78.92 | 2 | desktop | 13244 | B | 102.775000 | 2.0 | 1.0 | ... | 1 | 1 | 0 | 6 | 2 | 2 | 1986 | 2012 | 2014 | 2014 |
371 | 1 | 3 | 31.54 | 2 | desktop | 13244 | B | 88.528000 | 2.0 | 1.0 | ... | 1 | 1 | 0 | 6 | 2 | 2 | 1986 | 2012 | 2014 | 2014 |
5 rows × 44 columns
如何使用DFS防止标签泄漏?#
使用DFS时可能会遇到的一个问题是标签泄漏。您希望确保数据中的标签没有被错误地用来创建特征和特征矩阵。
Featuretools特别注重帮助用户避免标签泄漏。
有两种方法可以防止标签泄漏,具体取决于您的数据是否具有时间戳。
1. 没有时间戳的数据#
在没有时间戳的情况下,您可以使用仅包含训练数据的一个EntitySet
,然后运行ft.dfs
。这将仅使用训练数据创建一个特征矩阵,同时返回一个特征定义列表。接下来,您可以使用测试数据创建一个EntitySet
,通过使用之前得到的特征定义列表调用ft.calculate_feature_matrix
来重新计算相同的特征。
以下是该流程的示例:
首先,让我们创建我们的训练数据。
[34]:
train_data = pd.DataFrame(
{
"customer_id": [1, 2, 3, 4, 5],
"age": [40, 50, 10, 20, 30],
"gender": ["m", "f", "m", "f", "f"],
"signup_date": pd.date_range("2014-01-01 01:41:50", periods=5, freq="25min"),
"labels": [True, False, True, False, True],
}
)
train_data.head()
[34]:
customer_id | age | gender | signup_date | labels | |
---|---|---|---|---|---|
0 | 1 | 40 | m | 2014-01-01 01:41:50 | True |
1 | 2 | 50 | f | 2014-01-01 02:06:50 | False |
2 | 3 | 10 | m | 2014-01-01 02:31:50 | True |
3 | 4 | 20 | f | 2014-01-01 02:56:50 | False |
4 | 5 | 30 | f | 2014-01-01 03:21:50 | True |
现在,我们可以为我们的训练数据创建一个实体集。
[35]:
es_train_data = ft.EntitySet(id="customer_train_data")
es_train_data = es_train_data.add_dataframe(
dataframe_name="customers", dataframe=train_data, index="customer_id"
)
es_train_data
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[35]:
Entityset: customer_train_data
DataFrames:
customers [Rows: 5, Columns: 5]
Relationships:
No relationships
接下来,我们准备为训练数据创建特征和特征矩阵。我们不希望 Featuretools 使用标签列来构建新特征,因此我们将使用 ignore_columns
选项来排除它。这也会从特征矩阵中删除标签列,因此我们会告诉 DFS 将其包含为种子特征。
[36]:
labels_feature = ft.Feature(es_train_data["customers"].ww["labels"])
feature_matrix_train, feature_defs = ft.dfs(
entityset=es_train_data,
target_dataframe_name="customers",
ignore_columns={"customers": ["labels"]},
seed_features=[labels_feature],
)
feature_matrix_train
/Users/code/fin_tool/github/featuretools/featuretools/synthesis/deep_feature_synthesis.py:154: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created
warnings.warn(
[36]:
age | labels | DAY(signup_date) | MONTH(signup_date) | WEEKDAY(signup_date) | YEAR(signup_date) | |
---|---|---|---|---|---|---|
customer_id | ||||||
1 | 40 | True | 1 | 1 | 2 | 2014 |
2 | 50 | False | 1 | 1 | 2 | 2014 |
3 | 10 | True | 1 | 1 | 2 | 2014 |
4 | 20 | False | 1 | 1 | 2 | 2014 |
5 | 30 | True | 1 | 1 | 2 | 2014 |
我们还将对特征矩阵进行编码,以使其与机器学习兼容。
[37]:
feature_matrix_train_enc, features_enc = ft.encode_features(
feature_matrix_train, feature_defs
)
feature_matrix_train_enc.head()
[37]:
age | labels | DAY(signup_date) = 1 | DAY(signup_date) is unknown | MONTH(signup_date) = 1 | MONTH(signup_date) is unknown | WEEKDAY(signup_date) = 2 | WEEKDAY(signup_date) is unknown | YEAR(signup_date) = 2014 | YEAR(signup_date) is unknown | |
---|---|---|---|---|---|---|---|---|---|---|
customer_id | ||||||||||
1 | 40 | True | True | False | True | False | True | False | True | False |
2 | 50 | False | True | False | True | False | True | False | True | False |
3 | 10 | True | True | False | True | False | True | False | True | False |
4 | 20 | False | True | False | True | False | True | False | True | False |
5 | 30 | True | True | False | True | False | True | False | True | False |
注意整个特征矩阵现在只包含数值和布尔值。
现在我们可以使用特征定义来计算测试数据的特征矩阵,并避免标签泄漏。
[38]:
test_train = pd.DataFrame(
{
"customer_id": [6, 7, 8, 9, 10],
"age": [20, 25, 55, 22, 35],
"gender": ["f", "m", "m", "m", "m"],
"signup_date": pd.date_range("2014-01-01 01:41:50", periods=5, freq="25min"),
"labels": [True, False, False, True, True],
}
)
es_test_data = ft.EntitySet(id="customer_test_data")
es_test_data = es_test_data.add_dataframe(
dataframe_name="customers",
dataframe=test_train,
index="customer_id",
time_index="signup_date",
)
# 使用之前的功能定义
feature_matrix_enc_test = ft.calculate_feature_matrix(
features=features_enc, entityset=es_test_data
)
feature_matrix_enc_test.head()
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[38]:
age | labels | DAY(signup_date) = 1 | DAY(signup_date) is unknown | MONTH(signup_date) = 1 | MONTH(signup_date) is unknown | WEEKDAY(signup_date) = 2 | WEEKDAY(signup_date) is unknown | YEAR(signup_date) = 2014 | YEAR(signup_date) is unknown | |
---|---|---|---|---|---|---|---|---|---|---|
customer_id | ||||||||||
6 | 20 | True | True | False | True | False | True | False | True | False |
7 | 25 | False | True | False | True | False | True | False | True | False |
8 | 55 | False | True | False | True | False | True | False | True | False |
9 | 22 | True | True | False | True | False | True | False | True | False |
10 | 35 | True | True | False | True | False | True | False | True | False |
查看建模部分,了解如何在sklearn中使用编码矩阵的示例。
2. 带有时间戳的数据#
如果您的数据带有时间戳,防止标签泄漏的最佳方法是使用一个截止时间列表,该列表指定了在生成特征矩阵的每一行中允许使用的数据的最后时间点。要使用截止时间,您需要为实体集中的每个时间敏感的DataFrame设置一个时间索引。
提示:即使您的数据没有时间戳,您也可以添加一个带有虚拟时间戳的列,Featuretools可以将其用作时间索引。
当您调用ft.dfs
时,可以像这样提供一个截止时间的DataFrame:
[39]:
cutoff_times = pd.DataFrame(
{
"customer_id": [1, 2, 3, 4, 5],
"time": pd.date_range("2014-01-01 01:41:50", periods=5, freq="25min"),
}
)
cutoff_times.head()
[39]:
customer_id | time | |
---|---|---|
0 | 1 | 2014-01-01 01:41:50 |
1 | 2 | 2014-01-01 02:06:50 |
2 | 3 | 2014-01-01 02:31:50 |
3 | 4 | 2014-01-01 02:56:50 |
4 | 5 | 2014-01-01 03:21:50 |
[40]:
train_test_data = pd.DataFrame(
{
"customer_id": [1, 2, 3, 4, 5],
"age": [20, 25, 55, 22, 35],
"gender": ["f", "m", "m", "m", "m"],
"signup_date": pd.date_range("2010-01-01 01:41:50", periods=5, freq="25min"),
}
)
es_train_test_data = ft.EntitySet(id="customer_train_test_data")
es_train_test_data = es_train_test_data.add_dataframe(
dataframe_name="customers",
dataframe=train_test_data,
index="customer_id",
time_index="signup_date",
)
feature_matrix_train_test, features = ft.dfs(
entityset=es_train_test_data,
target_dataframe_name="customers",
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
)
feature_matrix_train_test.head()
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/featuretools/synthesis/deep_feature_synthesis.py:154: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created
warnings.warn(
[40]:
age | DAY(signup_date) | MONTH(signup_date) | WEEKDAY(signup_date) | YEAR(signup_date) | ||
---|---|---|---|---|---|---|
customer_id | time | |||||
1 | 2014-01-01 01:41:50 | 20 | 1 | 1 | 4 | 2010 |
2 | 2014-01-01 02:06:50 | 25 | 1 | 1 | 4 | 2010 |
3 | 2014-01-01 02:31:50 | 55 | 1 | 1 | 4 | 2010 |
4 | 2014-01-01 02:56:50 | 22 | 1 | 1 | 4 | 2010 |
5 | 2014-01-01 03:21:50 | 35 | 1 | 1 | 4 | 2010 |
在上面,我们已经创建了一个使用截止时间来避免标签泄漏的特征矩阵。我们也可以使用ft.encode_features
来对这个特征矩阵进行编码。
传递原始对象和字符串到DFS之间有什么区别?#
有两种方法可以将原始对象传递给DFS:使用原始对象本身,或者使用原始对象的字符串名称。
我们将使用名为TimeSincePrevious
的Transform原始对象来说明这两种方法之间的区别。
首先,让我们使用原始对象名称的字符串。
[41]:
es = ft.demo.load_mock_customer(return_entityset=True)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[42]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=[],
trans_primitives=["time_since_previous"],
)
feature_matrix
[42]:
zip_code | TIME_SINCE_PREVIOUS(join_date) | |
---|---|---|
customer_id | ||
5 | 60091 | NaN |
4 | 60091 | 22948824.0 |
1 | 60091 | 744019.0 |
3 | 13244 | 10212841.0 |
2 | 13244 | 21282510.0 |
现在,让我们使用原始对象。
[43]:
from featuretools.primitives import TimeSincePrevious
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=[],
trans_primitives=[TimeSincePrevious],
)
feature_matrix
[43]:
zip_code | TIME_SINCE_PREVIOUS(join_date) | |
---|---|---|
customer_id | ||
5 | 60091 | NaN |
4 | 60091 | 22948824.0 |
1 | 60091 | 744019.0 |
3 | 13244 | 10212841.0 |
2 | 13244 | 21282510.0 |
正如我们在上面看到的,特征矩阵是相同的。
然而,如果我们需要修改原语中可控参数,我们应该使用原语对象。
例如,让我们将TimeSincePrevious返回的单位修改为小时(默认为秒)。
[44]:
from featuretools.primitives import TimeSincePrevious
time_since_previous_in_hours = TimeSincePrevious(unit="hours")
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=[],
trans_primitives=[time_since_previous_in_hours],
)
feature_matrix
[44]:
zip_code | TIME_SINCE_PREVIOUS(join_date, unit=hours) | |
---|---|---|
customer_id | ||
5 | 60091 | NaN |
4 | 60091 | 6374.673333 |
1 | 60091 | 206.671944 |
3 | 13244 | 2836.900278 |
2 | 13244 | 5911.808333 |
特性#
如何根据一些属性(特定字符串、显式原始类型、返回类型、给定深度)选择特征?#
您可能希望根据一些属性选择特征的子集。
假设您想要选择名称中包含字符串amount
的特征。您可以通过在特征定义上使用get_name
函数来检查这一点。
[45]:
es = ft.demo.load_mock_customer(return_entityset=True)
feature_defs = ft.dfs(
entityset=es, target_dataframe_name="customers", features_only=True
)
features_with_amount = []
for x in feature_defs:
if "amount" in x.get_name():
features_with_amount.append(x)
features_with_amount[0:5]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[45]:
[<Feature: MAX(transactions.amount)>,
<Feature: MEAN(transactions.amount)>,
<Feature: MIN(transactions.amount)>,
<Feature: SKEW(transactions.amount)>,
<Feature: STD(transactions.amount)>]
您可能还希望仅选择聚合特征。
[46]:
from featuretools import AggregationFeature
features_only_aggregations = []
for x in feature_defs:
if type(x) == AggregationFeature:
features_only_aggregations.append(x)
features_only_aggregations[0:5]
[46]:
[<Feature: COUNT(sessions)>,
<Feature: MODE(sessions.device)>,
<Feature: NUM_UNIQUE(sessions.device)>,
<Feature: COUNT(transactions)>,
<Feature: MAX(transactions.amount)>]
另外,您可能只想选择在特定深度计算的特征。您可以通过使用get_depth
函数来实现这一点。
[47]:
features_only_depth_2 = []
for x in feature_defs:
if x.get_depth() == 2:
features_only_depth_2.append(x)
features_only_depth_2[0:5]
[47]:
[<Feature: MAX(sessions.COUNT(transactions))>,
<Feature: MAX(sessions.MEAN(transactions.amount))>,
<Feature: MAX(sessions.MIN(transactions.amount))>,
<Feature: MAX(sessions.NUM_UNIQUE(transactions.product_id))>,
<Feature: MAX(sessions.SKEW(transactions.amount))>]
最后,您可能只想返回特定类型的特征。您可以通过使用column_schema
属性来实现这一点。有关使用列模式的更多信息,请查看从变量过渡到Woodwork。
[48]:
features_only_numeric = []
for x in feature_defs:
if "numeric" in x.column_schema.semantic_tags:
features_only_numeric.append(x)
features_only_numeric[0:5]
[48]:
[<Feature: COUNT(sessions)>,
<Feature: NUM_UNIQUE(sessions.device)>,
<Feature: COUNT(transactions)>,
<Feature: MAX(transactions.amount)>,
<Feature: MEAN(transactions.amount)>]
一旦您有了特定的特征列表,您可以使用 ft.calculate_feature_matrix
仅为这些特征生成特征矩阵。
对于我们的示例,让我们只使用名称中包含字符串 amount
的特征。
[49]:
feature_matrix = ft.calculate_feature_matrix(
entityset=es, features=features_with_amount
) # 切换到您的特定功能列表
feature_matrix.head()
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
[49]:
MAX(transactions.amount) | MEAN(transactions.amount) | MIN(transactions.amount) | SKEW(transactions.amount) | STD(transactions.amount) | SUM(transactions.amount) | MAX(sessions.MEAN(transactions.amount)) | MAX(sessions.MIN(transactions.amount)) | MAX(sessions.SKEW(transactions.amount)) | MAX(sessions.STD(transactions.amount)) | ... | STD(sessions.MAX(transactions.amount)) | STD(sessions.MEAN(transactions.amount)) | STD(sessions.MIN(transactions.amount)) | STD(sessions.SKEW(transactions.amount)) | STD(sessions.SUM(transactions.amount)) | SUM(sessions.MAX(transactions.amount)) | SUM(sessions.MEAN(transactions.amount)) | SUM(sessions.MIN(transactions.amount)) | SUM(sessions.SKEW(transactions.amount)) | SUM(sessions.STD(transactions.amount)) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | |||||||||||||||||||||
5 | 149.02 | 80.375443 | 7.55 | -0.025941 | 44.095630 | 6349.66 | 94.481667 | 20.65 | 0.602209 | 51.149250 | ... | 7.928001 | 11.007471 | 4.961414 | 0.415426 | 402.775486 | 839.76 | 472.231119 | 86.49 | 0.014384 | 259.873954 |
4 | 149.95 | 80.070459 | 5.73 | -0.036348 | 45.068765 | 8727.68 | 110.450000 | 54.83 | 0.382868 | 54.293903 | ... | 3.514421 | 13.027258 | 16.960575 | 0.387884 | 235.992478 | 1157.99 | 649.657515 | 131.51 | 0.002764 | 356.125829 |
1 | 139.43 | 71.631905 | 5.81 | 0.019698 | 40.442059 | 9025.62 | 88.755625 | 26.36 | 0.640252 | 46.905665 | ... | 7.322191 | 13.759314 | 6.954507 | 0.589386 | 279.510713 | 1057.97 | 582.193117 | 78.59 | -0.476122 | 312.745952 |
3 | 149.15 | 67.060430 | 5.89 | 0.418230 | 43.683296 | 6236.62 | 82.109444 | 20.06 | 0.854976 | 50.110120 | ... | 10.724241 | 11.174282 | 5.424407 | 0.429374 | 219.021420 | 847.63 | 405.237462 | 66.21 | 2.286086 | 257.299895 |
2 | 146.81 | 77.422366 | 8.73 | 0.098259 | 37.705178 | 7200.28 | 96.581000 | 56.46 | 0.755711 | 47.935920 | ... | 17.221593 | 11.477071 | 15.874374 | 0.509798 | 251.609234 | 931.63 | 548.905851 | 154.60 | -0.277640 | 258.700528 |
5 rows × 37 columns
注意,在上面的代码中,我们可以看到所有特征矩阵的列名都包含字符串amount
。
如何创建where特征?#
有时,您可能希望创建在计算之前受第二个值限制的特征。这种额外的过滤条件被称为“where子句”。您可以使用列的interesting_values
来创建这些特征。
如果您的EntitySet
中有分类列,您可以使用add_interesting_values
。此函数将为您的分类列找到有趣的值,然后可以用于生成“where”子句。
首先,让我们创建我们的EntitySet
。
[50]:
es = ft.demo.load_mock_customer(return_entityset=True)
es
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[50]:
Entityset: transactions
DataFrames:
transactions [Rows: 500, Columns: 6]
products [Rows: 5, Columns: 3]
sessions [Rows: 35, Columns: 5]
customers [Rows: 5, Columns: 5]
Relationships:
transactions.product_id -> products.product_id
transactions.session_id -> sessions.session_id
sessions.customer_id -> customers.customer_id
现在我们可以为分类列添加有趣的值。
[51]:
es.add_interesting_values()
现在我们可以运行DFS,使用where_primitives
参数来定义应用带有where子句的原语。在这种情况下,让我们使用原语count
。为了使其工作,原语count
必须同时存在于agg_primitives
和where_primitives
中。
[52]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["count"],
where_primitives=["count"],
trans_primitives=[],
)
feature_matrix.head()
[52]:
zip_code | COUNT(sessions) | COUNT(transactions) | COUNT(sessions WHERE device = mobile) | COUNT(sessions WHERE device = desktop) | COUNT(sessions WHERE device = tablet) | COUNT(sessions WHERE customers.zip_code = 13244) | COUNT(sessions WHERE customers.zip_code = 60091) | COUNT(transactions WHERE sessions.device = mobile) | COUNT(transactions WHERE sessions.device = tablet) | COUNT(transactions WHERE sessions.device = desktop) | |
---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | |||||||||||
5 | 60091 | 6 | 79 | 3 | 2 | 1 | 0 | 6 | 36 | 14 | 29 |
4 | 60091 | 8 | 109 | 4 | 3 | 1 | 0 | 8 | 53 | 18 | 38 |
1 | 60091 | 8 | 126 | 3 | 2 | 3 | 0 | 8 | 56 | 43 | 27 |
3 | 13244 | 6 | 93 | 1 | 4 | 1 | 6 | 0 | 16 | 15 | 62 |
2 | 13244 | 7 | 93 | 2 | 3 | 2 | 7 | 0 | 31 | 28 | 34 |
我们现在已经创建了一些有用的特性。一个有用特性的例子是 COUNT(sessions WHERE device = tablet)
。这个特性告诉我们客户在平板电脑上完成了多少个会话。
[53]:
feature_matrix[["COUNT(sessions WHERE device = tablet)"]]
[53]:
COUNT(sessions WHERE device = tablet) | |
---|---|
customer_id | |
5 | 1 |
4 | 1 |
1 | 3 |
3 | 1 |
2 | 2 |
Basic Data Types#
原始类型(Transform、GroupBy Transform和Aggregation)之间有什么区别?#
您可能想知道原始类型之间的区别。
让我们来看一下transform、groupby transform和aggregation原始类型之间的区别。
首先,让我们创建一个简单的EntitySet
。
[54]:
import pandas as pd
import featuretools as ft
df = pd.DataFrame(
{
"id": [1, 2, 3, 4, 5, 6],
"time_index": pd.date_range("1/1/2019", periods=6, freq="D"),
"group": ["a", "a", "a", "a", "a", "a"],
"val": [5, 1, 10, 20, 6, 23],
}
)
es = ft.EntitySet()
es = es.add_dataframe(
dataframe_name="observations", dataframe=df, index="id", time_index="time_index"
)
es = es.normalize_dataframe(
base_dataframe_name="observations", new_dataframe_name="groups", index="group"
)
es.plot()
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[54]:
在调用normalize_dataframe
之后,列”group”具有语义标签”foreign_key”,因为它标识另一个DataFrame。或者,当我们首次调用es.add_dataframe()
时,也可以使用semantic_tags
参数进行设置。
转换原语#
cum_sum原语计算数字列表中的累积和。
[55]:
from featuretools.primitives import CumSum
cum_sum = CumSum()
cum_sum([1, 2, 3, 4, 5]).tolist()
[55]:
[1, 3, 6, 10, 15]
如果我们使用trans_primitives
参数应用它,它将在整个观察数据框上进行计算,就像这样:
[56]:
feature_matrix, feature_defs = ft.dfs(
target_dataframe_name="observations",
entityset=es,
agg_primitives=[],
trans_primitives=["cum_sum"],
groupby_trans_primitives=[],
)
feature_matrix
[56]:
group | val | CUM_SUM(val) | |
---|---|---|---|
id | |||
1 | a | 5 | 5.0 |
2 | a | 1 | 6.0 |
3 | a | 10 | 16.0 |
4 | a | 20 | 36.0 |
5 | a | 6 | 42.0 |
6 | a | 23 | 65.0 |
分组转换原语#
如果我们使用groupby_trans_primitives
应用它,那么DFS将首先按任何外键列进行分组,然后应用转换原语。因此,我们可以按组获得累积和。
[57]:
feature_matrix, feature_defs = ft.dfs(
target_dataframe_name="observations",
entityset=es,
agg_primitives=[],
trans_primitives=[],
groupby_trans_primitives=["cum_sum"],
)
feature_matrix
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:516: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
grouped = frame.groupby(groupby)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:559: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
frame[name].update(pd.concat(col_vals))
[57]:
group | val | CUM_SUM(val) by group | |
---|---|---|---|
id | |||
1 | a | 5 | 5.0 |
2 | a | 1 | 6.0 |
3 | a | 10 | 16.0 |
4 | a | 20 | 36.0 |
5 | a | 6 | 42.0 |
6 | a | 23 | 65.0 |
聚合原语#
最后,还有一个聚合原语“sum”。如果我们使用sum,它将在每行的截止时间为每个组计算总和。因为我们没有指定截止时间,它将对每个组的所有数据在每行中进行计算。
[58]:
feature_matrix, feature_defs = ft.dfs(
target_dataframe_name="observations",
entityset=es,
agg_primitives=["sum"],
trans_primitives=[],
cutoff_time_in_index=True,
groupby_trans_primitives=[],
)
feature_matrix
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
[58]:
group | val | groups.SUM(observations.val) | ||
---|---|---|---|---|
id | time | |||
1 | 2024-10-11 14:50:23.002623 | a | 5 | 65.0 |
2 | 2024-10-11 14:50:23.002623 | a | 1 | 65.0 |
3 | 2024-10-11 14:50:23.002623 | a | 10 | 65.0 |
4 | 2024-10-11 14:50:23.002623 | a | 20 | 65.0 |
5 | 2024-10-11 14:50:23.002623 | a | 6 | 65.0 |
6 | 2024-10-11 14:50:23.002623 | a | 23 | 65.0 |
如果我们将每行的截止时间设置为时间索引,然后使用 sum 作为聚合原语,结果与 cum_sum 相同。(尽管在显示的数据框中顺序不同)。
[59]:
cutoff_time = df[["id", "time_index"]]
cutoff_time
[59]:
id | time_index | |
---|---|---|
1 | 1 | 2019-01-01 |
2 | 2 | 2019-01-02 |
3 | 3 | 2019-01-03 |
4 | 4 | 2019-01-04 |
5 | 5 | 2019-01-05 |
6 | 6 | 2019-01-06 |
[60]:
feature_matrix, feature_defs = ft.dfs(
target_dataframe_name="observations",
entityset=es,
agg_primitives=["sum"],
trans_primitives=[],
groupby_trans_primitives=[],
cutoff_time_in_index=True,
cutoff_time=cutoff_time,
)
feature_matrix
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
[60]:
group | val | groups.SUM(observations.val) | ||
---|---|---|---|---|
id | time | |||
1 | 2019-01-01 | a | 5 | 5.0 |
2 | 2019-01-02 | a | 1 | 6.0 |
3 | 2019-01-03 | a | 10 | 16.0 |
4 | 2019-01-04 | a | 20 | 36.0 |
5 | 2019-01-05 | a | 6 | 42.0 |
6 | 2019-01-06 | a | 23 | 65.0 |
如何获取所有聚合和转换基元的列表?#
您可以使用featuretools.list_primitives()
来获取Featuretools中的所有基元。它将返回一个包含基元名称、类型和描述的DataFrame。
[61]:
df_primitives = ft.list_primitives()
df_primitives.head()
[61]:
name | type | description | valid_inputs | return_type | |
---|---|---|---|---|---|
0 | is_monotonically_increasing | aggregation | 判断一个序列是否单调递增. | <ColumnSchema (Semantic Tags = ['numeric'])> | <ColumnSchema (Logical Type = BooleanNullable)> |
1 | max_consecutive_positives | aggregation | 确定输入中连续正数值的最大数量 | <ColumnSchema (Logical Type = Double)>, <Colum... | <ColumnSchema (Logical Type = Integer) (Semant... |
2 | count_outside_nth_std | aggregation | 确定位于前N个标准差之外的观测值数量. | <ColumnSchema (Semantic Tags = ['numeric'])> | <ColumnSchema (Logical Type = Integer) (Semant... |
3 | num_peaks | aggregation | 确定一个数字列表中的峰值数量. | <ColumnSchema (Semantic Tags = ['numeric'])> | <ColumnSchema (Logical Type = Integer) (Semant... |
4 | first | aggregation | 确定列表中的第一个值. | <ColumnSchema> | None |
[62]:
df_primitives.tail()
[62]:
name | type | description | valid_inputs | return_type | |
---|---|---|---|---|---|
220 | upper_case_word_count | transform | 确定字符串中完全大写的单词数量. | <ColumnSchema (Logical Type = NaturalLanguage)> | <ColumnSchema (Logical Type = IntegerNullable)... |
221 | days_in_month | transform | 确定给定日期时间所在月份的天数. | <ColumnSchema (Logical Type = Datetime)> | <ColumnSchema (Logical Type = Ordinal: [1, 2, ... |
222 | is_null | transform | 判断一个值是否为空. | <ColumnSchema> | <ColumnSchema (Logical Type = Boolean)> |
223 | add_numeric | transform | 对两个列表进行元素逐项相加. | <ColumnSchema (Semantic Tags = ['numeric'])> | <ColumnSchema (Semantic Tags = ['numeric'])> |
224 | expanding_min | transform | 计算给定窗口内事件的扩展最小值. | <ColumnSchema (Semantic Tags = ['numeric'])>, ... | <ColumnSchema (Semantic Tags = ['numeric'])> |
如何更改TimeSince原语的单位?#
Featuretools中有一些原语可以进行基于时间的计算。这些包括TimeSince, TimeSincePrevious, TimeSinceLast, TimeSinceFirst
。
您可以将单位从默认的秒更改为任何有效的时间单位,方法如下:
[63]:
from featuretools.primitives import (
TimeSince,
TimeSinceFirst,
TimeSinceLast,
TimeSincePrevious,
)
time_since = TimeSince(unit="minutes")
time_since_previous = TimeSincePrevious(unit="hours")
time_since_last = TimeSinceLast(unit="days")
time_since_first = TimeSinceFirst(unit="years")
es = ft.demo.load_mock_customer(return_entityset=True)
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=[time_since_last, time_since_first],
trans_primitives=[time_since, time_since_previous],
)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
现在,我们将单位更改为以下内容:
TimeSince
的单位为分钟TimeSincePrevious
的单位为小时TimeSinceLast
的单位为天TimeSinceFirst
的单位为年
现在我们可以看到,我们的特征矩阵包含多个特征,其中时间差原语的单位已更改。
[64]:
feature_matrix.head()
[64]:
zip_code | TIME_SINCE_FIRST(sessions.session_start, unit=years) | TIME_SINCE_LAST(sessions.session_start, unit=days) | TIME_SINCE_FIRST(transactions.transaction_time, unit=years) | TIME_SINCE_LAST(transactions.transaction_time, unit=days) | TIME_SINCE(birthday, unit=minutes) | TIME_SINCE(join_date, unit=minutes) | TIME_SINCE_PREVIOUS(join_date, unit=hours) | TIME_SINCE_FIRST(transactions.sessions.session_start, unit=years) | TIME_SINCE_LAST(transactions.sessions.session_start, unit=days) | |
---|---|---|---|---|---|---|---|---|---|---|
customer_id | ||||||||||
5 | 60091 | 10.783855 | 3936.283543 | 10.783855 | 3936.278277 | 2.114729e+07 | 7.488563e+06 | NaN | 10.783855 | 3936.283543 |
4 | 60091 | 10.783834 | 3936.394885 | 10.783834 | 3936.388114 | 9.550970e+06 | 7.106082e+06 | 6374.673333 | 10.783834 | 3936.394885 |
1 | 60091 | 10.783803 | 3936.319654 | 10.783803 | 3936.308369 | 1.590281e+07 | 7.093682e+06 | 206.671944 | 10.783803 | 3936.319654 |
3 | 13244 | 10.783698 | 3936.254202 | 10.783698 | 3936.242918 | 1.098809e+07 | 6.923468e+06 | 2836.900278 | 10.783698 | 3936.254202 |
2 | 13244 | 10.783888 | 3936.277524 | 10.783888 | 3936.268496 | 2.006585e+07 | 6.568759e+06 | 5911.808333 | 10.783888 | 3936.277524 |
现在有一些特性,其中时间单位与默认的秒不同,比如 TIME_SINCE_LAST(sessions.session_start, unit=days)
和 TIME_SINCE_FIRST(sessions.session_start, unit=years)
。
Modeling#
如何在Featuretools和sklearn的train_test_split中使用我的训练和测试数据?#
您可能想知道如何在Featuretools和sklearn的train_test_split中正确使用您的训练和测试数据。有些步骤您需要遵循,以确保这个工作流程的准确性。
让我们假设我们有一个包含标签的训练数据的数据框。
[65]:
train_data = pd.DataFrame(
{
"customer_id": [1, 2, 3, 4, 5],
"age": [20, 25, 55, 22, 35],
"gender": ["f", "m", "m", "m", "m"],
"signup_date": pd.date_range("2010-01-01 01:41:50", periods=5, freq="25min"),
"labels": [False, True, True, False, False],
}
)
train_data.head()
[65]:
customer_id | age | gender | signup_date | labels | |
---|---|---|---|---|---|
0 | 1 | 20 | f | 2010-01-01 01:41:50 | False |
1 | 2 | 25 | m | 2010-01-01 02:06:50 | True |
2 | 3 | 55 | m | 2010-01-01 02:31:50 | True |
3 | 4 | 22 | m | 2010-01-01 02:56:50 | False |
4 | 5 | 35 | m | 2010-01-01 03:21:50 | False |
现在我们可以为训练数据创建我们的EntitySet
,并创建我们的特征。为了防止标签泄漏,我们将使用截止时间(请参见之前的问题)。
[66]:
es_train_data = ft.EntitySet(id="customer_data")
es_train_data = es_train_data.add_dataframe(
dataframe_name="customers", dataframe=train_data, index="customer_id"
)
cutoff_times = pd.DataFrame(
{
"customer_id": [1, 2, 3, 4, 5],
"time": pd.date_range("2014-01-01 01:41:50", periods=5, freq="25min"),
}
)
feature_matrix_train, features = ft.dfs(
entityset=es_train_data,
target_dataframe_name="customers",
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
)
feature_matrix_train.head()
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/featuretools/synthesis/deep_feature_synthesis.py:154: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created
warnings.warn(
[66]:
age | labels | DAY(signup_date) | MONTH(signup_date) | WEEKDAY(signup_date) | YEAR(signup_date) | ||
---|---|---|---|---|---|---|---|
customer_id | time | ||||||
1 | 2014-01-01 01:41:50 | 20 | False | 1 | 1 | 4 | 2010 |
2 | 2014-01-01 02:06:50 | 25 | True | 1 | 1 | 4 | 2010 |
3 | 2014-01-01 02:31:50 | 55 | True | 1 | 1 | 4 | 2010 |
4 | 2014-01-01 02:56:50 | 22 | False | 1 | 1 | 4 | 2010 |
5 | 2014-01-01 03:21:50 | 35 | False | 1 | 1 | 4 | 2010 |
我们还将对特征矩阵进行编码,以便与机器学习算法兼容。
[67]:
feature_matrix_train_enc, feature_enc = ft.encode_features(
feature_matrix_train, features
)
feature_matrix_train_enc.head()
[67]:
age | labels | DAY(signup_date) = 1 | DAY(signup_date) is unknown | MONTH(signup_date) = 1 | MONTH(signup_date) is unknown | WEEKDAY(signup_date) = 4 | WEEKDAY(signup_date) is unknown | YEAR(signup_date) = 2010 | YEAR(signup_date) is unknown | ||
---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | time | ||||||||||
1 | 2014-01-01 01:41:50 | 20 | False | True | False | True | False | True | False | True | False |
2 | 2014-01-01 02:06:50 | 25 | True | True | False | True | False | True | False | True | False |
3 | 2014-01-01 02:31:50 | 55 | True | True | False | True | False | True | False | True | False |
4 | 2014-01-01 02:56:50 | 22 | False | True | False | True | False | True | False | True | False |
5 | 2014-01-01 03:21:50 | 35 | False | True | False | True | False | True | False | True | False |
[68]:
from sklearn.model_selection import train_test_split
X = feature_matrix_train_enc.drop(["labels"], axis=1)
y = feature_matrix_train_enc["labels"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
现在您可以使用编码后的特征矩阵与sklearn的train_test_split。这将允许您训练模型并调整参数。
在拆分训练和测试数据时,分类列是如何编码的?#
您可能想知道当对训练和测试数据进行编码时会发生什么。您可能好奇想知道如果训练数据中有一个分类列在测试数据中不存在会发生什么。
让我们通过一个简单的例子来探讨编码过程中会发生什么。
[69]:
train_data = pd.DataFrame(
{
"customer_id": [1, 2, 3, 4, 5],
"product_purchased": ["coke zero", "car", "toothpaste", "coke zero", "car"],
}
)
es_train = ft.EntitySet(id="customer_data")
es_train = es_train.add_dataframe(
dataframe_name="customers",
dataframe=train_data,
index="customer_id",
logical_types={"product_purchased": ww.logical_types.Categorical},
)
feature_matrix_train, features = ft.dfs(
entityset=es_train, target_dataframe_name="customers"
)
feature_matrix_train
/Users/code/fin_tool/github/featuretools/featuretools/synthesis/deep_feature_synthesis.py:154: UserWarning: Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created
warnings.warn(
[69]:
product_purchased | |
---|---|
customer_id | |
1 | coke zero |
2 | car |
3 | toothpaste |
4 | coke zero |
5 | car |
我们将使用ft.encode_features
来正确编码product_purchased
列。
[70]:
feature_matrix_train_encoded, features_encoded = ft.encode_features(
feature_matrix_train, features
)
feature_matrix_train_encoded.head()
[70]:
product_purchased = coke zero | product_purchased = car | product_purchased = toothpaste | product_purchased is unknown | |
---|---|---|---|---|
customer_id | ||||
1 | True | False | False | False |
2 | False | True | False | False |
3 | False | False | True | False |
4 | True | False | False | False |
5 | False | True | False | False |
现在让我们想象一下,我们有一些测试数据,其中缺少一个分类值(牙膏)。此外,测试数据中有一个在训练数据中不存在的值(水)。
[71]:
test_data = pd.DataFrame(
{
"customer_id": [6, 7, 8, 9, 10],
"product_purchased": ["coke zero", "car", "coke zero", "coke zero", "water"],
}
)
es_test = ft.EntitySet(id="customer_data")
es_test = es_test.add_dataframe(
dataframe_name="customers", dataframe=test_data, index="customer_id"
)
feature_matrix_test = ft.calculate_feature_matrix(
entityset=es_test, features=features_encoded
)
feature_matrix_test.head()
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[71]:
product_purchased = coke zero | product_purchased = car | product_purchased = toothpaste | product_purchased is unknown | |
---|---|---|---|---|
customer_id | ||||
6 | True | False | False | False |
7 | False | True | False | False |
8 | True | False | False | False |
9 | True | False | False | False |
10 | False | False | False | True |
如上所示,我们成功处理了编码,并处理了以下复杂情况:
牙膏 在训练数据中存在,但在测试数据中不存在
水 在测试数据中存在,但在训练数据中不存在。
Errors and Warnings#
为什么会出现错误’数据框中的索引不唯一’?#
您可能正在尝试创建您的EntitySet
,并遇到此错误。
IndexError: 索引列必须是唯一的
这是因为您的EntitySet中的每个数据框都需要一个唯一的索引。
让我们看一个简单的例子。
[72]:
product_df = pd.DataFrame({"id": [1, 2, 3, 4, 4], "rating": [3.5, 4.0, 4.5, 1.5, 5.0]})
product_df
[72]:
id | rating | |
---|---|---|
0 | 1 | 3.5 |
1 | 2 | 4.0 |
2 | 3 | 4.5 |
3 | 4 | 1.5 |
4 | 4 | 5.0 |
请注意id
列具有重复索引4
。如果尝试将此数据框添加到EntitySet中,将会遇到以下错误。
es = ft.EntitySet(id="产品数据")
es = es.add_dataframe(dataframe_name="产品",
dataframe=product_df,
index="id")
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-78-854fbaf207f8> in <module>
1 es = ft.EntitySet(id="product_data")
----> 2 es = es.add_dataframe(dataframe_name="products",
3 dataframe=product_df,
4 index="id")
~/Code/featuretools/featuretools/entityset/entityset.py in add_dataframe(self, dataframe, dataframe_name, index, logical_types, semantic_tags, make_index, time_index, secondary_time_index, already_sorted)
625 index_was_created, index, dataframe = _get_or_create_index(index, make_index, dataframe)
626
--> 627 dataframe.ww.init(name=dataframe_name,
628 index=index,
629 time_index=time_index,
/usr/local/Caskroom/miniconda/base/envs/featuretools/lib/python3.8/site-packages/woodwork/table_accessor.py in init(self, index, time_index, logical_types, already_sorted, schema, validate, use_standard_tags, **kwargs)
94 """
95 if validate:
---> 96 _validate_accessor_params(self._dataframe, index, time_index, logical_types, schema, use_standard_tags)
97 if schema is not None:
98 self._schema = schema
/usr/local/Caskroom/miniconda/base/envs/featuretools/lib/python3.8/site-packages/woodwork/table_accessor.py in _validate_accessor_params(dataframe, index, time_index, logical_types, schema, use_standard_tags)
877 # 如果传递了schema,我们将忽略这些参数
878 if index is not None:
--> 879 _check_index(dataframe, index)
880 if logical_types:
881 _check_logical_types(dataframe.columns, logical_types)
/usr/local/Caskroom/miniconda/base/envs/featuretools/lib/python3.8/site-packages/woodwork/table_accessor.py in _check_index(dataframe, index)
903 # 用户指定的索引在数据框中存在但不唯一
--> 904 raise IndexError('索引列必须是唯一的')
905
906
IndexError: 索引列必须是唯一的
要解决上述错误,您可以采取以下解决方案之一:
解决方案#1 - 您可以在数据框上创建唯一索引。
[73]:
product_df = pd.DataFrame({"id": [1, 2, 3, 4, 5], "rating": [3.5, 4.0, 4.5, 1.5, 5.0]})
product_df
[73]:
id | rating | |
---|---|---|
0 | 1 | 3.5 |
1 | 2 | 4.0 |
2 | 3 | 4.5 |
3 | 4 | 1.5 |
4 | 5 | 5.0 |
注意我们现在有一个名为id
的唯一索引列。
[74]:
es = es.add_dataframe(dataframe_name="products", dataframe=product_df, index="id")
es
[74]:
Entityset: transactions
DataFrames:
transactions [Rows: 500, Columns: 6]
products [Rows: 5, Columns: 2]
sessions [Rows: 35, Columns: 5]
customers [Rows: 5, Columns: 5]
Relationships:
transactions.product_id -> products.product_id
transactions.session_id -> sessions.session_id
sessions.customer_id -> customers.customer_id
如上所示,我们现在可以通过在DataFrame中创建一个唯一索引来为我们的EntitySet
创建DataFrame,而不会出现错误。
解决方案#2 - 在调用``add_dataframe``时将``make_index``设置为True,以在该数据上创建新索引
make_index
通过查看行在所有其他行中的位置来为每一行创建一个唯一索引。
[75]:
product_df = pd.DataFrame({"id": [1, 2, 3, 4, 4], "rating": [3.5, 4.0, 4.5, 1.5, 5.0]})
es = ft.EntitySet(id="product_data")
es = es.add_dataframe(
dataframe_name="products", dataframe=product_df, index="product_id", make_index=True
)
es["products"]
[75]:
product_id | id | rating | |
---|---|---|---|
0 | 0 | 1 | 3.5 |
1 | 1 | 2 | 4.0 |
2 | 2 | 3 | 4.5 |
3 | 3 | 4 | 1.5 |
4 | 4 | 4 | 5.0 |
如上所示,我们在创建EntitySet
时,使用了make_index
参数而没有出现错误。
为什么会收到以下警告’Using training_window but last_time_index is not set’?#
如果您正在使用训练窗口,并且您的数据框没有设置last_time_index
,那么您将收到此警告。
Featuretools中的训练窗口属性限制了在计算特定特征向量时可以使用的过去数据量。
您可以在创建EntitySet
之后调用your_entityset.add_last_time_indexes()
,自动为所有数据框添加last_time_index
。这将消除警告。
[76]:
es = ft.demo.load_mock_customer(return_entityset=True)
es.add_last_time_indexes()
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
现在我们可以运行深度优先搜索(DFS),而不会收到警告。
[77]:
cutoff_times = pd.DataFrame()
cutoff_times["customer_id"] = [1, 2, 3, 1]
cutoff_times["time"] = pd.to_datetime(
["2014-1-1 04:00", "2014-1-1 05:00", "2014-1-1 06:00", "2014-1-1 08:00"]
)
cutoff_times["label"] = [True, True, False, True]
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
training_window="1 hour",
)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1112a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1112a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1112a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1112a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1112a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
last_time_index vs. time_index#
time_index
是实例首次被知晓的时间。last_time_index
是实例最后一次出现的时间。举个例子,一个顾客的会话中可能有多笔交易,这些交易可能发生在不同的时间点。如果我们想要计算用户在给定时间段内的会话次数,通常我们希望计算在训练窗口期间有任何交易的所有会话次数。为了实现这一点,我们不仅需要知道会话何时开始(time_index),还需要知道会话何时结束(last_time_index)。数据框中存储实例在数据中出现的最后时间作为
last_time_index
。一旦设置了
last_time_index
,Featuretools 将检查最后时间索引是否在训练窗口的开始之后。这个检查,结合截止时间,允许 DFS 发现哪些数据与给定的训练窗口相关。
为什么在Google Colab上使用Featuretools会出现错误?#
默认情况下,Google Colab安装的是Featuretools 0.4.1
版本。如果您在使用较旧版本的Featuretools时遇到问题,可能会导致无法按照我们最新的指南或文档进行操作。因此,我们建议您在Google Colab的笔记本中执行以下操作,将Featuretools升级到最新版本:
!pip install -U featuretools
您可能需要通过执行 Runtime -> Restart Runtime 来重新启动运行时。
您可以通过以下方式检查最新的Featuretools版本:
import featuretools as ft
print(ft.__version__)
您应该看到的版本号大于 0.4.1
。