Migration to Featuretools Version 1.0#
Featuretools version 1.0 contains many significant changes that affect how EntitySets are created, how primitives are defined, and the feature matrices that are created in some cases. This document will outline these major changes to help existing Featuretools users transition to version 1.0.
Background and Introduction#
Why were these changes made?#
The lack of a unified type system across libraries has made it more difficult to share information between them. This issue led to the development of Woodwork. Updating Featuretools to use Woodwork for managing column type information allows for easy sharing of feature matrix column type information with other libraries without the need for expensive conversions of custom type systems. For example, EvalML has also adopted Woodwork, allowing it to directly use Woodwork type information on feature matrices to create machine learning models without first inferring or redefining column types. Other benefits of using Woodwork to manage types in Featuretools include: - Simplified code - Custom type management code has been removed. - Seamless integration of new types and improvements in type integration as Woodwork evolves. - Easily and flexibly store additional information about columns. For instance, we can now store whether a feature was engineered by Featuretools or existed in the original data.
有哪些变化?#
Featuretools的传统自定义类型系统已被Woodwork取代,用于管理列类型
Featuretools中的
Entity
和Variable
类均已移除几个关键的Featuretools方法已被移动或更新
传统类型系统与Woodwork类型系统的比较#
Featuretools < 1.0 |
Featuretools 1.0 |
描述 |
---|---|---|
Entity |
Woodwork DataFrame |
存储所有列的类型信息 |
Variable |
ColumnSchema |
存储单个列的类型信息 |
Variable子类 |
LogicalType和semantic_tags |
用于定义列类型的元素 |
重要方法变化摘要#
下表概述了发生的最重要变化。总结:在某些情况下,方法参数也发生了变化,这些变化将在本文档中更详细地说明。
旧版本 |
Featuretools 1.0 |
---|---|
EntitySet.entity_from_dataframe |
EntitySet.add_dataframe |
EntitySet.normalize_entity |
EntitySet.normalize_dataframe |
EntitySet.update_data |
EntitySet.replace_dataframe |
Entity.variable_types |
es[‘dataframe_name’].ww |
es[‘entity_id’][‘variable_name’] |
es[‘dataframe_name’].ww.columns[‘column_name’] |
Entity.convert_variable_type |
es[‘dataframe_name’].ww.set_types |
Entity.add_interesting_values |
es.add_interesting_values(dataframe_name=‘df_name’, …) |
Entity.set_secondary_time_index |
es.set_secondary_time_index(dataframe_name=‘df_name’, …) |
Feature(es[‘entity_id’][‘variable_name’]) |
Feature(es[‘dataframe_name’].ww[‘column_name’]) |
dfs(target_entity=‘entity_id’, …) |
dfs(target_dataframe_name=‘dataframe_name’, …) |
有关Woodwork如何管理类型信息的更多信息,请参考Woodwork 理解类型和标签指南。
这些变化对用户意味着什么?删除这些类需要将几个方法从Entity
移动到EntitySet
对象中。这个变化也影响了关系、特征和基元的定义方式,需要不同于以前所需的参数。另外,由于Woodwork类型系统与旧的Featuretools类型系统不完全相同,因此在某些情况下,返回的特征矩阵可能会略有不同,因为列被识别为不同的类型。所有这些变化以及更多内容将在本文档中详细审查,尽可能提供旧API和新API的示例。#
删除Entity
类并更新EntitySet
#
在Featuretools的先前版本中,通过添加多个实体然后定义不同实体中变量(列)之间的关系来创建EntitySet。从Featuretools 1.0版本开始,EntitySets现在是通过添加多个数据框并定义数据框中列之间的关系来创建的。虽然在概念上类似,但在过程中有一些细微的差异。
向EntitySet添加数据框#
当向EntitySet添加数据框时,用户可以传入一个带有Woodwork数据框或不带Woodwork类型信息的常规数据框。如果用户提供了一个带有初始化的Woodwork类型信息的数据框,Featuretools将直接使用这些类型信息。如果用户提供了一个没有初始化Woodwork的数据框,Featuretools将在数据框上初始化Woodwork,对于任何未指定类型信息的列执行类型推断。以下是一些示例来说明这个过程。首先,我们将创建两个小数据框用于示例。
[1]:
import pandas as pd
import featuretools as ft
2024-10-11 14:50:34,909 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:50:34,909 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:50:34,909 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:50:34,909 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:50:34,910 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:50:34,910 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:50:34,910 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:50:34,928 featuretools - WARNING Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
[2]:
orders_df = pd.DataFrame(
{"order_id": [0, 1, 2], "order_date": ["2021-01-02", "2021-01-03", "2021-01-04"]}
)
items_df = pd.DataFrame(
{
"id": [0, 1, 2, 3, 4],
"order_id": [0, 1, 1, 2, 2],
"item_price": [29.95, 4.99, 10.25, 20.50, 15.99],
"on_sale": [False, True, False, True, False],
}
)
在较旧版本的Featuretools中,用户首先会创建一个EntitySet对象,然后通过调用entity_from_dataframe
方法将数据框添加到EntitySet中,如下所示。
es = ft.EntitySet('old_es')
es.entity_from_dataframe(dataframe=orders_df,
entity_id='orders',
index='order_id',
time_index='order_date')
es.entity_from_dataframe(dataframe=items_df,
entity_id='items',
index='id')
Entityset: old_es
Entities:
orders [行数: 3, 列数: 2]
items [行数: 5, 列数: 3]
Relationships:
无关系
使用Featuretools 1.0,向EntitySet添加数据框的步骤与以前相同,但一些细节已经更改。首先,像以前一样创建一个EntitySet。要添加数据框,请调用EntitySet.add_dataframe
,而不是以前的EntitySet.entity_from_dataframe
调用。请注意,数据框的名称是在dataframe_name
参数中指定的,该参数以前称为entity_id
。
[3]:
es = ft.EntitySet("new_es")
es.add_dataframe(
dataframe=orders_df,
dataframe_name="orders",
index="order_id",
time_index="order_date",
)
[3]:
Entityset: new_es
DataFrames:
orders [Rows: 3, Columns: 2]
Relationships:
No relationships
您还可以通过首先在数据框上初始化Woodwork,然后将初始化的Woodwork数据框直接传递给add_dataframe
调用来定义名称、索引和时间索引。在此示例中,我们将在items_df
上初始化Woodwork,将数据框名称设置为items
,并指定索引应为id
列。
[4]:
items_df.ww.init(name="items", index="id")
items_df.ww
[4]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | int64 | Integer | ['index'] |
order_id | int64 | Integer | ['numeric'] |
item_price | float64 | Double | ['numeric'] |
on_sale | bool | Boolean | [] |
在Woodwork初始化后,当调用add_dataframe
时,我们不再需要为dataframe_name
或index
参数指定值,因为Featuretools将简单地使用在初始化Woodwork时已经指定的值。
[5]:
es.add_dataframe(dataframe=items_df)
[5]:
Entityset: new_es
DataFrames:
orders [Rows: 3, Columns: 2]
items [Rows: 5, Columns: 4]
Relationships:
No relationships
访问列类型信息以前,可以通过Entity.variable_types
来访问整个实体的列变量类型信息,或者通过首先通过es['entity_id']['col_id']
选择单个列来访问单个列的信息。pythones['items'].variable_types``````{'id': featuretools.variable_types.variable.Index, 'order_id': featuretools.variable_types.variable.Numeric, 'item_price': featuretools.variable_types.variable.Numeric}``````pythones['items']['item_price']``````<Variable: item_price (dtype = numeric)>
通过更新后的Featuretools版本,可以通过数据框的.ww
命名空间查看单个数据框中所有列的逻辑类型和语义标签。首先,通过es['dataframe_name']
选择实体集中的数据框,然后通过在末尾链接.ww
调用来访问类型信息,如下所示。#
[6]:
es["items"].ww
[6]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | int64 | Integer | ['index'] |
order_id | int64 | Integer | ['numeric'] |
item_price | float64 | Double | ['numeric'] |
on_sale | bool | Boolean | [] |
可以从存储在数据框上的Woodwork列字典中获取单个列的逻辑类型和语义标签,返回一个存储类型信息的Woodwork.ColumnSchema
对象:
[7]:
es["items"].ww.columns["item_price"]
[7]:
<ColumnSchema (Logical Type = Double) (Semantic Tags = ['numeric'])>
类型推断和更新列类型#
Featuretools将尝试为用户未定义类型的任何列推断类型。在版本1.0之前,Featuretools实现了自定义类型推断代码,以确定应将哪种变量类型分配给每个列。您可以通过查看Entity.variable_types
字典的内容来查看推断的变量类型。从Featuretools
1.0开始,列类型推断由Woodwork处理。当向EntitySet添加数据框时,如果用户未分配逻辑类型,则Woodwork将推断出这些列的逻辑类型。与以前一样,可以通过在调用EntitySet.add_dataframe
时传递适当的逻辑类型字典来跳过数据框中的任何列的类型推断。例如,我们可以创建一个新数据框并将其添加到EntitySet,指定用户的全名的逻辑类型为Woodwork的PersonFullName
逻辑类型。
[8]:
users_df = pd.DataFrame(
{"id": [0, 1, 2], "name": ["John Doe", "Rita Book", "Teri Dactyl"]}
)
[9]:
es.add_dataframe(
dataframe=users_df,
dataframe_name="users",
index="id",
logical_types={"name": "PersonFullName"},
)
es["users"].ww
[9]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | int64 | Integer | ['index'] |
name | string | PersonFullName | [] |
从上面的类型信息中,我们可以看到name
列的逻辑类型被设置为PersonFullName
,就像我们指定的那样。会出现一些情况,类型推断会将某列识别为错误的逻辑类型。在这些情况下,可以使用Woodwork的set_types
方法来更新逻辑类型。假设我们希望orders
数据框的order_id
列具有Categorical
逻辑类型,而不是推断出的Integer
类型。以前,可以通过Entity.convert_variable_type
方法来实现这一点。
from featuretools.variable_types import Categorical
es['items'].convert_variable_type(variable_id='order_id', new_type=Categorical)
现在,我们可以使用Woodwork执行相同的更新:
[10]:
es["items"].ww.set_types(logical_types={"order_id": "Categorical"})
es["items"].ww
[10]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
id | int64 | Integer | ['index'] |
order_id | category | Categorical | ['category'] |
item_price | float64 | Double | ['numeric'] |
on_sale | bool | Boolean | [] |
有关Woodwork类型及其在Featuretools中的使用的更多信息,请参考Featuretools中的Woodwork类型。
添加有趣的值#
有趣的值可以添加到EntitySet中的所有数据框,EntitySet中的单个数据框,或EntitySet中数据框的单个列中。要为EntitySet中的所有数据框添加有趣的值,只需调用EntitySet.add_interesting_values
,可选择指定要为每列添加的最大值数量。这与Featuretools的旧版本到1.0版本的发布没有变化。
添加单个数据框或单个列的值已更改。以前,要为Entity添加有趣的值,用户会调用Entity.add_interesting_values()
:
es['items'].add_interesting_values()
现在,为了指定单个数据框的有趣值,您需要在EntitySet上调用add_interesting_values
,并传递要添加有趣值的数据框的名称:
[11]:
es.add_interesting_values(dataframe_name="items")
以前,要手动为列添加感兴趣的值,只需将它们分配给变量的属性:python es['items']['order_id'].interesting_values = [1, 2]
现在,可以通过EntitySet.add_interesting_values
来实现,传入数据框的名称和将列名映射到要为该列分配的感兴趣的值的字典。例如,要将items
数据框的order_id
列的感兴趣值[1, 2]
分配给它,可以使用以下方法:
[12]:
es.add_interesting_values(dataframe_name="items", values={"order_id": [1, 2]})
可以通过向传递给values
参数的字典添加更多条目,为同一数据框中的多个列分配有趣的值。访问有趣的值也发生了变化。以前,可以从变量中查看有趣的值:pythones['items']['order_id'].interesting_values
有趣的值现在存储在数据框中列的Woodwork元数据中:
[13]:
es["items"].ww.columns["order_id"].metadata["interesting_values"]
[13]:
[1, 2]
设置次要时间索引#
在Featuretools的早期版本中,可以通过调用 Entity.set_secondary_time_index
在实体上设置次要时间索引。
es_flight = ft.demo.load_flight(nrows=100)
arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay',
'national_airspace_delay', 'security_delay',
'late_aircraft_delay', 'canceled', 'diverted',
'taxi_in', 'taxi_out', 'air_time', 'dep_time']
es_flight['trip_logs'].set_secondary_time_index({'arr_time': arr_time_columns})
由于Featuretools 1.0中已经移除了 Entity
类,现在需要通过 EntitySet
来完成这个操作:
[14]:
es_flight = ft.demo.load_flight(nrows=100)
arr_time_columns = [
"arr_delay",
"dep_delay",
"carrier_delay",
"weather_delay",
"national_airspace_delay",
"security_delay",
"late_aircraft_delay",
"canceled",
"diverted",
"taxi_in",
"taxi_out",
"air_time",
"dep_time",
]
es_flight.set_secondary_time_index(
dataframe_name="trip_logs", secondary_time_index={"arr_time": arr_time_columns}
)
Downloading data ...
/Users/code/fin_tool/github/featuretools/featuretools/demo/flight.py:288: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
clean_data.loc[:, "dep_time"] = clean_data["scheduled_dep_time"] + pd.to_timedelta(
/Users/code/fin_tool/github/featuretools/featuretools/demo/flight.py:293: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
clean_data.loc[:, "arr_time"] = clean_data["dep_time"] + pd.to_timedelta(
/Users/code/fin_tool/github/featuretools/featuretools/demo/flight.py:299: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
clean_data["scheduled_dep_time"] + clean_data["scheduled_elapsed_time"]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/logical_types.py:897: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/logical_types.py:897: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/logical_types.py:897: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/logical_types.py:897: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
series = series.replace(ww.config.get_option("nan_values"), np.nan)
以前,可以直接从实体中访问二级时间索引,方法是 es_flight['trip_logs'].secondary_time_index
。从Featuretools 1.0开始,二级时间索引及其关联的列存储在Woodwork数据框的元数据中,可以按如下所示访问。
[15]:
es_flight["trip_logs"].ww.metadata["secondary_time_index"]
[15]:
{'arr_time': ['arr_delay',
'dep_delay',
'carrier_delay',
'weather_delay',
'national_airspace_delay',
'security_delay',
'late_aircraft_delay',
'canceled',
'diverted',
'taxi_in',
'taxi_out',
'air_time',
'dep_time',
'arr_time']}
实体/数据框的规范化#
在Featuretools 1.0中,EntitySet.normalize_entity
已经更名为EntitySet.normalize_dataframe
。新方法的工作方式与旧方法相同,但是一些参数已更名。下表显示了旧名称和新名称以供参考。调用此方法时,需要使用新的参数名称。
旧参数名称 |
新参数名称 |
---|---|
base_entity_id |
base_dataframe_name |
new_entity_id |
new_dataframe_name |
additional_variables |
additional_columns |
copy_variables |
copy_columns |
new_entity_time_index |
new_dataframe_time_index |
new_entity_secondary_time_index |
new_dataframe_secondary_time_index |
定义和添加关系#
在Featuretools的早期版本中,关系是通过创建一个Relationship
对象来定义的,该对象接受两个Variables
作为输入。要定义订单实体和商品实体之间的关系,我们首先会创建一个Relationship
,然后将其添加到EntitySet中:
relationship = ft.Relationship(es['orders']['order_id'], es['items']['order_id'])
es.add_relationship(relationship)
在Featuretools 1.0中,这个过程类似,但是有两种不同的方法可以将关系添加到EntitySet中。一种方法是将数据框和列名传递给EntitySet.add_relationship
,另一种方法是将先前创建的Relationship
对象传递给relationship
关键字参数。下面演示了这两种方法。
[16]:
# 撤消上述更改,并将子列的逻辑类型更改为与父列匹配,以防止警告# 注意:此单元格在文档构建中被隐藏es["items"].ww.set_types(logical_types={"order_id": "Integer"})
[17]:
es.add_relationship(
parent_dataframe_name="orders",
parent_column_name="order_id",
child_dataframe_name="items",
child_column_name="order_id",
)
/Users/code/fin_tool/github/featuretools/featuretools/entityset/entityset.py:383: UserWarning: Logical type Categorical for child column order_id does not match parent column order_id logical type Integer. Changing child logical type to match parent.
warnings.warn(
[17]:
Entityset: new_es
DataFrames:
orders [Rows: 3, Columns: 2]
items [Rows: 5, Columns: 4]
users [Rows: 3, Columns: 2]
Relationships:
items.order_id -> orders.order_id
[18]:
# 重置关系,以便我们可以再次添加它
es.relationships = []
另一种方法是首先创建一个Relationship
,然后将其传递给EntitySet.add_relationship
。在定义Relationship
时,我们需要传入它所属的EntitySet以及父数据框和父列的名称,以及子数据框和子列的名称。
[19]:
relationship = ft.Relationship(
entityset=es,
parent_dataframe_name="orders",
parent_column_name="order_id",
child_dataframe_name="items",
child_column_name="order_id",
)
es.add_relationship(relationship=relationship)
[19]:
Entityset: new_es
DataFrames:
orders [Rows: 3, Columns: 2]
items [Rows: 5, Columns: 4]
users [Rows: 3, Columns: 2]
Relationships:
items.order_id -> orders.order_id
更新实体集中数据框的数据以前,要更新(替换)与实体关联的数据,用户可以调用Entity.update_data
并传入新的数据框。例如,让我们更新我们的users
实体中的数据:#
new_users_df = pd.DataFrame({
'id': [3, 4],
'name': ['Anne Teak', 'Art Decco']
})
es['users'].update_data(df=new_users_df)
为了在Featuretools 1.0中完成这个任务,我们将使用EntitySet.replace_dataframe
方法:
[20]:
new_users_df = pd.DataFrame({"id": [0, 1], "name": ["Anne Teak", "Art Decco"]})
es.replace_dataframe(dataframe_name="users", df=new_users_df)
es["users"]
[20]:
id | name | |
---|---|---|
0 | 0 | Anne Teak |
1 | 1 | Art Decco |
定义特征#
在Featuretools 1.0中,定义特征的语法略有变化。以前,可以通过传入应该用于构建特征的变量来定义身份特征。
feature = ft.Feature(es['items']['item_price'])
从Featuretools 1.0开始,可以使用类似的语法,但是因为 es['items']
现在将返回一个Woodwork数据框而不是一个 Entity
,我们需要稍微更新语法以访问Woodwork列。要进行更新,只需在数据框名称选择器和列选择器之间添加 .ww
,如下所示。
[21]:
feature = ft.Feature(es["items"].ww["item_price"])
定义基元#
在Featuretools的早期版本中,基元的输入和返回类型是通过指定适当的Variable
类来定义的。从1.0版本开始,输入和返回类型是通过Woodwork ColumnSchema
对象来定义的。为了说明这一变化,让我们更仔细地看一下Age
转换基元。这个基元接受代表出生日期的日期时间,并返回对应于一个人年龄的数值。在Featuretools的先前版本中,输入类型是通过指定DateOfBirth
变量类型来定义的,返回类型是通过指定Numeric
变量类型来指定:
input_types = [DateOfBirth]
return_type = Numeric
Woodwork没有特定的DateOfBirth
逻辑类型,而是通过将逻辑类型指定为Datetime
并使用语义标签date_of_birth
来标识列作为出生日期列。Woodwork中也没有Numeric
逻辑类型,而是通过使用语义标签numeric
来标识所有可以用于数值操作的列。此外,我们知道Age
基元将返回一个浮点数,这对应于Woodwork的逻辑类型Double
。有了这些信息,我们可以使用ColumnSchema
对象重新定义Age
的输入类型和返回类型如下:
input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={'date_of_birth'})]
return_type = ColumnSchema(logical_type=Double, semantic_tags={'numeric'})
除了改变输入和返回类型的定义方式外,定义基元的其余过程保持不变。
从旧Featuretools变量类型到Woodwork ColumnSchemas的映射#
Woodwork定义的类型与Featuretools 1.0版本之前定义的旧变量类型不同。虽然旧变量类型与ColumnSchema
对象定义的新Woodwork类型之间没有直接映射,但下表显示了近似的映射。
Featuretools变量 |
Woodwork Column Schema |
---|---|
布尔型 |
ColumnSchema(logical_type=Boolean) 或 ColumnSchema(logical_type=BooleanNullable) |
分类 |
ColumnSchema(logical_type=Categorical) |
国家代码 |
ColumnSchema(logical_type=CountryCode) |
日期时间 |
ColumnSchema(logical_type=Datetime) |
出生日期 |
ColumnSchema(logical_type=Datetime, semantic_tags={‘date_of_birth’}) |
日期时间索引 |
ColumnSchema(logical_type=Datetime, semantic_tags={‘time_index’}) |
离散型 |
ColumnSchema(semantic_tags={‘category’}) |
电子邮件地址 |
ColumnSchema(logical_type=EmailAddress) |
文件路径 |
ColumnSchema(logical_type=Filepath) |
全名 |
ColumnSchema(logical_type=PersonFullName) |
ID |
ColumnSchema(semantic_tags={‘foreign_key’}) |
索引 |
ColumnSchema(semantic_tags={‘index’}) |
IP地址 |
ColumnSchema(logical_type=IPAddress) |
纬度经度 |
ColumnSchema(logical_type=LatLong) |
自然语言 |
ColumnSchema(logical_type=NaturalLanguage) |
数值型 |
ColumnSchema(semantic_tags={‘numeric’}) |
数值型时间索引 |
ColumnSchema(semantic_tags={‘numeric’, ‘time_index’}) |
顺序型 |
ColumnSchema(logical_type=Ordinal) |
电话号码 |
ColumnSchema(logical_type=PhoneNumber) |
子区域代码 |
ColumnSchema(logical_type=SubRegionCode) |
时间间隔 |
ColumnSchema(logical_type=Timedelta) |
时间索引 |
ColumnSchema(semantic_tags={‘time_index’}) |
URL |
ColumnSchema(logical_type=URL) |
未知 |
ColumnSchema(logical_type=Unknown) |
邮政编码 |
ColumnSchema(logical_type=PostalCode) |
更改Deep Feature Synthesis#
在Featuretools 1.0中,featuretools.dfs
和featuretools.calculate_feature_matrix
的参数名称略有更改。在之前的版本中,用户可以使用默认的基元和选项生成特征列表,如下所示:
features = ft.dfs(entityset=es,
target_entity='items',
features_only=True)
在Featuretools 1.0中,target_entity
参数已更名为target_dataframe_name
,但除此之外,此基本调用保持不变。
[22]:
features = ft.dfs(entityset=es, target_dataframe_name="items", features_only=True)
features
[22]:
[<Feature: order_id>,
<Feature: item_price>,
<Feature: on_sale>,
<Feature: orders.COUNT(items)>,
<Feature: orders.MAX(items.item_price)>,
<Feature: orders.MEAN(items.item_price)>,
<Feature: orders.MIN(items.item_price)>,
<Feature: orders.PERCENT_TRUE(items.on_sale)>,
<Feature: orders.SKEW(items.item_price)>,
<Feature: orders.STD(items.item_price)>,
<Feature: orders.SUM(items.item_price)>,
<Feature: orders.DAY(order_date)>,
<Feature: orders.MONTH(order_date)>,
<Feature: orders.WEEKDAY(order_date)>,
<Feature: orders.YEAR(order_date)>]
此外,dfs
参数中的 ignore_entities
已更名为 ignore_dataframes
,ignore_variables
已更名为 ignore_columns
。类似地,如果指定原始选项,则应将所有对 entities
的引用替换为 dataframes
,将对 variables
的引用替换为 columns
。例如,include_groupby_entities
的原始选项现在是 include_groupby_dataframes
,include_variables
现在是 include_columns
。如果传入一个 EntitySet 以及要计算的特征列表,那么对 featuretools.calculate_feature_matrix
的基本调用保持不变。然而,通过传入一个 entities
和 relationships
列表来调用 calculate_feature_matrix
的用户应注意,entities
参数已更名为 dataframes
,字典值现在应包含 Woodwork 逻辑类型,而不是 Featuretools 的 Variable
类。
[23]:
feature_matrix = ft.calculate_feature_matrix(features=features, entityset=es)
feature_matrix
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1056a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1056a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1056a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1056a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1056a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
[23]:
order_id | item_price | on_sale | orders.COUNT(items) | orders.MAX(items.item_price) | orders.MEAN(items.item_price) | orders.MIN(items.item_price) | orders.PERCENT_TRUE(items.on_sale) | orders.SKEW(items.item_price) | orders.STD(items.item_price) | orders.SUM(items.item_price) | orders.DAY(order_date) | orders.MONTH(order_date) | orders.WEEKDAY(order_date) | orders.YEAR(order_date) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||
0 | 0 | 29.95 | False | 1 | 29.95 | 29.950 | 29.95 | 0.0 | NaN | NaN | 29.95 | 2 | 1 | 5 | 2021 |
1 | 1 | 4.99 | True | 2 | 10.25 | 7.620 | 4.99 | 0.5 | NaN | 3.719382 | 15.24 | 3 | 1 | 6 | 2021 |
2 | 1 | 10.25 | False | 2 | 10.25 | 7.620 | 4.99 | 0.5 | NaN | 3.719382 | 15.24 | 3 | 1 | 6 | 2021 |
3 | 2 | 20.50 | True | 2 | 20.50 | 18.245 | 15.99 | 0.5 | NaN | 3.189052 | 36.49 | 4 | 1 | 0 | 2021 |
4 | 2 | 15.99 | False | 2 | 20.50 | 18.245 | 15.99 | 0.5 | NaN | 3.189052 | 36.49 | 4 | 1 | 0 | 2021 |
除了参数名称的更改之外,用户还应该注意返回的特征矩阵中的另外一些变化。首先,由于Woodwork定义列类型的方式与先前的Featuretools实现方式略有不同,因此在旧版本和新版本之间生成的特征可能会有一些差异。最显著的影响在于外键列的处理方式。以前,Featuretools将所有外键(之前是Id
)列视为分类列,并会从这些列生成适当的特征。从版本1.0开始,外键列不再被限制为分类列,如果它们是其他类型,如Integer
,则不会从这些列生成特征。像上面展示的手动将外键列转换为Categorical
将会产生与之前版本中实现的特征非常接近的特征。另外,由于Woodwork的类型推断过程与先前的Featuretools类型推断过程不同,一个EntitySet可能会有不同的列类型被识别出来。列类型的这种差异可能会影响生成的特征。如果重要的是要有相同的特征集,可以检查EntitySet数据框中的所有逻辑类型,并根据需要更新为期望的类型。最后,由Featuretools计算的特征矩阵现在将会被初始化为Woodwork。这意味着用户可以通过Woodwork命名空间查看特征矩阵列的类型信息,如下所示。
[24]:
feature_matrix.ww
[24]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
order_id | int64 | Integer | ['numeric', 'foreign_key'] |
item_price | float64 | Double | ['numeric'] |
on_sale | bool | Boolean | [] |
orders.COUNT(items) | Int64 | IntegerNullable | ['numeric'] |
orders.MAX(items.item_price) | float64 | Double | ['numeric'] |
orders.MEAN(items.item_price) | float64 | Double | ['numeric'] |
orders.MIN(items.item_price) | float64 | Double | ['numeric'] |
orders.PERCENT_TRUE(items.on_sale) | float64 | Double | ['numeric'] |
orders.SKEW(items.item_price) | float64 | Double | ['numeric'] |
orders.STD(items.item_price) | float64 | Double | ['numeric'] |
orders.SUM(items.item_price) | float64 | Double | ['numeric'] |
orders.DAY(order_date) | category | Ordinal: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] | ['category'] |
orders.MONTH(order_date) | category | Ordinal: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] | ['category'] |
orders.WEEKDAY(order_date) | category | Ordinal: [0, 1, 2, 3, 4, 5, 6] | ['category'] |
orders.YEAR(order_date) | category | Ordinal| ['category'] |
Featuretools现在通过数据框中是否最初存在来标记特征,或者是由Featuretools创建的。这些信息存储在Woodwork的origin
属性中。原始数据中存在的列将被标记为base
,而由Featuretools创建的特征将被标记为engineered
。作为如何访问这些信息的演示,让我们比较特征矩阵中的两个特征:item_price
和orders.MEAN(items.item_price)
。item_price
在原始数据中存在,而orders.MEAN(items.item_price)
是由Featuretools创建的。
[25]:
feature_matrix.ww["item_price"].ww.origin
[25]:
'base'
[26]:
feature_matrix.ww["orders.MEAN(items.item_price)"].ww.origin
[26]:
'engineered'
其他更改#
除了上面概述的更改之外,Featuretools 1.0 中还有一些其他较小的更改,现有用户应该注意以下内容。
在 EntitySet 中,数据框的列顺序可能与以前不同。以前,Featuretools 会重新排列列,使索引列始终成为数据框中的第一列。这种行为已被移除,索引列不再保证是数据框中的第一列。现在,索引列将保持在数据框添加到 EntitySet 时的位置。
对于
LatLong
列,Featuretools 的旧版本会将列中单个nan
值替换为元组(nan, nan)
。现在不再这样,单个nan
值将保留在LatLong
列中。根据 Woodwork 的行为,LatLong
列中的任何(nan, nan)
值将被替换为单个nan
值。由于 Featuretools 不再定义具有彼此之间关系的
Variable
对象,因此已删除了featuretools.variable_types.graph_variable_types
函数。已删除
featuretools.variable_types.list_variable_types
实用程序函数,并用两个相应的 Woodwork 函数替换:woodwork.list_logical_types
和woodwork.list_semantic_tags
。从 Featuretools 1.0 开始,应使用 Woodwork 实用程序函数来获取可以应用于数据框列的逻辑类型和语义标签的信息。