Migration to Featuretools Version 1.0#

Featuretools version 1.0 contains many significant changes that affect how EntitySets are created, how primitives are defined, and the feature matrices that are created in some cases. This document will outline these major changes to help existing Featuretools users transition to version 1.0.

Background and Introduction#

Why were these changes made?#

The lack of a unified type system across libraries has made it more difficult to share information between them. This issue led to the development of Woodwork. Updating Featuretools to use Woodwork for managing column type information allows for easy sharing of feature matrix column type information with other libraries without the need for expensive conversions of custom type systems. For example, EvalML has also adopted Woodwork, allowing it to directly use Woodwork type information on feature matrices to create machine learning models without first inferring or redefining column types. Other benefits of using Woodwork to manage types in Featuretools include: - Simplified code - Custom type management code has been removed. - Seamless integration of new types and improvements in type integration as Woodwork evolves. - Easily and flexibly store additional information about columns. For instance, we can now store whether a feature was engineered by Featuretools or existed in the original data.

有哪些变化?#

  • Featuretools的传统自定义类型系统已被Woodwork取代,用于管理列类型

  • Featuretools中的EntityVariable类均已移除

  • 几个关键的Featuretools方法已被移动或更新

传统类型系统与Woodwork类型系统的比较#

Featuretools < 1.0

Featuretools 1.0

描述

Entity

Woodwork DataFrame

存储所有列的类型信息

Variable

ColumnSchema

存储单个列的类型信息

Variable子类

LogicalType和semantic_tags

用于定义列类型的元素

重要方法变化摘要#

下表概述了发生的最重要变化。总结:在某些情况下,方法参数也发生了变化,这些变化将在本文档中更详细地说明。

旧版本

Featuretools 1.0

EntitySet.entity_from_dataframe

EntitySet.add_dataframe

EntitySet.normalize_entity

EntitySet.normalize_dataframe

EntitySet.update_data

EntitySet.replace_dataframe

Entity.variable_types

es[‘dataframe_name’].ww

es[‘entity_id’][‘variable_name’]

es[‘dataframe_name’].ww.columns[‘column_name’]

Entity.convert_variable_type

es[‘dataframe_name’].ww.set_types

Entity.add_interesting_values

es.add_interesting_values(dataframe_name=‘df_name’, …)

Entity.set_secondary_time_index

es.set_secondary_time_index(dataframe_name=‘df_name’, …)

Feature(es[‘entity_id’][‘variable_name’])

Feature(es[‘dataframe_name’].ww[‘column_name’])

dfs(target_entity=‘entity_id’, …)

dfs(target_dataframe_name=‘dataframe_name’, …)

有关Woodwork如何管理类型信息的更多信息,请参考Woodwork 理解类型和标签指南。

这些变化对用户意味着什么?删除这些类需要将几个方法从Entity移动到EntitySet对象中。这个变化也影响了关系、特征和基元的定义方式,需要不同于以前所需的参数。另外,由于Woodwork类型系统与旧的Featuretools类型系统不完全相同,因此在某些情况下,返回的特征矩阵可能会略有不同,因为列被识别为不同的类型。所有这些变化以及更多内容将在本文档中详细审查,尽可能提供旧API和新API的示例。#

删除Entity类并更新EntitySet#

在Featuretools的先前版本中,通过添加多个实体然后定义不同实体中变量(列)之间的关系来创建EntitySet。从Featuretools 1.0版本开始,EntitySets现在是通过添加多个数据框并定义数据框中列之间的关系来创建的。虽然在概念上类似,但在过程中有一些细微的差异。

向EntitySet添加数据框#

当向EntitySet添加数据框时,用户可以传入一个带有Woodwork数据框或不带Woodwork类型信息的常规数据框。如果用户提供了一个带有初始化的Woodwork类型信息的数据框,Featuretools将直接使用这些类型信息。如果用户提供了一个没有初始化Woodwork的数据框,Featuretools将在数据框上初始化Woodwork,对于任何未指定类型信息的列执行类型推断。以下是一些示例来说明这个过程。首先,我们将创建两个小数据框用于示例。

[1]:
import pandas as pd


import featuretools as ft

2024-10-11 14:50:34,909 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:50:34,909 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:50:34,909 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:50:34,909 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:50:34,910 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:50:34,910 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:50:34,910 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:50:34,928 featuretools - WARNING    Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
[2]:
orders_df = pd.DataFrame(
    {"order_id": [0, 1, 2], "order_date": ["2021-01-02", "2021-01-03", "2021-01-04"]}
)
items_df = pd.DataFrame(
    {
        "id": [0, 1, 2, 3, 4],
        "order_id": [0, 1, 1, 2, 2],
        "item_price": [29.95, 4.99, 10.25, 20.50, 15.99],
        "on_sale": [False, True, False, True, False],
    }
)

在较旧版本的Featuretools中,用户首先会创建一个EntitySet对象,然后通过调用entity_from_dataframe方法将数据框添加到EntitySet中,如下所示。

es = ft.EntitySet('old_es')
es.entity_from_dataframe(dataframe=orders_df,
                         entity_id='orders',
                         index='order_id',
                         time_index='order_date')
es.entity_from_dataframe(dataframe=items_df,
                         entity_id='items',
                         index='id')
Entityset: old_es
Entities:
  orders [行数: 3, 列数: 2]
  items [行数: 5, 列数: 3]
Relationships:
  无关系

使用Featuretools 1.0,向EntitySet添加数据框的步骤与以前相同,但一些细节已经更改。首先,像以前一样创建一个EntitySet。要添加数据框,请调用EntitySet.add_dataframe,而不是以前的EntitySet.entity_from_dataframe调用。请注意,数据框的名称是在dataframe_name参数中指定的,该参数以前称为entity_id

[3]:
es = ft.EntitySet("new_es")

es.add_dataframe(
    dataframe=orders_df,
    dataframe_name="orders",
    index="order_id",
    time_index="order_date",
)

[3]:
Entityset: new_es
  DataFrames:
    orders [Rows: 3, Columns: 2]
  Relationships:
    No relationships

您还可以通过首先在数据框上初始化Woodwork,然后将初始化的Woodwork数据框直接传递给add_dataframe调用来定义名称、索引和时间索引。在此示例中,我们将在items_df上初始化Woodwork,将数据框名称设置为items,并指定索引应为id列。

[4]:
items_df.ww.init(name="items", index="id")
items_df.ww

[4]:
Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['index']
order_id int64 Integer ['numeric']
item_price float64 Double ['numeric']
on_sale bool Boolean []

在Woodwork初始化后,当调用add_dataframe时,我们不再需要为dataframe_nameindex参数指定值,因为Featuretools将简单地使用在初始化Woodwork时已经指定的值。

[5]:
es.add_dataframe(dataframe=items_df)

[5]:
Entityset: new_es
  DataFrames:
    orders [Rows: 3, Columns: 2]
    items [Rows: 5, Columns: 4]
  Relationships:
    No relationships

访问列类型信息以前,可以通过Entity.variable_types来访问整个实体的列变量类型信息,或者通过首先通过es['entity_id']['col_id']选择单个列来访问单个列的信息。pythones['items'].variable_types``````{'id': featuretools.variable_types.variable.Index, 'order_id': featuretools.variable_types.variable.Numeric, 'item_price': featuretools.variable_types.variable.Numeric}``````pythones['items']['item_price']``````<Variable: item_price (dtype = numeric)>通过更新后的Featuretools版本,可以通过数据框的.ww命名空间查看单个数据框中所有列的逻辑类型和语义标签。首先,通过es['dataframe_name']选择实体集中的数据框,然后通过在末尾链接.ww调用来访问类型信息,如下所示。#

[6]:
es["items"].ww

[6]:
Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['index']
order_id int64 Integer ['numeric']
item_price float64 Double ['numeric']
on_sale bool Boolean []

可以从存储在数据框上的Woodwork列字典中获取单个列的逻辑类型和语义标签,返回一个存储类型信息的Woodwork.ColumnSchema对象:

[7]:
es["items"].ww.columns["item_price"]

[7]:
<ColumnSchema (Logical Type = Double) (Semantic Tags = ['numeric'])>

类型推断和更新列类型#

Featuretools将尝试为用户未定义类型的任何列推断类型。在版本1.0之前,Featuretools实现了自定义类型推断代码,以确定应将哪种变量类型分配给每个列。您可以通过查看Entity.variable_types字典的内容来查看推断的变量类型。从Featuretools 1.0开始,列类型推断由Woodwork处理。当向EntitySet添加数据框时,如果用户未分配逻辑类型,则Woodwork将推断出这些列的逻辑类型。与以前一样,可以通过在调用EntitySet.add_dataframe时传递适当的逻辑类型字典来跳过数据框中的任何列的类型推断。例如,我们可以创建一个新数据框并将其添加到EntitySet,指定用户的全名的逻辑类型为Woodwork的PersonFullName逻辑类型。

[8]:
users_df = pd.DataFrame(
    {"id": [0, 1, 2], "name": ["John Doe", "Rita Book", "Teri Dactyl"]}
)

[9]:
es.add_dataframe(
    dataframe=users_df,
    dataframe_name="users",
    index="id",
    logical_types={"name": "PersonFullName"},
)

es["users"].ww

[9]:
Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['index']
name string PersonFullName []

从上面的类型信息中,我们可以看到name列的逻辑类型被设置为PersonFullName,就像我们指定的那样。会出现一些情况,类型推断会将某列识别为错误的逻辑类型。在这些情况下,可以使用Woodwork的set_types方法来更新逻辑类型。假设我们希望orders数据框的order_id列具有Categorical逻辑类型,而不是推断出的Integer类型。以前,可以通过Entity.convert_variable_type方法来实现这一点。

from featuretools.variable_types import Categorical

es['items'].convert_variable_type(variable_id='order_id', new_type=Categorical)

现在,我们可以使用Woodwork执行相同的更新:

[10]:
es["items"].ww.set_types(logical_types={"order_id": "Categorical"})
es["items"].ww

[10]:
Physical Type Logical Type Semantic Tag(s)
Column
id int64 Integer ['index']
order_id category Categorical ['category']
item_price float64 Double ['numeric']
on_sale bool Boolean []

有关Woodwork类型及其在Featuretools中的使用的更多信息,请参考Featuretools中的Woodwork类型

添加有趣的值#

有趣的值可以添加到EntitySet中的所有数据框,EntitySet中的单个数据框,或EntitySet中数据框的单个列中。要为EntitySet中的所有数据框添加有趣的值,只需调用EntitySet.add_interesting_values,可选择指定要为每列添加的最大值数量。这与Featuretools的旧版本到1.0版本的发布没有变化。

添加单个数据框或单个列的值已更改。以前,要为Entity添加有趣的值,用户会调用Entity.add_interesting_values()

es['items'].add_interesting_values()

现在,为了指定单个数据框的有趣值,您需要在EntitySet上调用add_interesting_values,并传递要添加有趣值的数据框的名称:

[11]:
es.add_interesting_values(dataframe_name="items")

以前,要手动为列添加感兴趣的值,只需将它们分配给变量的属性:python es['items']['order_id'].interesting_values = [1, 2] 现在,可以通过EntitySet.add_interesting_values来实现,传入数据框的名称和将列名映射到要为该列分配的感兴趣的值的字典。例如,要将items数据框的order_id列的感兴趣值[1, 2]分配给它,可以使用以下方法:

[12]:
es.add_interesting_values(dataframe_name="items", values={"order_id": [1, 2]})

可以通过向传递给values参数的字典添加更多条目,为同一数据框中的多个列分配有趣的值。访问有趣的值也发生了变化。以前,可以从变量中查看有趣的值:pythones['items']['order_id'].interesting_values有趣的值现在存储在数据框中列的Woodwork元数据中:

[13]:
es["items"].ww.columns["order_id"].metadata["interesting_values"]

[13]:
[1, 2]

设置次要时间索引#

在Featuretools的早期版本中,可以通过调用 Entity.set_secondary_time_index 在实体上设置次要时间索引。

es_flight = ft.demo.load_flight(nrows=100)
arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay',
                    'national_airspace_delay', 'security_delay',
                    'late_aircraft_delay', 'canceled', 'diverted',
                    'taxi_in', 'taxi_out', 'air_time', 'dep_time']
es_flight['trip_logs'].set_secondary_time_index({'arr_time': arr_time_columns})

由于Featuretools 1.0中已经移除了 Entity 类,现在需要通过 EntitySet 来完成这个操作:

[14]:
es_flight = ft.demo.load_flight(nrows=100)

arr_time_columns = [
    "arr_delay",
    "dep_delay",
    "carrier_delay",
    "weather_delay",
    "national_airspace_delay",
    "security_delay",
    "late_aircraft_delay",
    "canceled",
    "diverted",
    "taxi_in",
    "taxi_out",
    "air_time",
    "dep_time",
]
es_flight.set_secondary_time_index(
    dataframe_name="trip_logs", secondary_time_index={"arr_time": arr_time_columns}
)

Downloading data ...
/Users/code/fin_tool/github/featuretools/featuretools/demo/flight.py:288: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
  clean_data.loc[:, "dep_time"] = clean_data["scheduled_dep_time"] + pd.to_timedelta(
/Users/code/fin_tool/github/featuretools/featuretools/demo/flight.py:293: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
  clean_data.loc[:, "arr_time"] = clean_data["dep_time"] + pd.to_timedelta(
/Users/code/fin_tool/github/featuretools/featuretools/demo/flight.py:299: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
  clean_data["scheduled_dep_time"] + clean_data["scheduled_elapsed_time"]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/logical_types.py:897: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/logical_types.py:897: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/logical_types.py:897: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/logical_types.py:897: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)

以前,可以直接从实体中访问二级时间索引,方法是 es_flight['trip_logs'].secondary_time_index。从Featuretools 1.0开始,二级时间索引及其关联的列存储在Woodwork数据框的元数据中,可以按如下所示访问。

[15]:
es_flight["trip_logs"].ww.metadata["secondary_time_index"]

[15]:
{'arr_time': ['arr_delay',
  'dep_delay',
  'carrier_delay',
  'weather_delay',
  'national_airspace_delay',
  'security_delay',
  'late_aircraft_delay',
  'canceled',
  'diverted',
  'taxi_in',
  'taxi_out',
  'air_time',
  'dep_time',
  'arr_time']}

实体/数据框的规范化#

在Featuretools 1.0中,EntitySet.normalize_entity已经更名为EntitySet.normalize_dataframe。新方法的工作方式与旧方法相同,但是一些参数已更名。下表显示了旧名称和新名称以供参考。调用此方法时,需要使用新的参数名称。

旧参数名称

新参数名称

base_entity_id

base_dataframe_name

new_entity_id

new_dataframe_name

additional_variables

additional_columns

copy_variables

copy_columns

new_entity_time_index

new_dataframe_time_index

new_entity_secondary_time_index

new_dataframe_secondary_time_index

定义和添加关系#

在Featuretools的早期版本中,关系是通过创建一个Relationship对象来定义的,该对象接受两个Variables作为输入。要定义订单实体和商品实体之间的关系,我们首先会创建一个Relationship,然后将其添加到EntitySet中:

relationship = ft.Relationship(es['orders']['order_id'], es['items']['order_id'])
es.add_relationship(relationship)

在Featuretools 1.0中,这个过程类似,但是有两种不同的方法可以将关系添加到EntitySet中。一种方法是将数据框和列名传递给EntitySet.add_relationship,另一种方法是将先前创建的Relationship对象传递给relationship关键字参数。下面演示了这两种方法。

[16]:
# 撤消上述更改,并将子列的逻辑类型更改为与父列匹配,以防止警告# 注意:此单元格在文档构建中被隐藏es["items"].ww.set_types(logical_types={"order_id": "Integer"})

[17]:
es.add_relationship(
    parent_dataframe_name="orders",
    parent_column_name="order_id",
    child_dataframe_name="items",
    child_column_name="order_id",
)

/Users/code/fin_tool/github/featuretools/featuretools/entityset/entityset.py:383: UserWarning: Logical type Categorical for child column order_id does not match parent column order_id logical type Integer. Changing child logical type to match parent.
  warnings.warn(
[17]:
Entityset: new_es
  DataFrames:
    orders [Rows: 3, Columns: 2]
    items [Rows: 5, Columns: 4]
    users [Rows: 3, Columns: 2]
  Relationships:
    items.order_id -> orders.order_id
[18]:
# 重置关系,以便我们可以再次添加它
es.relationships = []

另一种方法是首先创建一个Relationship,然后将其传递给EntitySet.add_relationship。在定义Relationship时,我们需要传入它所属的EntitySet以及父数据框和父列的名称,以及子数据框和子列的名称。

[19]:
relationship = ft.Relationship(
    entityset=es,
    parent_dataframe_name="orders",
    parent_column_name="order_id",
    child_dataframe_name="items",
    child_column_name="order_id",
)
es.add_relationship(relationship=relationship)

[19]:
Entityset: new_es
  DataFrames:
    orders [Rows: 3, Columns: 2]
    items [Rows: 5, Columns: 4]
    users [Rows: 3, Columns: 2]
  Relationships:
    items.order_id -> orders.order_id

更新实体集中数据框的数据以前,要更新(替换)与实体关联的数据,用户可以调用Entity.update_data并传入新的数据框。例如,让我们更新我们的users实体中的数据:#

new_users_df = pd.DataFrame({
    'id': [3, 4],
    'name': ['Anne Teak', 'Art Decco']
})
es['users'].update_data(df=new_users_df)

为了在Featuretools 1.0中完成这个任务,我们将使用EntitySet.replace_dataframe方法:

[20]:
new_users_df = pd.DataFrame({"id": [0, 1], "name": ["Anne Teak", "Art Decco"]})

es.replace_dataframe(dataframe_name="users", df=new_users_df)
es["users"]

[20]:
id name
0 0 Anne Teak
1 1 Art Decco

定义特征#

在Featuretools 1.0中,定义特征的语法略有变化。以前,可以通过传入应该用于构建特征的变量来定义身份特征。

feature = ft.Feature(es['items']['item_price'])

从Featuretools 1.0开始,可以使用类似的语法,但是因为 es['items'] 现在将返回一个Woodwork数据框而不是一个 Entity,我们需要稍微更新语法以访问Woodwork列。要进行更新,只需在数据框名称选择器和列选择器之间添加 .ww,如下所示。

[21]:
feature = ft.Feature(es["items"].ww["item_price"])

定义基元#

在Featuretools的早期版本中,基元的输入和返回类型是通过指定适当的Variable类来定义的。从1.0版本开始,输入和返回类型是通过Woodwork ColumnSchema对象来定义的。为了说明这一变化,让我们更仔细地看一下Age转换基元。这个基元接受代表出生日期的日期时间,并返回对应于一个人年龄的数值。在Featuretools的先前版本中,输入类型是通过指定DateOfBirth变量类型来定义的,返回类型是通过指定Numeric变量类型来指定:

input_types = [DateOfBirth]
return_type = Numeric

Woodwork没有特定的DateOfBirth逻辑类型,而是通过将逻辑类型指定为Datetime并使用语义标签date_of_birth来标识列作为出生日期列。Woodwork中也没有Numeric逻辑类型,而是通过使用语义标签numeric来标识所有可以用于数值操作的列。此外,我们知道Age基元将返回一个浮点数,这对应于Woodwork的逻辑类型Double。有了这些信息,我们可以使用ColumnSchema对象重新定义Age的输入类型和返回类型如下:

input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={'date_of_birth'})]
return_type = ColumnSchema(logical_type=Double, semantic_tags={'numeric'})

除了改变输入和返回类型的定义方式外,定义基元的其余过程保持不变。

从旧Featuretools变量类型到Woodwork ColumnSchemas的映射#

Woodwork定义的类型与Featuretools 1.0版本之前定义的旧变量类型不同。虽然旧变量类型与ColumnSchema对象定义的新Woodwork类型之间没有直接映射,但下表显示了近似的映射。

Featuretools变量

Woodwork Column Schema

布尔型

ColumnSchema(logical_type=Boolean) 或 ColumnSchema(logical_type=BooleanNullable)

分类

ColumnSchema(logical_type=Categorical)

国家代码

ColumnSchema(logical_type=CountryCode)

日期时间

ColumnSchema(logical_type=Datetime)

出生日期

ColumnSchema(logical_type=Datetime, semantic_tags={‘date_of_birth’})

日期时间索引

ColumnSchema(logical_type=Datetime, semantic_tags={‘time_index’})

离散型

ColumnSchema(semantic_tags={‘category’})

电子邮件地址

ColumnSchema(logical_type=EmailAddress)

文件路径

ColumnSchema(logical_type=Filepath)

全名

ColumnSchema(logical_type=PersonFullName)

ID

ColumnSchema(semantic_tags={‘foreign_key’})

索引

ColumnSchema(semantic_tags={‘index’})

IP地址

ColumnSchema(logical_type=IPAddress)

纬度经度

ColumnSchema(logical_type=LatLong)

自然语言

ColumnSchema(logical_type=NaturalLanguage)

数值型

ColumnSchema(semantic_tags={‘numeric’})

数值型时间索引

ColumnSchema(semantic_tags={‘numeric’, ‘time_index’})

顺序型

ColumnSchema(logical_type=Ordinal)

电话号码

ColumnSchema(logical_type=PhoneNumber)

子区域代码

ColumnSchema(logical_type=SubRegionCode)

时间间隔

ColumnSchema(logical_type=Timedelta)

时间索引

ColumnSchema(semantic_tags={‘time_index’})

URL

ColumnSchema(logical_type=URL)

未知

ColumnSchema(logical_type=Unknown)

邮政编码

ColumnSchema(logical_type=PostalCode)

更改Deep Feature Synthesis#

在Featuretools 1.0中,featuretools.dfsfeaturetools.calculate_feature_matrix的参数名称略有更改。在之前的版本中,用户可以使用默认的基元和选项生成特征列表,如下所示:

features = ft.dfs(entityset=es,
                  target_entity='items',
                  features_only=True)

在Featuretools 1.0中,target_entity参数已更名为target_dataframe_name,但除此之外,此基本调用保持不变。

[22]:
features = ft.dfs(entityset=es, target_dataframe_name="items", features_only=True)
features

[22]:
[<Feature: order_id>,
 <Feature: item_price>,
 <Feature: on_sale>,
 <Feature: orders.COUNT(items)>,
 <Feature: orders.MAX(items.item_price)>,
 <Feature: orders.MEAN(items.item_price)>,
 <Feature: orders.MIN(items.item_price)>,
 <Feature: orders.PERCENT_TRUE(items.on_sale)>,
 <Feature: orders.SKEW(items.item_price)>,
 <Feature: orders.STD(items.item_price)>,
 <Feature: orders.SUM(items.item_price)>,
 <Feature: orders.DAY(order_date)>,
 <Feature: orders.MONTH(order_date)>,
 <Feature: orders.WEEKDAY(order_date)>,
 <Feature: orders.YEAR(order_date)>]

此外,dfs 参数中的 ignore_entities 已更名为 ignore_dataframesignore_variables 已更名为 ignore_columns。类似地,如果指定原始选项,则应将所有对 entities 的引用替换为 dataframes,将对 variables 的引用替换为 columns。例如,include_groupby_entities 的原始选项现在是 include_groupby_dataframesinclude_variables 现在是 include_columns。如果传入一个 EntitySet 以及要计算的特征列表,那么对 featuretools.calculate_feature_matrix 的基本调用保持不变。然而,通过传入一个 entitiesrelationships 列表来调用 calculate_feature_matrix 的用户应注意,entities 参数已更名为 dataframes,字典值现在应包含 Woodwork 逻辑类型,而不是 Featuretools 的 Variable 类。

[23]:
feature_matrix = ft.calculate_feature_matrix(features=features, entityset=es)
feature_matrix

/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1056a5260> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1056a5b20> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1056a5120> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1056a5c60> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1056a4a40> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  ).agg(to_agg)
[23]:
order_id item_price on_sale orders.COUNT(items) orders.MAX(items.item_price) orders.MEAN(items.item_price) orders.MIN(items.item_price) orders.PERCENT_TRUE(items.on_sale) orders.SKEW(items.item_price) orders.STD(items.item_price) orders.SUM(items.item_price) orders.DAY(order_date) orders.MONTH(order_date) orders.WEEKDAY(order_date) orders.YEAR(order_date)
id
0 0 29.95 False 1 29.95 29.950 29.95 0.0 NaN NaN 29.95 2 1 5 2021
1 1 4.99 True 2 10.25 7.620 4.99 0.5 NaN 3.719382 15.24 3 1 6 2021
2 1 10.25 False 2 10.25 7.620 4.99 0.5 NaN 3.719382 15.24 3 1 6 2021
3 2 20.50 True 2 20.50 18.245 15.99 0.5 NaN 3.189052 36.49 4 1 0 2021
4 2 15.99 False 2 20.50 18.245 15.99 0.5 NaN 3.189052 36.49 4 1 0 2021

除了参数名称的更改之外,用户还应该注意返回的特征矩阵中的另外一些变化。首先,由于Woodwork定义列类型的方式与先前的Featuretools实现方式略有不同,因此在旧版本和新版本之间生成的特征可能会有一些差异。最显著的影响在于外键列的处理方式。以前,Featuretools将所有外键(之前是Id)列视为分类列,并会从这些列生成适当的特征。从版本1.0开始,外键列不再被限制为分类列,如果它们是其他类型,如Integer,则不会从这些列生成特征。像上面展示的手动将外键列转换为Categorical将会产生与之前版本中实现的特征非常接近的特征。另外,由于Woodwork的类型推断过程与先前的Featuretools类型推断过程不同,一个EntitySet可能会有不同的列类型被识别出来。列类型的这种差异可能会影响生成的特征。如果重要的是要有相同的特征集,可以检查EntitySet数据框中的所有逻辑类型,并根据需要更新为期望的类型。最后,由Featuretools计算的特征矩阵现在将会被初始化为Woodwork。这意味着用户可以通过Woodwork命名空间查看特征矩阵列的类型信息,如下所示。

[24]:
feature_matrix.ww

[24]:
Physical Type Logical Type Semantic Tag(s)
Column
order_id int64 Integer ['numeric', 'foreign_key']
item_price float64 Double ['numeric']
on_sale bool Boolean []
orders.COUNT(items) Int64 IntegerNullable ['numeric']
orders.MAX(items.item_price) float64 Double ['numeric']
orders.MEAN(items.item_price) float64 Double ['numeric']
orders.MIN(items.item_price) float64 Double ['numeric']
orders.PERCENT_TRUE(items.on_sale) float64 Double ['numeric']
orders.SKEW(items.item_price) float64 Double ['numeric']
orders.STD(items.item_price) float64 Double ['numeric']
orders.SUM(items.item_price) float64 Double ['numeric']
orders.DAY(order_date) category Ordinal: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] ['category']
orders.MONTH(order_date) category Ordinal: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] ['category']
orders.WEEKDAY(order_date) category Ordinal: [0, 1, 2, 3, 4, 5, 6] ['category']
orders.YEAR(order_date) category Ordinalcategory']

Featuretools现在通过数据框中是否最初存在来标记特征,或者是由Featuretools创建的。这些信息存储在Woodwork的origin属性中。原始数据中存在的列将被标记为base,而由Featuretools创建的特征将被标记为engineered。作为如何访问这些信息的演示,让我们比较特征矩阵中的两个特征:item_priceorders.MEAN(items.item_price)item_price在原始数据中存在,而orders.MEAN(items.item_price)是由Featuretools创建的。

[25]:
feature_matrix.ww["item_price"].ww.origin

[25]:
'base'
[26]:
feature_matrix.ww["orders.MEAN(items.item_price)"].ww.origin

[26]:
'engineered'

其他更改#

除了上面概述的更改之外,Featuretools 1.0 中还有一些其他较小的更改,现有用户应该注意以下内容。

  • 在 EntitySet 中,数据框的列顺序可能与以前不同。以前,Featuretools 会重新排列列,使索引列始终成为数据框中的第一列。这种行为已被移除,索引列不再保证是数据框中的第一列。现在,索引列将保持在数据框添加到 EntitySet 时的位置。

  • 对于 LatLong 列,Featuretools 的旧版本会将列中单个 nan 值替换为元组 (nan, nan)。现在不再这样,单个 nan 值将保留在 LatLong 列中。根据 Woodwork 的行为,LatLong 列中的任何 (nan, nan) 值将被替换为单个 nan 值。

  • 由于 Featuretools 不再定义具有彼此之间关系的 Variable 对象,因此已删除了 featuretools.variable_types.graph_variable_types 函数。

  • 已删除 featuretools.variable_types.list_variable_types 实用程序函数,并用两个相应的 Woodwork 函数替换:woodwork.list_logical_typeswoodwork.list_semantic_tags。从 Featuretools 1.0 开始,应使用 Woodwork 实用程序函数来获取可以应用于数据框列的逻辑类型和语义标签的信息。