在Featuretools中使用Woodwork进行数据类型处理#
Featuretools依赖于在创建EntitySets、Primitives、Features和特征矩阵时保持一致的数据类型。以前,Featuretools使用自己的类型系统,其中包含称为Variables的对象。现在以及未来,Featuretools将使用外部数据类型库进行数据类型处理:Woodwork。了解Woodwork存在的类型以及Featuretools如何使用Woodwork的类型系统将使用户能够: - 构建最能代表其数据的EntitySets - 了解Featuretools的Primitives的可能输入和返回类型 - 了解从给定数据和Primitives生成哪些特征
阅读了解Woodwork逻辑类型和语义标签指南,深入了解下面概述的可用Woodwork类型。对于熟悉旧的Variable
对象的用户,迁移到Featuretools版本1.0指南将有助于将Variable类型转换为Woodwork类型。
物理类型#
物理类型定义了Woodwork DataFrame中的数据在磁盘或内存中的存储方式。您可能会看到一个列的物理类型被称为该列的dtype
。了解Woodwork DataFrame的物理类型很重要,因为Pandas在执行DataFrame操作时依赖于这些类型。每个Woodwork LogicalType
类都有一个与之关联的单个物理类型。
逻辑类型#
逻辑类型提供了关于数据应该如何解释或解析的额外信息,超出了物理类型所包含的内容。事实上,多个逻辑类型具有相同的物理类型,每个逻辑类型传达了不仅包含在物理类型中的不同含义。在Featuretools中,列的逻辑类型指导数据如何读入EntitySet以及在深度特征合成中如何使用。Woodwork提供了许多不同的逻辑类型,可以使用list_logical_types
函数查看。
[1]:
import featuretools as ft
ft.list_logical_types()
2024-10-11 14:49:05,742 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:49:05,742 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:49:05,742 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:49:05,742 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:49:05,743 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:49:05,743 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:49:05,743 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:49:05,760 featuretools - WARNING Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
[1]:
name | type_string | description | physical_type | standard_tags | is_default_type | is_registered | parent_type | |
---|---|---|---|---|---|---|---|---|
0 | Address | address | Represents Logical Types that contain address ... | string | {} | True | True | None |
1 | Age | age | Represents Logical Types that contain whole nu... | int64 | {numeric} | True | True | Integer |
2 | AgeFractional | age_fractional | Represents Logical Types that contain non-nega... | float64 | {numeric} | True | True | Double |
3 | AgeNullable | age_nullable | Represents Logical Types that contain whole nu... | Int64 | {numeric} | True | True | IntegerNullable |
4 | Boolean | boolean | Represents Logical Types that contain binary v... | bool | {} | True | True | BooleanNullable |
5 | BooleanNullable | boolean_nullable | Represents Logical Types that contain binary v... | boolean | {} | True | True | None |
6 | Categorical | categorical | Represents Logical Types that contain unordere... | category | {category} | True | True | None |
7 | CountryCode | country_code | Represents Logical Types that use the ISO-3166... | category | {category} | True | True | Categorical |
8 | CurrencyCode | currency_code | Represents Logical Types that use the ISO-4217... | category | {category} | True | True | Categorical |
9 | Datetime | datetime | Represents Logical Types that contain date and... | datetime64[ns] | {} | True | True | None |
10 | Double | double | Represents Logical Types that contain positive... | float64 | {numeric} | True | True | None |
11 | EmailAddress | email_address | Represents Logical Types that contain email ad... | string | {} | True | True | Unknown |
12 | Filepath | filepath | Represents Logical Types that specify location... | string | {} | True | True | None |
13 | IPAddress | ip_address | Represents Logical Types that contain IP addre... | string | {} | True | True | Unknown |
14 | Integer | integer | Represents Logical Types that contain positive... | int64 | {numeric} | True | True | IntegerNullable |
15 | IntegerNullable | integer_nullable | Represents Logical Types that contain positive... | Int64 | {numeric} | True | True | None |
16 | LatLong | lat_long | Represents Logical Types that contain latitude... | object | {} | True | True | None |
17 | NaturalLanguage | natural_language | Represents Logical Types that contain text or ... | string | {} | True | True | None |
18 | Ordinal | ordinal | Represents Logical Types that contain ordered ... | category | {category} | True | True | Categorical |
19 | PersonFullName | person_full_name | Represents Logical Types that may contain firs... | string | {} | True | True | None |
20 | PhoneNumber | phone_number | Represents Logical Types that contain numeric ... | string | {} | True | True | Unknown |
21 | PostalCode | postal_code | Represents Logical Types that contain a series... | category | {category} | True | True | Categorical |
22 | SubRegionCode | sub_region_code | Represents Logical Types that use the ISO-3166... | category | {category} | True | True | Categorical |
23 | Timedelta | timedelta | Represents Logical Types that contain values s... | timedelta64[ns] | {} | True | True | Unknown |
24 | URL | url | Represents Logical Types that contain URLs, wh... | string | {} | True | True | Unknown |
25 | Unknown | unknown | Represents Logical Types that cannot be inferr... | string | {} | True | True | None |
Featuretools会执行类型推断,为EntitySets中的数据分配逻辑类型,如果没有提供的话,但也可以指定应为任何列设置哪些逻辑类型(前提是该列中的数据与逻辑类型兼容)。要了解有关逻辑类型在EntitySets中如何使用的更多信息,请参阅创建EntitySets指南。要了解如何直接在DataFrame上设置逻辑类型的更多信息,请参阅Woodwork指南中关于处理逻辑类型的内容。
语义标签#
语义标签为列提供有关数据含义或潜在用途的附加信息。列可以具有许多或零个语义标签。一些标签是由Woodwork添加的,一些是由Featuretools添加的,用户可以根据需要添加额外的标签。要了解如何直接在DataFrame上设置语义标签的更多信息,请参阅Woodwork指南中关于处理语义标签的内容。
Woodwork定义的语义标签#
Woodwork将在初始化时向列添加某些语义标签。这些可以是与不同逻辑类型集合相关联的标准标签或索引标签。还有一些标签是用户可以添加的,以在Woodwork中为列提供建议的含义。要获取这些标签的列表,可以使用list_semantic_tags
函数。
[2]:
ft.list_semantic_tags()
[2]:
name | is_standard_tag | valid_logical_types | |
---|---|---|---|
0 | numeric | True | [Age, AgeFractional, AgeNullable, Double, Inte... |
1 | category | True | [Categorical, CountryCode, CurrencyCode, Ordin... |
2 | index | False | Any LogicalType |
3 | time_index | False | [Datetime, Age, AgeFractional, AgeNullable, Do... |
4 | date_of_birth | False | [Datetime] |
5 | ignore | False | Any LogicalType |
6 | passthrough | False | Any LogicalType |
在上面,我们看到了Woodwork中定义的语义标签。这些标签指导了Featuretools如何解释数据,其中一个示例可以在Age
原语中看到,该原语要求在列上存在date_of_birth
语义标签。date_of_birth
标签不会被Woodwork自动添加,因此为了使Featuretools能够使用Age
原语,必须手动将date_of_birth
标签添加到适用的任何列中。
Featuretools定义的语义标签#
就像Woodwork在内部指定语义标签一样,Featuretools也定义了一些自己的标签,允许生成完整的特征集。当这些标签存在于列上时,它们具有特定的含义。 - 'last_time_index'
- Featuretools添加到DataFrame的最后时间索引列。指示此列已由Featuretools创建。 - 'foreign_key'
- 用于指示此列是关系的子列,这意味着此列与EntitySet中另一个DataFrame的相应索引列相关。
Woodwork在Featuretools中的应用#
现在我们已经描述了构成Woodwork类型系统的元素,让我们在Featuretools中看到它们的应用。
在EntitySets中使用Woodwork#
有关使用Woodwork构建EntitySets的更多信息,请参阅EntitySet指南。让我们看一下存储在零售数据演示EntitySet中的Woodwork类型信息:
[3]:
es = ft.demo.load_retail()
es
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[3]:
Entityset: demo_retail_data
DataFrames:
order_products [Rows: 401604, Columns: 8]
products [Rows: 3684, Columns: 4]
orders [Rows: 22190, Columns: 6]
customers [Rows: 4372, Columns: 3]
Relationships:
order_products.product_id -> products.product_id
order_products.order_id -> orders.order_id
orders.customer_name -> customers.customer_name
Woodwork类型信息不存储在EntitySet对象中,而是存储在组成EntitySet的各个DataFrame中。要查看Woodwork类型信息,我们首先从EntitySet中选择一个单独的DataFrame,然后通过ww
命名空间访问Woodwork信息:
[4]:
df = es["products"]
df.head()
[4]:
product_id | description | first_order_products_time | _ft_last_time | |
---|---|---|---|---|
85123A | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 2010-12-01 08:26:00 | 2011-12-09 11:34:00 |
71053 | 71053 | WHITE METAL LANTERN | 2010-12-01 08:26:00 | 2011-12-07 14:12:00 |
84406B | 84406B | CREAM CUPID HEARTS COAT HANGER | 2010-12-01 08:26:00 | 2011-12-05 14:30:00 |
84029G | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 2010-12-01 08:26:00 | 2011-12-09 11:26:00 |
84029E | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 2010-12-01 08:26:00 | 2011-12-09 09:07:00 |
[5]:
df.ww
[5]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
product_id | category | Categorical | ['index'] |
description | string | NaturalLanguage | [] |
first_order_products_time | datetime64[ns] | Datetime | ['time_index'] |
_ft_last_time | datetime64[ns] | Datetime | ['last_time_index'] |
请注意,显示此DataFrame的类型信息的三列是本指南开头概述的三个类型信息元素。重申一下:通过为DataFrame中的每一列定义物理类型、逻辑类型和语义标签,我们定义了一个DataFrame的Woodwork模式,通过这个模式,我们可以了解每一列的内容。在EntitySet中的每个DataFrame中存在的这种针对每一列的特定类型信息是Deep Feature Synthesis生成EntitySet特征能力的一个重要部分。###
在DFS中的Woodwork作为Featuretools中的计算单元,Primitive需要能够指定它们允许的输入类型,并具有可预测的返回类型。有关Featuretools中Primitive的详细解释,请参阅Feature Primitives指南。在这里,我们将看看Woodwork类型如何汇集到一个ColumnSchema
对象中,以描述Primitive的输入和返回类型。以下是我们从零售EntitySet中products
DataFrame中的'product_id'
列获取的Woodwork ColumnSchema
。
[6]:
products_df = es["products"]
product_ids_series = products_df.ww["product_id"]
column_schema = product_ids_series.ww.schema
column_schema
[6]:
<ColumnSchema (Logical Type = Categorical) (Semantic Tags = ['index'])>
这种逻辑类型和语义标记类型信息的组合是一个ColumnSchema
。在上面的情况中,ColumnSchema
描述了单个数据列的类型定义。请注意,在ColumnSchema
中没有物理类型。这是因为ColumnSchema
是一组Woodwork类型,它没有任何与之关联的数据,因此没有物理表示。由于ColumnSchema
对象与任何数据都没有关联,它也可以用来描述其他列可能属于或不属于的类型空间。ColumnSchema
类的这种灵活性允许ColumnSchema
对象既用作实体集中每列的类型定义,也用作Featuretools中每个Primitive的输入和返回类型空间。让我们看一个不同DataFrame中的不同列,看看它是如何工作的:
[7]:
order_products_df = es["order_products"]
order_products_df.head()
[7]:
order_product_id | order_id | product_id | quantity | order_date | unit_price | total | _ft_last_time | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 536365 | 85123A | 6 | 2010-12-01 08:26:00 | 4.2075 | 25.245 | 2010-12-01 08:26:00 |
1 | 1 | 536365 | 71053 | 6 | 2010-12-01 08:26:00 | 5.5935 | 33.561 | 2010-12-01 08:26:00 |
2 | 2 | 536365 | 84406B | 8 | 2010-12-01 08:26:00 | 4.5375 | 36.300 | 2010-12-01 08:26:00 |
3 | 3 | 536365 | 84029G | 6 | 2010-12-01 08:26:00 | 5.5935 | 33.561 | 2010-12-01 08:26:00 |
4 | 4 | 536365 | 84029E | 6 | 2010-12-01 08:26:00 | 5.5935 | 33.561 | 2010-12-01 08:26:00 |
[8]:
quantity_series = order_products_df.ww["quantity"]
column_schema = quantity_series.ww.schema
column_schema
[8]:
<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>
上面的ColumnSchema
是从零售EntitySet中的order_products
DataFrame中的'quantity'
列中提取的。这是一个类型定义。如果我们查看order_products
DataFrame的Woodwork类型信息,我们会发现有几列将具有类似的ColumnSchema
类型定义。如果我们想描述这些列的子集,我们可以定义几个ColumnSchema
类型空间。
[9]:
es["order_products"].ww
[9]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
order_product_id | int64 | Integer | ['index'] |
order_id | category | Categorical | ['category', 'foreign_key'] |
product_id | category | Categorical | ['category', 'foreign_key'] |
quantity | int64 | Integer | ['numeric'] |
order_date | datetime64[ns] | Datetime | ['time_index'] |
unit_price | float64 | Double | ['numeric'] |
total | float64 | Double | ['numeric'] |
_ft_last_time | datetime64[ns] | Datetime | ['last_time_index'] |
下面是几个ColumnSchema
,它们都包括我们的quantity
列,但每个都描述了不同类型的空间。随着我们继续向下,这些ColumnSchema
会变得更加严格:##### 整个DataFrame没有任何限制;任何列都符合这个定义。这将包括整个DataFrame。
[10]:
from woodwork.column_schema import ColumnSchema
ColumnSchema()
[10]:
<ColumnSchema>
一个以ColumnSchema
作为输入类型的原始变换示例是IsNull
原始变换。##### 按语义标签只有带有numeric
标签的列适用。这可以包括Double、Integer和Age逻辑类型列。它不会包括index
列,尽管它包含整数,但其标准标签已被替换为'index'
标签。
[11]:
ColumnSchema(semantic_tags={"numeric"})
[11]:
<ColumnSchema (Semantic Tags = ['numeric'])>
[12]:
df = es["order_products"].ww.select(include="numeric")
df.ww
[12]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
quantity | int64 | Integer | ['numeric'] |
unit_price | float64 | Double | ['numeric'] |
total | float64 | Double | ['numeric'] |
一个以ColumnSchema
作为输入类型的原始类型的示例是Mean
聚合原始类型。##### 按逻辑类型只有逻辑类型为Integer
的列被包含在此定义中。不需要numeric
标签,因此索引列(其标准标签已被移除)仍然适用。
[13]:
from woodwork.logical_types import Integer
ColumnSchema(logical_type=Integer)
[13]:
<ColumnSchema (Logical Type = Integer)>
[14]:
df = es["order_products"].ww.select(include="Integer")
df.ww
[14]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
order_product_id | int64 | Integer | ['index'] |
quantity | int64 | Integer | ['numeric'] |
The column must be categorized by logical type and semantic label, having a logical type of integer
and a numeric
semantic label, excluding index columns.#
[15]:
ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
[15]:
<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>
[16]:
df = es["order_products"].ww.select(include="numeric")
df = df.ww.select(include="Integer")
df.ww
[16]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
quantity | int64 | Integer | ['numeric'] |
这样,ColumnSchema
可以定义一个类型空间,在这个空间下,Woodwork DataFrame中的列可以存在。这就是Featuretools在DFS过程中确定DataFrame中哪些列对于构建特征是有效的方式。每个Primitive都有由Woodwork
ColumnSchema
描述的input_types
和return_type
。EntitySet中的每个DataFrame都已经初始化了Woodwork。这意味着当一个EntitySet被传递到DFS中时,Featuretools可以选择DataFrame中与Primitive的input_types
有效的相关列。然后我们得到一个具有column_schema
属性的特征,该属性指示该特征的类型定义是什么,从而让DFS可以将特征堆叠在一起。通过这种方式,Featuretools能够利用Woodwork类型信息的基本单元ColumnSchema
,并与Woodwork
DataFrames的EntitySet一起使用,以构建具有深度特征合成的特征。