在Featuretools中使用Woodwork进行数据类型处理#

Featuretools依赖于在创建EntitySets、Primitives、Features和特征矩阵时保持一致的数据类型。以前,Featuretools使用自己的类型系统,其中包含称为Variables的对象。现在以及未来,Featuretools将使用外部数据类型库进行数据类型处理:Woodwork。了解Woodwork存在的类型以及Featuretools如何使用Woodwork的类型系统将使用户能够: - 构建最能代表其数据的EntitySets - 了解Featuretools的Primitives的可能输入和返回类型 - 了解从给定数据和Primitives生成哪些特征

阅读了解Woodwork逻辑类型和语义标签指南,深入了解下面概述的可用Woodwork类型。对于熟悉旧的Variable对象的用户,迁移到Featuretools版本1.0指南将有助于将Variable类型转换为Woodwork类型。

物理类型#

物理类型定义了Woodwork DataFrame中的数据在磁盘或内存中的存储方式。您可能会看到一个列的物理类型被称为该列的dtype。了解Woodwork DataFrame的物理类型很重要,因为Pandas在执行DataFrame操作时依赖于这些类型。每个Woodwork LogicalType类都有一个与之关联的单个物理类型。

逻辑类型#

逻辑类型提供了关于数据应该如何解释或解析的额外信息,超出了物理类型所包含的内容。事实上,多个逻辑类型具有相同的物理类型,每个逻辑类型传达了不仅包含在物理类型中的不同含义。在Featuretools中,列的逻辑类型指导数据如何读入EntitySet以及在深度特征合成中如何使用。Woodwork提供了许多不同的逻辑类型,可以使用list_logical_types函数查看。

[1]:
import featuretools as ft


ft.list_logical_types()

2024-10-11 14:49:05,742 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:49:05,742 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:49:05,742 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:49:05,742 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:49:05,743 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:49:05,743 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:49:05,743 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:49:05,760 featuretools - WARNING    Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
[1]:
name type_string description physical_type standard_tags is_default_type is_registered parent_type
0 Address address Represents Logical Types that contain address ... string {} True True None
1 Age age Represents Logical Types that contain whole nu... int64 {numeric} True True Integer
2 AgeFractional age_fractional Represents Logical Types that contain non-nega... float64 {numeric} True True Double
3 AgeNullable age_nullable Represents Logical Types that contain whole nu... Int64 {numeric} True True IntegerNullable
4 Boolean boolean Represents Logical Types that contain binary v... bool {} True True BooleanNullable
5 BooleanNullable boolean_nullable Represents Logical Types that contain binary v... boolean {} True True None
6 Categorical categorical Represents Logical Types that contain unordere... category {category} True True None
7 CountryCode country_code Represents Logical Types that use the ISO-3166... category {category} True True Categorical
8 CurrencyCode currency_code Represents Logical Types that use the ISO-4217... category {category} True True Categorical
9 Datetime datetime Represents Logical Types that contain date and... datetime64[ns] {} True True None
10 Double double Represents Logical Types that contain positive... float64 {numeric} True True None
11 EmailAddress email_address Represents Logical Types that contain email ad... string {} True True Unknown
12 Filepath filepath Represents Logical Types that specify location... string {} True True None
13 IPAddress ip_address Represents Logical Types that contain IP addre... string {} True True Unknown
14 Integer integer Represents Logical Types that contain positive... int64 {numeric} True True IntegerNullable
15 IntegerNullable integer_nullable Represents Logical Types that contain positive... Int64 {numeric} True True None
16 LatLong lat_long Represents Logical Types that contain latitude... object {} True True None
17 NaturalLanguage natural_language Represents Logical Types that contain text or ... string {} True True None
18 Ordinal ordinal Represents Logical Types that contain ordered ... category {category} True True Categorical
19 PersonFullName person_full_name Represents Logical Types that may contain firs... string {} True True None
20 PhoneNumber phone_number Represents Logical Types that contain numeric ... string {} True True Unknown
21 PostalCode postal_code Represents Logical Types that contain a series... category {category} True True Categorical
22 SubRegionCode sub_region_code Represents Logical Types that use the ISO-3166... category {category} True True Categorical
23 Timedelta timedelta Represents Logical Types that contain values s... timedelta64[ns] {} True True Unknown
24 URL url Represents Logical Types that contain URLs, wh... string {} True True Unknown
25 Unknown unknown Represents Logical Types that cannot be inferr... string {} True True None

Featuretools会执行类型推断,为EntitySets中的数据分配逻辑类型,如果没有提供的话,但也可以指定应为任何列设置哪些逻辑类型(前提是该列中的数据与逻辑类型兼容)。要了解有关逻辑类型在EntitySets中如何使用的更多信息,请参阅创建EntitySets指南。要了解如何直接在DataFrame上设置逻辑类型的更多信息,请参阅Woodwork指南中关于处理逻辑类型的内容。

语义标签#

语义标签为列提供有关数据含义或潜在用途的附加信息。列可以具有许多或零个语义标签。一些标签是由Woodwork添加的,一些是由Featuretools添加的,用户可以根据需要添加额外的标签。要了解如何直接在DataFrame上设置语义标签的更多信息,请参阅Woodwork指南中关于处理语义标签的内容。

Woodwork定义的语义标签#

Woodwork将在初始化时向列添加某些语义标签。这些可以是与不同逻辑类型集合相关联的标准标签或索引标签。还有一些标签是用户可以添加的,以在Woodwork中为列提供建议的含义。要获取这些标签的列表,可以使用list_semantic_tags函数。

[2]:
ft.list_semantic_tags()

[2]:
name is_standard_tag valid_logical_types
0 numeric True [Age, AgeFractional, AgeNullable, Double, Inte...
1 category True [Categorical, CountryCode, CurrencyCode, Ordin...
2 index False Any LogicalType
3 time_index False [Datetime, Age, AgeFractional, AgeNullable, Do...
4 date_of_birth False [Datetime]
5 ignore False Any LogicalType
6 passthrough False Any LogicalType

在上面,我们看到了Woodwork中定义的语义标签。这些标签指导了Featuretools如何解释数据,其中一个示例可以在Age原语中看到,该原语要求在列上存在date_of_birth语义标签。date_of_birth标签不会被Woodwork自动添加,因此为了使Featuretools能够使用Age原语,必须手动将date_of_birth标签添加到适用的任何列中。

Featuretools定义的语义标签#

就像Woodwork在内部指定语义标签一样,Featuretools也定义了一些自己的标签,允许生成完整的特征集。当这些标签存在于列上时,它们具有特定的含义。 - 'last_time_index' - Featuretools添加到DataFrame的最后时间索引列。指示此列已由Featuretools创建。 - 'foreign_key' - 用于指示此列是关系的子列,这意味着此列与EntitySet中另一个DataFrame的相应索引列相关。

Woodwork在Featuretools中的应用#

现在我们已经描述了构成Woodwork类型系统的元素,让我们在Featuretools中看到它们的应用。

在EntitySets中使用Woodwork#

有关使用Woodwork构建EntitySets的更多信息,请参阅EntitySet指南。让我们看一下存储在零售数据演示EntitySet中的Woodwork类型信息:

[3]:
es = ft.demo.load_retail()
es

/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[3]:
Entityset: demo_retail_data
  DataFrames:
    order_products [Rows: 401604, Columns: 8]
    products [Rows: 3684, Columns: 4]
    orders [Rows: 22190, Columns: 6]
    customers [Rows: 4372, Columns: 3]
  Relationships:
    order_products.product_id -> products.product_id
    order_products.order_id -> orders.order_id
    orders.customer_name -> customers.customer_name

Woodwork类型信息不存储在EntitySet对象中,而是存储在组成EntitySet的各个DataFrame中。要查看Woodwork类型信息,我们首先从EntitySet中选择一个单独的DataFrame,然后通过ww命名空间访问Woodwork信息:

[4]:
df = es["products"]
df.head()

[4]:
product_id description first_order_products_time _ft_last_time
85123A 85123A WHITE HANGING HEART T-LIGHT HOLDER 2010-12-01 08:26:00 2011-12-09 11:34:00
71053 71053 WHITE METAL LANTERN 2010-12-01 08:26:00 2011-12-07 14:12:00
84406B 84406B CREAM CUPID HEARTS COAT HANGER 2010-12-01 08:26:00 2011-12-05 14:30:00
84029G 84029G KNITTED UNION FLAG HOT WATER BOTTLE 2010-12-01 08:26:00 2011-12-09 11:26:00
84029E 84029E RED WOOLLY HOTTIE WHITE HEART. 2010-12-01 08:26:00 2011-12-09 09:07:00
[5]:
df.ww

[5]:
Physical Type Logical Type Semantic Tag(s)
Column
product_id category Categorical ['index']
description string NaturalLanguage []
first_order_products_time datetime64[ns] Datetime ['time_index']
_ft_last_time datetime64[ns] Datetime ['last_time_index']

请注意,显示此DataFrame的类型信息的三列是本指南开头概述的三个类型信息元素。重申一下:通过为DataFrame中的每一列定义物理类型、逻辑类型和语义标签,我们定义了一个DataFrame的Woodwork模式,通过这个模式,我们可以了解每一列的内容。在EntitySet中的每个DataFrame中存在的这种针对每一列的特定类型信息是Deep Feature Synthesis生成EntitySet特征能力的一个重要部分。### 在DFS中的Woodwork作为Featuretools中的计算单元,Primitive需要能够指定它们允许的输入类型,并具有可预测的返回类型。有关Featuretools中Primitive的详细解释,请参阅Feature Primitives指南。在这里,我们将看看Woodwork类型如何汇集到一个ColumnSchema对象中,以描述Primitive的输入和返回类型。以下是我们从零售EntitySet中products DataFrame中的'product_id'列获取的Woodwork ColumnSchema

[6]:
products_df = es["products"]
product_ids_series = products_df.ww["product_id"]
column_schema = product_ids_series.ww.schema
column_schema

[6]:
<ColumnSchema (Logical Type = Categorical) (Semantic Tags = ['index'])>

这种逻辑类型和语义标记类型信息的组合是一个ColumnSchema。在上面的情况中,ColumnSchema描述了单个数据列的类型定义。请注意,在ColumnSchema中没有物理类型。这是因为ColumnSchema是一组Woodwork类型,它没有任何与之关联的数据,因此没有物理表示。由于ColumnSchema对象与任何数据都没有关联,它也可以用来描述其他列可能属于或不属于的类型空间ColumnSchema类的这种灵活性允许ColumnSchema对象既用作实体集中每列的类型定义,也用作Featuretools中每个Primitive的输入和返回类型空间。让我们看一个不同DataFrame中的不同列,看看它是如何工作的:

[7]:
order_products_df = es["order_products"]
order_products_df.head()

[7]:
order_product_id order_id product_id quantity order_date unit_price total _ft_last_time
0 0 536365 85123A 6 2010-12-01 08:26:00 4.2075 25.245 2010-12-01 08:26:00
1 1 536365 71053 6 2010-12-01 08:26:00 5.5935 33.561 2010-12-01 08:26:00
2 2 536365 84406B 8 2010-12-01 08:26:00 4.5375 36.300 2010-12-01 08:26:00
3 3 536365 84029G 6 2010-12-01 08:26:00 5.5935 33.561 2010-12-01 08:26:00
4 4 536365 84029E 6 2010-12-01 08:26:00 5.5935 33.561 2010-12-01 08:26:00
[8]:
quantity_series = order_products_df.ww["quantity"]
column_schema = quantity_series.ww.schema
column_schema

[8]:
<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>

上面的ColumnSchema是从零售EntitySet中的order_products DataFrame中的'quantity'列中提取的。这是一个类型定义。如果我们查看order_products DataFrame的Woodwork类型信息,我们会发现有几列将具有类似的ColumnSchema类型定义。如果我们想描述这些列的子集,我们可以定义几个ColumnSchema 类型空间

[9]:
es["order_products"].ww

[9]:
Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['index']
order_id category Categorical ['category', 'foreign_key']
product_id category Categorical ['category', 'foreign_key']
quantity int64 Integer ['numeric']
order_date datetime64[ns] Datetime ['time_index']
unit_price float64 Double ['numeric']
total float64 Double ['numeric']
_ft_last_time datetime64[ns] Datetime ['last_time_index']

下面是几个ColumnSchema,它们都包括我们的quantity列,但每个都描述了不同类型的空间。随着我们继续向下,这些ColumnSchema会变得更加严格:##### 整个DataFrame没有任何限制;任何列都符合这个定义。这将包括整个DataFrame。

[10]:
from woodwork.column_schema import ColumnSchema

ColumnSchema()

[10]:
<ColumnSchema>

一个以ColumnSchema作为输入类型的原始变换示例是IsNull原始变换。##### 按语义标签只有带有numeric标签的列适用。这可以包括Double、Integer和Age逻辑类型列。它不会包括index列,尽管它包含整数,但其标准标签已被替换为'index'标签。

[11]:
ColumnSchema(semantic_tags={"numeric"})

[11]:
<ColumnSchema (Semantic Tags = ['numeric'])>
[12]:
df = es["order_products"].ww.select(include="numeric")
df.ww

[12]:
Physical Type Logical Type Semantic Tag(s)
Column
quantity int64 Integer ['numeric']
unit_price float64 Double ['numeric']
total float64 Double ['numeric']

一个以ColumnSchema作为输入类型的原始类型的示例是Mean聚合原始类型。##### 按逻辑类型只有逻辑类型为Integer的列被包含在此定义中。不需要numeric标签,因此索引列(其标准标签已被移除)仍然适用。

[13]:
from woodwork.logical_types import Integer

ColumnSchema(logical_type=Integer)

[13]:
<ColumnSchema (Logical Type = Integer)>
[14]:
df = es["order_products"].ww.select(include="Integer")
df.ww

[14]:
Physical Type Logical Type Semantic Tag(s)
Column
order_product_id int64 Integer ['index']
quantity int64 Integer ['numeric']

The column must be categorized by logical type and semantic label, having a logical type of integer and a numeric semantic label, excluding index columns.#

[15]:
ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})

[15]:
<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>
[16]:
df = es["order_products"].ww.select(include="numeric")
df = df.ww.select(include="Integer")
df.ww

[16]:
Physical Type Logical Type Semantic Tag(s)
Column
quantity int64 Integer ['numeric']

这样,ColumnSchema可以定义一个类型空间,在这个空间下,Woodwork DataFrame中的列可以存在。这就是Featuretools在DFS过程中确定DataFrame中哪些列对于构建特征是有效的方式。每个Primitive都有由Woodwork ColumnSchema描述的input_typesreturn_type。EntitySet中的每个DataFrame都已经初始化了Woodwork。这意味着当一个EntitySet被传递到DFS中时,Featuretools可以选择DataFrame中与Primitive的input_types有效的相关列。然后我们得到一个具有column_schema属性的特征,该属性指示该特征的类型定义是什么,从而让DFS可以将特征堆叠在一起。通过这种方式,Featuretools能够利用Woodwork类型信息的基本单元ColumnSchema,并与Woodwork DataFrames的EntitySet一起使用,以构建具有深度特征合成的特征。