在Featuretools中使用Woodwork进行数据类型处理#

Featuretools依赖于在创建EntitySets、Primitives、Features和特征矩阵时保持一致的数据类型。以前，Featuretools使用自己的类型系统，其中包含称为Variables的对象。现在以及未来，Featuretools将使用外部数据类型库进行数据类型处理：Woodwork。了解Woodwork存在的类型以及Featuretools如何使用Woodwork的类型系统将使用户能够： - 构建最能代表其数据的EntitySets - 了解Featuretools的Primitives的可能输入和返回类型 - 了解从给定数据和Primitives生成哪些特征

阅读了解Woodwork逻辑类型和语义标签指南，深入了解下面概述的可用Woodwork类型。对于熟悉旧的Variable对象的用户，迁移到Featuretools版本1.0指南将有助于将Variable类型转换为Woodwork类型。

物理类型#

物理类型定义了Woodwork DataFrame中的数据在磁盘或内存中的存储方式。您可能会看到一个列的物理类型被称为该列的dtype。了解Woodwork DataFrame的物理类型很重要，因为Pandas在执行DataFrame操作时依赖于这些类型。每个Woodwork LogicalType类都有一个与之关联的单个物理类型。

逻辑类型#

逻辑类型提供了关于数据应该如何解释或解析的额外信息，超出了物理类型所包含的内容。事实上，多个逻辑类型具有相同的物理类型，每个逻辑类型传达了不仅包含在物理类型中的不同含义。在Featuretools中，列的逻辑类型指导数据如何读入EntitySet以及在深度特征合成中如何使用。Woodwork提供了许多不同的逻辑类型，可以使用list_logical_types函数查看。

[1]:

import featuretools as ft

ft.list_logical_types()

2024-10-11 14:49:05,742 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:49:05,742 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:49:05,742 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:49:05,742 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:49:05,743 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:49:05,743 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:49:05,743 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:49:05,760 featuretools - WARNING    Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.

[1]:

	name	type_string	description	physical_type	standard_tags	is_default_type	is_registered	parent_type
0	Address	address	Represents Logical Types that contain address ...	string	{}	True	True	None
1	Age	age	Represents Logical Types that contain whole nu...	int64	{numeric}	True	True	Integer
2	AgeFractional	age_fractional	Represents Logical Types that contain non-nega...	float64	{numeric}	True	True	Double
3	AgeNullable	age_nullable	Represents Logical Types that contain whole nu...	Int64	{numeric}	True	True	IntegerNullable
4	Boolean	boolean	Represents Logical Types that contain binary v...	bool	{}	True	True	BooleanNullable
5	BooleanNullable	boolean_nullable	Represents Logical Types that contain binary v...	boolean	{}	True	True	None
6	Categorical	categorical	Represents Logical Types that contain unordere...	category	{category}	True	True	None
7	CountryCode	country_code	Represents Logical Types that use the ISO-3166...	category	{category}	True	True	Categorical
8	CurrencyCode	currency_code	Represents Logical Types that use the ISO-4217...	category	{category}	True	True	Categorical
9	Datetime	datetime	Represents Logical Types that contain date and...	datetime64[ns]	{}	True	True	None
10	Double	double	Represents Logical Types that contain positive...	float64	{numeric}	True	True	None
11	EmailAddress	email_address	Represents Logical Types that contain email ad...	string	{}	True	True	Unknown
12	Filepath	filepath	Represents Logical Types that specify location...	string	{}	True	True	None
13	IPAddress	ip_address	Represents Logical Types that contain IP addre...	string	{}	True	True	Unknown
14	Integer	integer	Represents Logical Types that contain positive...	int64	{numeric}	True	True	IntegerNullable
15	IntegerNullable	integer_nullable	Represents Logical Types that contain positive...	Int64	{numeric}	True	True	None
16	LatLong	lat_long	Represents Logical Types that contain latitude...	object	{}	True	True	None
17	NaturalLanguage	natural_language	Represents Logical Types that contain text or ...	string	{}	True	True	None
18	Ordinal	ordinal	Represents Logical Types that contain ordered ...	category	{category}	True	True	Categorical
19	PersonFullName	person_full_name	Represents Logical Types that may contain firs...	string	{}	True	True	None
20	PhoneNumber	phone_number	Represents Logical Types that contain numeric ...	string	{}	True	True	Unknown
21	PostalCode	postal_code	Represents Logical Types that contain a series...	category	{category}	True	True	Categorical
22	SubRegionCode	sub_region_code	Represents Logical Types that use the ISO-3166...	category	{category}	True	True	Categorical
23	Timedelta	timedelta	Represents Logical Types that contain values s...	timedelta64[ns]	{}	True	True	Unknown
24	URL	url	Represents Logical Types that contain URLs, wh...	string	{}	True	True	Unknown
25	Unknown	unknown	Represents Logical Types that cannot be inferr...	string	{}	True	True	None

Featuretools会执行类型推断，为EntitySets中的数据分配逻辑类型，如果没有提供的话，但也可以指定应为任何列设置哪些逻辑类型（前提是该列中的数据与逻辑类型兼容）。要了解有关逻辑类型在EntitySets中如何使用的更多信息，请参阅创建EntitySets指南。要了解如何直接在DataFrame上设置逻辑类型的更多信息，请参阅Woodwork指南中关于处理逻辑类型的内容。

语义标签#

语义标签为列提供有关数据含义或潜在用途的附加信息。列可以具有许多或零个语义标签。一些标签是由Woodwork添加的，一些是由Featuretools添加的，用户可以根据需要添加额外的标签。要了解如何直接在DataFrame上设置语义标签的更多信息，请参阅Woodwork指南中关于处理语义标签的内容。

Woodwork定义的语义标签#

Woodwork将在初始化时向列添加某些语义标签。这些可以是与不同逻辑类型集合相关联的标准标签或索引标签。还有一些标签是用户可以添加的，以在Woodwork中为列提供建议的含义。要获取这些标签的列表，可以使用list_semantic_tags函数。

[2]:

ft.list_semantic_tags()

[2]:

	name	is_standard_tag	valid_logical_types
0	numeric	True	[Age, AgeFractional, AgeNullable, Double, Inte...
1	category	True	[Categorical, CountryCode, CurrencyCode, Ordin...
2	index	False	Any LogicalType
3	time_index	False	[Datetime, Age, AgeFractional, AgeNullable, Do...
4	date_of_birth	False	[Datetime]
5	ignore	False	Any LogicalType
6	passthrough	False	Any LogicalType

在上面，我们看到了Woodwork中定义的语义标签。这些标签指导了Featuretools如何解释数据，其中一个示例可以在Age原语中看到，该原语要求在列上存在date_of_birth语义标签。date_of_birth标签不会被Woodwork自动添加，因此为了使Featuretools能够使用Age原语，必须手动将date_of_birth标签添加到适用的任何列中。

Featuretools定义的语义标签#

就像Woodwork在内部指定语义标签一样，Featuretools也定义了一些自己的标签，允许生成完整的特征集。当这些标签存在于列上时，它们具有特定的含义。 - 'last_time_index' - Featuretools添加到DataFrame的最后时间索引列。指示此列已由Featuretools创建。 - 'foreign_key' - 用于指示此列是关系的子列，这意味着此列与EntitySet中另一个DataFrame的相应索引列相关。

Woodwork在Featuretools中的应用#

现在我们已经描述了构成Woodwork类型系统的元素，让我们在Featuretools中看到它们的应用。

在EntitySets中使用Woodwork#

有关使用Woodwork构建EntitySets的更多信息，请参阅EntitySet指南。让我们看一下存储在零售数据演示EntitySet中的Woodwork类型信息：

[3]:

es = ft.demo.load_retail()
es

/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(

[3]:

Entityset: demo_retail_data
  DataFrames:
    order_products [Rows: 401604, Columns: 8]
    products [Rows: 3684, Columns: 4]
    orders [Rows: 22190, Columns: 6]
    customers [Rows: 4372, Columns: 3]
  Relationships:
    order_products.product_id -> products.product_id
    order_products.order_id -> orders.order_id
    orders.customer_name -> customers.customer_name

Woodwork类型信息不存储在EntitySet对象中，而是存储在组成EntitySet的各个DataFrame中。要查看Woodwork类型信息，我们首先从EntitySet中选择一个单独的DataFrame，然后通过ww命名空间访问Woodwork信息：

[4]:

df = es["products"]
df.head()

[4]:

	product_id	description	first_order_products_time	_ft_last_time
85123A	85123A	WHITE HANGING HEART T-LIGHT HOLDER	2010-12-01 08:26:00	2011-12-09 11:34:00
71053	71053	WHITE METAL LANTERN	2010-12-01 08:26:00	2011-12-07 14:12:00
84406B	84406B	CREAM CUPID HEARTS COAT HANGER	2010-12-01 08:26:00	2011-12-05 14:30:00
84029G	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	2010-12-01 08:26:00	2011-12-09 11:26:00
84029E	84029E	RED WOOLLY HOTTIE WHITE HEART.	2010-12-01 08:26:00	2011-12-09 09:07:00

[5]:

df.ww

[5]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
product_id	category	Categorical	['index']
description	string	NaturalLanguage	[]
first_order_products_time	datetime64[ns]	Datetime	['time_index']
_ft_last_time	datetime64[ns]	Datetime	['last_time_index']

请注意，显示此DataFrame的类型信息的三列是本指南开头概述的三个类型信息元素。重申一下：通过为DataFrame中的每一列定义物理类型、逻辑类型和语义标签，我们定义了一个DataFrame的Woodwork模式，通过这个模式，我们可以了解每一列的内容。在EntitySet中的每个DataFrame中存在的这种针对每一列的特定类型信息是Deep Feature Synthesis生成EntitySet特征能力的一个重要部分。### 在DFS中的Woodwork作为Featuretools中的计算单元，Primitive需要能够指定它们允许的输入类型，并具有可预测的返回类型。有关Featuretools中Primitive的详细解释，请参阅Feature Primitives指南。在这里，我们将看看Woodwork类型如何汇集到一个ColumnSchema对象中，以描述Primitive的输入和返回类型。以下是我们从零售EntitySet中products DataFrame中的'product_id'列获取的Woodwork ColumnSchema。

[6]:

products_df = es["products"]
product_ids_series = products_df.ww["product_id"]
column_schema = product_ids_series.ww.schema
column_schema

[6]:

<ColumnSchema (Logical Type = Categorical) (Semantic Tags = ['index'])>

这种逻辑类型和语义标记类型信息的组合是一个ColumnSchema。在上面的情况中，ColumnSchema描述了单个数据列的类型定义。请注意，在ColumnSchema中没有物理类型。这是因为ColumnSchema是一组Woodwork类型，它没有任何与之关联的数据，因此没有物理表示。由于ColumnSchema对象与任何数据都没有关联，它也可以用来描述其他列可能属于或不属于的类型空间。ColumnSchema类的这种灵活性允许ColumnSchema对象既用作实体集中每列的类型定义，也用作Featuretools中每个Primitive的输入和返回类型空间。让我们看一个不同DataFrame中的不同列，看看它是如何工作的：

[7]:

order_products_df = es["order_products"]
order_products_df.head()

[7]:

	order_product_id	order_id	product_id	quantity	order_date	unit_price	total	_ft_last_time
0	0	536365	85123A	6	2010-12-01 08:26:00	4.2075	25.245	2010-12-01 08:26:00
1	1	536365	71053	6	2010-12-01 08:26:00	5.5935	33.561	2010-12-01 08:26:00
2	2	536365	84406B	8	2010-12-01 08:26:00	4.5375	36.300	2010-12-01 08:26:00
3	3	536365	84029G	6	2010-12-01 08:26:00	5.5935	33.561	2010-12-01 08:26:00
4	4	536365	84029E	6	2010-12-01 08:26:00	5.5935	33.561	2010-12-01 08:26:00

[8]:

quantity_series = order_products_df.ww["quantity"]
column_schema = quantity_series.ww.schema
column_schema

[8]:

<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>

上面的ColumnSchema是从零售EntitySet中的order_products DataFrame中的'quantity'列中提取的。这是一个类型定义。如果我们查看order_products DataFrame的Woodwork类型信息，我们会发现有几列将具有类似的ColumnSchema类型定义。如果我们想描述这些列的子集，我们可以定义几个ColumnSchema 类型空间。

[9]:

es["order_products"].ww

[9]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_product_id	int64	Integer	['index']
order_id	category	Categorical	['category', 'foreign_key']
product_id	category	Categorical	['category', 'foreign_key']
quantity	int64	Integer	['numeric']
order_date	datetime64[ns]	Datetime	['time_index']
unit_price	float64	Double	['numeric']
total	float64	Double	['numeric']
_ft_last_time	datetime64[ns]	Datetime	['last_time_index']

下面是几个ColumnSchema，它们都包括我们的quantity列，但每个都描述了不同类型的空间。随着我们继续向下，这些ColumnSchema会变得更加严格：##### 整个DataFrame没有任何限制；任何列都符合这个定义。这将包括整个DataFrame。

[10]:

from woodwork.column_schema import ColumnSchema

ColumnSchema()

[10]:

<ColumnSchema>

一个以ColumnSchema作为输入类型的原始变换示例是IsNull原始变换。##### 按语义标签只有带有numeric标签的列适用。这可以包括Double、Integer和Age逻辑类型列。它不会包括index列，尽管它包含整数，但其标准标签已被替换为'index'标签。

[11]:

ColumnSchema(semantic_tags={"numeric"})

[11]:

<ColumnSchema (Semantic Tags = ['numeric'])>

[12]:

df = es["order_products"].ww.select(include="numeric")
df.ww

[12]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
quantity	int64	Integer	['numeric']
unit_price	float64	Double	['numeric']
total	float64	Double	['numeric']

一个以ColumnSchema作为输入类型的原始类型的示例是Mean聚合原始类型。##### 按逻辑类型只有逻辑类型为Integer的列被包含在此定义中。不需要numeric标签，因此索引列（其标准标签已被移除）仍然适用。

[13]:

from woodwork.logical_types import Integer

ColumnSchema(logical_type=Integer)

[13]:

<ColumnSchema (Logical Type = Integer)>

[14]:

df = es["order_products"].ww.select(include="Integer")
df.ww

[14]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
order_product_id	int64	Integer	['index']
quantity	int64	Integer	['numeric']

The column must be categorized by logical type and semantic label, having a logical type of `integer` and a `numeric` semantic label, excluding index columns.#

[15]:

ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})

[15]:

<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>

[16]:

df = es["order_products"].ww.select(include="numeric")
df = df.ww.select(include="Integer")
df.ww

[16]:

	Physical Type	Logical Type	Semantic Tag(s)
Column
quantity	int64	Integer	['numeric']

这样，ColumnSchema可以定义一个类型空间，在这个空间下，Woodwork DataFrame中的列可以存在。这就是Featuretools在DFS过程中确定DataFrame中哪些列对于构建特征是有效的方式。每个Primitive都有由Woodwork ColumnSchema描述的input_types和return_type。EntitySet中的每个DataFrame都已经初始化了Woodwork。这意味着当一个EntitySet被传递到DFS中时，Featuretools可以选择DataFrame中与Primitive的input_types有效的相关列。然后我们得到一个具有column_schema属性的特征，该属性指示该特征的类型定义是什么，从而让DFS可以将特征堆叠在一起。通过这种方式，Featuretools能够利用Woodwork类型信息的基本单元ColumnSchema，并与Woodwork DataFrames的EntitySet一起使用，以构建具有深度特征合成的特征。

Table of Contents

Previous topic

Next topic

This Page

在Featuretools中使用Woodwork进行数据类型处理#

物理类型#

逻辑类型#

语义标签#

Woodwork定义的语义标签#

Featuretools定义的语义标签#

Woodwork在Featuretools中的应用#

在EntitySets中使用Woodwork#

The column must be categorized by logical type and semantic label, having a logical type of `integer` and a `numeric` semantic label, excluding index columns.#

Table of Contents

Previous topic

Next topic

This Page

Quick search

在Featuretools中使用Woodwork进行数据类型处理#

物理类型#

逻辑类型#

语义标签#

Woodwork定义的语义标签#

Featuretools定义的语义标签#

Woodwork在Featuretools中的应用#

在EntitySets中使用Woodwork#

The column must be categorized by logical type and semantic label, having a logical type of integer and a numeric semantic label, excluding index columns.#

The column must be categorized by logical type and semantic label, having a logical type of `integer` and a `numeric` semantic label, excluding index columns.#