特征原语#
特征原语是Featuretools的构建模块。它们定义了可以应用于原始数据集以创建新特征的单个计算。由于原语仅限制输入和输出数据类型,因此它们可以应用于各种数据集,并可以堆叠以创建新的计算。
为什么要使用原语?#
人类用来创建特征的潜在函数空间是广阔的。通过将常见的特征工程计算分解为原语组件,我们能够捕获人类今天创建的特征的潜在结构。
原语仅限制输入和输出数据类型。这意味着它们可以用于将一个领域中已知的计算转移到另一个领域。考虑数据科学家经常为交易或事件日志数据计算的特征:事件之间的平均时间。这个特征在预测欺诈行为或未来客户参与方面非常有价值。
通过堆叠两个原语 "time_since_previous"
和 "mean"
,DFS实现了相同的特征。
[1]:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mean"],
trans_primitives=["time_since_previous"],
features_only=True,
)
feature_defs
2024-10-11 14:48:53,237 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:48:53,237 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:48:53,238 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:48:53,238 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:48:53,238 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:48:53,238 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:48:53,238 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:48:53,263 featuretools - WARNING Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
[1]:
[<Feature: zip_code>,
<Feature: MEAN(transactions.amount)>,
<Feature: TIME_SINCE_PREVIOUS(join_date)>,
<Feature: MEAN(sessions.MEAN(transactions.amount))>,
<Feature: MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))>]
Note
The primitive arguments to DFS (eg. agg_primitives
and trans_primitives
in the example above) accept snake_case
, camelCase
, or TitleCase
strings of included Featuretools primitives (ie. time_since_previous
, timeSincePrevious
, and TimeSincePrevious
are all acceptable inputs).
Note
When dfs
is called with features_only=True
, only feature definitions are returned as output. By default this parameter is set to False
. This parameter is used quickly inspect the feature definitions before the spending time calculating the feature matrix.
第二个原语的优点是它们可以以参数化的方式快速枚举许多有趣的特征。这被深度特征合成用来获得几种不同的方式,以总结自上一个事件以来的时间。
[2]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mean", "max", "min", "std", "skew"],
trans_primitives=["time_since_previous"],
)
feature_matrix[
[
"MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))",
"MAX(sessions.TIME_SINCE_PREVIOUS(session_start))",
"MIN(sessions.TIME_SINCE_PREVIOUS(session_start))",
"STD(sessions.TIME_SINCE_PREVIOUS(session_start))",
"SKEW(sessions.TIME_SINCE_PREVIOUS(session_start))",
]
]
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1079e4180> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1079e4040> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1079e4b80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1079e4a40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1079e4180> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1079e4040> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1079e4b80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1079e4a40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1079e4b80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1079e4a40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1079e4040> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1079e4180> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
).agg(to_agg)
[2]:
MEAN(sessions.TIME_SINCE_PREVIOUS(session_start)) | MAX(sessions.TIME_SINCE_PREVIOUS(session_start)) | MIN(sessions.TIME_SINCE_PREVIOUS(session_start)) | STD(sessions.TIME_SINCE_PREVIOUS(session_start)) | SKEW(sessions.TIME_SINCE_PREVIOUS(session_start)) | |
---|---|---|---|---|---|
customer_id | |||||
5 | 1007.500000 | 1170.0 | 715.0 | 157.884451 | -1.507217 |
4 | 999.375000 | 1625.0 | 650.0 | 308.688904 | 1.065177 |
1 | 966.875000 | 1170.0 | 715.0 | 171.754341 | -0.254557 |
3 | 888.333333 | 1170.0 | 650.0 | 177.613813 | 0.434581 |
2 | 725.833333 | 975.0 | 520.0 | 194.638554 | 0.162631 |
聚合 vs 转换特征#
在上面的例子中,我们使用了两种类型的特征。
聚合特征: 这些特征将相关实例作为输入,并输出单个值。它们在实体集中的父子关系中应用。例如:“count”、“sum”、“avg_time_between”。
转换原语: 这些原语从数据框中获取一个或多个列作为输入,并为该数据框输出一个新列。它们应用于单个数据框。例如:"hour"
,"time_since_previous"
,"absolute"
。
The above graphs were generated using the graph_feature
function. These feature lineage graphs help to visually show how primitives were stacked to generate a feature.
要列出并描述Featuretools中每个内置原语的DataFrame,调用ft.list_primitives()
。
[3]:
ft.list_primitives().head(5)
[3]:
name | type | description | valid_inputs | return_type | |
---|---|---|---|---|---|
0 | variance | aggregation | 计算一组数字的方差. | <ColumnSchema (Semantic Tags = ['numeric'])> | <ColumnSchema (Logical Type = Double) (Semanti... |
1 | std | aggregation | 计算相对于均值的离散度,忽略 `NaN`. | <ColumnSchema (Semantic Tags = ['numeric'])> | <ColumnSchema (Semantic Tags = ['numeric'])> |
2 | mode | aggregation | 确定最常重复的值. | <ColumnSchema (Semantic Tags = ['category'])> | None |
3 | min_count | aggregation | 计算列表中最小值的出现次数 | <ColumnSchema (Semantic Tags = ['numeric'])> | <ColumnSchema (Logical Type = IntegerNullable)... |
4 | is_unique | aggregation | 确定一系列离散值是否全部唯一. | <ColumnSchema (Semantic Tags = ['category'])> | <ColumnSchema (Logical Type = BooleanNullable)> |
要对DataFrame进行度量标准化,总结Featuretools中所有内置基元的各种属性和功能,请调用ft.summarize_primitives()
。
[4]:
ft.summarize_primitives()
[4]:
Metric | Count | |
---|---|---|
0 | total_primitives | 225 |
1 | aggregation_primitives | 65 |
2 | transform_primitives | 160 |
3 | unique_input_types | 26 |
4 | unique_output_types | 24 |
5 | uses_multi_input | 50 |
6 | uses_multi_output | 4 |
7 | uses_external_data | 9 |
8 | are_controllable | 92 |
9 | uses_address_input | 0 |
10 | uses_age_input | 0 |
11 | uses_age_fractional_input | 0 |
12 | uses_age_nullable_input | 0 |
13 | uses_boolean_input | 18 |
14 | uses_boolean_nullable_input | 12 |
15 | uses_categorical_input | 0 |
16 | uses_country_code_input | 3 |
17 | uses_currency_code_input | 0 |
18 | uses_datetime_input | 68 |
19 | uses_double_input | 4 |
20 | uses_email_address_input | 2 |
21 | uses_filepath_input | 1 |
22 | uses_ip_address_input | 0 |
23 | uses_integer_input | 4 |
24 | uses_integer_nullable_input | 0 |
25 | uses_lat_long_input | 10 |
26 | uses_natural_language_input | 24 |
27 | uses_ordinal_input | 4 |
28 | uses_person_full_name_input | 3 |
29 | uses_phone_number_input | 2 |
30 | uses_postal_code_input | 5 |
31 | uses_sub_region_code_input | 3 |
32 | uses_timedelta_input | 0 |
33 | uses_url_input | 3 |
34 | uses_unknown_input | 0 |
35 | uses_numeric_tag_input | 87 |
36 | uses_category_tag_input | 11 |
37 | uses_index_tag_input | 1 |
38 | uses_time_index_tag_input | 29 |
39 | uses_date_of_birth_tag_input | 1 |
40 | uses_ignore_tag_input | 0 |
41 | uses_passthrough_tag_input | 0 |
42 | uses_foreign_key_tag_input | 1 |
定义自定义基元#
Featuretools中的基元库不断扩展。用户可以使用以下API定义自己的基元。要定义一个基元,用户将会:
指定基元的类型为
Aggregation
或Transform
定义输入和输出数据类型
编写一个用于计算的Python函数
使用属性进行注释,以限制其应用方式
一旦定义了一个基元,它就可以与现有的基元堆叠,生成复杂的模式。这使得已知对于一个领域很重要的基元可以自动转移到另一个领域。
[5]:
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, NaturalLanguage
from featuretools.primitives import AggregationPrimitive, TransformPrimitive
from featuretools.tests.testing_utils import make_ecommerce_entityset
Simple Custom Primitives#
[6]:
class Absolute(TransformPrimitive):
name = "absolute"
input_types = [ColumnSchema(semantic_tags={"numeric"})]
return_type = ColumnSchema(semantic_tags={"numeric"})
def get_function(self):
def absolute(column):
return abs(column)
return absolute
在上面,我们创建了一个新的转换原语,可以通过使用TransformPrimitive
作为基类,并重写get_function
来返回计算特征的函数,从而与深度特征合成(Deep Feature Synthesis)一起使用。此外,我们设置了原语适用的输入数据类型和返回数据类型。输入和返回数据类型是使用Woodwork的ColumnSchema定义的。关于Woodwork逻辑类型和语义标签的完整指南可以在Woodwork的理解逻辑类型和语义标签指南中找到。
类似地,我们可以使用AggregationPrimitive
来创建一个新的聚合原语。
[7]:
class Maximum(AggregationPrimitive):
name = "maximum"
input_types = [ColumnSchema(semantic_tags={"numeric"})]
return_type = ColumnSchema(semantic_tags={"numeric"})
def get_function(self):
def maximum(column):
return max(column)
return maximum
因为我们定义了一个聚合原语,这个函数接受一个值列表,但只返回一个值。
现在我们已经定义了两个原语,我们可以像使用内置原语一样将它们与dfs函数一起使用。
[8]:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="sessions",
agg_primitives=[Maximum],
trans_primitives=[Absolute],
max_depth=2,
)
feature_matrix.head(5)[
[
"customers.MAXIMUM(transactions.amount)",
"MAXIMUM(transactions.ABSOLUTE(amount))",
]
]
[8]:
customers.MAXIMUM(transactions.amount) | MAXIMUM(transactions.ABSOLUTE(amount)) | |
---|---|---|
session_id | ||
1 | 146.81 | 141.66 |
2 | 149.02 | 135.25 |
3 | 149.95 | 147.73 |
4 | 139.43 | 129.00 |
5 | 149.95 | 139.20 |
单词计数示例#
在这里,我们定义了一个转换原语 WordCount
,它用于计算输入中每行的单词数,并返回计数的列表。
[9]:
class WordCount(TransformPrimitive):
"""
统计列中每一行的单词数量。返回一个包含每行单词数量的列表。
"""
name = "word_count"
input_types = [ColumnSchema(logical_type=NaturalLanguage)]
return_type = ColumnSchema(semantic_tags={"numeric"})
def get_function(self):
def word_count(column):
word_counts = []
for value in column:
words = value.split(None)
word_counts.append(len(words))
return word_counts
return word_count
[10]:
es = make_ecommerce_entityset()
feature_matrix, features = ft.dfs(
entityset=es,
target_dataframe_name="sessions",
agg_primitives=["sum", "mean", "std"],
trans_primitives=[WordCount],
)
feature_matrix[
[
"customers.WORD_COUNT(favorite_quote)",
"STD(log.WORD_COUNT(comments))",
"SUM(log.WORD_COUNT(comments))",
"MEAN(log.WORD_COUNT(comments))",
]
]
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1079cb920> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1079e4a40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1079e4b80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1079e4a40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1079cb920> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1079e4b80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
).agg(to_agg)
[10]:
customers.WORD_COUNT(favorite_quote) | STD(log.WORD_COUNT(comments)) | SUM(log.WORD_COUNT(comments)) | MEAN(log.WORD_COUNT(comments)) | |
---|---|---|---|---|
id | ||||
0 | 9.0 | 540.436860 | 2500.0 | 500.0 |
1 | 9.0 | 583.702550 | 1732.0 | 433.0 |
2 | 9.0 | NaN | 246.0 | 246.0 |
3 | 6.0 | 883.883476 | 1256.0 | 628.0 |
4 | 6.0 | 0.000000 | 9.0 | 3.0 |
5 | 12.0 | 19.798990 | 68.0 | 34.0 |
通过添加一些聚合原语,Deep Feature Synthesis 能够从一个新原语中生成四个新特征。
多个输入类型#
如果一个原语需要多个特征作为输入,input_types
就会有多个元素,例如 [ColumnSchema(semantic_tags={'numeric'}), ColumnSchema(semantic_tags={'numeric'})]
表示该原语需要两列带有语义标签 numeric
的输入。下面是一个具有多个输入特征的原语示例。
[11]:
class MeanSunday(AggregationPrimitive):
"""
找出某特征在星期天出现的非空值的平均值
"""
name = "mean_sunday"
input_types = [
ColumnSchema(semantic_tags={"numeric"}),
ColumnSchema(logical_type=Datetime),
]
return_type = ColumnSchema(semantic_tags={"numeric"})
def get_function(self):
def mean_sunday(numeric, datetime):
days = pd.DatetimeIndex(datetime).weekday.values
df = pd.DataFrame({"numeric": numeric, "time": days})
return df[df["time"] == 6]["numeric"].mean()
return mean_sunday
[12]:
feature_matrix, features = ft.dfs(
entityset=es,
target_dataframe_name="sessions",
agg_primitives=[MeanSunday],
trans_primitives=[],
max_depth=1,
)
feature_matrix[
[
"MEAN_SUNDAY(log.value, datetime)",
"MEAN_SUNDAY(log.value_2, datetime)",
]
]
[12]:
MEAN_SUNDAY(log.value, datetime) | MEAN_SUNDAY(log.value_2, datetime) | |
---|---|---|
id | ||
0 | NaN | NaN |
1 | NaN | NaN |
2 | NaN | NaN |
3 | 2.5 | 1.0 |
4 | 7.0 | 3.0 |
5 | NaN | NaN |