特征原语#

特征原语是Featuretools的构建模块。它们定义了可以应用于原始数据集以创建新特征的单个计算。由于原语仅限制输入和输出数据类型,因此它们可以应用于各种数据集,并可以堆叠以创建新的计算。

为什么要使用原语?#

人类用来创建特征的潜在函数空间是广阔的。通过将常见的特征工程计算分解为原语组件,我们能够捕获人类今天创建的特征的潜在结构。

原语仅限制输入和输出数据类型。这意味着它们可以用于将一个领域中已知的计算转移到另一个领域。考虑数据科学家经常为交易或事件日志数据计算的特征:事件之间的平均时间。这个特征在预测欺诈行为或未来客户参与方面非常有价值。

通过堆叠两个原语 "time_since_previous""mean",DFS实现了相同的特征。

[1]:
import featuretools as ft


es = ft.demo.load_mock_customer(return_entityset=True)


feature_defs = ft.dfs(

    entityset=es,

    target_dataframe_name="customers",

    agg_primitives=["mean"],

    trans_primitives=["time_since_previous"],

    features_only=True,

)


feature_defs

2024-10-11 14:48:53,237 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:48:53,237 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:48:53,238 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:48:53,238 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:48:53,238 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:48:53,238 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:48:53,238 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:48:53,263 featuretools - WARNING    Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[1]:
[<Feature: zip_code>,
 <Feature: MEAN(transactions.amount)>,
 <Feature: TIME_SINCE_PREVIOUS(join_date)>,
 <Feature: MEAN(sessions.MEAN(transactions.amount))>,
 <Feature: MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))>]

Note

The primitive arguments to DFS (eg. agg_primitives and trans_primitives in the example above) accept snake_case, camelCase, or TitleCase strings of included Featuretools primitives (ie. time_since_previous, timeSincePrevious, and TimeSincePrevious are all acceptable inputs).

Note

When dfs is called with features_only=True, only feature definitions are returned as output. By default this parameter is set to False. This parameter is used quickly inspect the feature definitions before the spending time calculating the feature matrix.

第二个原语的优点是它们可以以参数化的方式快速枚举许多有趣的特征。这被深度特征合成用来获得几种不同的方式,以总结自上一个事件以来的时间。

[2]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mean", "max", "min", "std", "skew"],
    trans_primitives=["time_since_previous"],
)

feature_matrix[
    [
        "MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))",
        "MAX(sessions.TIME_SINCE_PREVIOUS(session_start))",
        "MIN(sessions.TIME_SINCE_PREVIOUS(session_start))",
        "STD(sessions.TIME_SINCE_PREVIOUS(session_start))",
        "SKEW(sessions.TIME_SINCE_PREVIOUS(session_start))",
    ]
]

/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1079e4180> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1079e4040> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1079e4b80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1079e4a40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1079e4180> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1079e4040> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1079e4b80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1079e4a40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1079e4b80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1079e4a40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function max at 0x1079e4040> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function min at 0x1079e4180> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  ).agg(to_agg)
[2]:
MEAN(sessions.TIME_SINCE_PREVIOUS(session_start)) MAX(sessions.TIME_SINCE_PREVIOUS(session_start)) MIN(sessions.TIME_SINCE_PREVIOUS(session_start)) STD(sessions.TIME_SINCE_PREVIOUS(session_start)) SKEW(sessions.TIME_SINCE_PREVIOUS(session_start))
customer_id
5 1007.500000 1170.0 715.0 157.884451 -1.507217
4 999.375000 1625.0 650.0 308.688904 1.065177
1 966.875000 1170.0 715.0 171.754341 -0.254557
3 888.333333 1170.0 650.0 177.613813 0.434581
2 725.833333 975.0 520.0 194.638554 0.162631

聚合 vs 转换特征#

在上面的例子中,我们使用了两种类型的特征。

聚合特征: 这些特征将相关实例作为输入,并输出单个值。它们在实体集中的父子关系中应用。例如:“count”、“sum”、“avg_time_between”。

digraph "COUNT(sessions)" {
	graph [bb="0,0,649,116.75",
		rankdir=LR
	];
	node [label="\N",
		shape=box
	];
	edge [arrowhead=none,
		dir=forward,
		style=dotted
	];
	customers	[height=1.1493,
		label=<
<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10">
    <TR>
        <TD colspan="1" bgcolor="#A9A9A9"><B>★ customers (target)</B></TD>
    </TR>
    <TR>
        <TD ALIGN="LEFT" port="COUNT(sessions)" BGCOLOR="#D9EAD3">COUNT(sessions)</TD>
    </TR>
</TABLE>>,
		pos="567.75,59.375",
		shape=plaintext,
		width=2.2569];
	sessions	[height=1.6215,
		label=<
<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10">
    <TR>
        <TD colspan="1" bgcolor="#A9A9A9"><B>sessions</B></TD>
    </TR><TR><TD ALIGN="LEFT" port="session_id">session_id (index)</TD></TR>
<TR><TD ALIGN="LEFT" port="customer_id">customer_id</TD></TR>
</TABLE>>,
		pos="68.5,58.375",
		shape=plaintext,
		width=1.9028];
	"COUNT(sessions)_groupby_sessions--customer_id"	[height=0.50694,
		label="group by
customer_id",
		pos="214.75,40.375",
		width=1.1597];
	sessions:session_id -> "COUNT(sessions)_groupby_sessions--customer_id"	[arrowhead="",
		pos="e,172.8,52.778 130,58.375 140.38,58.375 151.32,57.038 161.74,55.093",
		style=solid];
	sessions:customer_id -> "COUNT(sessions)_groupby_sessions--customer_id"	[pos="130,22.125 144.14,22.125 159.32,24.638 172.81,27.799"];
	"0_COUNT(sessions)_count"	[height=0.94444,
		label=<<FONT POINT-SIZE="12"><B>Aggregation</B><BR></BR></FONT>COUNT>,
		pos="371.5,40.375",
		shape=diamond,
		width=2.1944];
	"0_COUNT(sessions)_count" -> customers:"COUNT(sessions)"	[arrowhead="",
		pos="e,493.5,40.125 451.26,40.165 461.48,40.148 471.94,40.134 482.13,40.128",
		style=solid];
	"COUNT(sessions)_groupby_sessions--customer_id" -> "0_COUNT(sessions)_count"	[arrowhead="",
		pos="e,291.52,40.375 256.81,40.375 264.12,40.375 271.98,40.375 280.04,40.375",
		style=solid];
}

转换原语: 这些原语从数据框中获取一个或多个列作为输入,并为该数据框输出一个新列。它们应用于单个数据框。例如:"hour""time_since_previous""absolute"

digraph "TIME_SINCE_PREVIOUS(join_date)" {
	graph [bb="0,0,622,119",
		rankdir=LR
	];
	node [label="\N",
		shape=box
	];
	edge [arrowhead=none,
		dir=forward,
		style=dotted
	];
	customers	[height=1.6528,
		label=<
<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10">
    <TR>
        <TD colspan="1" bgcolor="#A9A9A9"><B>★ customers (target)</B></TD>
    </TR><TR><TD ALIGN="LEFT" port="join_date">join_date</TD></TR>

    <TR>
        <TD ALIGN="LEFT" port="TIME_SINCE_PREVIOUS(join_date)" BGCOLOR="#D9EAD3">TIME_SINCE_PREVIOUS(join_date)</TD>
    </TR>
</TABLE>>,
		pos="124.75,59.5",
		shape=plaintext,
		width=3.4653];
	"0_TIME_SINCE_PREVIOUS(join_date)_time_since_previous"	[height=0.94444,
		label=<<FONT POINT-SIZE="12"><B>Transform</B><BR></BR></FONT>TIME_SINCE_PREVIOUS>,
		pos="453.75,40.5",
		shape=diamond,
		width=4.6736];
	customers:join_date -> "0_TIME_SINCE_PREVIOUS(join_date)_time_since_previous"	[arrowhead="",
		pos="e,345.67,53.133 242.5,58.375 272.3,58.375 304.4,56.57 334.32,54.107",
		style=solid];
	"0_TIME_SINCE_PREVIOUS(join_date)_time_since_previous" -> customers:"TIME_SINCE_PREVIOUS(join_date)"	[arrowhead="",
		pos="e,242.5,22.125 346.76,27.613 317.09,24.873 284.53,22.661 253.75,22.209",
		style=solid];
}

The above graphs were generated using the graph_feature function. These feature lineage graphs help to visually show how primitives were stacked to generate a feature.

要列出并描述Featuretools中每个内置原语的DataFrame,调用ft.list_primitives()

[3]:
ft.list_primitives().head(5)

[3]:
name type description valid_inputs return_type
0 variance aggregation 计算一组数字的方差. <ColumnSchema (Semantic Tags = ['numeric'])> <ColumnSchema (Logical Type = Double) (Semanti...
1 std aggregation 计算相对于均值的离散度,忽略 `NaN`. <ColumnSchema (Semantic Tags = ['numeric'])> <ColumnSchema (Semantic Tags = ['numeric'])>
2 mode aggregation 确定最常重复的值. <ColumnSchema (Semantic Tags = ['category'])> None
3 min_count aggregation 计算列表中最小值的出现次数 <ColumnSchema (Semantic Tags = ['numeric'])> <ColumnSchema (Logical Type = IntegerNullable)...
4 is_unique aggregation 确定一系列离散值是否全部唯一. <ColumnSchema (Semantic Tags = ['category'])> <ColumnSchema (Logical Type = BooleanNullable)>

要对DataFrame进行度量标准化,总结Featuretools中所有内置基元的各种属性和功能,请调用ft.summarize_primitives()

[4]:
ft.summarize_primitives()

[4]:
Metric Count
0 total_primitives 225
1 aggregation_primitives 65
2 transform_primitives 160
3 unique_input_types 26
4 unique_output_types 24
5 uses_multi_input 50
6 uses_multi_output 4
7 uses_external_data 9
8 are_controllable 92
9 uses_address_input 0
10 uses_age_input 0
11 uses_age_fractional_input 0
12 uses_age_nullable_input 0
13 uses_boolean_input 18
14 uses_boolean_nullable_input 12
15 uses_categorical_input 0
16 uses_country_code_input 3
17 uses_currency_code_input 0
18 uses_datetime_input 68
19 uses_double_input 4
20 uses_email_address_input 2
21 uses_filepath_input 1
22 uses_ip_address_input 0
23 uses_integer_input 4
24 uses_integer_nullable_input 0
25 uses_lat_long_input 10
26 uses_natural_language_input 24
27 uses_ordinal_input 4
28 uses_person_full_name_input 3
29 uses_phone_number_input 2
30 uses_postal_code_input 5
31 uses_sub_region_code_input 3
32 uses_timedelta_input 0
33 uses_url_input 3
34 uses_unknown_input 0
35 uses_numeric_tag_input 87
36 uses_category_tag_input 11
37 uses_index_tag_input 1
38 uses_time_index_tag_input 29
39 uses_date_of_birth_tag_input 1
40 uses_ignore_tag_input 0
41 uses_passthrough_tag_input 0
42 uses_foreign_key_tag_input 1

定义自定义基元#

Featuretools中的基元库不断扩展。用户可以使用以下API定义自己的基元。要定义一个基元,用户将会:

  • 指定基元的类型为AggregationTransform

  • 定义输入和输出数据类型

  • 编写一个用于计算的Python函数

  • 使用属性进行注释,以限制其应用方式

一旦定义了一个基元,它就可以与现有的基元堆叠,生成复杂的模式。这使得已知对于一个领域很重要的基元可以自动转移到另一个领域。

[5]:
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, NaturalLanguage

from featuretools.primitives import AggregationPrimitive, TransformPrimitive
from featuretools.tests.testing_utils import make_ecommerce_entityset

Simple Custom Primitives#

[6]:
class Absolute(TransformPrimitive):
    name = "absolute"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def get_function(self):
        def absolute(column):
            return abs(column)

        return absolute

在上面,我们创建了一个新的转换原语,可以通过使用TransformPrimitive作为基类,并重写get_function来返回计算特征的函数,从而与深度特征合成(Deep Feature Synthesis)一起使用。此外,我们设置了原语适用的输入数据类型和返回数据类型。输入和返回数据类型是使用Woodwork的ColumnSchema定义的。关于Woodwork逻辑类型和语义标签的完整指南可以在Woodwork的理解逻辑类型和语义标签指南中找到。

类似地,我们可以使用AggregationPrimitive来创建一个新的聚合原语。

[7]:
class Maximum(AggregationPrimitive):
    name = "maximum"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def get_function(self):
        def maximum(column):
            return max(column)

        return maximum

因为我们定义了一个聚合原语,这个函数接受一个值列表,但只返回一个值。

现在我们已经定义了两个原语,我们可以像使用内置原语一样将它们与dfs函数一起使用。

[8]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="sessions",
    agg_primitives=[Maximum],
    trans_primitives=[Absolute],
    max_depth=2,
)

feature_matrix.head(5)[
    [
        "customers.MAXIMUM(transactions.amount)",
        "MAXIMUM(transactions.ABSOLUTE(amount))",
    ]
]

[8]:
customers.MAXIMUM(transactions.amount) MAXIMUM(transactions.ABSOLUTE(amount))
session_id
1 146.81 141.66
2 149.02 135.25
3 149.95 147.73
4 139.43 129.00
5 149.95 139.20

单词计数示例#

在这里,我们定义了一个转换原语 WordCount,它用于计算输入中每行的单词数,并返回计数的列表。

[9]:
class WordCount(TransformPrimitive):
    """
    统计列中每一行的单词数量。返回一个包含每行单词数量的列表。
    """

    name = "word_count"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def get_function(self):
        def word_count(column):
            word_counts = []
            for value in column:
                words = value.split(None)
                word_counts.append(len(words))
            return word_counts

        return word_count

[10]:
es = make_ecommerce_entityset()

feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="sessions",
    agg_primitives=["sum", "mean", "std"],
    trans_primitives=[WordCount],
)

feature_matrix[
    [
        "customers.WORD_COUNT(favorite_quote)",
        "STD(log.WORD_COUNT(comments))",
        "SUM(log.WORD_COUNT(comments))",
        "MEAN(log.WORD_COUNT(comments))",
    ]
]

/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1079cb920> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1079e4a40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1079e4b80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function mean at 0x1079e4a40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function sum at 0x1079cb920> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  ).agg(to_agg)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:756: FutureWarning: The provided callable <function std at 0x1079e4b80> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  ).agg(to_agg)
[10]:
customers.WORD_COUNT(favorite_quote) STD(log.WORD_COUNT(comments)) SUM(log.WORD_COUNT(comments)) MEAN(log.WORD_COUNT(comments))
id
0 9.0 540.436860 2500.0 500.0
1 9.0 583.702550 1732.0 433.0
2 9.0 NaN 246.0 246.0
3 6.0 883.883476 1256.0 628.0
4 6.0 0.000000 9.0 3.0
5 12.0 19.798990 68.0 34.0

通过添加一些聚合原语,Deep Feature Synthesis 能够从一个新原语中生成四个新特征。

多个输入类型#

如果一个原语需要多个特征作为输入,input_types 就会有多个元素,例如 [ColumnSchema(semantic_tags={'numeric'}), ColumnSchema(semantic_tags={'numeric'})] 表示该原语需要两列带有语义标签 numeric 的输入。下面是一个具有多个输入特征的原语示例。

[11]:
class MeanSunday(AggregationPrimitive):
    """
    找出某特征在星期天出现的非空值的平均值
    """

    name = "mean_sunday"
    input_types = [
        ColumnSchema(semantic_tags={"numeric"}),
        ColumnSchema(logical_type=Datetime),
    ]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def get_function(self):
        def mean_sunday(numeric, datetime):
            days = pd.DatetimeIndex(datetime).weekday.values
            df = pd.DataFrame({"numeric": numeric, "time": days})
            return df[df["time"] == 6]["numeric"].mean()

        return mean_sunday

[12]:
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="sessions",
    agg_primitives=[MeanSunday],
    trans_primitives=[],
    max_depth=1,
)

feature_matrix[
    [
        "MEAN_SUNDAY(log.value, datetime)",
        "MEAN_SUNDAY(log.value_2, datetime)",
    ]
]

[12]:
MEAN_SUNDAY(log.value, datetime) MEAN_SUNDAY(log.value_2, datetime)
id
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 2.5 1.0
4 7.0 3.0
5 NaN NaN