指定基本选项#
默认情况下,DFS将在所有数据框和列上应用基本操作。可以通过一些不同的参数来改变这种行为。数据框和列可以选择性地在整个DFS运行过程中被忽略或包含,也可以在每个基本操作的基础上进行设置,从而实现对特征的更精细控制,减少运行时间开销。
[1]:
import featuretools as ft
from featuretools.tests.testing_utils import make_ecommerce_entityset
es = make_ecommerce_entityset()
features_list = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mode"],
trans_primitives=["weekday"],
features_only=True,
)
features_list
2024-10-11 14:49:55,813 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:49:55,813 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:49:55,814 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:49:55,814 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:49:55,814 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:49:55,814 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:49:55,814 featuretools - WARNING While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:49:55,831 featuretools - WARNING Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
[1]:
[<Feature: age>,
<Feature: région_id>,
<Feature: cohort>,
<Feature: loves_ice_cream>,
<Feature: cancel_reason>,
<Feature: engagement_level>,
<Feature: MODE(sessions.device_name)>,
<Feature: MODE(sessions.device_type)>,
<Feature: MODE(log.countrycode)>,
<Feature: MODE(log.priority_level)>,
<Feature: MODE(log.product_id)>,
<Feature: MODE(log.subregioncode)>,
<Feature: MODE(log.zipcode)>,
<Feature: WEEKDAY(birthday)>,
<Feature: WEEKDAY(cancel_date)>,
<Feature: WEEKDAY(signup_date)>,
<Feature: WEEKDAY(upgrade_date)>,
<Feature: cohorts.cohort_name>,
<Feature: régions.language>,
<Feature: MODE(sessions.MODE(log.countrycode))>,
<Feature: MODE(sessions.MODE(log.priority_level))>,
<Feature: MODE(sessions.MODE(log.product_id))>,
<Feature: MODE(sessions.MODE(log.subregioncode))>,
<Feature: MODE(sessions.MODE(log.zipcode))>,
<Feature: MODE(log.sessions.device_name)>,
<Feature: MODE(log.sessions.device_type)>,
<Feature: cohorts.MODE(customers.cancel_reason)>,
<Feature: cohorts.MODE(customers.engagement_level)>,
<Feature: cohorts.MODE(customers.région_id)>,
<Feature: cohorts.MODE(sessions.device_name)>,
<Feature: cohorts.MODE(sessions.device_type)>,
<Feature: cohorts.MODE(log.countrycode)>,
<Feature: cohorts.MODE(log.priority_level)>,
<Feature: cohorts.MODE(log.product_id)>,
<Feature: cohorts.MODE(log.subregioncode)>,
<Feature: cohorts.MODE(log.zipcode)>,
<Feature: cohorts.WEEKDAY(cohort_end)>,
<Feature: régions.MODE(customers.cancel_reason)>,
<Feature: régions.MODE(customers.engagement_level)>,
<Feature: régions.MODE(sessions.device_name)>,
<Feature: régions.MODE(sessions.device_type)>,
<Feature: régions.MODE(log.countrycode)>,
<Feature: régions.MODE(log.priority_level)>,
<Feature: régions.MODE(log.product_id)>,
<Feature: régions.MODE(log.subregioncode)>,
<Feature: régions.MODE(log.zipcode)>]
为整个运行指定选项#
DFS控制数据框架和列的ignore_dataframes
和ignore_columns
参数,这些参数应该被忽略,因为这对于所有原语都是有用的。这对于忽略与问题无关或不应包含在DFS运行中的列或数据框架非常有用。
[2]:
# 忽略完全 'log' 和 'cohorts' 数据框# 忽略 'customers' 中的 'birthday' 列和 'sessions' 中的 'device_name' 列features_list = ft.dfs( entityset=es, target_dataframe_name="customers", agg_primitives=["mode"], trans_primitives=["weekday"], ignore_dataframes=["log", "cohorts"], ignore_columns={"sessions": ["device_name"], "customers": ["birthday"]}, features_only=True,)features_list
DFS在创建特征时完全忽略了log
和cohorts
数据框。它还忽略了sessions
和customers
中的device_name
和birthday
列。然而,这两个选项都可以被primitive_options
参数中的单个基元选项覆盖。
为单个基元指定选项#
为单个基元或一组基元设置选项是通过DFS的primitive_options
参数来实现的。该参数将任何所需选项映射到特定的基元。在存在冲突选项的情况下,此级别设置的选项将覆盖整个DFS运行级别设置的选项,并且包含选项始终优先于它们的忽略对应项。使用字符串基元名称或基元类型将选项应用于所有同名基元。您还可以通过使用基元实例作为primitive_options
字典中的键来为特定实例的基元设置选项。但需要注意的是,为特定实例指定选项将导致该实例忽略通过基元名称或类作为键的通用基元选项设置的任何选项。
为单个基元指定数据框#
可以为单个基元或一组基元指定要包含/忽略的数据框。可以使用primitive_options
中的ignore_dataframes
选项来忽略数据框,而使用include_dataframes
选项来明确包含数据框。当给定include_dataframes
时,未列出的所有数据框都将被该基元忽略。不会使用任何被排除数据框中的列来生成具有给定基元的特征。
[3]:
# 忽略'cohorts'和'log'数据框,仅针对原始的'mode'# 仅包括'customers'数据框,用于原始的'weekday'和'day'features_list = ft.dfs( entityset=es, target_dataframe_name="customers", agg_primitives=["mode"], trans_primitives=["weekday", "day"], primitive_options={ "mode": {"ignore_dataframes": ["cohorts", "log"]}, ("weekday", "day"): {"include_dataframes": ["customers"]}, }, features_only=True,)features_list
在这个例子中,DFS 仅会使用 customers
数据框来处理 weekday
和 day
,并且会使用除了 cohorts
和 log
之外的所有数据框来处理 mode
。### 为单个基元指定列还可以明确地为一个基元或一组基元指定要包含/忽略的列。要忽略的列由 ignore_columns
选项设置,要包含的列由 include_columns
设置。当设置了 include_columns
选项时,不会使用来自该数据框的其他列来生成具有给定基元的特征。
[4]:
# 包括'mean'的'product_id'和'zipcode'列,'device_type'和'cancel_reason'列# 对于'weekday',忽略'signup_date'和'cancel_date'列features_list = ft.dfs( entityset=es, target_dataframe_name="customers", agg_primitives=["mode"], trans_primitives=["weekday"], primitive_options={ "mode": { "include_columns": { "log": ["product_id", "zipcode"], "sessions": ["device_type"], "customers": ["cancel_reason"], } }, "weekday": {"ignore_columns": {"customers": ["signup_date", "cancel_date"]}}, }, features_only=True,)features_list
在这里,mode
将仅从数据框 log
中使用列 product_id
和 zipcode
,从数据框 sessions
中使用 device_type
,从 customers
数据框中使用 cancel_reason
。对于任何其他数据框,mode
将使用所有列。weekday
原语将在所有数据框中使用除了 customers
数据框中的 signup_date
和 cancel_date
之外的所有列。
指定GroupBy选项#
GroupBy Transform Primitives 还有额外的选项 include_groupby_dataframes
、ignore_groupby_dataframes
、include_groupby_columns
和 ignore_groupby_columns
。这些选项用于指定要包含/忽略作为输入分组的数据框和列。默认情况下,DFS 仅按外键列进行分组。指定 include_groupby_columns
将覆盖此默认设置,只会按给定的列进行分组。另一方面,ignore_groupby_columns
将继续仅使用外键列,忽略指定的任何也是外键列的列。请注意,如果包括非外键列进行分组,则包含的列必须是分类列。
[5]:
features_list = ft.dfs(
entityset=es,
target_dataframe_name="log",
agg_primitives=[],
trans_primitives=[],
groupby_trans_primitives=["cum_sum", "cum_count"],
primitive_options={
"cum_sum": {"ignore_groupby_columns": {"log": ["product_id"]}},
"cum_count": {
"include_groupby_columns": {"log": ["product_id", "priority_level"]},
"ignore_groupby_dataframes": ["sessions"],
},
},
features_only=True,
)
features_list
[5]:
[<Feature: session_id>,
<Feature: product_id>,
<Feature: value>,
<Feature: value_2>,
<Feature: zipcode>,
<Feature: countrycode>,
<Feature: subregioncode>,
<Feature: value_many_nans>,
<Feature: priority_level>,
<Feature: purchased>,
<Feature: CUM_COUNT(countrycode) by priority_level>,
<Feature: CUM_COUNT(countrycode) by product_id>,
<Feature: CUM_COUNT(priority_level) by priority_level>,
<Feature: CUM_COUNT(priority_level) by product_id>,
<Feature: CUM_COUNT(product_id) by priority_level>,
<Feature: CUM_COUNT(product_id) by product_id>,
<Feature: CUM_COUNT(subregioncode) by priority_level>,
<Feature: CUM_COUNT(subregioncode) by product_id>,
<Feature: CUM_COUNT(zipcode) by priority_level>,
<Feature: CUM_COUNT(zipcode) by product_id>,
<Feature: CUM_SUM(value) by session_id>,
<Feature: CUM_SUM(value_2) by session_id>,
<Feature: CUM_SUM(value_many_nans) by session_id>,
<Feature: sessions.customer_id>,
<Feature: sessions.device_type>,
<Feature: sessions.device_name>,
<Feature: products.department>,
<Feature: products.rating>,
<Feature: sessions.customers.age>,
<Feature: sessions.customers.région_id>,
<Feature: sessions.customers.cohort>,
<Feature: sessions.customers.loves_ice_cream>,
<Feature: sessions.customers.cancel_reason>,
<Feature: sessions.customers.engagement_level>,
<Feature: CUM_COUNT(countrycode) by products.department>,
<Feature: CUM_COUNT(priority_level) by products.department>,
<Feature: CUM_COUNT(product_id) by products.department>,
<Feature: CUM_COUNT(products.department) by priority_level>,
<Feature: CUM_COUNT(products.department) by product_id>,
<Feature: CUM_COUNT(sessions.device_name) by priority_level>,
<Feature: CUM_COUNT(sessions.device_name) by product_id>,
<Feature: CUM_COUNT(sessions.device_name) by products.department>,
<Feature: CUM_COUNT(sessions.device_type) by priority_level>,
<Feature: CUM_COUNT(sessions.device_type) by product_id>,
<Feature: CUM_COUNT(sessions.device_type) by products.department>,
<Feature: CUM_COUNT(subregioncode) by products.department>,
<Feature: CUM_COUNT(zipcode) by products.department>,
<Feature: CUM_SUM(products.rating) by session_id>,
<Feature: CUM_SUM(products.rating) by sessions.customer_id>,
<Feature: CUM_SUM(value) by sessions.customer_id>,
<Feature: CUM_SUM(value_2) by sessions.customer_id>,
<Feature: CUM_SUM(value_many_nans) by sessions.customer_id>]
在cum_sum
中,我们忽略product_id
作为分组依据,但仍然使用该数据框或任何其他数据框中的其他外键列。对于cum_count
,我们只使用product_id
和priority_level
作为分组依据。请注意,cum_sum
不使用priority_level
,因为它不是外键列,但我们明确将其包含在cum_count
中。最后,请注意,指定分组选项不会影响基元应用于哪些特征。例如,cum_count
忽略了数据框sessions
作为分组依据,但仍会生成特征<Feature: CUM_COUNT(sessions.device_name) by product_id>
。分组依据来自目标数据框log
,因此给定相关选项,该特征是有效的。要忽略cum_count
中的数据框sessions
,需要在cum_count
的ignore_dataframes
选项中包含sessions
。
为多个输入指定每个输入的选项#
对于需要多列作为输入的基元,例如Trend
,可以通过将它们作为列表传递来为每个输入指定上述选项。如果只提供一个选项字典,则将用于所有输入。提供的列表长度必须与基元接受的输入数量相匹配。
[6]:
features_list = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["trend"],
trans_primitives=[],
primitive_options={
"trend": [
{"ignore_columns": {"log": ["value_many_nans"]}},
{"include_columns": {"customers": ["signup_date"], "log": ["datetime"]}},
]
},
features_only=True,
)
features_list
[6]:
[<Feature: age>,
<Feature: région_id>,
<Feature: cohort>,
<Feature: loves_ice_cream>,
<Feature: cancel_reason>,
<Feature: engagement_level>,
<Feature: TREND(log.value, datetime)>,
<Feature: TREND(log.value_2, datetime)>,
<Feature: cohorts.cohort_name>,
<Feature: régions.language>,
<Feature: cohorts.TREND(customers.age, signup_date)>,
<Feature: cohorts.TREND(log.value, datetime)>,
<Feature: cohorts.TREND(log.value_2, datetime)>,
<Feature: régions.TREND(customers.age, signup_date)>,
<Feature: régions.TREND(log.value, datetime)>,
<Feature: régions.TREND(log.value_2, datetime)>]
在这里,我们传入了一组用于趋势的原始选项列表。对于trend
的第一个输入,我们忽略value_many_nans
列,对于第二个输入,我们包括customers
中的signup_date
列。