生成特征描述#
随着特征变得更加复杂,它们的名称可能变得更难理解。describe_feature
函数和graph_feature
函数都可以帮助解释一个特征是什么,以及Featuretools生成它所采取的步骤。此外,describe_feature
函数可以通过提供自定义定义和模板来增强,以改善生成的描述结果。
默认情况下,describe_feature
使用现有的列和DataFrame名称,以及默认的原始描述模板来生成特征描述。
[2]:
feature_defs[9]
[2]:
<Feature: MONTH(birthday)>
[3]:
ft.describe_feature(feature_defs[9])
[3]:
'The month of the "birthday".'
[4]:
feature_defs[14]
[4]:
<Feature: MODE(sessions.MODE(transactions.product_id))>
[5]:
ft.describe_feature(feature_defs[14])
[5]:
'The most frequently occurring value of the most frequently occurring value of the "product_id" of all instances of "transactions" for each "session_id" in "sessions" of all instances of "sessions" for each "customer_id" in "customers".'
Improved Description#
虽然默认描述可能会有所帮助,但通过提供自定义列和特征的定义,以及提供原始描述的替代模板,可以进一步改进描述。
Feature Description#
自定义特征定义将在描述中替代自动生成的描述。这可以用于更好地解释ColumnSchema
或特征是什么,或者提供利用用户对数据或领域的现有知识的描述。
[6]:
feature_descriptions = {"customers: join_date": "the date the customer joined"}
ft.describe_feature(feature_defs[9], feature_descriptions=feature_descriptions)
[6]:
'The month of the "birthday".'
例如,上面的代码将列名"join_date"
替换为对数据集中该列代表的内容更具描述性的定义。还可以通过访问Woodwork类型信息来直接在DataFrame中的列上设置描述,以访问每个ColumnSchema
上存在的description
属性:
[7]:
join_date_column_schema = es["customers"].ww.columns["join_date"]
join_date_column_schema.description = "the date the customer joined"
es["customers"].ww.columns["join_date"].description
[7]:
'the date the customer joined'
[8]:
feature = ft.TransformFeature(es["customers"].ww["join_date"], ft.primitives.Hour)
feature
[8]:
<Feature: HOUR(join_date)>
[9]:
ft.describe_feature(feature)
[9]:
'The hour value of the date the customer joined.'
在为DataFrame中的列创建特征之前,必须为列设置描述以便描述能够传播。请注意,如果在列上直接设置了描述,并且在使用feature_descriptions
参数调用describe_feature
时也传递了描述,那么feature_descriptions
参数中的描述将优先生效。
还可以为生成的特征提供特征描述。
[10]:
feature_descriptions = {
"sessions: SUM(transactions.amount)": "the total transaction amount for a session"
}
feature_defs[14]
[10]:
<Feature: MODE(sessions.MODE(transactions.product_id))>
[11]:
ft.describe_feature(feature_defs[14], feature_descriptions=feature_descriptions)
[11]:
'The most frequently occurring value of the most frequently occurring value of the "product_id" of all instances of "transactions" for each "session_id" in "sessions" of all instances of "sessions" for each "customer_id" in "customers".'
在这里,我们创建并传入了一个自定义描述中间特征SUM(transactions.amount)
。建立在SUM(transactions.amount)
基础上的MEAN(sessions.SUM(transactions.amount))
的描述使用自定义描述代替自动生成的描述。特征描述可以作为一个字典传入,将自定义描述映射到特征对象本身或以"[dataframe_name]: [feature_name]"
形式的唯一特征名称,就像上面展示的那样。
原始模板#
原始模板使用原始模板生成原始描述。默认情况下,这些是使用原始的description_template
属性定义的。如果原始没有模板,则默认使用原始的name
属性(如果已定义)或类名(如果未定义)。原始描述模板是字符串模板,将输入特征描述作为位置参数。这些可以通过将原始实例或原始名称映射到自定义模板,并通过primitive_templates
参数传递给describe_feature
来覆盖。
[12]:
primitive_templates = {"sum": "the total of {}"}
feature_defs[6]
[12]:
<Feature: SUM(transactions.amount)>
[13]:
ft.describe_feature(feature_defs[6], primitive_templates=primitive_templates)
[13]:
'The total of the "amount" of all instances of "transactions" for each "customer_id" in "customers".'
在这个例子中,我们使用自定义模板'the total of {}'
覆盖了默认模板'the sum of {}'
。描述中使用了我们的自定义模板,而不是默认模板。
多输出基元可以使用基元描述模板列表来区分通用的多输出特征描述和特征切片描述。第一个基元模板始终是通用的整体特征。如果只提供一个其他模板,则它将用作所有切片的模板。将切片编号转换为”nth”形式可通过nth_slice
关键字获得。
[14]:
feature = feature_defs[5]
feature
[14]:
<Feature: N_MOST_COMMON(transactions.product_id)>
[15]:
primitive_templates = {
"n_most_common": [
"the 3 most common elements of {}", # 通用多输出特征
"the {nth_slice} most common element of {}",
]
} # 每个切片的模板
ft.describe_feature(feature, primitive_templates=primitive_templates)
[15]:
'The 3 most common elements of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
请注意,多输出特征使用第一个模板进行描述。该特征的每个切片将使用第二个切片模板:
[16]:
ft.describe_feature(feature[0], primitive_templates=primitive_templates)
[16]:
'The 1st most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[17]:
ft.describe_feature(feature[1], primitive_templates=primitive_templates)
[17]:
'The 2nd most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[18]:
ft.describe_feature(feature[2], primitive_templates=primitive_templates)
[18]:
'The 3rd most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
或者,可以为每个切片提供模板,以进一步定制输出,而不是为所有切片提供单个模板。请注意,在这种情况下,每个切片必须有自己的模板。
[19]:
primitive_templates = {
"n_most_common": [
"the 3 most common elements of {}",
"the most common element of {}",
"the second most common element of {}",
"the third most common element of {}",
]
}
ft.describe_feature(feature, primitive_templates=primitive_templates)
[19]:
'The 3 most common elements of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[20]:
ft.describe_feature(feature[0], primitive_templates=primitive_templates)
[20]:
'The most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[21]:
ft.describe_feature(feature[1], primitive_templates=primitive_templates)
[21]:
'The second most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
[22]:
ft.describe_feature(feature[2], primitive_templates=primitive_templates)
[22]:
'The third most common element of the "product_id" of all instances of "transactions" for each "customer_id" in "customers".'
可以将自定义特征描述和原始模板分别定义在一个JSON文件中,并通过使用metadata_file
关键字参数将其传递给describe_feature
函数。直接通过feature_descriptions
和primitive_templates
关键字参数传递的描述将优先于JSON元数据文件中提供的任何描述。