特征选择#

Featuretools提供了一种功能,使用户能够删除在构建有效的机器学习模型中不太可能有用的特征。在特征矩阵中减少特征的数量既可以产生更好的模型结果,也可以减少预测过程中涉及的计算成本。Featuretools使用户能够对深度特征合成的结果执行特征选择,具体有三个函数: - ft.selection.remove_highly_null_features - ft.selection.remove_single_value_features - ft.selection.remove_highly_correlated_features

我们将详细描述这三个函数,但首先我们必须创建一个实体集,以便我们可以运行ft.dfs

[1]:
import pandas as pd

import featuretools as ft
from featuretools.demo.flight import load_flight
from featuretools.selection import (
    remove_highly_correlated_features,
    remove_highly_null_features,
    remove_single_value_features,
)

es = load_flight(nrows=50)
es

2024-10-11 14:49:42,528 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "DiversityScore" from "premium_primitives.diversity_score" because a primitive with that name already exists in "nlp_primitives.diversity_score"
2024-10-11 14:49:42,528 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "LSA" from "premium_primitives.lsa" because a primitive with that name already exists in "nlp_primitives.lsa"
2024-10-11 14:49:42,528 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "MeanCharactersPerSentence" from "premium_primitives.mean_characters_per_sentence" because a primitive with that name already exists in "nlp_primitives.mean_characters_per_sentence"
2024-10-11 14:49:42,529 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "NumberOfSentences" from "premium_primitives.number_of_sentences" because a primitive with that name already exists in "nlp_primitives.number_of_sentences"
2024-10-11 14:49:42,529 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "PartOfSpeechCount" from "premium_primitives.part_of_speech_count" because a primitive with that name already exists in "nlp_primitives.part_of_speech_count"
2024-10-11 14:49:42,529 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "PolarityScore" from "premium_primitives.polarity_score" because a primitive with that name already exists in "nlp_primitives.polarity_score"
2024-10-11 14:49:42,529 featuretools - WARNING    While loading primitives via "premium_primitives" entry point, ignored primitive "StopwordCount" from "premium_primitives.stopword_count" because a primitive with that name already exists in "nlp_primitives.stopword_count"
2024-10-11 14:49:42,543 featuretools - WARNING    Featuretools failed to load plugin tsfresh from library featuretools_tsfresh_primitives.__init__. For a full stack trace, set logging to debug.
Downloading data ...
/Users/code/fin_tool/github/featuretools/featuretools/demo/flight.py:288: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
  clean_data.loc[:, "dep_time"] = clean_data["scheduled_dep_time"] + pd.to_timedelta(
/Users/code/fin_tool/github/featuretools/featuretools/demo/flight.py:293: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
  clean_data.loc[:, "arr_time"] = clean_data["dep_time"] + pd.to_timedelta(
/Users/code/fin_tool/github/featuretools/featuretools/demo/flight.py:299: PerformanceWarning: Adding/subtracting object-dtype array to TimedeltaArray not vectorized.
  clean_data["scheduled_dep_time"] + clean_data["scheduled_elapsed_time"]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/logical_types.py:897: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/logical_types.py:897: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/type_sys/utils.py:40: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/logical_types.py:897: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/logical_types.py:897: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)
[1]:
Entityset: Flight Data
  DataFrames:
    trip_logs [Rows: 50, Columns: 21]
    flights [Rows: 6, Columns: 9]
    airlines [Rows: 1, Columns: 1]
    airports [Rows: 4, Columns: 3]
  Relationships:
    trip_logs.flight_id -> flights.flight_id
    flights.carrier -> airlines.carrier
    flights.dest -> airports.dest

移除高度缺失的特征#

我们可能有一个数据集,其中的列有许多空值。深度特征合成可能会基于这些空列构建特征,从而创建更多高度缺失的特征。在这种情况下,我们可能希望移除任何空值超过一定阈值的特征。下面是我们的特征矩阵,展示了这样一种情况:

[2]:
fm, features = ft.dfs(
    entityset=es,
    target_dataframe_name="trip_logs",
    cutoff_time=pd.DataFrame(
        {
            "trip_log_id": [30, 1, 2, 3, 4],
            "time": pd.to_datetime(["2016-09-22 00:00:00"] * 5),
        }
    ),
    trans_primitives=[],
    agg_primitives=[],
    max_depth=2,
)
fm

/Users/code/fin_tool/github/featuretools/featuretools/entityset/entityset.py:1403: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.loc[mask, columns] = np.nan
/Users/code/fin_tool/github/featuretools/featuretools/entityset/entityset.py:1403: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.loc[mask, columns] = np.nan
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:128: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:128: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:128: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:128: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:128: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:128: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:128: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:128: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:128: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:128: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
/Users/code/fin_tool/github/featuretools/featuretools/computational_backends/feature_set_calculator.py:128: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, default_df], sort=True)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/logical_types.py:897: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/woodwork/logical_types.py:897: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  series = series.replace(ww.config.get_option("nan_values"), np.nan)
[2]:
flight_id dep_delay taxi_out taxi_in arr_delay diverted air_time distance carrier_delay weather_delay national_airspace_delay security_delay late_aircraft_delay canceled flights.origin flights.origin_city flights.origin_state flights.dest flights.distance_group flights.carrier flights.flight_num flights.airports.dest_city flights.airports.dest_state
trip_log_id
30 AA-494:RSW->CLT NaN NaN NaN NaN <NA> NaN 600.0 NaN NaN NaN NaN NaN <NA> RSW Fort Myers, FL FL CLT 3 AA 494 Charlotte, NC NC
1 AA-494:CLT->PHX NaN NaN NaN NaN <NA> NaN 1773.0 NaN NaN NaN NaN NaN <NA> CLT Charlotte, NC NC PHX 8 AA 494 Phoenix, AZ AZ
2 AA-494:CLT->PHX NaN NaN NaN NaN <NA> NaN 1773.0 NaN NaN NaN NaN NaN <NA> CLT Charlotte, NC NC PHX 8 AA 494 Phoenix, AZ AZ
3 AA-494:CLT->PHX NaN NaN NaN NaN <NA> NaN 1773.0 NaN NaN NaN NaN NaN <NA> CLT Charlotte, NC NC PHX 8 AA 494 Phoenix, AZ AZ
4 NaN NaN NaN NaN NaN <NA> NaN NaN NaN NaN NaN NaN NaN <NA> NaN NaN NaN NaN NaN NaN NaN NaN NaN

我们查看上面的特征矩阵,并决定移除缺失值较高的特征。

[3]:
ft.selection.remove_highly_null_features(fm)

[3]:
flight_id distance flights.origin flights.origin_city flights.origin_state flights.dest flights.distance_group flights.carrier flights.flight_num flights.airports.dest_city flights.airports.dest_state
trip_log_id
30 AA-494:RSW->CLT 600.0 RSW Fort Myers, FL FL CLT 3 AA 494 Charlotte, NC NC
1 AA-494:CLT->PHX 1773.0 CLT Charlotte, NC NC PHX 8 AA 494 Phoenix, AZ AZ
2 AA-494:CLT->PHX 1773.0 CLT Charlotte, NC NC PHX 8 AA 494 Phoenix, AZ AZ
3 AA-494:CLT->PHX 1773.0 CLT Charlotte, NC NC PHX 8 AA 494 Phoenix, AZ AZ
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

请注意,调用remove_highly_null_features并没有移除每一个包含空值的特征。默认情况下,我们只会移除在计算的特征矩阵中空值百分比超过95%的特征。如果我们想要降低这个阈值,我们可以自己设置pct_null_threshold参数。

[4]:
remove_highly_null_features(fm, pct_null_threshold=0.2)

[4]:
trip_log_id
30
1
2
3
4

移除单值特征#

另一种情况是我们计算的特征没有任何方差。在这种情况下,我们可能希望移除这些无趣的特征。为此,我们使用 remove_single_value_features。让我们看看当我们移除下面特征矩阵中的单值特征时会发生什么。

[5]:
fm

[5]:
flight_id dep_delay taxi_out taxi_in arr_delay diverted air_time distance carrier_delay weather_delay national_airspace_delay security_delay late_aircraft_delay canceled flights.origin flights.origin_city flights.origin_state flights.dest flights.distance_group flights.carrier flights.flight_num flights.airports.dest_city flights.airports.dest_state
trip_log_id
30 AA-494:RSW->CLT NaN NaN NaN NaN <NA> NaN 600.0 NaN NaN NaN NaN NaN <NA> RSW Fort Myers, FL FL CLT 3 AA 494 Charlotte, NC NC
1 AA-494:CLT->PHX NaN NaN NaN NaN <NA> NaN 1773.0 NaN NaN NaN NaN NaN <NA> CLT Charlotte, NC NC PHX 8 AA 494 Phoenix, AZ AZ
2 AA-494:CLT->PHX NaN NaN NaN NaN <NA> NaN 1773.0 NaN NaN NaN NaN NaN <NA> CLT Charlotte, NC NC PHX 8 AA 494 Phoenix, AZ AZ
3 AA-494:CLT->PHX NaN NaN NaN NaN <NA> NaN 1773.0 NaN NaN NaN NaN NaN <NA> CLT Charlotte, NC NC PHX 8 AA 494 Phoenix, AZ AZ
4 NaN NaN NaN NaN NaN <NA> NaN NaN NaN NaN NaN NaN NaN <NA> NaN NaN NaN NaN NaN NaN NaN NaN NaN

Note

A list of feature definitions such as those created by dfs can be provided to the feature selection functions. Doing this will change the outputs to include an updated list of feature definitions.

[6]:
new_fm, new_features = remove_single_value_features(fm, features=features)
new_fm

[6]:
flight_id distance flights.origin flights.origin_city flights.origin_state flights.dest flights.distance_group flights.airports.dest_city flights.airports.dest_state
trip_log_id
30 AA-494:RSW->CLT 600.0 RSW Fort Myers, FL FL CLT 3 Charlotte, NC NC
1 AA-494:CLT->PHX 1773.0 CLT Charlotte, NC NC PHX 8 Phoenix, AZ AZ
2 AA-494:CLT->PHX 1773.0 CLT Charlotte, NC NC PHX 8 Phoenix, AZ AZ
3 AA-494:CLT->PHX 1773.0 CLT Charlotte, NC NC PHX 8 Phoenix, AZ AZ
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN

现在我们已经为更新后的特征矩阵定义了特征,我们可以看到被移除的特征有:

[7]:
set(features) - set(new_features)

[7]:
{<Feature: air_time>,
 <Feature: arr_delay>,
 <Feature: canceled>,
 <Feature: carrier_delay>,
 <Feature: dep_delay>,
 <Feature: diverted>,
 <Feature: flights.carrier>,
 <Feature: flights.flight_num>,
 <Feature: late_aircraft_delay>,
 <Feature: national_airspace_delay>,
 <Feature: security_delay>,
 <Feature: taxi_in>,
 <Feature: taxi_out>,
 <Feature: weather_delay>}

使用上面所示的函数时,当计算特征的唯一值时,空值不会被考虑。如果我们想将NaN视为一个单独的值,我们可以将count_nan_as_value设置为True,这样我们将在矩阵中看到flights.carrierflights.flight_num

[8]:
new_fm, new_features = remove_single_value_features(
    fm, features=features, count_nan_as_value=True
)
new_fm

[8]:
flight_id distance flights.origin flights.origin_city flights.origin_state flights.dest flights.distance_group flights.carrier flights.flight_num flights.airports.dest_city flights.airports.dest_state
trip_log_id
30 AA-494:RSW->CLT 600.0 RSW Fort Myers, FL FL CLT 3 AA 494 Charlotte, NC NC
1 AA-494:CLT->PHX 1773.0 CLT Charlotte, NC NC PHX 8 AA 494 Phoenix, AZ AZ
2 AA-494:CLT->PHX 1773.0 CLT Charlotte, NC NC PHX 8 AA 494 Phoenix, AZ AZ
3 AA-494:CLT->PHX 1773.0 CLT Charlotte, NC NC PHX 8 AA 494 Phoenix, AZ AZ
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

被移除的特性包括:

[9]:
set(features) - set(new_features)

[9]:
{<Feature: air_time>,
 <Feature: arr_delay>,
 <Feature: canceled>,
 <Feature: carrier_delay>,
 <Feature: dep_delay>,
 <Feature: diverted>,
 <Feature: late_aircraft_delay>,
 <Feature: national_airspace_delay>,
 <Feature: security_delay>,
 <Feature: taxi_in>,
 <Feature: taxi_out>,
 <Feature: weather_delay>}

删除高度相关的特征#

我们拥有的最后一个特征选择函数允许我们通过考虑计算特征之间的相关性来删除可能对我们尝试构建的模型多余的特征。当确定两个特征高度相关时,我们会删除两者中较复杂的那个。例如,假设我们有两个特征:col-(col)。我们可以看到 -(col) 只是 col 的否定,因此我们可以猜想这些特征会高度相关。-(col) 应用了 Negate 原语,因此它比恒等特征 col 更复杂。因此,如果我们只想保留 col-(col) 中的一个,我们应该保留恒等特征。对于在复杂性上没有明显差异的特征,我们会丢弃出现在特征矩阵中较晚的特征。让我们在我们的数据上尝试一下:

[10]:
fm, features = ft.dfs(
    entityset=es,
    target_dataframe_name="trip_logs",
    trans_primitives=["negate"],
    agg_primitives=[],
    max_depth=3,
)
fm.head()

/Users/code/fin_tool/github/featuretools/featuretools/entityset/entityset.py:1403: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.loc[mask, columns] = np.nan
/Users/code/fin_tool/github/featuretools/featuretools/entityset/entityset.py:1403: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.loc[mask, columns] = np.nan
/Users/code/fin_tool/github/featuretools/featuretools/entityset/entityset.py:1403: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.loc[mask, columns] = np.nan
/Users/code/fin_tool/github/featuretools/featuretools/entityset/entityset.py:1403: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.loc[mask, columns] = np.nan
[10]:
flight_id dep_delay taxi_out taxi_in arr_delay diverted air_time distance carrier_delay weather_delay national_airspace_delay security_delay late_aircraft_delay canceled -(air_time) -(arr_delay) -(carrier_delay) -(dep_delay) -(distance) -(late_aircraft_delay) -(national_airspace_delay) -(security_delay) -(taxi_in) -(taxi_out) -(weather_delay) flights.origin flights.origin_city flights.origin_state flights.dest flights.distance_group flights.carrier flights.flight_num flights.airports.dest_city flights.airports.dest_state
trip_log_id
30 AA-494:RSW->CLT -11.0 12.0 10.0 -12.0 False 88.0 600.0 0.0 0.0 0.0 0.0 0.0 False -88.0 12.0 -0.0 11.0 -600.0 -0.0 -0.0 -0.0 -10.0 -12.0 -0.0 RSW Fort Myers, FL FL CLT 3 AA 494 Charlotte, NC NC
38 AA-495:ATL->PHX -6.0 28.0 5.0 1.0 False 224.0 1587.0 0.0 0.0 0.0 0.0 0.0 False -224.0 -1.0 -0.0 6.0 -1587.0 -0.0 -0.0 -0.0 -5.0 -28.0 -0.0 ATL Atlanta, GA GA PHX 7 AA 495 Phoenix, AZ AZ
46 AA-495:CLT->ATL -2.0 18.0 8.0 -3.0 False 50.0 226.0 0.0 0.0 0.0 0.0 0.0 False -50.0 3.0 -0.0 2.0 -226.0 -0.0 -0.0 -0.0 -8.0 -18.0 -0.0 CLT Charlotte, NC NC ATL 1 AA 495 Atlanta, GA GA
31 AA-494:RSW->CLT 0.0 11.0 10.0 -3.0 False 87.0 600.0 0.0 0.0 0.0 0.0 0.0 False -87.0 3.0 -0.0 -0.0 -600.0 -0.0 -0.0 -0.0 -10.0 -11.0 -0.0 RSW Fort Myers, FL FL CLT 3 AA 494 Charlotte, NC NC
39 AA-495:ATL->PHX -4.0 26.0 3.0 10.0 False 235.0 1587.0 0.0 0.0 0.0 0.0 0.0 False -235.0 -10.0 -0.0 4.0 -1587.0 -0.0 -0.0 -0.0 -3.0 -26.0 -0.0 ATL Atlanta, GA GA PHX 7 AA 495 Phoenix, AZ AZ

请注意,我们在所有特征及其否定之间有一些非常明显的相关性。现在,使用remove_highly_correlated_features函数,我们的默认相关性阈值为95%,我们将删除所有明显相关的特征,只保留较不复杂的特征。

[11]:
new_fm, new_features = remove_highly_correlated_features(fm, features=features)
new_fm.head()

/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
[11]:
flight_id dep_delay taxi_out taxi_in arr_delay diverted air_time carrier_delay weather_delay national_airspace_delay security_delay late_aircraft_delay canceled -(security_delay) -(weather_delay) flights.origin flights.origin_city flights.origin_state flights.dest flights.distance_group flights.carrier flights.flight_num flights.airports.dest_city flights.airports.dest_state
trip_log_id
30 AA-494:RSW->CLT -11.0 12.0 10.0 -12.0 False 88.0 0.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 RSW Fort Myers, FL FL CLT 3 AA 494 Charlotte, NC NC
38 AA-495:ATL->PHX -6.0 28.0 5.0 1.0 False 224.0 0.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 ATL Atlanta, GA GA PHX 7 AA 495 Phoenix, AZ AZ
46 AA-495:CLT->ATL -2.0 18.0 8.0 -3.0 False 50.0 0.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 CLT Charlotte, NC NC ATL 1 AA 495 Atlanta, GA GA
31 AA-494:RSW->CLT 0.0 11.0 10.0 -3.0 False 87.0 0.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 RSW Fort Myers, FL FL CLT 3 AA 494 Charlotte, NC NC
39 AA-495:ATL->PHX -4.0 26.0 3.0 10.0 False 235.0 0.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 ATL Atlanta, GA GA PHX 7 AA 495 Phoenix, AZ AZ

已删除的特征包括:

[12]:
set(features) - set(new_features)

[12]:
{<Feature: -(carrier_delay)>,
 <Feature: -(arr_delay)>,
 <Feature: distance>,
 <Feature: -(taxi_in)>,
 <Feature: -(distance)>,
 <Feature: -(national_airspace_delay)>,
 <Feature: -(late_aircraft_delay)>,
 <Feature: -(dep_delay)>,
 <Feature: -(air_time)>,
 <Feature: -(taxi_out)>}

更改相关性阈值#

我们可以通过使用pct_corr_threshold参数来降低删除相关特征的阈值,以便更加严格。

[13]:
new_fm, new_features = remove_highly_correlated_features(
    fm, features=features, pct_corr_threshold=0.9
)
new_fm.head()

/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
[13]:
flight_id dep_delay taxi_out taxi_in arr_delay diverted air_time carrier_delay weather_delay security_delay late_aircraft_delay canceled -(security_delay) -(weather_delay) flights.origin flights.origin_city flights.origin_state flights.dest flights.distance_group flights.carrier flights.flight_num flights.airports.dest_city flights.airports.dest_state
trip_log_id
30 AA-494:RSW->CLT -11.0 12.0 10.0 -12.0 False 88.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 RSW Fort Myers, FL FL CLT 3 AA 494 Charlotte, NC NC
38 AA-495:ATL->PHX -6.0 28.0 5.0 1.0 False 224.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 ATL Atlanta, GA GA PHX 7 AA 495 Phoenix, AZ AZ
46 AA-495:CLT->ATL -2.0 18.0 8.0 -3.0 False 50.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 CLT Charlotte, NC NC ATL 1 AA 495 Atlanta, GA GA
31 AA-494:RSW->CLT 0.0 11.0 10.0 -3.0 False 87.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 RSW Fort Myers, FL FL CLT 3 AA 494 Charlotte, NC NC
39 AA-495:ATL->PHX -4.0 26.0 3.0 10.0 False 235.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 ATL Atlanta, GA GA PHX 7 AA 495 Phoenix, AZ AZ

已删除的特性包括:#

[14]:
set(features) - set(new_features)

[14]:
{<Feature: -(carrier_delay)>,
 <Feature: -(arr_delay)>,
 <Feature: -(taxi_in)>,
 <Feature: distance>,
 <Feature: -(distance)>,
 <Feature: -(national_airspace_delay)>,
 <Feature: -(late_aircraft_delay)>,
 <Feature: -(dep_delay)>,
 <Feature: -(air_time)>,
 <Feature: national_airspace_delay>,
 <Feature: -(taxi_out)>}

如果我们只想检查特征的一个子集,我们可以将features_to_check设置为我们想要检查相关性的特征列表,那么列表之外的特征将不会被移除。

[15]:
new_fm, new_features = remove_highly_correlated_features(
    fm,
    features=features,
    features_to_check=["air_time", "distance", "flights.distance_group"],
)
new_fm.head()

[15]:
flight_id dep_delay taxi_out taxi_in arr_delay diverted air_time carrier_delay weather_delay national_airspace_delay security_delay late_aircraft_delay canceled -(air_time) -(arr_delay) -(carrier_delay) -(dep_delay) -(distance) -(late_aircraft_delay) -(national_airspace_delay) -(security_delay) -(taxi_in) -(taxi_out) -(weather_delay) flights.origin flights.origin_city flights.origin_state flights.dest flights.distance_group flights.carrier flights.flight_num flights.airports.dest_city flights.airports.dest_state
trip_log_id
30 AA-494:RSW->CLT -11.0 12.0 10.0 -12.0 False 88.0 0.0 0.0 0.0 0.0 0.0 False -88.0 12.0 -0.0 11.0 -600.0 -0.0 -0.0 -0.0 -10.0 -12.0 -0.0 RSW Fort Myers, FL FL CLT 3 AA 494 Charlotte, NC NC
38 AA-495:ATL->PHX -6.0 28.0 5.0 1.0 False 224.0 0.0 0.0 0.0 0.0 0.0 False -224.0 -1.0 -0.0 6.0 -1587.0 -0.0 -0.0 -0.0 -5.0 -28.0 -0.0 ATL Atlanta, GA GA PHX 7 AA 495 Phoenix, AZ AZ
46 AA-495:CLT->ATL -2.0 18.0 8.0 -3.0 False 50.0 0.0 0.0 0.0 0.0 0.0 False -50.0 3.0 -0.0 2.0 -226.0 -0.0 -0.0 -0.0 -8.0 -18.0 -0.0 CLT Charlotte, NC NC ATL 1 AA 495 Atlanta, GA GA
31 AA-494:RSW->CLT 0.0 11.0 10.0 -3.0 False 87.0 0.0 0.0 0.0 0.0 0.0 False -87.0 3.0 -0.0 -0.0 -600.0 -0.0 -0.0 -0.0 -10.0 -11.0 -0.0 RSW Fort Myers, FL FL CLT 3 AA 494 Charlotte, NC NC
39 AA-495:ATL->PHX -4.0 26.0 3.0 10.0 False 235.0 0.0 0.0 0.0 0.0 0.0 False -235.0 -10.0 -0.0 4.0 -1587.0 -0.0 -0.0 -0.0 -3.0 -26.0 -0.0 ATL Atlanta, GA GA PHX 7 AA 495 Phoenix, AZ AZ

已删除的特性包括:

[16]:
set(features) - set(new_features)

[16]:
{<Feature: distance>}

为了保护特定特征不被从特征矩阵中删除,我们可以包含一个features_to_keep列表,这些特征将不会被删除。

[17]:
new_fm, new_features = remove_highly_correlated_features(
    fm,
    features=features,
    features_to_keep=["air_time", "distance", "flights.distance_group"],
)
new_fm.head()

/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/code/fin_tool/github/featuretools/venv/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
[17]:
flight_id dep_delay taxi_out taxi_in arr_delay diverted air_time distance carrier_delay weather_delay national_airspace_delay security_delay late_aircraft_delay canceled -(security_delay) -(weather_delay) flights.origin flights.origin_city flights.origin_state flights.dest flights.distance_group flights.carrier flights.flight_num flights.airports.dest_city flights.airports.dest_state
trip_log_id
30 AA-494:RSW->CLT -11.0 12.0 10.0 -12.0 False 88.0 600.0 0.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 RSW Fort Myers, FL FL CLT 3 AA 494 Charlotte, NC NC
38 AA-495:ATL->PHX -6.0 28.0 5.0 1.0 False 224.0 1587.0 0.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 ATL Atlanta, GA GA PHX 7 AA 495 Phoenix, AZ AZ
46 AA-495:CLT->ATL -2.0 18.0 8.0 -3.0 False 50.0 226.0 0.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 CLT Charlotte, NC NC ATL 1 AA 495 Atlanta, GA GA
31 AA-494:RSW->CLT 0.0 11.0 10.0 -3.0 False 87.0 600.0 0.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 RSW Fort Myers, FL FL CLT 3 AA 494 Charlotte, NC NC
39 AA-495:ATL->PHX -4.0 26.0 3.0 10.0 False 235.0 1587.0 0.0 0.0 0.0 0.0 0.0 False -0.0 -0.0 ATL Atlanta, GA GA PHX 7 AA 495 Phoenix, AZ AZ

已删除的特性包括:

[18]:
set(features) - set(new_features)

[18]:
{<Feature: -(carrier_delay)>,
 <Feature: -(arr_delay)>,
 <Feature: -(taxi_in)>,
 <Feature: -(distance)>,
 <Feature: -(national_airspace_delay)>,
 <Feature: -(late_aircraft_delay)>,
 <Feature: -(dep_delay)>,
 <Feature: -(air_time)>,
 <Feature: -(taxi_out)>}