匹配变量#

MatchVariables() 确保测试集中的列与训练集中的列相同。

如果测试集包含额外的列，它们将被丢弃。或者，如果测试集缺少训练集中存在的列，将根据用户定义的值（例如 np.nan）添加这些列。MatchVariables() 还将按照训练集中出现的顺序返回变量。

让我们通过一个例子来探讨这一点。首先，我们加载泰坦尼克号数据集，并将其分为训练集和测试集：

from feature_engine.preprocessing import MatchVariables
from feature_engine.datasets import load_titanic

# Load dataset
data = load_titanic(
    predictors_only=True,
    cabin="letter_only",
)

data['pclass'] = data['pclass'].astype('O')

# Split test and train
train = data.iloc[0:1000, :]
test = data.iloc[1000:, :]

现在，我们设置 MatchVariables() 并将其拟合到训练集。

# set up the transformer
match_cols = MatchVariables(missing_values="ignore")

# learn the variables in the train set
match_cols.fit(train)

MatchVariables() 将其属性中的训练集变量存储起来:

# the transformer stores the input variables
match_cols.feature_names_in_

['pclass',
 'survived',
 'sex',
 'age',
 'sibsp',
 'parch',
 'fare',
 'cabin',
 'embarked']

现在，我们从测试集中删除一些列。

# Let's drop some columns in the test set for the demo
test_t = test.drop(["sex", "age"], axis=1)

test_t.head()

     pclass  survived  sibsp  parch     fare cabin embarked
    3         1      0      0   7.7500     n        Q
    3         1      2      0  23.2500     n        Q
    3         1      2      0  23.2500     n        Q
    3         1      2      0  23.2500     n        Q
    3         1      0      0   7.7875     n        Q

如果我们使用 MatchVariables() 对删除了列的数据框进行转换，我们可以看到新的数据框包含了所有变量，并且那些缺失的变量现在又回到了数据中，默认值为 np.nan。

# the transformer adds the columns back
test_tt = match_cols.transform(test_t)

test_tt.head()

The following variables are added to the DataFrame: ['age', 'sex']
     pclass  survived  sex  age  sibsp  parch     fare cabin embarked
    3         1  NaN  NaN      0      0   7.7500     n        Q
    3         1  NaN  NaN      2      0  23.2500     n        Q
    3         1  NaN  NaN      2      0  23.2500     n        Q
    3         1  NaN  NaN      2      0  23.2500     n        Q
    3         1  NaN  NaN      0      0   7.7875     n        Q

注意缺失的列是如何在转换后的测试集中被重新添加的，带有缺失值，并且位置（即顺序）与训练集中的相同。

同样地，如果测试集包含额外的列，这些列将被移除。为了测试这一点，让我们向测试集添加一些额外的列：

# let's add some columns for the demo
test_t[['var_a', 'var_b']] = 0

test_t.head()

     pclass  survived  sibsp  parch     fare cabin embarked  var_a  var_b
    3         1      0      0   7.7500     n        Q      0      0
    3         1      2      0  23.2500     n        Q      0      0
    3         1      2      0  23.2500     n        Q      0      0
    3         1      2      0  23.2500     n        Q      0      0
    3         1      0      0   7.7875     n        Q      0      0

现在，我们使用 MatchVariables() 转换数据：

test_tt = match_cols.transform(test_t)

test_tt.head()

The following variables are added to the DataFrame: ['age', 'sex']
The following variables are dropped from the DataFrame: ['var_b', 'var_a']
     pclass  survived  sex  age  sibsp  parch     fare cabin embarked
1000      3         1  NaN  NaN      0      0   7.7500     n        Q
1001      3         1  NaN  NaN      2      0  23.2500     n        Q
1002      3         1  NaN  NaN      2      0  23.2500     n        Q
1003      3         1  NaN  NaN      2      0  23.2500     n        Q
1004      3         1  NaN  NaN      0      0   7.7875     n        Q

现在，转换器同时添加了缺失的列，并用NA作为值，并从结果数据集中删除了多余的列。

然而，如果我们仔细观察，sex 变量的数据类型并不匹配。如果其他转换依赖于正确的数据类型，这可能会导致问题。

train.sex.dtype

dtype('O')

test_tt.sex.dtype

dtype('float64')

将 match_dtypes 参数设置为 True，以便同时对齐数据类型。

match_cols_and_dtypes = MatchVariables(missing_values="ignore", match_dtypes=True)
match_cols_and_dtypes.fit(train)

test_ttt = match_cols_and_dtypes.transform(test_t)

The following variables are added to the DataFrame: ['sex', 'age']
The following variables are dropped from the DataFrame: ['var_b', 'var_a']
The sex dtype is changing from  float64 to object

现在数据类型匹配了。

test_ttt.sex.dtype

dtype('O')

默认情况下，MatchVariables() 会打印出哪些变量被添加、删除和更改的消息。我们可以通过 verbose 参数关闭这些消息。

何时使用变压器#

这些转换器在“预测然后优化类型的问题”中很有用。在这种情况下，机器学习模型根据某些输入特征在特定数据集上进行训练。然后，测试集根据想要建模的场景进行“后处理”。例如，“如果客户收到了电子邮件活动会发生什么？”，其中变量“receive_campaign”将从0变为1。

在创建这些建模数据集时，可以向数据中添加大量元数据，例如“场景编号”、“生成场景的时间”等。然后，我们需要将这些数据传递给模型以获得建模预测。

MatchVariables() 提供了一种简单而优雅的方式来移除额外的元数据，同时返回按正确顺序排列输入特征的数据集，使得不同的场景可以直接在机器学习管道内部建模。

This site uses cookies

匹配变量#

何时使用变压器#

更多细节#