SklearnTransformerWrapper#

The SklearnTransformerWrapper() 将 Scikit-learn 转换器应用于选定的一组变量。它适用于 SimpleImputer、OrdinalEncoder、OneHotEncoder、KBinsDiscretizer 等转换器,以及所有的缩放器和特征选择转换器。其他转换器尚未经过测试,但我们认为它应该适用于大多数转换器。

The SklearnTransformerWrapper() 提供了与 Scikit-learn 中的 ColumnTransformer 类相似的功能。它们在选择变量和输出实现上有所不同。

The SklearnTransformerWrapper() 返回一个 pandas 数据框,其中的变量顺序与原始数据一致。ColumnTransformer 返回一个 Numpy 数组,变量的顺序可能与原始数据集不一致。

在下一个代码片段中,我们展示了如何将 Scikit-learn 中的 SimpleImputer 包装起来,以便仅对选定的变量进行插补。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.wrappers import SklearnTransformerWrapper

# Load dataset
data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1),
    data['SalePrice'], test_size=0.3, random_state=0)

# set up the wrapper with the SimpleImputer
imputer = SklearnTransformerWrapper(transformer = SimpleImputer(strategy='mean'),
                                    variables = ['LotFrontage', 'MasVnrArea'])

# fit the wrapper + SimpleImputer
imputer.fit(X_train)

# transform the data
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

在接下来的代码片段中,我们展示了如何包装 Scikit-learn 中的 StandardScaler 以仅标准化选定的变量。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from feature_engine.wrappers import SklearnTransformerWrapper

# Load dataset
data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1),
    data['SalePrice'], test_size=0.3, random_state=0)

# set up the wrapper with the StandardScaler
scaler = SklearnTransformerWrapper(transformer = StandardScaler(),
                                    variables = ['LotFrontage', 'MasVnrArea'])

# fit the wrapper + StandardScaler
scaler.fit(X_train)

# transform the data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

在接下来的代码片段中,我们展示了如何将 Scikit-learn 中的 SelectKBest 进行包装,以仅选择变量的一个子集。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import f_regression, SelectKBest
from feature_engine.wrappers import SklearnTransformerWrapper

# Load dataset
data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1),
    data['SalePrice'], test_size=0.3, random_state=0)

cols = [var for var in X_train.columns if X_train[var].dtypes !='O']

# let's apply the standard scaler on the above variables

selector = SklearnTransformerWrapper(
    transformer = SelectKBest(f_regression, k=5),
    variables = cols)

selector.fit(X_train.fillna(0), y_train)

# transform the data
X_train_t = selector.transform(X_train.fillna(0))
X_test_t = selector.transform(X_test.fillna(0))

尽管 Feature-engine 有自己的 OneHotEncoder 实现,但您可能希望使用 Scikit-Learn 的转换器来访问不同的选项,例如删除第一个类别。在下面的示例中,我们展示了如何使用 :class:SklearnTransformerWrapper() 将 Scikit-learn 的 OneHotEncoder 应用于类别的子集。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from feature_engine.wrappers import SklearnTransformerWrapper

# Load dataset
def load_titanic():
    data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
    data = data.replace('?', np.nan)
    data['cabin'] = data['cabin'].astype(str).str[0]
    data['pclass'] = data['pclass'].astype('O')
    data['embarked'].fillna('C', inplace=True)
    data.drop(["name", "home.dest", "ticket", "boat", "body"], axis=1, inplace=True)
    return data

df = load_titanic()

X_train, X_test, y_train, y_test= train_test_split(
    df.drop("survived", axis=1),
    df["survived"],
    test_size=0.2,
    random_state=42,
)

ohe = SklearnTransformerWrapper(
        OneHotEncoder(sparse=False, drop='first'),
        variables = ['pclass','sex'])

ohe.fit(X_train)

X_train_transformed = ohe.transform(X_train)
X_test_transformed = ohe.transform(X_test)

我们可以通过执行以下命令来检查结果:

print(X_train_transformed.head())

生成的数据框是:

     age  sibsp  parch     fare cabin embarked  pclass_2  pclass_3  sex_male
772   17      0      0   7.8958     n        S       0.0       1.0       1.0
543   36      0      0     10.5     n        S       1.0       0.0       1.0
289   18      0      2    79.65     E        S       0.0       0.0       0.0
10    47      1      0  227.525     C        C       0.0       0.0       1.0
147  NaN      0      0     42.4     n        S       0.0       0.0       1.0

假设你想在更复杂的上下文中使用 SklearnTransformerWrapper() 。你可能会注意到有 ? 符号表示未知值。由于所需的转换复杂性,我们将使用 Pipeline 来填补缺失值,编码分类特征,并使用 Scikit-Learn 的 PolynomialFeatures 为特定变量创建交互。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from feature_engine.datasets import load_titanic
from feature_engine.imputation import CategoricalImputer, MeanMedianImputer
from feature_engine.encoding import OrdinalEncoder
from feature_engine.wrappers import SklearnTransformerWrapper

X, y = load_titanic(
    return_X_y_frame=True,
    predictors_only=True,
    cabin="letter_only",
)

X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline(steps = [
    ('ci', CategoricalImputer(imputation_method='frequent')),
    ('mmi', MeanMedianImputer(imputation_method='mean')),
    ('od', OrdinalEncoder(encoding_method='arbitrary')),
    ('pl', SklearnTransformerWrapper(
        PolynomialFeatures(interaction_only = True, include_bias=False),
        variables=['pclass','sex']))
])

pipeline.fit(X_train)
X_train_transformed = pipeline.transform(X_train)
X_test_transformed = pipeline.transform(X_test)

print(X_train_transformed.head())
           age  sibsp  parch      fare  cabin  embarked  pclass  sex  \
772  17.000000      0      0    7.8958      0         0     3.0  0.0
543  36.000000      0      0   10.5000      0         0     2.0  0.0
289  18.000000      0      2   79.6500      1         0     1.0  1.0
10   47.000000      1      0  227.5250      2         1     1.0  0.0
147  29.532738      0      0   42.4000      0         0     1.0  0.0

     pclass sex
772         0.0
543         0.0
289         1.0
10          0.0
147         0.0

更多详情#

在以下 Jupyter 笔记本中,您可以找到更多关于如何导航 SklearnTransformerWrapper() 的参数的详细信息,还可以访问被包装的 Scikit-learn 转换器的参数以及转换的输出。

笔记本可以在一个 专用仓库 中找到。