SklearnTransformerWrapper#
The SklearnTransformerWrapper()
将 Scikit-learn 转换器应用于选定的一组变量。它适用于 SimpleImputer、OrdinalEncoder、OneHotEncoder、KBinsDiscretizer 等转换器,以及所有的缩放器和特征选择转换器。其他转换器尚未经过测试,但我们认为它应该适用于大多数转换器。
The SklearnTransformerWrapper()
提供了与 Scikit-learn 中的 ColumnTransformer 类相似的功能。它们在选择变量和输出实现上有所不同。
The SklearnTransformerWrapper()
返回一个 pandas 数据框,其中的变量顺序与原始数据一致。ColumnTransformer 返回一个 Numpy 数组,变量的顺序可能与原始数据集不一致。
在下一个代码片段中,我们展示了如何将 Scikit-learn 中的 SimpleImputer 包装起来,以便仅对选定的变量进行插补。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.wrappers import SklearnTransformerWrapper
# Load dataset
data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'], test_size=0.3, random_state=0)
# set up the wrapper with the SimpleImputer
imputer = SklearnTransformerWrapper(transformer = SimpleImputer(strategy='mean'),
variables = ['LotFrontage', 'MasVnrArea'])
# fit the wrapper + SimpleImputer
imputer.fit(X_train)
# transform the data
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
在接下来的代码片段中,我们展示了如何包装 Scikit-learn 中的 StandardScaler 以仅标准化选定的变量。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from feature_engine.wrappers import SklearnTransformerWrapper
# Load dataset
data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'], test_size=0.3, random_state=0)
# set up the wrapper with the StandardScaler
scaler = SklearnTransformerWrapper(transformer = StandardScaler(),
variables = ['LotFrontage', 'MasVnrArea'])
# fit the wrapper + StandardScaler
scaler.fit(X_train)
# transform the data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
在接下来的代码片段中,我们展示了如何将 Scikit-learn 中的 SelectKBest 进行包装,以仅选择变量的一个子集。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import f_regression, SelectKBest
from feature_engine.wrappers import SklearnTransformerWrapper
# Load dataset
data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'], test_size=0.3, random_state=0)
cols = [var for var in X_train.columns if X_train[var].dtypes !='O']
# let's apply the standard scaler on the above variables
selector = SklearnTransformerWrapper(
transformer = SelectKBest(f_regression, k=5),
variables = cols)
selector.fit(X_train.fillna(0), y_train)
# transform the data
X_train_t = selector.transform(X_train.fillna(0))
X_test_t = selector.transform(X_test.fillna(0))
尽管 Feature-engine 有自己的 OneHotEncoder 实现,但您可能希望使用 Scikit-Learn 的转换器来访问不同的选项,例如删除第一个类别。在下面的示例中,我们展示了如何使用 :class:SklearnTransformerWrapper() 将 Scikit-learn 的 OneHotEncoder 应用于类别的子集。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from feature_engine.wrappers import SklearnTransformerWrapper
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
data.drop(["name", "home.dest", "ticket", "boat", "body"], axis=1, inplace=True)
return data
df = load_titanic()
X_train, X_test, y_train, y_test= train_test_split(
df.drop("survived", axis=1),
df["survived"],
test_size=0.2,
random_state=42,
)
ohe = SklearnTransformerWrapper(
OneHotEncoder(sparse=False, drop='first'),
variables = ['pclass','sex'])
ohe.fit(X_train)
X_train_transformed = ohe.transform(X_train)
X_test_transformed = ohe.transform(X_test)
我们可以通过执行以下命令来检查结果:
print(X_train_transformed.head())
生成的数据框是:
age sibsp parch fare cabin embarked pclass_2 pclass_3 sex_male
772 17 0 0 7.8958 n S 0.0 1.0 1.0
543 36 0 0 10.5 n S 1.0 0.0 1.0
289 18 0 2 79.65 E S 0.0 0.0 0.0
10 47 1 0 227.525 C C 0.0 0.0 1.0
147 NaN 0 0 42.4 n S 0.0 0.0 1.0
假设你想在更复杂的上下文中使用 SklearnTransformerWrapper()
。你可能会注意到有 ?
符号表示未知值。由于所需的转换复杂性,我们将使用 Pipeline 来填补缺失值,编码分类特征,并使用 Scikit-Learn 的 PolynomialFeatures 为特定变量创建交互。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from feature_engine.datasets import load_titanic
from feature_engine.imputation import CategoricalImputer, MeanMedianImputer
from feature_engine.encoding import OrdinalEncoder
from feature_engine.wrappers import SklearnTransformerWrapper
X, y = load_titanic(
return_X_y_frame=True,
predictors_only=True,
cabin="letter_only",
)
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)
pipeline = Pipeline(steps = [
('ci', CategoricalImputer(imputation_method='frequent')),
('mmi', MeanMedianImputer(imputation_method='mean')),
('od', OrdinalEncoder(encoding_method='arbitrary')),
('pl', SklearnTransformerWrapper(
PolynomialFeatures(interaction_only = True, include_bias=False),
variables=['pclass','sex']))
])
pipeline.fit(X_train)
X_train_transformed = pipeline.transform(X_train)
X_test_transformed = pipeline.transform(X_test)
print(X_train_transformed.head())
age sibsp parch fare cabin embarked pclass sex \
772 17.000000 0 0 7.8958 0 0 3.0 0.0
543 36.000000 0 0 10.5000 0 0 2.0 0.0
289 18.000000 0 2 79.6500 1 0 1.0 1.0
10 47.000000 1 0 227.5250 2 1 1.0 0.0
147 29.532738 0 0 42.4000 0 0 1.0 0.0
pclass sex
772 0.0
543 0.0
289 1.0
10 0.0
147 0.0
更多详情#
在以下 Jupyter 笔记本中,您可以找到更多关于如何导航 SklearnTransformerWrapper()
的参数的详细信息,还可以访问被包装的 Scikit-learn 转换器的参数以及转换的输出。
笔记本可以在一个 专用仓库 中找到。