列选择器:Scikit-learn 工具函数,用于在管道中选择特定列

实现用于scikit-learn管道的列选择器类。

# 从 mlxtend.feature_selection 导入 ColumnSelector

概述

ColumnSelector 可以用于“手动”特征选择,例如,通过 scikit-learn 管道作为网格搜索的一部分。

参考文献

-

示例 1 - 在特征子集上拟合估计器

加载一个简单的基准数据集:

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

ColumnSelector 是一个简单的转换器类,它从数据集中选择特定的列(特征)。例如,使用 transform 方法返回一个减少后的数据集,该数据集仅包含两个特征(这里是通过索引 0 和 1 分别选取的前两个特征):

from mlxtend.feature_selection import ColumnSelector

col_selector = ColumnSelector(cols=(0, 1))
# col_selector.fit(X) 可选,不执行任何操作
col_selector.transform(X).shape

(150, 2)

ColumnSelector 同时适用于 numpy 数组和 pandas 数据框:

import pandas as pd

iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df.head()

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
col_selector = ColumnSelector(cols=("sepal length (cm)", "sepal width (cm)"))
col_selector.transform(iris_df).shape

(150, 2)

同样,我们可以将 ColumnSelector 作为 scikit-learn Pipeline 的一部分使用:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline


pipe = make_pipeline(StandardScaler(),
                     ColumnSelector(cols=(0, 1)),
                     KNeighborsClassifier())

pipe.fit(X, y)
pipe.score(X, y)

0.84

示例 2 - 通过网格搜索进行特征选择

示例 1 展示了 ColumnSelector 的简单用法;然而,从数据集中选择列是微不足道的,并不需要一个特定的转换器类,因为我们可以通过以下方式获得相同的结果:

classifier.fit(X[:, :2], y)
classifier.score(X[:, :2], y)

然而,ColumnSelector 在特征选择作为网格搜索的一部分时变得非常有用,如本示例所示。

加载一个简单的基准数据集:

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

创建所有可能的组合:

from itertools import combinations

all_comb = []
for size in range(1, 5):
    all_comb += list(combinations(range(X.shape[1]), r=size))
print(all_comb)

[(0,), (1,), (2,), (3,), (0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3), (0, 1, 2), (0, 1, 3), (0, 2, 3), (1, 2, 3), (0, 1, 2, 3)]

特征和模型选择通过网格搜索:

from mlxtend.feature_selection import ColumnSelector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(),
                     ColumnSelector(),
                     KNeighborsClassifier())

param_grid = {'columnselector__cols': all_comb,
              'kneighborsclassifier__n_neighbors': list(range(1, 11))}

grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
grid.fit(X, y)
print('Best parameters:', grid.best_params_)
print('Best performance:', grid.best_score_)

Best parameters: {'columnselector__cols': (2, 3), 'kneighborsclassifier__n_neighbors': 1}
Best performance: 0.98

示例 3 -- 在 scikit-learn 管道中对特征子集进行缩放

以下示例说明了如何在Pipeline中使用ColumnSelector与scikit-learn的FeatureUnion配合,仅对数据集中的某些特征(在这个示例中是第一个和第二个特征)进行缩放。

from mlxtend.feature_selection import ColumnSelector
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.data import iris_data


X, y = iris_data()

scale_pipe = make_pipeline(ColumnSelector(cols=(0, 1)),
                           MinMaxScaler())

pipeline = Pipeline([
    ('feats', FeatureUnion([
        ('col_1-2', scale_pipe),
        ('col_3-4', ColumnSelector(cols=(2, 3)))
    ])),
    ('clf', KNeighborsClassifier())
])


pipeline.fit(X, y)

Pipeline(memory=None,
     steps=[('feats', FeatureUnion(n_jobs=None,
       transformer_list=[('col_1-2', Pipeline(memory=None,
     steps=[('columnselector', ColumnSelector(cols=(0, 1), drop_axis=False)), ('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1)))])), ('col_3-4', ColumnSelector(cols=(2, 3), drop_axis=Fa...ki',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'))])

API

ColumnSelector(cols=None, drop_axis=False)

Object for selecting specific columns from a data set.

Parameters

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/feature_selection/ColumnSelector/

Methods


fit(X, y=None)

Mock method. Does nothing.

Parameters

Returns

self


fit_transform(X, y=None)

Return a slice of the input array.

Parameters

Returns


get_params(deep=True)

Get parameters for this estimator.

Parameters

Returns


set_params(params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Returns

self


transform(X, y=None)

Return a slice of the input array.

Parameters

Returns