列选择器：Scikit-learn 工具函数，用于在管道中选择特定列

实现用于scikit-learn管道的列选择器类。

# 从 mlxtend.feature_selection 导入 ColumnSelector

概述

ColumnSelector 可以用于“手动”特征选择，例如，通过 scikit-learn 管道作为网格搜索的一部分。

参考文献

示例 1 - 在特征子集上拟合估计器

加载一个简单的基准数据集：

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

ColumnSelector 是一个简单的转换器类，它从数据集中选择特定的列（特征）。例如，使用 transform 方法返回一个减少后的数据集，该数据集仅包含两个特征（这里是通过索引 0 和 1 分别选取的前两个特征）：

from mlxtend.feature_selection import ColumnSelector

col_selector = ColumnSelector(cols=(0, 1))
# col_selector.fit(X) 可选，不执行任何操作
col_selector.transform(X).shape

(150, 2)

ColumnSelector 同时适用于 numpy 数组和 pandas 数据框：

import pandas as pd

iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

col_selector = ColumnSelector(cols=("sepal length (cm)", "sepal width (cm)"))
col_selector.transform(iris_df).shape

(150, 2)

同样，我们可以将 ColumnSelector 作为 scikit-learn Pipeline 的一部分使用：

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline


pipe = make_pipeline(StandardScaler(),
                     ColumnSelector(cols=(0, 1)),
                     KNeighborsClassifier())

pipe.fit(X, y)
pipe.score(X, y)

0.84

示例 2 - 通过网格搜索进行特征选择

示例 1 展示了 ColumnSelector 的简单用法；然而，从数据集中选择列是微不足道的，并不需要一个特定的转换器类，因为我们可以通过以下方式获得相同的结果：

classifier.fit(X[:, :2], y)
classifier.score(X[:, :2], y)

然而，ColumnSelector 在特征选择作为网格搜索的一部分时变得非常有用，如本示例所示。

加载一个简单的基准数据集：

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

创建所有可能的组合：

from itertools import combinations

all_comb = []
for size in range(1, 5):
    all_comb += list(combinations(range(X.shape[1]), r=size))
print(all_comb)

[(0,), (1,), (2,), (3,), (0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3), (0, 1, 2), (0, 1, 3), (0, 2, 3), (1, 2, 3), (0, 1, 2, 3)]

特征和模型选择通过网格搜索：

from mlxtend.feature_selection import ColumnSelector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(),
                     ColumnSelector(),
                     KNeighborsClassifier())

param_grid = {'columnselector__cols': all_comb,
              'kneighborsclassifier__n_neighbors': list(range(1, 11))}

grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
grid.fit(X, y)
print('Best parameters:', grid.best_params_)
print('Best performance:', grid.best_score_)

Best parameters: {'columnselector__cols': (2, 3), 'kneighborsclassifier__n_neighbors': 1}
Best performance: 0.98

示例 3 -- 在 scikit-learn 管道中对特征子集进行缩放

以下示例说明了如何在Pipeline中使用ColumnSelector与scikit-learn的FeatureUnion配合，仅对数据集中的某些特征（在这个示例中是第一个和第二个特征）进行缩放。

from mlxtend.feature_selection import ColumnSelector
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.data import iris_data


X, y = iris_data()

scale_pipe = make_pipeline(ColumnSelector(cols=(0, 1)),
                           MinMaxScaler())

pipeline = Pipeline([
    ('feats', FeatureUnion([
        ('col_1-2', scale_pipe),
        ('col_3-4', ColumnSelector(cols=(2, 3)))
    ])),
    ('clf', KNeighborsClassifier())
])


pipeline.fit(X, y)

Pipeline(memory=None,
     steps=[('feats', FeatureUnion(n_jobs=None,
       transformer_list=[('col_1-2', Pipeline(memory=None,
     steps=[('columnselector', ColumnSelector(cols=(0, 1), drop_axis=False)), ('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1)))])), ('col_3-4', ColumnSelector(cols=(2, 3), drop_axis=Fa...ki',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'))])

API

ColumnSelector(cols=None, drop_axis=False)

Object for selecting specific columns from a data set.

Parameters

cols : array-like (default: None)

A list specifying the feature indices to be selected. For example, [1, 4, 5] to select the 2nd, 5th, and 6th feature columns. If None, returns all columns in the array.
drop_axis : bool (default=False)

Drops last axis if True and the only one column is selected. This is useful, e.g., when the ColumnSelector is used for selecting only one column and the resulting array should be fed to e.g., a scikit-learn column selector. E.g., instead of returning an array with shape (n_samples, 1), drop_axis=True will return an aray with shape (n_samples,).

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/feature_selection/ColumnSelector/

Methods

fit(X, y=None)

Mock method. Does nothing.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples] (default: None)

Returns

self

fit_transform(X, y=None)

Return a slice of the input array.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples] (default: None)

Returns

X_slice : shape = [n_samples, k_features]

Subset of the feature space where k_features <= n_features

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params : mapping of string to any

Parameter names mapped to their values.

set_params(params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Returns

self

transform(X, y=None)

Return a slice of the input array.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples] (default: None)

Returns

X_slice : shape = [n_samples, k_features]

Subset of the feature space where k_features <= n_features

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2