列选择器:Scikit-learn 工具函数,用于在管道中选择特定列
实现用于scikit-learn管道的列选择器类。
# 从 mlxtend.feature_selection 导入 ColumnSelector
概述
ColumnSelector
可以用于“手动”特征选择,例如,通过 scikit-learn 管道作为网格搜索的一部分。
参考文献
-
示例 1 - 在特征子集上拟合估计器
加载一个简单的基准数据集:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
ColumnSelector
是一个简单的转换器类,它从数据集中选择特定的列(特征)。例如,使用 transform
方法返回一个减少后的数据集,该数据集仅包含两个特征(这里是通过索引 0 和 1 分别选取的前两个特征):
from mlxtend.feature_selection import ColumnSelector
col_selector = ColumnSelector(cols=(0, 1))
# col_selector.fit(X) 可选,不执行任何操作
col_selector.transform(X).shape
(150, 2)
ColumnSelector
同时适用于 numpy 数组和 pandas 数据框:
import pandas as pd
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
col_selector = ColumnSelector(cols=("sepal length (cm)", "sepal width (cm)"))
col_selector.transform(iris_df).shape
(150, 2)
同样,我们可以将 ColumnSelector
作为 scikit-learn Pipeline
的一部分使用:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(),
ColumnSelector(cols=(0, 1)),
KNeighborsClassifier())
pipe.fit(X, y)
pipe.score(X, y)
0.84
示例 2 - 通过网格搜索进行特征选择
示例 1 展示了 ColumnSelector
的简单用法;然而,从数据集中选择列是微不足道的,并不需要一个特定的转换器类,因为我们可以通过以下方式获得相同的结果:
classifier.fit(X[:, :2], y)
classifier.score(X[:, :2], y)
然而,ColumnSelector
在特征选择作为网格搜索的一部分时变得非常有用,如本示例所示。
加载一个简单的基准数据集:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
创建所有可能的组合:
from itertools import combinations
all_comb = []
for size in range(1, 5):
all_comb += list(combinations(range(X.shape[1]), r=size))
print(all_comb)
[(0,), (1,), (2,), (3,), (0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3), (0, 1, 2), (0, 1, 3), (0, 2, 3), (1, 2, 3), (0, 1, 2, 3)]
特征和模型选择通过网格搜索:
from mlxtend.feature_selection import ColumnSelector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(),
ColumnSelector(),
KNeighborsClassifier())
param_grid = {'columnselector__cols': all_comb,
'kneighborsclassifier__n_neighbors': list(range(1, 11))}
grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
grid.fit(X, y)
print('Best parameters:', grid.best_params_)
print('Best performance:', grid.best_score_)
Best parameters: {'columnselector__cols': (2, 3), 'kneighborsclassifier__n_neighbors': 1}
Best performance: 0.98
示例 3 -- 在 scikit-learn 管道中对特征子集进行缩放
以下示例说明了如何在Pipeline
中使用ColumnSelector
与scikit-learn的FeatureUnion
配合,仅对数据集中的某些特征(在这个示例中是第一个和第二个特征)进行缩放。
from mlxtend.feature_selection import ColumnSelector
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.data import iris_data
X, y = iris_data()
scale_pipe = make_pipeline(ColumnSelector(cols=(0, 1)),
MinMaxScaler())
pipeline = Pipeline([
('feats', FeatureUnion([
('col_1-2', scale_pipe),
('col_3-4', ColumnSelector(cols=(2, 3)))
])),
('clf', KNeighborsClassifier())
])
pipeline.fit(X, y)
Pipeline(memory=None,
steps=[('feats', FeatureUnion(n_jobs=None,
transformer_list=[('col_1-2', Pipeline(memory=None,
steps=[('columnselector', ColumnSelector(cols=(0, 1), drop_axis=False)), ('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1)))])), ('col_3-4', ColumnSelector(cols=(2, 3), drop_axis=Fa...ki',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform'))])
API
ColumnSelector(cols=None, drop_axis=False)
Object for selecting specific columns from a data set.
Parameters
-
cols
: array-like (default: None)A list specifying the feature indices to be selected. For example, [1, 4, 5] to select the 2nd, 5th, and 6th feature columns. If None, returns all columns in the array.
-
drop_axis
: bool (default=False)Drops last axis if True and the only one column is selected. This is useful, e.g., when the ColumnSelector is used for selecting only one column and the resulting array should be fed to e.g., a scikit-learn column selector. E.g., instead of returning an array with shape (n_samples, 1), drop_axis=True will return an aray with shape (n_samples,).
Examples
For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/feature_selection/ColumnSelector/
Methods
fit(X, y=None)
Mock method. Does nothing.
Parameters
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features.
-
y
: array-like, shape = [n_samples] (default: None)
Returns
self
fit_transform(X, y=None)
Return a slice of the input array.
Parameters
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features.
-
y
: array-like, shape = [n_samples] (default: None)
Returns
-
X_slice
: shape = [n_samples, k_features]Subset of the feature space where k_features <= n_features
get_params(deep=True)
Get parameters for this estimator.
Parameters
-
deep
: boolean, optionalIf True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns
-
params
: mapping of string to anyParameter names mapped to their values.
set_params(params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it's possible to update each
component of a nested object.
Returns
self
transform(X, y=None)
Return a slice of the input array.
Parameters
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features.
-
y
: array-like, shape = [n_samples] (default: None)
Returns
-
X_slice
: shape = [n_samples, k_features]Subset of the feature space where k_features <= n_features