顺序特征选择:流行的前向和后向特征选择方法(包括浮动变体)

序列特征算法(SFA)的实现 -- 贪婪搜索算法 -- 这些算法作为对计算上通常不可行的穷举搜索的次优解而开发。

> from mlxtend.feature_selection import SequentialFeatureSelector

概述

顺序特征选择算法是一类贪婪搜索算法,用于将初始d维特征空间减少到k维特征子空间,其中k < d。特征选择算法的动机是自动选择与问题最相关的特征子集。特征选择的目标是双重的:我们希望通过去除无关特征或噪音来提高计算效率并减少模型的泛化误差。此外,当嵌入式特征选择(例如,LASSO等正则化惩罚)不可应用时,像顺序特征选择这样的包装方法是有利的。

简而言之,顺序特征选择算法通过根据分类器的性能逐个添加或移除特征,直到达到所需大小k的特征子集。有四种不同类型的顺序特征选择可通过SequentialFeatureSelector获得:

  1. 顺序前向选择 (SFS)
  2. 顺序后向选择 (SBS)
  3. 顺序前向浮动选择 (SFFS)
  4. 顺序后向浮动选择 (SBFS)

浮动变体,即SFFS和SBFS,可以视为对更简单的SFS和SBS算法的扩展。浮动算法具有一个额外的排除或包含步骤,以便在特征被包含(或排除)后移除特征,从而可以采样更多的特征子集组合。重要的是强调,这一步是有条件的,仅在移除(或添加)特定特征后,结果特征子集被标准函数评估为“更好”的情况下发生。此外,我增加了一个可选检查,以跳过条件排除步骤,以防算法陷入循环。


这与递归特征消除(RFE)有什么不同——例如,在sklearn.feature_selection.RFE中实现的RFE?RFE使用特征权重系数(例如线性模型)或特征重要性(基于树的算法)递归地消除特征,其计算复杂性较低,而SFS基于用户定义的分类器/回归性能指标来消除(或添加)特征。

教程视频

<iframe width="560" height="315" src="https://www.youtube.com/embed/0vCXcGJg5Bo" title="YouTube 视频播放器" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

抱歉,我无法提供需要的帮助。

视觉插图

下面提供了顺向后向选择过程的视觉说明,摘自论文

算法细节

顺序前向选择 (SFS)

输入: $Y = {y_1, y_2, ..., y_d}$

输出: $X_k = {x_j \; | \;j = 1, 2, ..., k; \; x_j \in Y}$, 其中 $k = (0, 1, 2, ..., d)$

初始化: $X_0 = \emptyset$, $k = 0$

步骤 1(包含):

$x^+ = \text{ arg max } J(X_k + x), \text{ 其中 } x \in Y - X_k$
$X_{k+1} = X_k + x^+$
$k = k + 1$
返回步骤 1

终止: $k = p$

顺序向后选择 (SBS)

输入: 所有特征的集合,$Y = {y_1, y_2, ..., y_d}$

输出: $X_k = {x_j \; | \;j = 1, 2, ..., k; \; x_j \in Y}$,其中 $k = (0, 1, 2, ..., d)$

初始化: $X_0 = Y$, $k = d$

步骤1 (排除):

$x^- = \text{ arg max } J(X_k - x), \text{ 其中 } x \in X_k$
$X_{k-1} = X_k - x^-$
$k = k - 1$
返回步骤1

终止: $k = p$

顺序向后浮动选择 (SBFS)

输入: 所有特征的集合, $Y = {y_1, y_2, ..., y_d}$

输出: $X_k = {x_j \; | \;j = 1, 2, ..., k; \; x_j \in Y}$,其中 $k = (0, 1, 2, ..., d)$

初始化: $X_0 = Y$,$k = d$

步骤1(排除):

$x^- = \text{ arg max } J(X_k - x),\text{ 其中 } x \in X_k$
$X_{k-1} = X_k - x^-$
$k = k - 1$
继续到步骤2

步骤2(条件包含):

$x^+ = \text{ arg max } J(X_k + x),\text{ 其中 } x \in Y - X_k$
如果 J(X_k + x) > J(X_k):
     $X_{k+1} = X_k + x^+$
     $k = k + 1$
继续到步骤1

终止: $k = p$

顺序前向浮动选择 (SFFS)

输入: 所有特征的集合,$Y = {y_1, y_2, ..., y_d}$

输出: 特征子集,$X_k = {x_j \; | \;j = 1, 2, ..., k; \; x_j \in Y}$,其中 $k = (0, 1, 2, ..., d)$

初始化: $X_0 = \emptyset$, $k = 0$

步骤 1(包含):

     $x^+ = \text{ arg max } J(X_k + x), \text{ 其中 } x \in Y - X_k$
     $X_{k+1} = X_k + x^+$
     $k = k + 1$
    进入步骤 2


步骤 2(条件排除):

     $x^- = \text{ arg max } J(X_k - x), \text{ 其中 } x \in X_k$
    $如果 \; J(X_k - x) > J(X_k)$:
         $X_{k-1} = X_k - x^- $
         $k = k - 1$
    返回步骤 1

终止:k 等于所需特征的数量时停止。

参考文献

示例 1 - 一个简单的顺序前向选择示例

从scikit-learn初始化一个简单的分类器:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)

我们首先通过顺序向前选择(SFS)从鸢尾花数据集中选择“三个最佳”特征。在这里,我们设置 forward=Truefloating=False。通过选择 cv=0,我们不进行任何交叉验证,因此性能(此处为:‘准确率’)完全是在训练集上计算的。

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

sfs1 = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='accuracy',
           cv=0)

sfs1 = sfs1.fit(X, y)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334

通过 subsets_ 属性,我们可以查看每一步选择的特征索引:

sfs1.subsets_

{1: {'feature_idx': (3,),
  'cv_scores': array([0.96]),
  'avg_score': 0.96,
  'feature_names': ('3',)},
 2: {'feature_idx': (2, 3),
  'cv_scores': array([0.97333333]),
  'avg_score': 0.9733333333333334,
  'feature_names': ('2', '3')},
 3: {'feature_idx': (1, 2, 3),
  'cv_scores': array([0.97333333]),
  'avg_score': 0.9733333333333334,
  'feature_names': ('1', '2', '3')}}
sfs1 = sfs1.fit(X, y)
sfs1.subsets_

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334




{1: {'feature_idx': (3,),
  'cv_scores': array([0.96]),
  'avg_score': 0.96,
  'feature_names': ('3',)},
 2: {'feature_idx': (2, 3),
  'cv_scores': array([0.97333333]),
  'avg_score': 0.9733333333333334,
  'feature_names': ('2', '3')},
 3: {'feature_idx': (1, 2, 3),
  'cv_scores': array([0.97333333]),
  'avg_score': 0.9733333333333334,
  'feature_names': ('1', '2', '3')}}

此外,我们可以通过 k_feature_idx_ 属性直接访问3个最佳特征的索引:

sfs1.k_feature_idx_

(1, 2, 3)

最终,这三个特征的预测得分可以通过 k_score_ 访问:

sfs1.k_score_

0.9733333333333334

特征名称

在处理大型数据集时,特征索引可能难以解释。在这种情况下,我们建议使用具有不同列名的pandas DataFrame作为输入:

import pandas as pd

df_X = pd.DataFrame(X, columns=["Sepal length", "Sepal width", "Petal length", "Petal width"])
df_X.head()

Sepal length Sepal width Petal length Petal width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
sfs1 = sfs1.fit(df_X, y)

print('Best accuracy score: %.2f' % sfs1.k_score_)
print('Best subset (indices):', sfs1.k_feature_idx_)
print('Best subset (corresponding names):', sfs1.k_feature_names_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best accuracy score: 0.97
Best subset (indices): (1, 2, 3)
Best subset (corresponding names): ('Sepal width', 'Petal length', 'Petal width')


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334

示例 2 - 在 SFS、SBS、SFFS 和 SBFS 之间切换

使用 forwardfloating 参数,我们可以在 SFS、SBS、SFFS 和 SBFS 之间切换,如下所示。请注意,我们正在进行(分层)4折交叉验证,以获得比示例 1 更加稳健的估计。通过 n_jobs=-1,我们选择在所有可用的 CPU 核心上运行交叉验证。

# 顺序前向选择
sfs = SFS(knn, 
          k_features=3, 
          forward=True, 
          floating=False, 
          scoring='accuracy',
          cv=4,
          n_jobs=-1)
sfs = sfs.fit(X, y)

print('\nSequential Forward Selection (k=3):')
print(sfs.k_feature_idx_)
print('CV Score:')
print(sfs.k_score_)

###################################################

# 顺序后向选择
sbs = SFS(knn, 
          k_features=3, 
          forward=False, 
          floating=False, 
          scoring='accuracy',
          cv=4,
          n_jobs=-1)
sbs = sbs.fit(X, y)

print('\nSequential Backward Selection (k=3):')
print(sbs.k_feature_idx_)
print('CV Score:')
print(sbs.k_score_)

###################################################

# 顺序浮动前向选择
sffs = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=True, 
           scoring='accuracy',
           cv=4,
           n_jobs=-1)
sffs = sffs.fit(X, y)

print('\nSequential Forward Floating Selection (k=3):')
print(sffs.k_feature_idx_)
print('CV Score:')
print(sffs.k_score_)

###################################################

# 顺序向后浮动选择
sbfs = SFS(knn, 
           k_features=3, 
           forward=False, 
           floating=True, 
           scoring='accuracy',
           cv=4,
           n_jobs=-1)
sbfs = sbfs.fit(X, y)

print('\nSequential Backward Floating Selection (k=3):')
print(sbfs.k_feature_idx_)
print('CV Score:')
print(sbfs.k_score_)

Sequential Forward Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088

Sequential Backward Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088

Sequential Forward Floating Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088

Sequential Backward Floating Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088

在这个简单的场景中,从鸢尾花数据集中选择出最佳的3个特征,无论我们使用哪种顺序选择算法,最终的结果都大致相同。

示例 3 - 在数据框中可视化结果

为了方便我们,可以使用SequentialFeatureSelector对象的get_metric_dict方法将特征选择的输出可视化为pandas DataFrame格式。列std_devstd_err分别表示交叉验证分数的标准差和标准误差。

以下是示例2中顺序前向选择器的DataFrame:

import pandas as pd
pd.DataFrame.from_dict(sfs.get_metric_dict()).T

feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err
1 (3,) [0.9736842105263158, 0.9473684210526315, 0.918... 0.959993 (3,) 0.048319 0.030143 0.017403
2 (2, 3) [0.9736842105263158, 0.9473684210526315, 0.918... 0.959993 (2, 3) 0.048319 0.030143 0.017403
3 (1, 2, 3) [0.9736842105263158, 1.0, 0.9459459459459459, ... 0.973151 (1, 2, 3) 0.030639 0.019113 0.011035

现在,让我们将其与顺序向后选择器进行比较:

pd.DataFrame.from_dict(sbs.get_metric_dict()).T

feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err
4 (0, 1, 2, 3) [0.9736842105263158, 0.9473684210526315, 0.918... 0.953236 (0, 1, 2, 3) 0.03602 0.022471 0.012974
3 (1, 2, 3) [0.9736842105263158, 1.0, 0.9459459459459459, ... 0.973151 (1, 2, 3) 0.030639 0.019113 0.011035

我们可以看到,SFS 和 SBFS 都找到了相同的“最佳”3个特征,但中间步骤显然不同。

上述DataFrame中的 ci_bound 列表示计算出的交叉验证分数的置信区间。默认情况下,使用95%的置信区间,但我们可以通过 confidence_interval 参数使用不同的置信范围。例如,可以如下获得90%置信区间的置信范围:

pd.DataFrame.from_dict(sbs.get_metric_dict(confidence_interval=0.90)).T

feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err
4 (0, 1, 2, 3) [0.9736842105263158, 0.9473684210526315, 0.918... 0.953236 (0, 1, 2, 3) 0.027658 0.022471 0.012974
3 (1, 2, 3) [0.9736842105263158, 1.0, 0.9459459459459459, ... 0.973151 (1, 2, 3) 0.023525 0.019113 0.011035

示例 4 - 绘制结果

在导入小助手函数 plotting.plot_sequential_feature_selection 之后,我们还可以使用 matplotlib 图形可视化结果。

from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt

sfs = SFS(knn, 
          k_features=4, 
          forward=True, 
          floating=False, 
          scoring='accuracy',
          verbose=2,
          cv=5)

sfs = sfs.fit(X, y)

fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')

plt.ylim([0.8, 1])
plt.title('Sequential Forward Selection (w. StdDev)')
plt.grid()
plt.show()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished

[2023-05-17 08:36:18] Features: 1/4 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished

[2023-05-17 08:36:18] Features: 2/4 -- score: 0.9666666666666668[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

[2023-05-17 08:36:18] Features: 3/4 -- score: 0.9533333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished

[2023-05-17 08:36:18] Features: 4/4 -- score: 0.9733333333333334

png

示例 5 - 回归的顺序特征选择

与上述分类示例类似,SequentialFeatureSelector 也支持 scikit-learn 的回归估计器。

from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()
X, y = data.data, data.target

lr = LinearRegression()

sfs = SFS(lr, 
          k_features=8, 
          forward=True, 
          floating=False, 
          scoring='neg_mean_squared_error',
          cv=10)

sfs = sfs.fit(X, y)
fig = plot_sfs(sfs.get_metric_dict(), kind='std_err')

plt.title('Sequential Forward Selection (w. StdErr)')
plt.grid()
plt.show()

png

示例 6 -- 使用固定的训练/验证分割进行特征选择

如果您不希望使用交叉验证(这里指的是k折交叉验证,即轮换训练和验证折叠),您可以使用PredefinedHoldoutSplit类来指定您自己的固定训练和验证拆分。

from sklearn.datasets import load_iris
from mlxtend.evaluate import PredefinedHoldoutSplit
import numpy as np


iris = load_iris()
X = iris.data
y = iris.target

rng = np.random.RandomState(123)
my_validation_indices = rng.permutation(np.arange(150))[:30]
print(my_validation_indices)

[ 72 112 132  88  37 138  87  42   8  90 141  33  59 116 135 104  36  13
  63  45  28 133  24 127  46  20  31 121 117   4]
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS



knn = KNeighborsClassifier(n_neighbors=4)
piter = PredefinedHoldoutSplit(my_validation_indices)

sfs1 = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='accuracy',
           cv=piter)

sfs1 = sfs1.fit(X, y)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished

[2023-05-17 08:36:19] Features: 1/3 -- score: 0.9666666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished

[2023-05-17 08:36:19] Features: 2/3 -- score: 0.9666666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

[2023-05-17 08:36:19] Features: 3/3 -- score: 0.9666666666666667

示例 7 -- 使用选定特征子集进行新预测

# 初始化数据集

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
         X, y, test_size=0.33, random_state=1)

knn = KNeighborsClassifier(n_neighbors=4)

# Select the "best" three features via
# 在训练集上进行5折交叉验证。

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

sfs1 = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=5)
sfs1 = sfs1.fit(X_train, y_train)

print('Selected features:', sfs1.k_feature_idx_)

Selected features: (1, 2, 3)
# 基于所选特征生成新的子集
# 请注意,transform 调用等同于
# X_train[:, sfs1.k_feature_idx_]

X_train_sfs = sfs1.transform(X_train)
X_test_sfs = sfs1.transform(X_test)

# 使用新的特征子集拟合估计器
# 并对测试数据进行预测
knn.fit(X_train_sfs, y_train)
y_pred = knn.predict(X_test_sfs)

# 计算预测的准确性
acc = float((y_test == y_pred).sum()) / y_pred.shape[0]
print('Test set accuracy: %.2f %%' % (acc * 100))

Test set accuracy: 96.00 %

示例 8 - 顺序特征选择和网格搜索

在以下示例中,我们通过网格搜索调整SFS的估计器。为了避免不必要的行为或副作用,建议在SFS内部和外部使用估计器的独立实例。

# 初始化数据集

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
         X, y, test_size=0.2, random_state=123)

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import mlxtend

knn1 = KNeighborsClassifier()
knn2 = KNeighborsClassifier()

sfs1 = SFS(estimator=knn1, 
           k_features=3,
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=5)

pipe = Pipeline([('sfs', sfs1), 
                 ('knn2', knn2)])

param_grid = {
    'sfs__k_features': [1, 2, 3],
    'sfs__estimator__n_neighbors': [3, 4, 7], # 内部最近邻
    'knn2__n_neighbors': [3, 4, 7] # 外部k近邻
  }

gs = GridSearchCV(estimator=pipe, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  n_jobs=1, 
                  cv=5,
                  refit=False)

# 运行网格搜索
gs = gs.fit(X_train, y_train)

让我们来看看下面建议的超参数:

对于范围内的每一个i,输出它在交叉验证结果中的参数和对应的测试准确率。

通过GridSearch确定的“最佳”参数是...

print("Best parameters via GridSearch", gs.best_params_)

Best parameters via GridSearch {'knn2__n_neighbors': 7, 'sfs__estimator__n_neighbors': 3, 'sfs__k_features': 3}
pipe.set_params(**gs.best_params_).fit(X_train, y_train)

Pipeline(steps=[('sfs',
                 SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3),
                                           k_features=(3, 3),
                                           scoring='accuracy')),
                ('knn2', KNeighborsClassifier(n_neighbors=7))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

示例 9 -- 在k范围内选择“最佳”特征组合

如果k_features设置为元组(min_k, max_k)(在0.4.2中新增),SFS现在将通过从k=1max_k(前向)或从max_kmin_k(后向)迭代来选择其发现的最佳特征组合。返回的特征子集的大小在max_kmin_k之间,具体取决于在交叉验证期间得分最高的组合。

X.shape

(150, 4)
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.data import wine_data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

X, y = wine_data()
X_train, X_test, y_train, y_test= train_test_split(X, y, 
                                                   stratify=y,
                                                   test_size=0.3,
                                                   random_state=1)

knn = KNeighborsClassifier(n_neighbors=2)

sfs1 = SFS(estimator=knn, 
           k_features=(3, 10),
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=5)

pipe = make_pipeline(StandardScaler(), sfs1)

pipe.fit(X_train, y_train)

print('best combination (ACC: %.3f): %s\n' % (sfs1.k_score_, sfs1.k_feature_idx_))
print('all subsets:\n', sfs1.subsets_)
plot_sfs(sfs1.get_metric_dict(), kind='std_err');

best combination (ACC: 0.992): (0, 1, 2, 3, 6, 8, 9, 10, 11, 12)

all subsets:
 {1: {'feature_idx': (6,), 'cv_scores': array([0.84 , 0.64 , 0.84 , 0.8  , 0.875]), 'avg_score': 0.799, 'feature_names': ('6',)}, 2: {'feature_idx': (6, 9), 'cv_scores': array([0.92      , 0.88      , 1.        , 0.96      , 0.91666667]), 'avg_score': 0.9353333333333333, 'feature_names': ('6', '9')}, 3: {'feature_idx': (6, 9, 12), 'cv_scores': array([0.92      , 0.92      , 0.96      , 1.        , 0.95833333]), 'avg_score': 0.9516666666666665, 'feature_names': ('6', '9', '12')}, 4: {'feature_idx': (3, 6, 9, 12), 'cv_scores': array([0.96      , 0.96      , 0.96      , 1.        , 0.95833333]), 'avg_score': 0.9676666666666666, 'feature_names': ('3', '6', '9', '12')}, 5: {'feature_idx': (3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.96, 1.  , 1.  , 1.  ]), 'avg_score': 0.976, 'feature_names': ('3', '6', '9', '10', '12')}, 6: {'feature_idx': (2, 3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.96, 1.  , 0.96, 1.  ]), 'avg_score': 0.968, 'feature_names': ('2', '3', '6', '9', '10', '12')}, 7: {'feature_idx': (0, 2, 3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.92, 1.  , 1.  , 1.  ]), 'avg_score': 0.968, 'feature_names': ('0', '2', '3', '6', '9', '10', '12')}, 8: {'feature_idx': (0, 2, 3, 6, 8, 9, 10, 12), 'cv_scores': array([1.  , 0.92, 1.  , 1.  , 1.  ]), 'avg_score': 0.984, 'feature_names': ('0', '2', '3', '6', '8', '9', '10', '12')}, 9: {'feature_idx': (0, 2, 3, 6, 8, 9, 10, 11, 12), 'cv_scores': array([1.  , 0.92, 1.  , 1.  , 1.  ]), 'avg_score': 0.984, 'feature_names': ('0', '2', '3', '6', '8', '9', '10', '11', '12')}, 10: {'feature_idx': (0, 1, 2, 3, 6, 8, 9, 10, 11, 12), 'cv_scores': array([1.  , 0.96, 1.  , 1.  , 1.  ]), 'avg_score': 0.992, 'feature_names': ('0', '1', '2', '3', '6', '8', '9', '10', '11', '12')}}

png

示例 10 -- 使用其他交叉验证方案

除了标准的k折交叉验证和分层k折交叉验证之外,还可以使用其他交叉验证方案与SequentialFeatureSelector配合使用。例如,来自scikit-learn的GroupKFoldLeaveOneOut交叉验证。

使用 GroupKFold 结合 SequentialFeatureSelector

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import GroupKFold
import numpy as np

X, y = iris_data()
groups = np.arange(len(y)) // 10
print('groups: {}'.format(groups))

groups: [ 0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  2  2  2  2
  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4
  4  4  5  5  5  5  5  5  5  5  5  5  6  6  6  6  6  6  6  6  6  6  7  7
  7  7  7  7  7  7  7  7  8  8  8  8  8  8  8  8  8  8  9  9  9  9  9  9
  9  9  9  9 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11
 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 14 14 14 14
 14 14 14 14 14 14]

调用 scikit-learn 交叉验证器对象的 split() 方法将返回一个生成器,该生成器生成训练集和测试集的划分。

cv_gen = GroupKFold(4).split(X, y, groups)
cv_gen

<generator object _BaseKFold.split at 0x17c109580>

SequentialFeatureSelectorcv 参数必须是一个 int 或一个可迭代的对象,该对象产生训练和测试的划分。这个可迭代对象可以通过将训练、测试划分生成器传递给内置的 list() 函数来构造。

cv = list(cv_gen)

knn = KNeighborsClassifier(n_neighbors=2)
sfs = SFS(estimator=knn, 
          k_features=2,
          scoring='accuracy',
          cv=cv)

sfs.fit(X, y)

print('best combination (ACC: %.3f): %s\n' % (sfs.k_score_, sfs.k_feature_idx_))

best combination (ACC: 0.940): (2, 3)

示例 11 - 中断长时间运行以获得中间结果

如果您的运行时间过长,可以触发一个 KeyboardInterrupt(例如,在Mac上使用ctrl+c,或在Jupyter notebook中中断单元格)来获取临时结果。

玩具数据集

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split


X, y = make_classification(
    n_samples=20000,
    n_features=500,
    n_informative=10,
    n_redundant=40,
    n_repeated=25,
    n_clusters_per_class=5,
    flip_y=0.05,
    class_sep=0.5,
    random_state=123,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)

长时间中断运行

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

sfs1 = SFS(model, 
           k_features=10, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='accuracy',
           cv=5)

sfs1 = sfs1.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:    8.3s finished

[2023-05-17 08:36:32] Features: 1/10 -- score: 0.5965[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 499 out of 499 | elapsed:   13.8s finished

[2023-05-17 08:36:45] Features: 2/10 -- score: 0.6256875000000001[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 498 out of 498 | elapsed:   18.1s finished

[2023-05-17 08:37:03] Features: 3/10 -- score: 0.642[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 497 out of 497 | elapsed:   20.4s finished

[2023-05-17 08:37:24] Features: 4/10 -- score: 0.6463125[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 496 out of 496 | elapsed:   22.2s finished

[2023-05-17 08:37:46] Features: 5/10 -- score: 0.6495000000000001[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 495 out of 495 | elapsed:   26.1s finished

[2023-05-17 08:38:12] Features: 6/10 -- score: 0.6514374999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 494 out of 494 | elapsed:   26.1s finished

[2023-05-17 08:38:38] Features: 7/10 -- score: 0.6533749999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 493 out of 493 | elapsed:   25.3s finished

[2023-05-17 08:39:04] Features: 8/10 -- score: 0.6545624999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 492 out of 492 | elapsed:   26.3s finished

[2023-05-17 08:39:30] Features: 9/10 -- score: 0.6549375[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 491 out of 491 | elapsed:   27.0s finished

[2023-05-17 08:39:57] Features: 10/10 -- score: 0.6554374999999999

完成拟合

注意,特征选择运行尚未完成,因此某些属性可能不可用。为了使用 SFS 实例,建议调用 finalize_fit,这将使 SFS 估计器显示为“已拟合”,并处理临时结果:

sfs1.finalize_fit()

print(sfs1.k_feature_idx_)
print(sfs1.k_score_)

(30, 128, 144, 160, 184, 229, 256, 356, 439, 458)
0.6554374999999999

示例 12 - 使用 Pandas 数据框

我们还可以选择使用pandas DataFrames和pandas Series作为fit函数的输入。在这种情况下,pandas DataFrame的列名将被用作特征名。然而,请注意,如果在fit函数中提供了custom_feature_names,这些custom_feature_names将优先于基于DataFrame列的特征名。

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS


iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)

sfs1 = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=0)

X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
                                'sepal width', 'petal width'])
X_df.head()

sepal len petal len sepal width petal width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

此外,目标数组 y 也可以选择性地转换为一个序列:

y_series = pd.Series(y)
y_series.head()

0    0
1    0
2    0
3    0
4    0
dtype: int64
sfs1 = sfs1.fit(X_df, y_series)

注意,传递一个pandas DataFrame 作为输入的唯一不同之处在于,sfs1.subsets_ 数组现在将包含一列新数据,

sfs1.subsets_

{1: {'feature_idx': (3,),
  'cv_scores': array([0.96]),
  'avg_score': 0.96,
  'feature_names': ('petal width',)},
 2: {'feature_idx': (2, 3),
  'cv_scores': array([0.97333333]),
  'avg_score': 0.9733333333333334,
  'feature_names': ('sepal width', 'petal width')},
 3: {'feature_idx': (1, 2, 3),
  'cv_scores': array([0.97333333]),
  'avg_score': 0.9733333333333334,
  'feature_names': ('petal len', 'sepal width', 'petal width')}}

在mlxtend版本>= 0.13中,支持将pandas DataFrame作为SequentianFeatureSelector的特征输入,而不是NumPy数组或其他类似NumPy的数组类型。

示例 13 - 指定固定特征集

通常情况下,指定我们希望用于给定模型的一组固定特征(例如,由先前知识或领域知识确定)可能是有用的。自 MLxtend v 0.18.0 以来,现在可以通过 fixed_features 属性指定这些特征。这将意味着这些特征保证会包含在选定的子集中。

请注意,这个特性适用于有关前向和后向选择的所有选项,以及是否使用浮动选择。

下面的示例演示了我们如何将数据集中的特征 0 和特征 2 设置为固定:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=3)

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

sfs1 = SFS(knn, 
           k_features=4, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='accuracy',
           fixed_features=(0, 2),
           cv=3)

sfs1 = sfs1.fit(X, y)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

[2023-05-17 08:39:57] Features: 3/4 -- score: 0.9733333333333333[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished

[2023-05-17 08:39:57] Features: 4/4 -- score: 0.9733333333333333
sfs1.subsets_

{2: {'feature_idx': (0, 2),
  'cv_scores': array([0.98, 0.92, 0.94]),
  'avg_score': 0.9466666666666667,
  'feature_names': ('0', '2')},
 3: {'feature_idx': (0, 2, 3),
  'cv_scores': array([0.98, 0.96, 0.98]),
  'avg_score': 0.9733333333333333,
  'feature_names': ('0', '2', '3')},
 4: {'feature_idx': (0, 1, 2, 3),
  'cv_scores': array([0.98, 0.96, 0.98]),
  'avg_score': 0.9733333333333333,
  'feature_names': ('0', '1', '2', '3')}}

如果输入的数据集是一个 pandas 数据框,我们也可以直接使用列名:

import pandas as pd

X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
                                'sepal width', 'petal width'])
X_df.head()

sepal len petal len sepal width petal width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
sfs2 = SFS(knn, 
           k_features=4, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='accuracy',
           fixed_features=('sepal len', 'petal len'),
           cv=3)

sfs2 = sfs2.fit(X_df, y_series)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

[2023-05-17 08:39:57] Features: 3/4 -- score: 0.9466666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished

[2023-05-17 08:39:57] Features: 4/4 -- score: 0.9733333333333333
sfs2.subsets_

{2: {'feature_idx': (0, 1),
  'cv_scores': array([0.72, 0.74, 0.78]),
  'avg_score': 0.7466666666666667,
  'feature_names': ('sepal len', 'petal len')},
 3: {'feature_idx': (0, 1, 2),
  'cv_scores': array([0.98, 0.92, 0.94]),
  'avg_score': 0.9466666666666667,
  'feature_names': ('sepal len', 'petal len', 'sepal width')},
 4: {'feature_idx': (0, 1, 2, 3),
  'cv_scores': array([0.98, 0.96, 0.98]),
  'avg_score': 0.9733333333333333,
  'feature_names': ('sepal len', 'petal len', 'sepal width', 'petal width')}}

示例 13 - 使用特征组

自mlxtend v0.21.0以来,可以指定特征组。特征组允许您将某些特征组合在一起,使它们始终作为一个组被选择。这在类似于独热编码的上下文中非常有用——如果您想将独热编码特征视为一个单一特征:

在下面的示例中,我们将花萼长度和花萼宽度指定为一个特征组,以便它们总是一起被选择:

from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
X = iris.data
y = iris.target

X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
                                'sepal wid', 'petal wid'])
X_df.head()


sepal len petal len sepal wid petal wid
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

knn = KNeighborsClassifier(n_neighbors=3)

sfs1 = SFS(knn, 
           k_features=2, 
           scoring='accuracy',
           feature_groups=(['sepal len', 'sepal wid'], ['petal len'], ['petal wid']),
           cv=3)

sfs1 = sfs1.fit(X_df, y)

# 单变量特征选择 (SFS)

此代码示例演示了如何使用单变量特征选择(SFS)来选择最佳特征组合。

## 参数说明
- `knn`: 选择的模型,通常是 k-近邻分类器。
- `k_features`: 选择的特征数量,此处选择 2 个特征。
- `scoring`: 评分标准,此处使用准确率 ('accuracy')。
- `feature_groups`: 特征组的集合,此处有三个特征组。
- `cv`: 交叉验证的折数,此处设置为 3。

## 使用示例
下面的代码将 SFS 应用于数据集 `X` 和标签 `y`,以确定最佳的特征组合。

```python
sfs1 = SFS(knn, 
           k_features=2, 
           scoring='accuracy',
           feature_groups=[[0, 2], [1], [3]],
           cv=3)

sfs1 = sfs1.fit(X, y)


## 示例 14 - 多分类指标


某些评分指标,如ROC AUC,最初是为二元分类设计的。然而,它们也可以用于多类设置。最好的做法是参考[这个scikit-learn指标表](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values)。

例如,我们可以通过`‘"roc_auc_ovr"`使用ROC AUC一对其余的分数,如下所示。



```python
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10, centers=4, n_features=5, random_state=0)

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

sfs1 = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='roc_auc_ovr',
           cv=0)

sfs1 = sfs1.fit(X, y)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished

[2023-05-17 08:39:57] Features: 1/3 -- score: 1.0[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished

[2023-05-17 08:39:57] Features: 2/3 -- score: 1.0[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished

[2023-05-17 08:39:57] Features: 3/3 -- score: 1.0

API

SequentialFeatureSelector(estimator, k_features=1, forward=True, floating=False, verbose=0, scoring=None, cv=5, n_jobs=1, pre_dispatch='2n_jobs', clone_estimator=True, fixed_features=None, feature_groups=None)*

Sequential Feature Selection for Classification and Regression.

Parameters

Attributes

Notes

(1) If parameter feature_groups is not None, the number of features is equal to the number of feature groups, i.e. len(feature_groups). For example, if feature_groups = [[0], [1], [2, 3], [4]], then the max_features value cannot exceed 4.

(2) Although two or more individual features may be considered as one group
throughout the feature-selection process, it does not mean the individual
features of that group have the same impact on the outcome. For instance, in
linear regression, the coefficient of the feature 2 and 3 can be different
even if they are considered as one group in feature_groups.

(3) If both fixed_features and feature_groups are specified, ensure that each
feature group contains the fixed_features selection. E.g., for a 3-feature set
fixed_features=[0, 1] and feature_groups=[[0, 1], [2]] is valid;
fixed_features=[0, 1] and feature_groups=[[0], [1, 2]] is not valid.

(4) In case of KeyboardInterrupt, the dictionary subsets may not be completed.
If user is still interested in getting the best score, they can use method
`finalize_fit`.

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/

Methods


finalize_fit()

None


fit(X, y, groups=None, fit_params)

Perform feature selection and learn model from training data.

Parameters

Returns


fit_transform(X, y, groups=None, fit_params)

Fit to training data then reduce X to its most important features.

Parameters

Returns

Reduced feature subset of X, shape={n_samples, k_features}


generate_error_message_k_features(name)

None


get_metric_dict(confidence_interval=0.95)

Return metric dictionary

Parameters

Returns

Dictionary with items where each dictionary value is a list with the number of iterations (number of feature subsets) as its length. The dictionary keys corresponding to these lists are as follows: 'feature_idx': tuple of the indices of the feature subset 'cv_scores': list with individual CV scores 'avg_score': of CV average scores 'std_dev': standard deviation of the CV score average 'std_err': standard error of the CV score average 'ci_bound': confidence interval bound of the CV score average


get_params(deep=True)

Get parameters for this estimator.

Parameters

Returns


set_params(params)

Set the parameters of this estimator. Valid parameter keys can be listed with get_params().

Returns

self


transform(X)

Reduce X to its most important features.

Parameters

Returns

Reduced feature subset of X, shape={n_samples, k_features}

Properties


named_estimators

Returns

List of named estimator tuples, like [('svc', SVC(...))]

ython