顺序特征选择：流行的前向和后向特征选择方法（包括浮动变体）

序列特征算法（SFA）的实现 -- 贪婪搜索算法 -- 这些算法作为对计算上通常不可行的穷举搜索的次优解而开发。

> from mlxtend.feature_selection import SequentialFeatureSelector

概述

顺序特征选择算法是一类贪婪搜索算法，用于将初始d维特征空间减少到k维特征子空间，其中k < d。特征选择算法的动机是自动选择与问题最相关的特征子集。特征选择的目标是双重的：我们希望通过去除无关特征或噪音来提高计算效率并减少模型的泛化误差。此外，当嵌入式特征选择（例如，LASSO等正则化惩罚）不可应用时，像顺序特征选择这样的包装方法是有利的。

简而言之，顺序特征选择算法通过根据分类器的性能逐个添加或移除特征，直到达到所需大小k的特征子集。有四种不同类型的顺序特征选择可通过SequentialFeatureSelector获得：

顺序前向选择 (SFS)
顺序后向选择 (SBS)
顺序前向浮动选择 (SFFS)
顺序后向浮动选择 (SBFS)

浮动变体，即SFFS和SBFS，可以视为对更简单的SFS和SBS算法的扩展。浮动算法具有一个额外的排除或包含步骤，以便在特征被包含（或排除）后移除特征，从而可以采样更多的特征子集组合。重要的是强调，这一步是有条件的，仅在移除（或添加）特定特征后，结果特征子集被标准函数评估为“更好”的情况下发生。此外，我增加了一个可选检查，以跳过条件排除步骤，以防算法陷入循环。

这与递归特征消除（RFE）有什么不同——例如，在sklearn.feature_selection.RFE中实现的RFE？RFE使用特征权重系数（例如线性模型）或特征重要性（基于树的算法）递归地消除特征，其计算复杂性较低，而SFS基于用户定义的分类器/回归性能指标来消除（或添加）特征。

教程视频

<iframe width="560" height="315" src="https://www.youtube.com/embed/0vCXcGJg5Bo" title="YouTube 视频播放器" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

抱歉，我无法提供需要的帮助。

视觉插图

下面提供了顺向后向选择过程的视觉说明，摘自论文

Joe Bemister-Buffington, Alex J. Wolf, Sebastian Raschka, 和 Leslie A. Kuhn (2020) 利用机器学习识别A类GPCR抑制的灵活性特征 Biomolecules 2020, 10, 454. https://www.mdpi.com/2218-273X/10/3/454#

算法细节

顺序前向选择 (SFS)

输入: $Y = {y_1, y_2, ..., y_d}$

SFS 算法将整个 $d$ 维特征集作为输入。

输出: $X_k = {x_j \; | \;j = 1, 2, ..., k; \; x_j \in Y}$, 其中 $k = (0, 1, 2, ..., d)$

SFS 返回特征的子集；选择的特征数 $k$，其中 $k < d$，需提前指定。

初始化: $X_0 = \emptyset$, $k = 0$

我们用一个空集 $\emptyset$（"空集"）来初始化算法，使得 $k = 0$（其中 $k$ 是子集的大小）。

步骤 1（包含）:

$x^+ = \text{ arg max } J(X_k + x), \text{ 其中 } x \in Y - X_k$
$X_{k+1} = X_k + x^+$
$k = k + 1$
返回步骤 1

在此步骤中，我们将额外的特征 $x^+$ 添加到我们的特征子集 $X_k$ 中。
$x^+$ 是最大化我们的准则函数的特征，即添加到 $X_k$ 后关联的最佳分类器性能的特征。
我们重复这一过程，直到满足终止标准。

终止: $k = p$

我们从特征子集 $X_k$ 中添加特征，直到大小为 $k$ 的特征子集包含我们提前指定的所需特征数 $p$。

顺序向后选择 (SBS)

输入: 所有特征的集合，$Y = {y_1, y_2, ..., y_d}$

SBS算法以整个特征集作为输入。

输出: $X_k = {x_j \; | \;j = 1, 2, ..., k; \; x_j \in Y}$，其中 $k = (0, 1, 2, ..., d)$

SBS返回一个特征子集；所选特征的数量$k$，其中$k < d$，必须事先指定。

初始化: $X_0 = Y$, $k = d$

我们用给定的特征集初始化算法，使得$k = d$。

步骤1 (排除):

$x^- = \text{ arg max } J(X_k - x), \text{ 其中 } x \in X_k$
$X_{k-1} = X_k - x^-$
$k = k - 1$
返回步骤1

在此步骤中，我们从特征子集$X_k$中移除一个特征$x^-$。
$x^-$是最大化我们准则函数的特征，即，如果从$X_k$中移除该特征，所获得的分类器性能最佳的特征。
我们重复这一过程，直到满足终止标准。

终止: $k = p$

我们从特征子集$X_k$中添加特征，直到大小为$k$的特征子集包含我们事先指定的所需特征数量$p$。

顺序向后浮动选择 (SBFS)

输入： 所有特征的集合， $Y = {y_1, y_2, ..., y_d}$

SBFS算法以整个特征集作为输入。

输出： $X_k = {x_j \; | \;j = 1, 2, ..., k; \; x_j \in Y}$，其中 $k = (0, 1, 2, ..., d)$

SBFS返回一个特征子集；所选特征的数量 $k$，其中 $k < d$，必须事先指定。

初始化： $X_0 = Y$，$k = d$

我们用给定的特征集初始化算法，使得 $k = d$。

步骤1（排除）：

$x^- = \text{ arg max } J(X_k - x)，\text{ 其中 } x \in X_k$
$X_{k-1} = X_k - x^-$
$k = k - 1$
继续到步骤2

在此步骤中，我们从特征子集 $X_k$ 中删除特征 $x^-$。
$x^-$是移除后能够最大化我们标准函数的特征，即与最佳分类器性能相关联的特征，当从 $X_k$ 中移除时的表现。

步骤2（条件包含）：

$x^+ = \text{ arg max } J(X_k + x)，\text{ 其中 } x \in Y - X_k$
如果 J(X_k + x) > J(X_k):
$X_{k+1} = X_k + x^+$
$k = k + 1$
继续到步骤1

在步骤2中，我们搜索那些如果添加回特征子集能够改善分类器性能的特征。如果这样的特征存在，我们将添加能够最大化性能提升的特征 $x^+$。如果 $k = 2$ 或无法获得提升（即找不到这样的特征 $x^+$），则返回步骤1；否则，重复此步骤。

终止： $k = p$

我们从特征子集 $X_k$ 中添加特征，直到所需特征的数量 $p$ 达到事先指定的特征子集大小 $k$。

顺序前向浮动选择 (SFFS)

输入： 所有特征的集合，$Y = {y_1, y_2, ..., y_d}$

SFFS 算法将整个特征集作为输入，如果我们的特征空间包含例如10个维度（d = 10）。

输出： 特征子集，$X_k = {x_j \; | \;j = 1, 2, ..., k; \; x_j \in Y}$，其中 $k = (0, 1, 2, ..., d)$

算法返回的输出是指定大小的特征空间子集。例如，从10维特征空间中提取5个特征的子集（k = 5, d = 10）。

初始化： $X_0 = \emptyset$, $k = 0$

我们用一个空集（“空集”）初始化算法，使得 k = 0（其中 k 是子集的大小）

步骤 1（包含）：

     $x^+ = \text{ arg max } J(X_k + x), \text{ 其中 } x \in Y - X_k$
     $X_{k+1} = X_k + x^+$
     $k = k + 1$
    进入步骤 2

步骤 2（条件排除）：

     $x^- = \text{ arg max } J(X_k - x), \text{ 其中 } x \in X_k$
    $如果 \; J(X_k - x) > J(X_k)$:
         $X_{k-1} = X_k - x^- $
         $k = k - 1$
    返回步骤 1

在步骤1中，我们包含来自特征空间的特征，以实现对我们的特征子集（通过标准函数评估）的最佳性能提升。然后，我们进入步骤2。
在步骤2中，我们只在结果子集能够获得性能提升时才去除特征。如果 $k = 2$ 或无法实现改进（即无法找到此特征 $x^+$），则返回步骤1；否则，重复此步骤。
步骤1和步骤2重复进行，直到达到终止标准。

终止： 当 k 等于所需特征的数量时停止。

参考文献

Ferri, F. J., Pudil P., Hatef, M., Kittler, J. (1994). "大规模特征选择技术的比较研究。" 《实践中的模式识别 IV》: 403-413.
Pudil, P., Novovičová, J., & Kittler, J. (1994). "特征选择中的浮动搜索方法。" 《模式识别快报》 15.11 (1994): 1119-1125.

示例 1 - 一个简单的顺序前向选择示例

从scikit-learn初始化一个简单的分类器：

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)

我们首先通过顺序向前选择（SFS）从鸢尾花数据集中选择“三个最佳”特征。在这里，我们设置 forward=True 和 floating=False。通过选择 cv=0，我们不进行任何交叉验证，因此性能（此处为：‘准确率’）完全是在训练集上计算的。

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

sfs1 = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='accuracy',
           cv=0)

sfs1 = sfs1.fit(X, y)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334

通过 subsets_ 属性，我们可以查看每一步选择的特征索引：

sfs1.subsets_

{1: {'feature_idx': (3,),
  'cv_scores': array([0.96]),
  'avg_score': 0.96,
  'feature_names': ('3',)},
 2: {'feature_idx': (2, 3),
  'cv_scores': array([0.97333333]),
  'avg_score': 0.9733333333333334,
  'feature_names': ('2', '3')},
 3: {'feature_idx': (1, 2, 3),
  'cv_scores': array([0.97333333]),
  'avg_score': 0.9733333333333334,
  'feature_names': ('1', '2', '3')}}

sfs1 = sfs1.fit(X, y)
sfs1.subsets_

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334




{1: {'feature_idx': (3,),
  'cv_scores': array([0.96]),
  'avg_score': 0.96,
  'feature_names': ('3',)},
 2: {'feature_idx': (2, 3),
  'cv_scores': array([0.97333333]),
  'avg_score': 0.9733333333333334,
  'feature_names': ('2', '3')},
 3: {'feature_idx': (1, 2, 3),
  'cv_scores': array([0.97333333]),
  'avg_score': 0.9733333333333334,
  'feature_names': ('1', '2', '3')}}

此外，我们可以通过 k_feature_idx_ 属性直接访问3个最佳特征的索引：

sfs1.k_feature_idx_

(1, 2, 3)

最终，这三个特征的预测得分可以通过 k_score_ 访问：

sfs1.k_score_

0.9733333333333334

特征名称

在处理大型数据集时，特征索引可能难以解释。在这种情况下，我们建议使用具有不同列名的pandas DataFrame作为输入：

import pandas as pd

df_X = pd.DataFrame(X, columns=["Sepal length", "Sepal width", "Petal length", "Petal width"])
df_X.head()

	Sepal length	Sepal width	Petal length	Petal width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

sfs1 = sfs1.fit(df_X, y)

print('Best accuracy score: %.2f' % sfs1.k_score_)
print('Best subset (indices):', sfs1.k_feature_idx_)
print('Best subset (corresponding names):', sfs1.k_feature_names_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best accuracy score: 0.97
Best subset (indices): (1, 2, 3)
Best subset (corresponding names): ('Sepal width', 'Petal length', 'Petal width')


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334

示例 2 - 在 SFS、SBS、SFFS 和 SBFS 之间切换

使用 forward 和 floating 参数，我们可以在 SFS、SBS、SFFS 和 SBFS 之间切换，如下所示。请注意，我们正在进行（分层）4折交叉验证，以获得比示例 1 更加稳健的估计。通过 n_jobs=-1，我们选择在所有可用的 CPU 核心上运行交叉验证。

# 顺序前向选择
sfs = SFS(knn, 
          k_features=3, 
          forward=True, 
          floating=False, 
          scoring='accuracy',
          cv=4,
          n_jobs=-1)
sfs = sfs.fit(X, y)

print('\nSequential Forward Selection (k=3):')
print(sfs.k_feature_idx_)
print('CV Score:')
print(sfs.k_score_)

###################################################

# 顺序后向选择
sbs = SFS(knn, 
          k_features=3, 
          forward=False, 
          floating=False, 
          scoring='accuracy',
          cv=4,
          n_jobs=-1)
sbs = sbs.fit(X, y)

print('\nSequential Backward Selection (k=3):')
print(sbs.k_feature_idx_)
print('CV Score:')
print(sbs.k_score_)

###################################################

# 顺序浮动前向选择
sffs = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=True, 
           scoring='accuracy',
           cv=4,
           n_jobs=-1)
sffs = sffs.fit(X, y)

print('\nSequential Forward Floating Selection (k=3):')
print(sffs.k_feature_idx_)
print('CV Score:')
print(sffs.k_score_)

###################################################

# 顺序向后浮动选择
sbfs = SFS(knn, 
           k_features=3, 
           forward=False, 
           floating=True, 
           scoring='accuracy',
           cv=4,
           n_jobs=-1)
sbfs = sbfs.fit(X, y)

print('\nSequential Backward Floating Selection (k=3):')
print(sbfs.k_feature_idx_)
print('CV Score:')
print(sbfs.k_score_)

Sequential Forward Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088

Sequential Backward Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088

Sequential Forward Floating Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088

Sequential Backward Floating Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088

在这个简单的场景中，从鸢尾花数据集中选择出最佳的3个特征，无论我们使用哪种顺序选择算法，最终的结果都大致相同。

示例 3 - 在数据框中可视化结果

为了方便我们，可以使用SequentialFeatureSelector对象的get_metric_dict方法将特征选择的输出可视化为pandas DataFrame格式。列std_dev和std_err分别表示交叉验证分数的标准差和标准误差。

以下是示例2中顺序前向选择器的DataFrame：

import pandas as pd
pd.DataFrame.from_dict(sfs.get_metric_dict()).T

	feature_idx	cv_scores	avg_score	feature_names	ci_bound	std_dev	std_err
1	(3,)	[0.9736842105263158, 0.9473684210526315, 0.918...	0.959993	(3,)	0.048319	0.030143	0.017403
2	(2, 3)	[0.9736842105263158, 0.9473684210526315, 0.918...	0.959993	(2, 3)	0.048319	0.030143	0.017403
3	(1, 2, 3)	[0.9736842105263158, 1.0, 0.9459459459459459, ...	0.973151	(1, 2, 3)	0.030639	0.019113	0.011035

现在，让我们将其与顺序向后选择器进行比较：

pd.DataFrame.from_dict(sbs.get_metric_dict()).T

	feature_idx	cv_scores	avg_score	feature_names	ci_bound	std_dev	std_err
4	(0, 1, 2, 3)	[0.9736842105263158, 0.9473684210526315, 0.918...	0.953236	(0, 1, 2, 3)	0.03602	0.022471	0.012974
3	(1, 2, 3)	[0.9736842105263158, 1.0, 0.9459459459459459, ...	0.973151	(1, 2, 3)	0.030639	0.019113	0.011035

我们可以看到，SFS 和 SBFS 都找到了相同的“最佳”3个特征，但中间步骤显然不同。

上述DataFrame中的 ci_bound 列表示计算出的交叉验证分数的置信区间。默认情况下，使用95%的置信区间，但我们可以通过 confidence_interval 参数使用不同的置信范围。例如，可以如下获得90%置信区间的置信范围：

pd.DataFrame.from_dict(sbs.get_metric_dict(confidence_interval=0.90)).T

	feature_idx	cv_scores	avg_score	feature_names	ci_bound	std_dev	std_err
4	(0, 1, 2, 3)	[0.9736842105263158, 0.9473684210526315, 0.918...	0.953236	(0, 1, 2, 3)	0.027658	0.022471	0.012974
3	(1, 2, 3)	[0.9736842105263158, 1.0, 0.9459459459459459, ...	0.973151	(1, 2, 3)	0.023525	0.019113	0.011035

示例 4 - 绘制结果

在导入小助手函数 plotting.plot_sequential_feature_selection 之后，我们还可以使用 matplotlib 图形可视化结果。

from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt

sfs = SFS(knn, 
          k_features=4, 
          forward=True, 
          floating=False, 
          scoring='accuracy',
          verbose=2,
          cv=5)

sfs = sfs.fit(X, y)

fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')

plt.ylim([0.8, 1])
plt.title('Sequential Forward Selection (w. StdDev)')
plt.grid()
plt.show()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished

[2023-05-17 08:36:18] Features: 1/4 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished

[2023-05-17 08:36:18] Features: 2/4 -- score: 0.9666666666666668[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

[2023-05-17 08:36:18] Features: 3/4 -- score: 0.9533333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished

[2023-05-17 08:36:18] Features: 4/4 -- score: 0.9733333333333334

png

示例 5 - 回归的顺序特征选择

与上述分类示例类似，SequentialFeatureSelector 也支持 scikit-learn 的回归估计器。

from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()
X, y = data.data, data.target

lr = LinearRegression()

sfs = SFS(lr, 
          k_features=8, 
          forward=True, 
          floating=False, 
          scoring='neg_mean_squared_error',
          cv=10)

sfs = sfs.fit(X, y)
fig = plot_sfs(sfs.get_metric_dict(), kind='std_err')

plt.title('Sequential Forward Selection (w. StdErr)')
plt.grid()
plt.show()

png

示例 6 -- 使用固定的训练/验证分割进行特征选择

如果您不希望使用交叉验证（这里指的是k折交叉验证，即轮换训练和验证折叠），您可以使用PredefinedHoldoutSplit类来指定您自己的固定训练和验证拆分。

from sklearn.datasets import load_iris
from mlxtend.evaluate import PredefinedHoldoutSplit
import numpy as np


iris = load_iris()
X = iris.data
y = iris.target

rng = np.random.RandomState(123)
my_validation_indices = rng.permutation(np.arange(150))[:30]
print(my_validation_indices)

[ 72 112 132  88  37 138  87  42   8  90 141  33  59 116 135 104  36  13
  63  45  28 133  24 127  46  20  31 121 117   4]

from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS



knn = KNeighborsClassifier(n_neighbors=4)
piter = PredefinedHoldoutSplit(my_validation_indices)

sfs1 = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='accuracy',
           cv=piter)

sfs1 = sfs1.fit(X, y)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished

[2023-05-17 08:36:19] Features: 1/3 -- score: 0.9666666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished

[2023-05-17 08:36:19] Features: 2/3 -- score: 0.9666666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

[2023-05-17 08:36:19] Features: 3/3 -- score: 0.9666666666666667

示例 7 -- 使用选定特征子集进行新预测

# 初始化数据集

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
         X, y, test_size=0.33, random_state=1)

knn = KNeighborsClassifier(n_neighbors=4)

# Select the "best" three features via
# 在训练集上进行5折交叉验证。

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

sfs1 = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=5)
sfs1 = sfs1.fit(X_train, y_train)

print('Selected features:', sfs1.k_feature_idx_)

Selected features: (1, 2, 3)

# 基于所选特征生成新的子集
# 请注意，transform 调用等同于
# X_train[:, sfs1.k_feature_idx_]

X_train_sfs = sfs1.transform(X_train)
X_test_sfs = sfs1.transform(X_test)

# 使用新的特征子集拟合估计器
# 并对测试数据进行预测
knn.fit(X_train_sfs, y_train)
y_pred = knn.predict(X_test_sfs)

# 计算预测的准确性
acc = float((y_test == y_pred).sum()) / y_pred.shape[0]
print('Test set accuracy: %.2f %%' % (acc * 100))

Test set accuracy: 96.00 %

示例 8 - 顺序特征选择和网格搜索

在以下示例中，我们通过网格搜索调整SFS的估计器。为了避免不必要的行为或副作用，建议在SFS内部和外部使用估计器的独立实例。

# 初始化数据集

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
         X, y, test_size=0.2, random_state=123)

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import mlxtend

knn1 = KNeighborsClassifier()
knn2 = KNeighborsClassifier()

sfs1 = SFS(estimator=knn1, 
           k_features=3,
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=5)

pipe = Pipeline([('sfs', sfs1), 
                 ('knn2', knn2)])

param_grid = {
    'sfs__k_features': [1, 2, 3],
    'sfs__estimator__n_neighbors': [3, 4, 7], # 内部最近邻
    'knn2__n_neighbors': [3, 4, 7] # 外部k近邻
  }

gs = GridSearchCV(estimator=pipe, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  n_jobs=1, 
                  cv=5,
                  refit=False)

# 运行网格搜索
gs = gs.fit(X_train, y_train)

让我们来看看下面建议的超参数：

对于范围内的每一个i，输出它在交叉验证结果中的参数和对应的测试准确率。

通过GridSearch确定的“最佳”参数是...

print("Best parameters via GridSearch", gs.best_params_)

Best parameters via GridSearch {'knn2__n_neighbors': 7, 'sfs__estimator__n_neighbors': 3, 'sfs__k_features': 3}

pipe.set_params(**gs.best_params_).fit(X_train, y_train)

Pipeline(steps=[('sfs',
                 SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3),
                                           k_features=(3, 3),
                                           scoring='accuracy')),
                ('knn2', KNeighborsClassifier(n_neighbors=7))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

示例 9 -- 在k范围内选择“最佳”特征组合

如果k_features设置为元组(min_k, max_k)（在0.4.2中新增），SFS现在将通过从k=1到max_k（前向）或从max_k到min_k（后向）迭代来选择其发现的最佳特征组合。返回的特征子集的大小在max_k到min_k之间，具体取决于在交叉验证期间得分最高的组合。

X.shape

(150, 4)

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.data import wine_data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

X, y = wine_data()
X_train, X_test, y_train, y_test= train_test_split(X, y, 
                                                   stratify=y,
                                                   test_size=0.3,
                                                   random_state=1)

knn = KNeighborsClassifier(n_neighbors=2)

sfs1 = SFS(estimator=knn, 
           k_features=(3, 10),
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=5)

pipe = make_pipeline(StandardScaler(), sfs1)

pipe.fit(X_train, y_train)

print('best combination (ACC: %.3f): %s\n' % (sfs1.k_score_, sfs1.k_feature_idx_))
print('all subsets:\n', sfs1.subsets_)
plot_sfs(sfs1.get_metric_dict(), kind='std_err');

best combination (ACC: 0.992): (0, 1, 2, 3, 6, 8, 9, 10, 11, 12)

all subsets:
 {1: {'feature_idx': (6,), 'cv_scores': array([0.84 , 0.64 , 0.84 , 0.8  , 0.875]), 'avg_score': 0.799, 'feature_names': ('6',)}, 2: {'feature_idx': (6, 9), 'cv_scores': array([0.92      , 0.88      , 1.        , 0.96      , 0.91666667]), 'avg_score': 0.9353333333333333, 'feature_names': ('6', '9')}, 3: {'feature_idx': (6, 9, 12), 'cv_scores': array([0.92      , 0.92      , 0.96      , 1.        , 0.95833333]), 'avg_score': 0.9516666666666665, 'feature_names': ('6', '9', '12')}, 4: {'feature_idx': (3, 6, 9, 12), 'cv_scores': array([0.96      , 0.96      , 0.96      , 1.        , 0.95833333]), 'avg_score': 0.9676666666666666, 'feature_names': ('3', '6', '9', '12')}, 5: {'feature_idx': (3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.96, 1.  , 1.  , 1.  ]), 'avg_score': 0.976, 'feature_names': ('3', '6', '9', '10', '12')}, 6: {'feature_idx': (2, 3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.96, 1.  , 0.96, 1.  ]), 'avg_score': 0.968, 'feature_names': ('2', '3', '6', '9', '10', '12')}, 7: {'feature_idx': (0, 2, 3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.92, 1.  , 1.  , 1.  ]), 'avg_score': 0.968, 'feature_names': ('0', '2', '3', '6', '9', '10', '12')}, 8: {'feature_idx': (0, 2, 3, 6, 8, 9, 10, 12), 'cv_scores': array([1.  , 0.92, 1.  , 1.  , 1.  ]), 'avg_score': 0.984, 'feature_names': ('0', '2', '3', '6', '8', '9', '10', '12')}, 9: {'feature_idx': (0, 2, 3, 6, 8, 9, 10, 11, 12), 'cv_scores': array([1.  , 0.92, 1.  , 1.  , 1.  ]), 'avg_score': 0.984, 'feature_names': ('0', '2', '3', '6', '8', '9', '10', '11', '12')}, 10: {'feature_idx': (0, 1, 2, 3, 6, 8, 9, 10, 11, 12), 'cv_scores': array([1.  , 0.96, 1.  , 1.  , 1.  ]), 'avg_score': 0.992, 'feature_names': ('0', '1', '2', '3', '6', '8', '9', '10', '11', '12')}}

png

示例 10 -- 使用其他交叉验证方案

除了标准的k折交叉验证和分层k折交叉验证之外，还可以使用其他交叉验证方案与SequentialFeatureSelector配合使用。例如，来自scikit-learn的GroupKFold或LeaveOneOut交叉验证。

使用 GroupKFold 结合 SequentialFeatureSelector

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import GroupKFold
import numpy as np

X, y = iris_data()
groups = np.arange(len(y)) // 10
print('groups: {}'.format(groups))

groups: [ 0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  2  2  2  2
  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4
  4  4  5  5  5  5  5  5  5  5  5  5  6  6  6  6  6  6  6  6  6  6  7  7
  7  7  7  7  7  7  7  7  8  8  8  8  8  8  8  8  8  8  9  9  9  9  9  9
  9  9  9  9 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11
 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 14 14 14 14
 14 14 14 14 14 14]

调用 scikit-learn 交叉验证器对象的 split() 方法将返回一个生成器，该生成器生成训练集和测试集的划分。

cv_gen = GroupKFold(4).split(X, y, groups)
cv_gen

<generator object _BaseKFold.split at 0x17c109580>

SequentialFeatureSelector 的 cv 参数必须是一个 int 或一个可迭代的对象，该对象产生训练和测试的划分。这个可迭代对象可以通过将训练、测试划分生成器传递给内置的 list() 函数来构造。

cv = list(cv_gen)

knn = KNeighborsClassifier(n_neighbors=2)
sfs = SFS(estimator=knn, 
          k_features=2,
          scoring='accuracy',
          cv=cv)

sfs.fit(X, y)

print('best combination (ACC: %.3f): %s\n' % (sfs.k_score_, sfs.k_feature_idx_))

best combination (ACC: 0.940): (2, 3)

示例 11 - 中断长时间运行以获得中间结果

如果您的运行时间过长，可以触发一个 KeyboardInterrupt（例如，在Mac上使用ctrl+c，或在Jupyter notebook中中断单元格）来获取临时结果。

玩具数据集

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split


X, y = make_classification(
    n_samples=20000,
    n_features=500,
    n_informative=10,
    n_redundant=40,
    n_repeated=25,
    n_clusters_per_class=5,
    flip_y=0.05,
    class_sep=0.5,
    random_state=123,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)

长时间中断运行

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

sfs1 = SFS(model, 
           k_features=10, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='accuracy',
           cv=5)

sfs1 = sfs1.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:    8.3s finished

[2023-05-17 08:36:32] Features: 1/10 -- score: 0.5965[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 499 out of 499 | elapsed:   13.8s finished

[2023-05-17 08:36:45] Features: 2/10 -- score: 0.6256875000000001[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 498 out of 498 | elapsed:   18.1s finished

[2023-05-17 08:37:03] Features: 3/10 -- score: 0.642[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 497 out of 497 | elapsed:   20.4s finished

[2023-05-17 08:37:24] Features: 4/10 -- score: 0.6463125[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 496 out of 496 | elapsed:   22.2s finished

[2023-05-17 08:37:46] Features: 5/10 -- score: 0.6495000000000001[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 495 out of 495 | elapsed:   26.1s finished

[2023-05-17 08:38:12] Features: 6/10 -- score: 0.6514374999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 494 out of 494 | elapsed:   26.1s finished

[2023-05-17 08:38:38] Features: 7/10 -- score: 0.6533749999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 493 out of 493 | elapsed:   25.3s finished

[2023-05-17 08:39:04] Features: 8/10 -- score: 0.6545624999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 492 out of 492 | elapsed:   26.3s finished

[2023-05-17 08:39:30] Features: 9/10 -- score: 0.6549375[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 491 out of 491 | elapsed:   27.0s finished

[2023-05-17 08:39:57] Features: 10/10 -- score: 0.6554374999999999

完成拟合

注意，特征选择运行尚未完成，因此某些属性可能不可用。为了使用 SFS 实例，建议调用 finalize_fit，这将使 SFS 估计器显示为“已拟合”，并处理临时结果：

sfs1.finalize_fit()

print(sfs1.k_feature_idx_)
print(sfs1.k_score_)

(30, 128, 144, 160, 184, 229, 256, 356, 439, 458)
0.6554374999999999

示例 12 - 使用 Pandas 数据框

我们还可以选择使用pandas DataFrames和pandas Series作为fit函数的输入。在这种情况下，pandas DataFrame的列名将被用作特征名。然而，请注意，如果在fit函数中提供了custom_feature_names，这些custom_feature_names将优先于基于DataFrame列的特征名。

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS


iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)

sfs1 = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=0)

X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
                                'sepal width', 'petal width'])
X_df.head()

	sepal len	petal len	sepal width	petal width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

此外，目标数组 y 也可以选择性地转换为一个序列：

y_series = pd.Series(y)
y_series.head()

0    0
1    0
2    0
3    0
4    0
dtype: int64

sfs1 = sfs1.fit(X_df, y_series)

注意，传递一个pandas DataFrame 作为输入的唯一不同之处在于，sfs1.subsets_ 数组现在将包含一列新数据，

sfs1.subsets_

{1: {'feature_idx': (3,),
  'cv_scores': array([0.96]),
  'avg_score': 0.96,
  'feature_names': ('petal width',)},
 2: {'feature_idx': (2, 3),
  'cv_scores': array([0.97333333]),
  'avg_score': 0.9733333333333334,
  'feature_names': ('sepal width', 'petal width')},
 3: {'feature_idx': (1, 2, 3),
  'cv_scores': array([0.97333333]),
  'avg_score': 0.9733333333333334,
  'feature_names': ('petal len', 'sepal width', 'petal width')}}

在mlxtend版本>= 0.13中，支持将pandas DataFrame作为SequentianFeatureSelector的特征输入，而不是NumPy数组或其他类似NumPy的数组类型。

示例 13 - 指定固定特征集

通常情况下，指定我们希望用于给定模型的一组固定特征（例如，由先前知识或领域知识确定）可能是有用的。自 MLxtend v 0.18.0 以来，现在可以通过 fixed_features 属性指定这些特征。这将意味着这些特征保证会包含在选定的子集中。

请注意，这个特性适用于有关前向和后向选择的所有选项，以及是否使用浮动选择。

下面的示例演示了我们如何将数据集中的特征 0 和特征 2 设置为固定：

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=3)

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

sfs1 = SFS(knn, 
           k_features=4, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='accuracy',
           fixed_features=(0, 2),
           cv=3)

sfs1 = sfs1.fit(X, y)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

[2023-05-17 08:39:57] Features: 3/4 -- score: 0.9733333333333333[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished

[2023-05-17 08:39:57] Features: 4/4 -- score: 0.9733333333333333

sfs1.subsets_

{2: {'feature_idx': (0, 2),
  'cv_scores': array([0.98, 0.92, 0.94]),
  'avg_score': 0.9466666666666667,
  'feature_names': ('0', '2')},
 3: {'feature_idx': (0, 2, 3),
  'cv_scores': array([0.98, 0.96, 0.98]),
  'avg_score': 0.9733333333333333,
  'feature_names': ('0', '2', '3')},
 4: {'feature_idx': (0, 1, 2, 3),
  'cv_scores': array([0.98, 0.96, 0.98]),
  'avg_score': 0.9733333333333333,
  'feature_names': ('0', '1', '2', '3')}}

如果输入的数据集是一个 pandas 数据框，我们也可以直接使用列名：

import pandas as pd

X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
                                'sepal width', 'petal width'])
X_df.head()

	sepal len	petal len	sepal width	petal width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

sfs2 = SFS(knn, 
           k_features=4, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='accuracy',
           fixed_features=('sepal len', 'petal len'),
           cv=3)

sfs2 = sfs2.fit(X_df, y_series)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

[2023-05-17 08:39:57] Features: 3/4 -- score: 0.9466666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished

[2023-05-17 08:39:57] Features: 4/4 -- score: 0.9733333333333333

sfs2.subsets_

{2: {'feature_idx': (0, 1),
  'cv_scores': array([0.72, 0.74, 0.78]),
  'avg_score': 0.7466666666666667,
  'feature_names': ('sepal len', 'petal len')},
 3: {'feature_idx': (0, 1, 2),
  'cv_scores': array([0.98, 0.92, 0.94]),
  'avg_score': 0.9466666666666667,
  'feature_names': ('sepal len', 'petal len', 'sepal width')},
 4: {'feature_idx': (0, 1, 2, 3),
  'cv_scores': array([0.98, 0.96, 0.98]),
  'avg_score': 0.9733333333333333,
  'feature_names': ('sepal len', 'petal len', 'sepal width', 'petal width')}}

示例 13 - 使用特征组

自mlxtend v0.21.0以来，可以指定特征组。特征组允许您将某些特征组合在一起，使它们始终作为一个组被选择。这在类似于独热编码的上下文中非常有用——如果您想将独热编码特征视为一个单一特征：

在下面的示例中，我们将花萼长度和花萼宽度指定为一个特征组，以便它们总是一起被选择：

from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
X = iris.data
y = iris.target

X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
                                'sepal wid', 'petal wid'])
X_df.head()

	sepal len	petal len	sepal wid	petal wid
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

knn = KNeighborsClassifier(n_neighbors=3)

sfs1 = SFS(knn, 
           k_features=2, 
           scoring='accuracy',
           feature_groups=(['sepal len', 'sepal wid'], ['petal len'], ['petal wid']),
           cv=3)

sfs1 = sfs1.fit(X_df, y)

# 单变量特征选择 (SFS)

此代码示例演示了如何使用单变量特征选择（SFS）来选择最佳特征组合。

## 参数说明
- `knn`: 选择的模型，通常是 k-近邻分类器。
- `k_features`: 选择的特征数量，此处选择 2 个特征。
- `scoring`: 评分标准，此处使用准确率 ('accuracy')。
- `feature_groups`: 特征组的集合，此处有三个特征组。
- `cv`: 交叉验证的折数，此处设置为 3。

## 使用示例
下面的代码将 SFS 应用于数据集 `X` 和标签 `y`，以确定最佳的特征组合。

```python
sfs1 = SFS(knn, 
           k_features=2, 
           scoring='accuracy',
           feature_groups=[[0, 2], [1], [3]],
           cv=3)

sfs1 = sfs1.fit(X, y)



## 示例 14 - 多分类指标


某些评分指标，如ROC AUC，最初是为二元分类设计的。然而，它们也可以用于多类设置。最好的做法是参考[这个scikit-learn指标表](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values)。

例如，我们可以通过`‘"roc_auc_ovr"`使用ROC AUC一对其余的分数，如下所示。



```python
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10, centers=4, n_features=5, random_state=0)

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

sfs1 = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='roc_auc_ovr',
           cv=0)

sfs1 = sfs1.fit(X, y)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished

[2023-05-17 08:39:57] Features: 1/3 -- score: 1.0[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished

[2023-05-17 08:39:57] Features: 2/3 -- score: 1.0[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished

[2023-05-17 08:39:57] Features: 3/3 -- score: 1.0

API

SequentialFeatureSelector(estimator, k_features=1, forward=True, floating=False, verbose=0, scoring=None, cv=5, n_jobs=1, pre_dispatch='2n_jobs', clone_estimator=True, fixed_features=None, feature_groups=None)*

Sequential Feature Selection for Classification and Regression.

Parameters

estimator : scikit-learn classifier or regressor
k_features : int or tuple or str (default: 1)

Number of features to select, where k_features < the full feature set. New in 0.4.2: A tuple containing a min and max value can be provided, and the SFS will consider return any feature combination between min and max that scored highest in cross-validation. For example, the tuple (1, 4) will return any combination from 1 up to 4 features instead of a fixed number of features k. New in 0.8.0: A string argument "best" or "parsimonious". If "best" is provided, the feature selector will return the feature subset with the best cross-validation performance. If "parsimonious" is provided as an argument, the smallest feature subset that is within one standard error of the cross-validation performance will be selected.
forward : bool (default: True)

Forward selection if True, backward selection otherwise
floating : bool (default: False)

Adds a conditional exclusion/inclusion if True.
verbose : int (default: 0), level of verbosity to use in logging.

If 0, no output, if 1 number of features in current set, if 2 detailed logging i ncluding timestamp and cv scores at step.
scoring : str, callable, or None (default: None)

If None (default), uses 'accuracy' for sklearn classifiers and 'r2' for sklearn regressors. If str, uses a sklearn scoring metric string identifier, for example {accuracy, f1, precision, recall, roc_auc} for classifiers, {'mean_absolute_error', 'mean_squared_error'/'neg_mean_squared_error', 'median_absolute_error', 'r2'} for regressors. If a callable object or function is provided, it has to be conform with sklearn's signature scorer(estimator, X, y); see https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html for more information.
cv : int (default: 5)

Integer or iterable yielding train, test splits. If cv is an integer and estimator is a classifier (or y consists of integer class labels) stratified k-fold. Otherwise regular k-fold cross-validation is performed. No cross-validation if cv is None, False, or 0.
n_jobs : int (default: 1)

The number of CPUs to use for evaluating different feature subsets in parallel. -1 means 'all CPUs'.
pre_dispatch : int, or string (default: '2*n_jobs')

Controls the number of jobs that get dispatched during parallel execution if n_jobs > 1 or n_jobs=-1. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function of n_jobs, as in 2*n_jobs
clone_estimator : bool (default: True)

Clones estimator if True; works with the original estimator instance if False. Set to False if the estimator doesn't implement scikit-learn's set_params and get_params methods. In addition, it is required to set cv=0, and n_jobs=1.
fixed_features : tuple (default: None)

If not None, the feature indices provided as a tuple will be regarded as fixed by the feature selector. For example, if fixed_features=(1, 3, 7), the 2nd, 4th, and 8th feature are guaranteed to be present in the solution. Note that if fixed_features is not None, make sure that the number of features to be selected is greater than len(fixed_features). In other words, ensure that k_features > len(fixed_features). New in mlxtend v. 0.18.0.
feature_groups : list or None (default: None)

Optional argument for treating certain features as a group. This means, the features within a group are always selected together, never split. For example, feature_groups=[[1], [2], [3, 4, 5]] specifies 3 feature groups. In this case, possible feature selection results with k_features=2 are [[1], [2], [[1], [3, 4, 5]], or [[2], [3, 4, 5]]. Feature groups can be useful for interpretability, for example, if features 3, 4, 5 are one-hot encoded features. (For more details, please read the notes at the bottom of this docstring). New in mlxtend v. 0.21.0.

Attributes

k_feature_idx_ : array-like, shape = [n_predictions]

Feature Indices of the selected feature subsets.
k_feature_names_ : array-like, shape = [n_predictions]

Feature names of the selected feature subsets. If pandas DataFrames are used in the fit method, the feature names correspond to the column names. Otherwise, the feature names are string representation of the feature array indices. New in v 0.13.0.
k_score_ : float

Cross validation average score of the selected subset.
subsets_ : dict

A dictionary of selected feature subsets during the sequential selection, where the dictionary keys are the lengths k of these feature subsets. If the parameter feature_groups is not None, the value of key indicates the number of groups that are selected together. The dictionary values are dictionaries themselves with the following keys: 'feature_idx' (tuple of indices of the feature subset) 'feature_names' (tuple of feature names of the feat. subset) 'cv_scores' (list individual cross-validation scores) 'avg_score' (average cross-validation score) Note that if pandas DataFrames are used in the fit method, the 'feature_names' correspond to the column names. Otherwise, the feature names are string representation of the feature array indices. The 'feature_names' is new in v 0.13.0.

Notes

(1) If parameter feature_groups is not None, the number of features is equal to the number of feature groups, i.e. len(feature_groups). For example, if feature_groups = [[0], [1], [2, 3], [4]], then the max_features value cannot exceed 4.

(2) Although two or more individual features may be considered as one group
throughout the feature-selection process, it does not mean the individual
features of that group have the same impact on the outcome. For instance, in
linear regression, the coefficient of the feature 2 and 3 can be different
even if they are considered as one group in feature_groups.

(3) If both fixed_features and feature_groups are specified, ensure that each
feature group contains the fixed_features selection. E.g., for a 3-feature set
fixed_features=[0, 1] and feature_groups=[[0, 1], [2]] is valid;
fixed_features=[0, 1] and feature_groups=[[0], [1, 2]] is not valid.

(4) In case of KeyboardInterrupt, the dictionary subsets may not be completed.
If user is still interested in getting the best score, they can use method
`finalize_fit`.

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/

Methods

finalize_fit()

None

fit(X, y, groups=None, fit_params)

Perform feature selection and learn model from training data.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.
y : array-like, shape = [n_samples]

Target values. New in v 0.13.0: pandas DataFrames are now also accepted as argument for y.
groups : array-like, with shape (n_samples,), optional

Group labels for the samples used while splitting the dataset into train/test set. Passed to the fit method of the cross-validator.
fit_params : various, optional

Additional parameters that are being passed to the estimator. For example, sample_weights=weights.

Returns

self : object

fit_transform(X, y, groups=None, fit_params)

Fit to training data then reduce X to its most important features.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.
y : array-like, shape = [n_samples]

Target values. New in v 0.13.0: a pandas Series are now also accepted as argument for y.
groups : array-like, with shape (n_samples,), optional

Group labels for the samples used while splitting the dataset into train/test set. Passed to the fit method of the cross-validator.
fit_params : various, optional

Additional parameters that are being passed to the estimator. For example, sample_weights=weights.

Returns

Reduced feature subset of X, shape={n_samples, k_features}

generate_error_message_k_features(name)

None

get_metric_dict(confidence_interval=0.95)

Return metric dictionary

Parameters

confidence_interval : float (default: 0.95)

A positive float between 0.0 and 1.0 to compute the confidence interval bounds of the CV score averages.

Returns

Dictionary with items where each dictionary value is a list with the number of iterations (number of feature subsets) as its length. The dictionary keys corresponding to these lists are as follows: 'feature_idx': tuple of the indices of the feature subset 'cv_scores': list with individual CV scores 'avg_score': of CV average scores 'std_dev': standard deviation of the CV score average 'std_err': standard error of the CV score average 'ci_bound': confidence interval bound of the CV score average

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params : dict

Parameter names mapped to their values.

set_params(params)

Set the parameters of this estimator. Valid parameter keys can be listed with get_params().

Returns

self

transform(X)

Reduce X to its most important features.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.

Returns

Reduced feature subset of X, shape={n_samples, k_features}

Properties

named_estimators

Returns

List of named estimator tuples, like [('svc', SVC(...))]

ython

	Sepal length	Sepal width	Petal length	Petal width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal len	petal len	sepal width	petal width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal len	petal len	sepal width	petal width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal len	petal len	sepal wid	petal wid
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	Sepal length	Sepal width	Petal length	Petal width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal len	petal len	sepal width	petal width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal len	petal len	sepal width	petal width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal len	petal len	sepal wid	petal wid
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	Sepal length	Sepal width	Petal length	Petal width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal len	petal len	sepal width	petal width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal len	petal len	sepal width	petal width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal len	petal len	sepal wid	petal wid
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2