顺序特征选择:流行的前向和后向特征选择方法(包括浮动变体)
序列特征算法(SFA)的实现 -- 贪婪搜索算法 -- 这些算法作为对计算上通常不可行的穷举搜索的次优解而开发。
> from mlxtend.feature_selection import SequentialFeatureSelector
概述
顺序特征选择算法是一类贪婪搜索算法,用于将初始d维特征空间减少到k维特征子空间,其中k < d。特征选择算法的动机是自动选择与问题最相关的特征子集。特征选择的目标是双重的:我们希望通过去除无关特征或噪音来提高计算效率并减少模型的泛化误差。此外,当嵌入式特征选择(例如,LASSO等正则化惩罚)不可应用时,像顺序特征选择这样的包装方法是有利的。
简而言之,顺序特征选择算法通过根据分类器的性能逐个添加或移除特征,直到达到所需大小k的特征子集。有四种不同类型的顺序特征选择可通过SequentialFeatureSelector
获得:
- 顺序前向选择 (SFS)
- 顺序后向选择 (SBS)
- 顺序前向浮动选择 (SFFS)
- 顺序后向浮动选择 (SBFS)
浮动变体,即SFFS和SBFS,可以视为对更简单的SFS和SBS算法的扩展。浮动算法具有一个额外的排除或包含步骤,以便在特征被包含(或排除)后移除特征,从而可以采样更多的特征子集组合。重要的是强调,这一步是有条件的,仅在移除(或添加)特定特征后,结果特征子集被标准函数评估为“更好”的情况下发生。此外,我增加了一个可选检查,以跳过条件排除步骤,以防算法陷入循环。
这与递归特征消除(RFE)有什么不同——例如,在sklearn.feature_selection.RFE
中实现的RFE?RFE使用特征权重系数(例如线性模型)或特征重要性(基于树的算法)递归地消除特征,其计算复杂性较低,而SFS基于用户定义的分类器/回归性能指标来消除(或添加)特征。
教程视频
<iframe width="560" height="315" src="https://www.youtube.com/embed/0vCXcGJg5Bo" title="YouTube 视频播放器" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
抱歉,我无法提供需要的帮助。
视觉插图
下面提供了顺向后向选择过程的视觉说明,摘自论文
- Joe Bemister-Buffington, Alex J. Wolf, Sebastian Raschka, 和 Leslie A. Kuhn (2020) 利用机器学习识别A类GPCR抑制的灵活性特征 Biomolecules 2020, 10, 454. https://www.mdpi.com/2218-273X/10/3/454#
算法细节
顺序前向选择 (SFS)
输入: $Y = {y_1, y_2, ..., y_d}$
- SFS 算法将整个 $d$ 维特征集作为输入。
输出: $X_k = {x_j \; | \;j = 1, 2, ..., k; \; x_j \in Y}$, 其中 $k = (0, 1, 2, ..., d)$
- SFS 返回特征的子集;选择的特征数 $k$,其中 $k < d$,需提前指定。
初始化: $X_0 = \emptyset$, $k = 0$
- 我们用一个空集 $\emptyset$("空集")来初始化算法,使得 $k = 0$(其中 $k$ 是子集的大小)。
步骤 1(包含):
$x^+ = \text{ arg max } J(X_k + x), \text{ 其中 } x \in Y - X_k$
$X_{k+1} = X_k + x^+$
$k = k + 1$
返回步骤 1
- 在此步骤中,我们将额外的特征 $x^+$ 添加到我们的特征子集 $X_k$ 中。
- $x^+$ 是最大化我们的准则函数的特征,即添加到 $X_k$ 后关联的最佳分类器性能的特征。
- 我们重复这一过程,直到满足终止标准。
终止: $k = p$
- 我们从特征子集 $X_k$ 中添加特征,直到大小为 $k$ 的特征子集包含我们提前指定的所需特征数 $p$。
顺序向后选择 (SBS)
输入: 所有特征的集合,$Y = {y_1, y_2, ..., y_d}$
- SBS算法以整个特征集作为输入。
输出: $X_k = {x_j \; | \;j = 1, 2, ..., k; \; x_j \in Y}$,其中 $k = (0, 1, 2, ..., d)$
- SBS返回一个特征子集;所选特征的数量$k$,其中$k < d$,必须事先指定。
初始化: $X_0 = Y$, $k = d$
- 我们用给定的特征集初始化算法,使得$k = d$。
步骤1 (排除):
$x^- = \text{ arg max } J(X_k - x), \text{ 其中 } x \in X_k$
$X_{k-1} = X_k - x^-$
$k = k - 1$
返回步骤1
- 在此步骤中,我们从特征子集$X_k$中移除一个特征$x^-$。
- $x^-$是最大化我们准则函数的特征,即,如果从$X_k$中移除该特征,所获得的分类器性能最佳的特征。
- 我们重复这一过程,直到满足终止标准。
终止: $k = p$
- 我们从特征子集$X_k$中添加特征,直到大小为$k$的特征子集包含我们事先指定的所需特征数量$p$。
顺序向后浮动选择 (SBFS)
输入: 所有特征的集合, $Y = {y_1, y_2, ..., y_d}$
- SBFS算法以整个特征集作为输入。
输出: $X_k = {x_j \; | \;j = 1, 2, ..., k; \; x_j \in Y}$,其中 $k = (0, 1, 2, ..., d)$
- SBFS返回一个特征子集;所选特征的数量 $k$,其中 $k < d$,必须事先指定。
初始化: $X_0 = Y$,$k = d$
- 我们用给定的特征集初始化算法,使得 $k = d$。
步骤1(排除):
$x^- = \text{ arg max } J(X_k - x),\text{ 其中 } x \in X_k$
$X_{k-1} = X_k - x^-$
$k = k - 1$
继续到步骤2
- 在此步骤中,我们从特征子集 $X_k$ 中删除特征 $x^-$。
- $x^-$是移除后能够最大化我们标准函数的特征,即与最佳分类器性能相关联的特征,当从 $X_k$ 中移除时的表现。
步骤2(条件包含):
$x^+ = \text{ arg max } J(X_k + x),\text{ 其中 } x \in Y - X_k$
如果 J(X_k + x) > J(X_k):
$X_{k+1} = X_k + x^+$
$k = k + 1$
继续到步骤1
- 在步骤2中,我们搜索那些如果添加回特征子集能够改善分类器性能的特征。如果这样的特征存在,我们将添加能够最大化性能提升的特征 $x^+$。如果 $k = 2$ 或无法获得提升(即找不到这样的特征 $x^+$),则返回步骤1;否则,重复此步骤。
终止: $k = p$
- 我们从特征子集 $X_k$ 中添加特征,直到所需特征的数量 $p$ 达到事先指定的特征子集大小 $k$。
顺序前向浮动选择 (SFFS)
输入: 所有特征的集合,$Y = {y_1, y_2, ..., y_d}$
- SFFS 算法将整个特征集作为输入,如果我们的特征空间包含例如10个维度(d = 10)。
输出: 特征子集,$X_k = {x_j \; | \;j = 1, 2, ..., k; \; x_j \in Y}$,其中 $k = (0, 1, 2, ..., d)$
- 算法返回的输出是指定大小的特征空间子集。例如,从10维特征空间中提取5个特征的子集(k = 5, d = 10)。
初始化: $X_0 = \emptyset$, $k = 0$
- 我们用一个空集(“空集”)初始化算法,使得 k = 0(其中 k 是子集的大小)
步骤 1(包含):
$x^+ = \text{ arg max } J(X_k + x), \text{ 其中 } x \in Y - X_k$
$X_{k+1} = X_k + x^+$
$k = k + 1$
进入步骤 2
步骤 2(条件排除):
$x^- = \text{ arg max } J(X_k - x), \text{ 其中 } x \in X_k$
$如果 \; J(X_k - x) > J(X_k)$:
$X_{k-1} = X_k - x^- $
$k = k - 1$
返回步骤 1
- 在步骤1中,我们包含来自特征空间的特征,以实现对我们的特征子集(通过标准函数评估)的最佳性能提升。然后,我们进入步骤2。
-
在步骤2中,我们只在结果子集能够获得性能提升时才去除特征。如果 $k = 2$ 或无法实现改进(即无法找到此特征 $x^+$),则返回步骤1;否则,重复此步骤。
-
步骤1和步骤2重复进行,直到达到 终止 标准。
终止: 当 k 等于所需特征的数量时停止。
参考文献
-
Ferri, F. J., Pudil P., Hatef, M., Kittler, J. (1994). "大规模特征选择技术的比较研究。" 《实践中的模式识别 IV》: 403-413.
-
Pudil, P., Novovičová, J., & Kittler, J. (1994). "特征选择中的浮动搜索方法。" 《模式识别快报》 15.11 (1994): 1119-1125.
示例 1 - 一个简单的顺序前向选择示例
从scikit-learn初始化一个简单的分类器:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
我们首先通过顺序向前选择(SFS)从鸢尾花数据集中选择“三个最佳”特征。在这里,我们设置 forward=True
和 floating=False
。通过选择 cv=0
,我们不进行任何交叉验证,因此性能(此处为:‘准确率’)完全是在训练集上计算的。
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=0)
sfs1 = sfs1.fit(X, y)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334
通过 subsets_
属性,我们可以查看每一步选择的特征索引:
sfs1.subsets_
{1: {'feature_idx': (3,),
'cv_scores': array([0.96]),
'avg_score': 0.96,
'feature_names': ('3',)},
2: {'feature_idx': (2, 3),
'cv_scores': array([0.97333333]),
'avg_score': 0.9733333333333334,
'feature_names': ('2', '3')},
3: {'feature_idx': (1, 2, 3),
'cv_scores': array([0.97333333]),
'avg_score': 0.9733333333333334,
'feature_names': ('1', '2', '3')}}
sfs1 = sfs1.fit(X, y)
sfs1.subsets_
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334
{1: {'feature_idx': (3,),
'cv_scores': array([0.96]),
'avg_score': 0.96,
'feature_names': ('3',)},
2: {'feature_idx': (2, 3),
'cv_scores': array([0.97333333]),
'avg_score': 0.9733333333333334,
'feature_names': ('2', '3')},
3: {'feature_idx': (1, 2, 3),
'cv_scores': array([0.97333333]),
'avg_score': 0.9733333333333334,
'feature_names': ('1', '2', '3')}}
此外,我们可以通过 k_feature_idx_
属性直接访问3个最佳特征的索引:
sfs1.k_feature_idx_
(1, 2, 3)
最终,这三个特征的预测得分可以通过 k_score_
访问:
sfs1.k_score_
0.9733333333333334
特征名称
在处理大型数据集时,特征索引可能难以解释。在这种情况下,我们建议使用具有不同列名的pandas DataFrame作为输入:
import pandas as pd
df_X = pd.DataFrame(X, columns=["Sepal length", "Sepal width", "Petal length", "Petal width"])
df_X.head()
Sepal length | Sepal width | Petal length | Petal width | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
sfs1 = sfs1.fit(df_X, y)
print('Best accuracy score: %.2f' % sfs1.k_score_)
print('Best subset (indices):', sfs1.k_feature_idx_)
print('Best subset (corresponding names):', sfs1.k_feature_names_)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Best accuracy score: 0.97
Best subset (indices): (1, 2, 3)
Best subset (corresponding names): ('Sepal width', 'Petal length', 'Petal width')
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334
示例 2 - 在 SFS、SBS、SFFS 和 SBFS 之间切换
使用 forward
和 floating
参数,我们可以在 SFS、SBS、SFFS 和 SBFS 之间切换,如下所示。请注意,我们正在进行(分层)4折交叉验证,以获得比示例 1 更加稳健的估计。通过 n_jobs=-1
,我们选择在所有可用的 CPU 核心上运行交叉验证。
# 顺序前向选择
sfs = SFS(knn,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=4,
n_jobs=-1)
sfs = sfs.fit(X, y)
print('\nSequential Forward Selection (k=3):')
print(sfs.k_feature_idx_)
print('CV Score:')
print(sfs.k_score_)
###################################################
# 顺序后向选择
sbs = SFS(knn,
k_features=3,
forward=False,
floating=False,
scoring='accuracy',
cv=4,
n_jobs=-1)
sbs = sbs.fit(X, y)
print('\nSequential Backward Selection (k=3):')
print(sbs.k_feature_idx_)
print('CV Score:')
print(sbs.k_score_)
###################################################
# 顺序浮动前向选择
sffs = SFS(knn,
k_features=3,
forward=True,
floating=True,
scoring='accuracy',
cv=4,
n_jobs=-1)
sffs = sffs.fit(X, y)
print('\nSequential Forward Floating Selection (k=3):')
print(sffs.k_feature_idx_)
print('CV Score:')
print(sffs.k_score_)
###################################################
# 顺序向后浮动选择
sbfs = SFS(knn,
k_features=3,
forward=False,
floating=True,
scoring='accuracy',
cv=4,
n_jobs=-1)
sbfs = sbfs.fit(X, y)
print('\nSequential Backward Floating Selection (k=3):')
print(sbfs.k_feature_idx_)
print('CV Score:')
print(sbfs.k_score_)
Sequential Forward Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088
Sequential Backward Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088
Sequential Forward Floating Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088
Sequential Backward Floating Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088
在这个简单的场景中,从鸢尾花数据集中选择出最佳的3个特征,无论我们使用哪种顺序选择算法,最终的结果都大致相同。
示例 3 - 在数据框中可视化结果
为了方便我们,可以使用SequentialFeatureSelector对象的get_metric_dict
方法将特征选择的输出可视化为pandas DataFrame格式。列std_dev
和std_err
分别表示交叉验证分数的标准差和标准误差。
以下是示例2中顺序前向选择器的DataFrame:
import pandas as pd
pd.DataFrame.from_dict(sfs.get_metric_dict()).T
feature_idx | cv_scores | avg_score | feature_names | ci_bound | std_dev | std_err | |
---|---|---|---|---|---|---|---|
1 | (3,) | [0.9736842105263158, 0.9473684210526315, 0.918... | 0.959993 | (3,) | 0.048319 | 0.030143 | 0.017403 |
2 | (2, 3) | [0.9736842105263158, 0.9473684210526315, 0.918... | 0.959993 | (2, 3) | 0.048319 | 0.030143 | 0.017403 |
3 | (1, 2, 3) | [0.9736842105263158, 1.0, 0.9459459459459459, ... | 0.973151 | (1, 2, 3) | 0.030639 | 0.019113 | 0.011035 |
现在,让我们将其与顺序向后选择器进行比较:
pd.DataFrame.from_dict(sbs.get_metric_dict()).T
feature_idx | cv_scores | avg_score | feature_names | ci_bound | std_dev | std_err | |
---|---|---|---|---|---|---|---|
4 | (0, 1, 2, 3) | [0.9736842105263158, 0.9473684210526315, 0.918... | 0.953236 | (0, 1, 2, 3) | 0.03602 | 0.022471 | 0.012974 |
3 | (1, 2, 3) | [0.9736842105263158, 1.0, 0.9459459459459459, ... | 0.973151 | (1, 2, 3) | 0.030639 | 0.019113 | 0.011035 |
我们可以看到,SFS 和 SBFS 都找到了相同的“最佳”3个特征,但中间步骤显然不同。
上述DataFrame中的 ci_bound
列表示计算出的交叉验证分数的置信区间。默认情况下,使用95%的置信区间,但我们可以通过 confidence_interval
参数使用不同的置信范围。例如,可以如下获得90%置信区间的置信范围:
pd.DataFrame.from_dict(sbs.get_metric_dict(confidence_interval=0.90)).T
feature_idx | cv_scores | avg_score | feature_names | ci_bound | std_dev | std_err | |
---|---|---|---|---|---|---|---|
4 | (0, 1, 2, 3) | [0.9736842105263158, 0.9473684210526315, 0.918... | 0.953236 | (0, 1, 2, 3) | 0.027658 | 0.022471 | 0.012974 |
3 | (1, 2, 3) | [0.9736842105263158, 1.0, 0.9459459459459459, ... | 0.973151 | (1, 2, 3) | 0.023525 | 0.019113 | 0.011035 |
示例 4 - 绘制结果
在导入小助手函数 plotting.plot_sequential_feature_selection
之后,我们还可以使用 matplotlib 图形可视化结果。
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
sfs = SFS(knn,
k_features=4,
forward=True,
floating=False,
scoring='accuracy',
verbose=2,
cv=5)
sfs = sfs.fit(X, y)
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.ylim([0.8, 1])
plt.title('Sequential Forward Selection (w. StdDev)')
plt.grid()
plt.show()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished
[2023-05-17 08:36:18] Features: 1/4 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished
[2023-05-17 08:36:18] Features: 2/4 -- score: 0.9666666666666668[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
[2023-05-17 08:36:18] Features: 3/4 -- score: 0.9533333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished
[2023-05-17 08:36:18] Features: 4/4 -- score: 0.9733333333333334
示例 5 - 回归的顺序特征选择
与上述分类示例类似,SequentialFeatureSelector
也支持 scikit-learn 的回归估计器。
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X, y = data.data, data.target
lr = LinearRegression()
sfs = SFS(lr,
k_features=8,
forward=True,
floating=False,
scoring='neg_mean_squared_error',
cv=10)
sfs = sfs.fit(X, y)
fig = plot_sfs(sfs.get_metric_dict(), kind='std_err')
plt.title('Sequential Forward Selection (w. StdErr)')
plt.grid()
plt.show()
示例 6 -- 使用固定的训练/验证分割进行特征选择
如果您不希望使用交叉验证(这里指的是k折交叉验证,即轮换训练和验证折叠),您可以使用PredefinedHoldoutSplit
类来指定您自己的固定训练和验证拆分。
from sklearn.datasets import load_iris
from mlxtend.evaluate import PredefinedHoldoutSplit
import numpy as np
iris = load_iris()
X = iris.data
y = iris.target
rng = np.random.RandomState(123)
my_validation_indices = rng.permutation(np.arange(150))[:30]
print(my_validation_indices)
[ 72 112 132 88 37 138 87 42 8 90 141 33 59 116 135 104 36 13
63 45 28 133 24 127 46 20 31 121 117 4]
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
knn = KNeighborsClassifier(n_neighbors=4)
piter = PredefinedHoldoutSplit(my_validation_indices)
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=piter)
sfs1 = sfs1.fit(X, y)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished
[2023-05-17 08:36:19] Features: 1/3 -- score: 0.9666666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished
[2023-05-17 08:36:19] Features: 2/3 -- score: 0.9666666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
[2023-05-17 08:36:19] Features: 3/3 -- score: 0.9666666666666667
示例 7 -- 使用选定特征子集进行新预测
# 初始化数据集
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=1)
knn = KNeighborsClassifier(n_neighbors=4)
# Select the "best" three features via
# 在训练集上进行5折交叉验证。
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=5)
sfs1 = sfs1.fit(X_train, y_train)
print('Selected features:', sfs1.k_feature_idx_)
Selected features: (1, 2, 3)
# 基于所选特征生成新的子集
# 请注意,transform 调用等同于
# X_train[:, sfs1.k_feature_idx_]
X_train_sfs = sfs1.transform(X_train)
X_test_sfs = sfs1.transform(X_test)
# 使用新的特征子集拟合估计器
# 并对测试数据进行预测
knn.fit(X_train_sfs, y_train)
y_pred = knn.predict(X_test_sfs)
# 计算预测的准确性
acc = float((y_test == y_pred).sum()) / y_pred.shape[0]
print('Test set accuracy: %.2f %%' % (acc * 100))
Test set accuracy: 96.00 %
示例 8 - 顺序特征选择和网格搜索
在以下示例中,我们通过网格搜索调整SFS的估计器。为了避免不必要的行为或副作用,建议在SFS内部和外部使用估计器的独立实例。
# 初始化数据集
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=123)
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import mlxtend
knn1 = KNeighborsClassifier()
knn2 = KNeighborsClassifier()
sfs1 = SFS(estimator=knn1,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=5)
pipe = Pipeline([('sfs', sfs1),
('knn2', knn2)])
param_grid = {
'sfs__k_features': [1, 2, 3],
'sfs__estimator__n_neighbors': [3, 4, 7], # 内部最近邻
'knn2__n_neighbors': [3, 4, 7] # 外部k近邻
}
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring='accuracy',
n_jobs=1,
cv=5,
refit=False)
# 运行网格搜索
gs = gs.fit(X_train, y_train)
让我们来看看下面建议的超参数:
对于范围内的每一个i,输出它在交叉验证结果中的参数和对应的测试准确率。
通过GridSearch确定的“最佳”参数是...
print("Best parameters via GridSearch", gs.best_params_)
Best parameters via GridSearch {'knn2__n_neighbors': 7, 'sfs__estimator__n_neighbors': 3, 'sfs__k_features': 3}
pipe.set_params(**gs.best_params_).fit(X_train, y_train)
Pipeline(steps=[('sfs', SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3), k_features=(3, 3), scoring='accuracy')), ('knn2', KNeighborsClassifier(n_neighbors=7))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('sfs', SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3), k_features=(3, 3), scoring='accuracy')), ('knn2', KNeighborsClassifier(n_neighbors=7))])
SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3), k_features=(3, 3), scoring='accuracy')
KNeighborsClassifier(n_neighbors=3)
KNeighborsClassifier(n_neighbors=3)
KNeighborsClassifier(n_neighbors=7)
示例 9 -- 在k范围内选择“最佳”特征组合
如果k_features
设置为元组(min_k, max_k)
(在0.4.2中新增),SFS现在将通过从k=1
到max_k
(前向)或从max_k
到min_k
(后向)迭代来选择其发现的最佳特征组合。返回的特征子集的大小在max_k
到min_k
之间,具体取决于在交叉验证期间得分最高的组合。
X.shape
(150, 4)
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.data import wine_data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
X, y = wine_data()
X_train, X_test, y_train, y_test= train_test_split(X, y,
stratify=y,
test_size=0.3,
random_state=1)
knn = KNeighborsClassifier(n_neighbors=2)
sfs1 = SFS(estimator=knn,
k_features=(3, 10),
forward=True,
floating=False,
scoring='accuracy',
cv=5)
pipe = make_pipeline(StandardScaler(), sfs1)
pipe.fit(X_train, y_train)
print('best combination (ACC: %.3f): %s\n' % (sfs1.k_score_, sfs1.k_feature_idx_))
print('all subsets:\n', sfs1.subsets_)
plot_sfs(sfs1.get_metric_dict(), kind='std_err');
best combination (ACC: 0.992): (0, 1, 2, 3, 6, 8, 9, 10, 11, 12)
all subsets:
{1: {'feature_idx': (6,), 'cv_scores': array([0.84 , 0.64 , 0.84 , 0.8 , 0.875]), 'avg_score': 0.799, 'feature_names': ('6',)}, 2: {'feature_idx': (6, 9), 'cv_scores': array([0.92 , 0.88 , 1. , 0.96 , 0.91666667]), 'avg_score': 0.9353333333333333, 'feature_names': ('6', '9')}, 3: {'feature_idx': (6, 9, 12), 'cv_scores': array([0.92 , 0.92 , 0.96 , 1. , 0.95833333]), 'avg_score': 0.9516666666666665, 'feature_names': ('6', '9', '12')}, 4: {'feature_idx': (3, 6, 9, 12), 'cv_scores': array([0.96 , 0.96 , 0.96 , 1. , 0.95833333]), 'avg_score': 0.9676666666666666, 'feature_names': ('3', '6', '9', '12')}, 5: {'feature_idx': (3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.96, 1. , 1. , 1. ]), 'avg_score': 0.976, 'feature_names': ('3', '6', '9', '10', '12')}, 6: {'feature_idx': (2, 3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.96, 1. , 0.96, 1. ]), 'avg_score': 0.968, 'feature_names': ('2', '3', '6', '9', '10', '12')}, 7: {'feature_idx': (0, 2, 3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.92, 1. , 1. , 1. ]), 'avg_score': 0.968, 'feature_names': ('0', '2', '3', '6', '9', '10', '12')}, 8: {'feature_idx': (0, 2, 3, 6, 8, 9, 10, 12), 'cv_scores': array([1. , 0.92, 1. , 1. , 1. ]), 'avg_score': 0.984, 'feature_names': ('0', '2', '3', '6', '8', '9', '10', '12')}, 9: {'feature_idx': (0, 2, 3, 6, 8, 9, 10, 11, 12), 'cv_scores': array([1. , 0.92, 1. , 1. , 1. ]), 'avg_score': 0.984, 'feature_names': ('0', '2', '3', '6', '8', '9', '10', '11', '12')}, 10: {'feature_idx': (0, 1, 2, 3, 6, 8, 9, 10, 11, 12), 'cv_scores': array([1. , 0.96, 1. , 1. , 1. ]), 'avg_score': 0.992, 'feature_names': ('0', '1', '2', '3', '6', '8', '9', '10', '11', '12')}}
示例 10 -- 使用其他交叉验证方案
除了标准的k折交叉验证和分层k折交叉验证之外,还可以使用其他交叉验证方案与SequentialFeatureSelector
配合使用。例如,来自scikit-learn的GroupKFold
或LeaveOneOut
交叉验证。
使用 GroupKFold 结合 SequentialFeatureSelector
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import GroupKFold
import numpy as np
X, y = iris_data()
groups = np.arange(len(y)) // 10
print('groups: {}'.format(groups))
groups: [ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2
2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4
4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 7 7
7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9
9 9 9 9 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11
12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 14 14 14 14
14 14 14 14 14 14]
调用 scikit-learn 交叉验证器对象的 split()
方法将返回一个生成器,该生成器生成训练集和测试集的划分。
cv_gen = GroupKFold(4).split(X, y, groups)
cv_gen
<generator object _BaseKFold.split at 0x17c109580>
SequentialFeatureSelector
的 cv
参数必须是一个 int
或一个可迭代的对象,该对象产生训练和测试的划分。这个可迭代对象可以通过将训练、测试划分生成器传递给内置的 list()
函数来构造。
cv = list(cv_gen)
knn = KNeighborsClassifier(n_neighbors=2)
sfs = SFS(estimator=knn,
k_features=2,
scoring='accuracy',
cv=cv)
sfs.fit(X, y)
print('best combination (ACC: %.3f): %s\n' % (sfs.k_score_, sfs.k_feature_idx_))
best combination (ACC: 0.940): (2, 3)
示例 11 - 中断长时间运行以获得中间结果
如果您的运行时间过长,可以触发一个 KeyboardInterrupt
(例如,在Mac上使用ctrl+c,或在Jupyter notebook中中断单元格)来获取临时结果。
玩具数据集
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(
n_samples=20000,
n_features=500,
n_informative=10,
n_redundant=40,
n_repeated=25,
n_clusters_per_class=5,
flip_y=0.05,
class_sep=0.5,
random_state=123,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=123
)
长时间中断运行
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
sfs1 = SFS(model,
k_features=10,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=5)
sfs1 = sfs1.fit(X_train, y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed: 8.3s finished
[2023-05-17 08:36:32] Features: 1/10 -- score: 0.5965[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 499 out of 499 | elapsed: 13.8s finished
[2023-05-17 08:36:45] Features: 2/10 -- score: 0.6256875000000001[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 498 out of 498 | elapsed: 18.1s finished
[2023-05-17 08:37:03] Features: 3/10 -- score: 0.642[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 497 out of 497 | elapsed: 20.4s finished
[2023-05-17 08:37:24] Features: 4/10 -- score: 0.6463125[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 496 out of 496 | elapsed: 22.2s finished
[2023-05-17 08:37:46] Features: 5/10 -- score: 0.6495000000000001[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 495 out of 495 | elapsed: 26.1s finished
[2023-05-17 08:38:12] Features: 6/10 -- score: 0.6514374999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 494 out of 494 | elapsed: 26.1s finished
[2023-05-17 08:38:38] Features: 7/10 -- score: 0.6533749999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 493 out of 493 | elapsed: 25.3s finished
[2023-05-17 08:39:04] Features: 8/10 -- score: 0.6545624999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 492 out of 492 | elapsed: 26.3s finished
[2023-05-17 08:39:30] Features: 9/10 -- score: 0.6549375[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 491 out of 491 | elapsed: 27.0s finished
[2023-05-17 08:39:57] Features: 10/10 -- score: 0.6554374999999999
完成拟合
注意,特征选择运行尚未完成,因此某些属性可能不可用。为了使用 SFS 实例,建议调用 finalize_fit
,这将使 SFS 估计器显示为“已拟合”,并处理临时结果:
sfs1.finalize_fit()
print(sfs1.k_feature_idx_)
print(sfs1.k_score_)
(30, 128, 144, 160, 184, 229, 256, 356, 439, 458)
0.6554374999999999
示例 12 - 使用 Pandas 数据框
我们还可以选择使用pandas DataFrames和pandas Series作为fit
函数的输入。在这种情况下,pandas DataFrame的列名将被用作特征名。然而,请注意,如果在fit函数中提供了custom_feature_names
,这些custom_feature_names
将优先于基于DataFrame列的特征名。
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=0)
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
'sepal width', 'petal width'])
X_df.head()
sepal len | petal len | sepal width | petal width | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
此外,目标数组 y
也可以选择性地转换为一个序列:
y_series = pd.Series(y)
y_series.head()
0 0
1 0
2 0
3 0
4 0
dtype: int64
sfs1 = sfs1.fit(X_df, y_series)
注意,传递一个pandas DataFrame 作为输入的唯一不同之处在于,sfs1.subsets_ 数组现在将包含一列新数据,
sfs1.subsets_
{1: {'feature_idx': (3,),
'cv_scores': array([0.96]),
'avg_score': 0.96,
'feature_names': ('petal width',)},
2: {'feature_idx': (2, 3),
'cv_scores': array([0.97333333]),
'avg_score': 0.9733333333333334,
'feature_names': ('sepal width', 'petal width')},
3: {'feature_idx': (1, 2, 3),
'cv_scores': array([0.97333333]),
'avg_score': 0.9733333333333334,
'feature_names': ('petal len', 'sepal width', 'petal width')}}
在mlxtend版本>= 0.13中,支持将pandas DataFrame作为SequentianFeatureSelector
的特征输入,而不是NumPy数组或其他类似NumPy的数组类型。
示例 13 - 指定固定特征集
通常情况下,指定我们希望用于给定模型的一组固定特征(例如,由先前知识或领域知识确定)可能是有用的。自 MLxtend v 0.18.0 以来,现在可以通过 fixed_features
属性指定这些特征。这将意味着这些特征保证会包含在选定的子集中。
请注意,这个特性适用于有关前向和后向选择的所有选项,以及是否使用浮动选择。
下面的示例演示了我们如何将数据集中的特征 0 和特征 2 设置为固定:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=3)
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
sfs1 = SFS(knn,
k_features=4,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
fixed_features=(0, 2),
cv=3)
sfs1 = sfs1.fit(X, y)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
[2023-05-17 08:39:57] Features: 3/4 -- score: 0.9733333333333333[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished
[2023-05-17 08:39:57] Features: 4/4 -- score: 0.9733333333333333
sfs1.subsets_
{2: {'feature_idx': (0, 2),
'cv_scores': array([0.98, 0.92, 0.94]),
'avg_score': 0.9466666666666667,
'feature_names': ('0', '2')},
3: {'feature_idx': (0, 2, 3),
'cv_scores': array([0.98, 0.96, 0.98]),
'avg_score': 0.9733333333333333,
'feature_names': ('0', '2', '3')},
4: {'feature_idx': (0, 1, 2, 3),
'cv_scores': array([0.98, 0.96, 0.98]),
'avg_score': 0.9733333333333333,
'feature_names': ('0', '1', '2', '3')}}
如果输入的数据集是一个 pandas 数据框,我们也可以直接使用列名:
import pandas as pd
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
'sepal width', 'petal width'])
X_df.head()
sepal len | petal len | sepal width | petal width | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
sfs2 = SFS(knn,
k_features=4,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
fixed_features=('sepal len', 'petal len'),
cv=3)
sfs2 = sfs2.fit(X_df, y_series)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
[2023-05-17 08:39:57] Features: 3/4 -- score: 0.9466666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished
[2023-05-17 08:39:57] Features: 4/4 -- score: 0.9733333333333333
sfs2.subsets_
{2: {'feature_idx': (0, 1),
'cv_scores': array([0.72, 0.74, 0.78]),
'avg_score': 0.7466666666666667,
'feature_names': ('sepal len', 'petal len')},
3: {'feature_idx': (0, 1, 2),
'cv_scores': array([0.98, 0.92, 0.94]),
'avg_score': 0.9466666666666667,
'feature_names': ('sepal len', 'petal len', 'sepal width')},
4: {'feature_idx': (0, 1, 2, 3),
'cv_scores': array([0.98, 0.96, 0.98]),
'avg_score': 0.9733333333333333,
'feature_names': ('sepal len', 'petal len', 'sepal width', 'petal width')}}
示例 13 - 使用特征组
自mlxtend v0.21.0以来,可以指定特征组。特征组允许您将某些特征组合在一起,使它们始终作为一个组被选择。这在类似于独热编码的上下文中非常有用——如果您想将独热编码特征视为一个单一特征:
在下面的示例中,我们将花萼长度和花萼宽度指定为一个特征组,以便它们总是一起被选择:
from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
X = iris.data
y = iris.target
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
'sepal wid', 'petal wid'])
X_df.head()
sepal len | petal len | sepal wid | petal wid | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
knn = KNeighborsClassifier(n_neighbors=3)
sfs1 = SFS(knn,
k_features=2,
scoring='accuracy',
feature_groups=(['sepal len', 'sepal wid'], ['petal len'], ['petal wid']),
cv=3)
sfs1 = sfs1.fit(X_df, y)
# 单变量特征选择 (SFS)
此代码示例演示了如何使用单变量特征选择(SFS)来选择最佳特征组合。
## 参数说明
- `knn`: 选择的模型,通常是 k-近邻分类器。
- `k_features`: 选择的特征数量,此处选择 2 个特征。
- `scoring`: 评分标准,此处使用准确率 ('accuracy')。
- `feature_groups`: 特征组的集合,此处有三个特征组。
- `cv`: 交叉验证的折数,此处设置为 3。
## 使用示例
下面的代码将 SFS 应用于数据集 `X` 和标签 `y`,以确定最佳的特征组合。
```python
sfs1 = SFS(knn,
k_features=2,
scoring='accuracy',
feature_groups=[[0, 2], [1], [3]],
cv=3)
sfs1 = sfs1.fit(X, y)
## 示例 14 - 多分类指标
某些评分指标,如ROC AUC,最初是为二元分类设计的。然而,它们也可以用于多类设置。最好的做法是参考[这个scikit-learn指标表](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values)。
例如,我们可以通过`‘"roc_auc_ovr"`使用ROC AUC一对其余的分数,如下所示。
```python
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10, centers=4, n_features=5, random_state=0)
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
verbose=2,
scoring='roc_auc_ovr',
cv=0)
sfs1 = sfs1.fit(X, y)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished
[2023-05-17 08:39:57] Features: 1/3 -- score: 1.0[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished
[2023-05-17 08:39:57] Features: 2/3 -- score: 1.0[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished
[2023-05-17 08:39:57] Features: 3/3 -- score: 1.0
API
SequentialFeatureSelector(estimator, k_features=1, forward=True, floating=False, verbose=0, scoring=None, cv=5, n_jobs=1, pre_dispatch='2n_jobs', clone_estimator=True, fixed_features=None, feature_groups=None)*
Sequential Feature Selection for Classification and Regression.
Parameters
-
estimator
: scikit-learn classifier or regressor -
k_features
: int or tuple or str (default: 1)Number of features to select, where k_features < the full feature set. New in 0.4.2: A tuple containing a min and max value can be provided, and the SFS will consider return any feature combination between min and max that scored highest in cross-validation. For example, the tuple (1, 4) will return any combination from 1 up to 4 features instead of a fixed number of features k. New in 0.8.0: A string argument "best" or "parsimonious". If "best" is provided, the feature selector will return the feature subset with the best cross-validation performance. If "parsimonious" is provided as an argument, the smallest feature subset that is within one standard error of the cross-validation performance will be selected.
-
forward
: bool (default: True)Forward selection if True, backward selection otherwise
-
floating
: bool (default: False)Adds a conditional exclusion/inclusion if True.
-
verbose
: int (default: 0), level of verbosity to use in logging.If 0, no output, if 1 number of features in current set, if 2 detailed logging i ncluding timestamp and cv scores at step.
-
scoring
: str, callable, or None (default: None)If None (default), uses 'accuracy' for sklearn classifiers and 'r2' for sklearn regressors. If str, uses a sklearn scoring metric string identifier, for example {accuracy, f1, precision, recall, roc_auc} for classifiers, {'mean_absolute_error', 'mean_squared_error'/'neg_mean_squared_error', 'median_absolute_error', 'r2'} for regressors. If a callable object or function is provided, it has to be conform with sklearn's signature
scorer(estimator, X, y)
; see https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html for more information. -
cv
: int (default: 5)Integer or iterable yielding train, test splits. If cv is an integer and
estimator
is a classifier (or y consists of integer class labels) stratified k-fold. Otherwise regular k-fold cross-validation is performed. No cross-validation if cv is None, False, or 0. -
n_jobs
: int (default: 1)The number of CPUs to use for evaluating different feature subsets in parallel. -1 means 'all CPUs'.
-
pre_dispatch
: int, or string (default: '2*n_jobs')Controls the number of jobs that get dispatched during parallel execution if
n_jobs > 1
orn_jobs=-1
. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function of n_jobs, as in2*n_jobs
-
clone_estimator
: bool (default: True)Clones estimator if True; works with the original estimator instance if False. Set to False if the estimator doesn't implement scikit-learn's set_params and get_params methods. In addition, it is required to set cv=0, and n_jobs=1.
-
fixed_features
: tuple (default: None)If not
None
, the feature indices provided as a tuple will be regarded as fixed by the feature selector. For example, iffixed_features=(1, 3, 7)
, the 2nd, 4th, and 8th feature are guaranteed to be present in the solution. Note that iffixed_features
is notNone
, make sure that the number of features to be selected is greater thanlen(fixed_features)
. In other words, ensure thatk_features > len(fixed_features)
. New in mlxtend v. 0.18.0. -
feature_groups
: list or None (default: None)Optional argument for treating certain features as a group. This means, the features within a group are always selected together, never split. For example,
feature_groups=[[1], [2], [3, 4, 5]]
specifies 3 feature groups. In this case, possible feature selection results withk_features=2
are[[1], [2]
,[[1], [3, 4, 5]]
, or[[2], [3, 4, 5]]
. Feature groups can be useful for interpretability, for example, if features 3, 4, 5 are one-hot encoded features. (For more details, please read the notes at the bottom of this docstring). New in mlxtend v. 0.21.0.
Attributes
-
k_feature_idx_
: array-like, shape = [n_predictions]Feature Indices of the selected feature subsets.
-
k_feature_names_
: array-like, shape = [n_predictions]Feature names of the selected feature subsets. If pandas DataFrames are used in the
fit
method, the feature names correspond to the column names. Otherwise, the feature names are string representation of the feature array indices. New in v 0.13.0. -
k_score_
: floatCross validation average score of the selected subset.
-
subsets_
: dictA dictionary of selected feature subsets during the sequential selection, where the dictionary keys are the lengths k of these feature subsets. If the parameter
feature_groups
is not None, the value of key indicates the number of groups that are selected together. The dictionary values are dictionaries themselves with the following keys: 'feature_idx' (tuple of indices of the feature subset) 'feature_names' (tuple of feature names of the feat. subset) 'cv_scores' (list individual cross-validation scores) 'avg_score' (average cross-validation score) Note that if pandas DataFrames are used in thefit
method, the 'feature_names' correspond to the column names. Otherwise, the feature names are string representation of the feature array indices. The 'feature_names' is new in v 0.13.0.
Notes
(1) If parameter feature_groups
is not None, the
number of features is equal to the number of feature groups, i.e.
len(feature_groups)
. For example, if feature_groups = [[0], [1], [2, 3],
[4]]
, then the max_features
value cannot exceed 4.
(2) Although two or more individual features may be considered as one group
throughout the feature-selection process, it does not mean the individual
features of that group have the same impact on the outcome. For instance, in
linear regression, the coefficient of the feature 2 and 3 can be different
even if they are considered as one group in feature_groups.
(3) If both fixed_features and feature_groups are specified, ensure that each
feature group contains the fixed_features selection. E.g., for a 3-feature set
fixed_features=[0, 1] and feature_groups=[[0, 1], [2]] is valid;
fixed_features=[0, 1] and feature_groups=[[0], [1, 2]] is not valid.
(4) In case of KeyboardInterrupt, the dictionary subsets may not be completed.
If user is still interested in getting the best score, they can use method
`finalize_fit`.
Examples
For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/
Methods
finalize_fit()
None
fit(X, y, groups=None, fit_params)
Perform feature selection and learn model from training data.
Parameters
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.
-
y
: array-like, shape = [n_samples]Target values. New in v 0.13.0: pandas DataFrames are now also accepted as argument for y.
-
groups
: array-like, with shape (n_samples,), optionalGroup labels for the samples used while splitting the dataset into train/test set. Passed to the fit method of the cross-validator.
-
fit_params
: various, optionalAdditional parameters that are being passed to the estimator. For example,
sample_weights=weights
.
Returns
self
: object
fit_transform(X, y, groups=None, fit_params)
Fit to training data then reduce X to its most important features.
Parameters
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.
-
y
: array-like, shape = [n_samples]Target values. New in v 0.13.0: a pandas Series are now also accepted as argument for y.
-
groups
: array-like, with shape (n_samples,), optionalGroup labels for the samples used while splitting the dataset into train/test set. Passed to the fit method of the cross-validator.
-
fit_params
: various, optionalAdditional parameters that are being passed to the estimator. For example,
sample_weights=weights
.
Returns
Reduced feature subset of X, shape={n_samples, k_features}
generate_error_message_k_features(name)
None
get_metric_dict(confidence_interval=0.95)
Return metric dictionary
Parameters
-
confidence_interval
: float (default: 0.95)A positive float between 0.0 and 1.0 to compute the confidence interval bounds of the CV score averages.
Returns
Dictionary with items where each dictionary value is a list with the number of iterations (number of feature subsets) as its length. The dictionary keys corresponding to these lists are as follows: 'feature_idx': tuple of the indices of the feature subset 'cv_scores': list with individual CV scores 'avg_score': of CV average scores 'std_dev': standard deviation of the CV score average 'std_err': standard error of the CV score average 'ci_bound': confidence interval bound of the CV score average
get_params(deep=True)
Get parameters for this estimator.
Parameters
-
deep
: bool, default=TrueIf True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns
-
params
: dictParameter names mapped to their values.
set_params(params)
Set the parameters of this estimator.
Valid parameter keys can be listed with get_params()
.
Returns
self
transform(X)
Reduce X to its most important features.
Parameters
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.
Returns
Reduced feature subset of X, shape={n_samples, k_features}
Properties
named_estimators
Returns
List of named estimator tuples, like [('svc', SVC(...))]
ython