连续减半迭代


本示例展示了连续减半搜索(HalvingGridSearchCVHalvingRandomSearchCV )如何通过迭代从多个候选中选择最佳参数组合。

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import randint

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingRandomSearchCV

我们首先定义参数空间并训练一个:class:~sklearn.model_selection.HalvingRandomSearchCV 实例。

rng = np.random.RandomState(0)

X, y = datasets.make_classification(n_samples=400, n_features=12, random_state=rng)

clf = RandomForestClassifier(n_estimators=20, random_state=rng)

param_dist = {
    "max_depth": [3, None],
    "max_features": randint(1, 6),
    "min_samples_split": randint(2, 11),
    "bootstrap": [True, False],
    "criterion": ["gini", "entropy"],
}

rsh = HalvingRandomSearchCV(
    estimator=clf, param_distributions=param_dist, factor=2, random_state=rng
)
rsh.fit(X, y)
HalvingRandomSearchCV(estimator=RandomForestClassifier(n_estimators=20,
                                                       random_state=RandomState(MT19937) at 0xFFFFA3943E40),
                      factor=2,
                      param_distributions={'bootstrap': [True, False],
                                           'criterion': ['gini', 'entropy'],
                                           'max_depth': [3, None],
                                           'max_features': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0xffff4c3fa710>,
                                           'min_samples_split': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0xffff400ceb10>},
                      random_state=RandomState(MT19937) at 0xFFFFA3943E40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


我们现在可以使用搜索估计器的 cv_results_ 属性来检查和绘制搜索的演变过程。

results = pd.DataFrame(rsh.cv_results_)
results["params_str"] = results.params.apply(str)
results.drop_duplicates(subset=("params_str", "iter"), inplace=True)
mean_scores = results.pivot(
    index="iter", columns="params_str", values="mean_test_score"
)
ax = mean_scores.plot(legend=False, alpha=0.6)

labels = [
    f"iter={i}\nn_samples={rsh.n_resources_[i]}\nn_candidates={rsh.n_candidates_[i]}"
    for i in range(rsh.n_iterations_)
]

ax.set_xticks(range(rsh.n_iterations_))
ax.set_xticklabels(labels, rotation=45, multialignment="left")
ax.set_title("Scores of candidates over iterations")
ax.set_ylabel("mean test score", fontsize=15)
ax.set_xlabel("iterations", fontsize=15)
plt.tight_layout()
plt.show()
Scores of candidates over iterations

候选人数和每次迭代的资源量#

在第一次迭代时,使用少量资源。这里的资源是指训练估计器所用的样本数量。所有候选者都会被评估。

在第二次迭代中,只评估表现最好的候选者的一半。分配的资源数量加倍:候选者在两倍数量的样本上进行评估。

这个过程会重复进行,直到最后一轮迭代,此时只剩下2个候选者。最佳候选者是在最后一轮迭代中得分最高的候选者。

Total running time of the script: (0 minutes 2.333 seconds)

Related examples

网格搜索与逐步减半的比较

网格搜索与逐步减半的比较

类似然比率用于衡量分类性能

类似然比率用于衡量分类性能

缩放SVC的正则化参数

缩放SVC的正则化参数

随机搜索与网格搜索在超参数估计中的比较

随机搜索与网格搜索在超参数估计中的比较

Gallery generated by Sphinx-Gallery