缩放 XGBoost

内容

实时笔记本

你可以在 live session 中运行此笔记本，或查看 Github 上的内容。

缩放 XGBoost¶

Dask 和 XGBoost 可以一起工作，以并行方式训练梯度提升树。这个笔记本展示了如何一起使用 Dask 和 XGBoost。

XGBoost 提供了一个强大的预测框架，并且在实践中表现良好。它在 Kaggle 竞赛中获胜，并在工业界广受欢迎，因为它具有良好的性能并且易于解释（即，从 XGBoost 模型中找到重要特征很容易）。

Dask 标志

设置 Dask¶

我们设置了一个 Dask 客户端，它通过仪表板提供性能和进度指标。

您可以通过点击运行单元格后的链接来查看仪表板。

[ ]:

from dask.distributed import Client

client = Client(n_workers=4, threads_per_worker=1)
client

创建数据¶

首先，我们创建一批合成数据，包含100,000个样本和20个特征。

[ ]:

from dask_ml.datasets import make_classification

X, y = make_classification(n_samples=100000, n_features=20,
                           chunks=1000, n_informative=4,
                           random_state=0)
X

Dask-XGBoost 适用于数组和数据帧。有关从实际数据创建dask数组和数据帧的更多信息，请参阅 Dask数组或 Dask数据帧的文档。

分割数据用于训练和测试¶

我们将数据集分为训练数据和测试数据，以确保我们有公平的测试，从而帮助评估：

[ ]:

from dask_ml.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

现在，让我们尝试使用 dask-xgboost 来处理这些数据。

训练 Dask-XGBoost¶

[ ]:

import dask
import xgboost
import dask_xgboost

dask-xgboost 是 xgboost 的一个小包装器。Dask 设置 XGBoost，提供数据，并让 XGBoost 使用 Dask 可用的所有工作线程在后台进行训练。

让我们进行一些训练：

[ ]:

params = {'objective': 'binary:logistic',
          'max_depth': 4, 'eta': 0.01, 'subsample': 0.5,
          'min_child_weight': 0.5}

bst = dask_xgboost.train(client, params, X_train, y_train, num_boost_round=10)

可视化结果¶

bst 对象是一个常规的 xgboost.Booster 对象。

[ ]:

bst

这意味着 XGBoost 文档中提到的所有方法都可用。我们展示了两个例子来扩展这一点，但这些例子是关于 XGBoost 而不是 Dask 的。

绘制特征重要性¶

[ ]:

%matplotlib inline
import matplotlib.pyplot as plt

ax = xgboost.plot_importance(bst, height=0.8, max_num_features=9)
ax.grid(False, axis="y")
ax.set_title('Estimated feature importance')
plt.show()

我们在创建数据时指定了只有4个特征是信息性的，而只有3个特征显示为重要。

绘制接收者操作特征曲线¶

我们可以使用一个更高级的指标来确定我们的分类器的表现，通过绘制接收者操作特征（ROC）曲线：

[ ]:

y_hat = dask_xgboost.predict(client, bst, X_test).persist()
y_hat

[ ]:

from sklearn.metrics import roc_curve

y_test, y_hat = dask.compute(y_test, y_hat)
fpr, tpr, _ = roc_curve(y_test, y_hat)

[ ]:

from sklearn.metrics import auc

fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(fpr, tpr, lw=3,
        label='ROC Curve (area = {:.2f})'.format(auc(fpr, tpr)))
ax.plot([0, 1], [0, 1], 'k--', lw=2)
ax.set(
    xlim=(0, 1),
    ylim=(0, 1),
    title="ROC Curve",
    xlabel="False Positive Rate",
    ylabel="True Positive Rate",
)
ax.legend();
plt.show()

这个受试者工作特征（ROC）曲线告诉我们分类器的表现如何。我们可以通过它向左上角弯曲的程度来判断它的表现。一个完美的分类器会在左上角，而一个随机的分类器会沿着对角线。

这条曲线下的面积是 area = 0.76。这告诉我们，我们的分类器对随机选择的实例进行正确预测的概率。

了解更多¶

录制了上述真实世界示例的屏幕录像：
关于 dask-xgboost 的博客文章 http://matthewrocklin.com/blog/work/2017/03/28/dask-xgboost
XGBoost 文档: https://xgboost.readthedocs.io/en/latest/python/python_intro.html#
Dask-XGBoost 文档: http://ml.dask.org/xgboost.html

使用 Dask 进行超参数优化

使用投票分类器

Dask Examples 文档