在GPU设备上使用xgboost

展示了如何使用 GPU 加速在 森林覆盖类型 数据集上训练模型。森林覆盖类型数据集有 581,012 行和 54 个特征,处理起来非常耗时。我们比较了 GPU 和 CPU 直方图算法的运行时间和准确性。

此外,该演示展示了使用 GPU 以及其他与 GPU 相关的库,包括 cupy 和 cuml。这些库并不是严格要求的。

import time

import cupy as cp
from cuml.model_selection import train_test_split
from sklearn.datasets import fetch_covtype

import xgboost as xgb

# Fetch dataset using sklearn
X, y = fetch_covtype(return_X_y=True)
X = cp.array(X)
y = cp.array(y)
y -= y.min()

# Create 0.75/0.25 train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, train_size=0.75, random_state=42
)

# Specify sufficient boosting iterations to reach a minimum
num_round = 3000

# Leave most parameters as default
clf = xgb.XGBClassifier(device="cuda", n_estimators=num_round)
# Train model
start = time.time()
clf.fit(X_train, y_train, eval_set=[(X_test, y_test)])
gpu_res = clf.evals_result()
print("GPU Training Time: %s seconds" % (str(time.time() - start)))

# Repeat for CPU algorithm
clf = xgb.XGBClassifier(device="cpu", n_estimators=num_round)
start = time.time()
cpu_res = clf.evals_result()
print("CPU Training Time: %s seconds" % (str(time.time() - start)))

由 Sphinx-Gallery 生成的图库