使用 XGBoost 预测《英雄联盟》胜利
本笔记本使用 Kaggle 数据集 英雄联盟排位赛 ,该数据集包含自2014年以来的180,000场英雄联盟排位赛。我们使用这些数据构建了一个 XGBoost 模型,以根据玩家在比赛中的表现统计数据来预测该玩家所在的队伍是否会获胜。
from pathlib import Path
import matplotlib.pyplot as pl
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
import shap
要自己运行此程序,您需要从Kaggle下载数据集,并确保下面的 prefix
变量是正确的。为此,请按照上述链接下载并解压数据。如有需要,请更改 prefix
# read in the data
folder_path = Path("../local_scratch/data/league-of-legends-ranked-matches/")
matches = pd.read_csv(folder_path / "matches.csv")
participants = pd.read_csv(folder_path / "participants.csv")
stats1 = pd.read_csv(folder_path / "stats1.csv", low_memory=False)
stats2 = pd.read_csv(folder_path / "stats2.csv", low_memory=False)
stats = pd.concat([stats1, stats2])
# merge into a single DataFrame
a = pd.merge(
participants, matches, left_on="matchid", right_on="id", suffixes=("", "_matches")
allstats_orig = pd.merge(
a, stats, left_on="matchid", right_on="id", suffixes=("", "_stats")
allstats = allstats_orig.copy()
# drop games that lasted less than 10 minutes
allstats = allstats.loc[allstats["duration"] >= 10 * 60, :]
# Convert string-based categories to numeric values
cat_cols = ["role", "position", "version", "platformid"]
for c in cat_cols:
allstats[c] = allstats[c].astype("category")
allstats[c] = allstats[c].cat.codes
allstats["wardsbought"] = allstats["wardsbought"].astype(np.int32)
X = allstats.drop(columns=["win"])
y = allstats["win"]
# convert all features we want to consider as rates
rate_features = [
for feature_name in rate_features:
X[feature_name] /= X["duration"] / 60 # per minute rate
# convert to fraction of game
X["longesttimespentliving"] /= X["duration"]
# define friendly names for the features
full_names = {
"kills": "Kills per min.",
"deaths": "Deaths per min.",
"assists": "Assists per min.",
"killingsprees": "Killing sprees per min.",
"longesttimespentliving": "Longest time living as % of game",
"doublekills": "Double kills per min.",
"triplekills": "Triple kills per min.",
"quadrakills": "Quadra kills per min.",
"pentakills": "Penta kills per min.",
"legendarykills": "Legendary kills per min.",
"totdmgdealt": "Total damage dealt per min.",
"magicdmgdealt": "Magic damage dealt per min.",
"physicaldmgdealt": "Physical damage dealt per min.",
"truedmgdealt": "True damage dealt per min.",
"totdmgtochamp": "Total damage to champions per min.",
"magicdmgtochamp": "Magic damage to champions per min.",
"physdmgtochamp": "Physical damage to champions per min.",
"truedmgtochamp": "True damage to champions per min.",
"totheal": "Total healing per min.",
"totunitshealed": "Total units healed per min.",
"dmgtoobj": "Damage to objects per min.",
"timecc": "Time spent with crown control per min.",
"totdmgtaken": "Total damage taken per min.",
"magicdmgtaken": "Magic damage taken per min.",
"physdmgtaken": "Physical damage taken per min.",
"truedmgtaken": "True damage taken per min.",
"goldearned": "Gold earned per min.",
"goldspent": "Gold spent per min.",
"totminionskilled": "Total minions killed per min.",
"neutralminionskilled": "Neutral minions killed per min.",
"ownjunglekills": "Own jungle kills per min.",
"enemyjunglekills": "Enemy jungle kills per min.",
"totcctimedealt": "Total crown control time dealt per min.",
"pinksbought": "Pink wards bought per min.",
"wardsbought": "Wards bought per min.",
"wardsplaced": "Wards placed per min.",
"turretkills": "# of turret kills",
"inhibkills": "# of inhibitor kills",
"dmgtoturrets": "Damage to turrets",
feature_names = [full_names.get(n, n) for n in X.columns]
X.columns = feature_names
# create train/validation split
Xt, Xv, yt, yv = train_test_split(X, y, test_size=0.2, random_state=10)
dt = xgb.DMatrix(Xt, label=yt.values)
dv = xgb.DMatrix(Xv, label=yv.values)
训练 XGBoost 模型
params = {
"objective": "binary:logistic",
"base_score": np.mean(yt),
"eval_metric": "logloss",
model = xgb.train(
evals=[(dt, "train"), (dv, "valid")],
[0] train-logloss:0.57255 valid-logloss:0.57258
[9] train-logloss:0.34293 valid-logloss:0.34323
因为 Tree SHAP 算法在 XGBoost 中实现,我们可以在数千个样本上快速计算精确的 SHAP 值。单个预测的 SHAP 值(包括最后一列中的预期输出)总和为该预测的模型输出。
# compute the SHAP values for every prediction in the validation dataset
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(Xv)
SHAP 值总和为模型预期输出与当前玩家当前输出的差异。请注意,对于 Tree SHAP 实现,解释的是模型的边际输出,而不是转换后的输出(例如逻辑回归的概率)。这意味着该模型的 SHAP 值的单位是 log 赔率比。较大的正值意味着玩家很可能获胜,而较大的负值意味着他们很可能输掉。
shap.force_plot(explainer.expected_value, shap_values[0, :], Xv.iloc[0, :])
xs = np.linspace(-4, 4, 100)
pl.xlabel("Log odds of winning")
pl.ylabel("Probability of winning")
pl.title("How changes in log odds convert to probability of winning")
pl.plot(xs, 1 / (1 + np.exp(-xs)))
shap.summary_plot(shap_values, Xv)
"Gold earned per min.", shap_values, Xv, interaction_index="Deaths per min."
# sort the features indexes by their importance in the model
# (sum of SHAP value magnitudes over the validation dataset)
top_inds = np.argsort(-np.sum(np.abs(shap_values), 0))
# make SHAP plots of the three most important features
for i in range(20):
shap.dependence_plot(top_inds[i], shap_values, Xv)