使用模型筛选作为初始模型选择工具

前提条件: 具备深度学习和表格问题（如回归和分类）的基本知识。还请阅读使用PyTorch Tabular处理任何表格问题教程。
级别: 中级

在这个教程中，我们将探讨一种简单的方法来评估在数据集上不同深度学习模型在PyTorch Tabular中的表现。这是一种类似于pycaret的模型筛选方式。在PyTorch Tabular中，我们称之为模型筛选。

from rich import print
from rich.pretty import pprint

数据¶

我们将使用来自UCI ML库的Covertype数据集，并将其拆分为训练集和测试集。我们也可以拆分出验证集，但即使我们不这样做，PyTorch Tabular也会自动从训练集中为我们进行拆分。

from pytorch_tabular.utils import load_covertype_dataset
from sklearn.model_selection import train_test_split

data, cat_col_names, num_col_names, target_col = load_covertype_dataset()
train, test = train_test_split(data, random_state=42, test_size=0.2)
print(f"Train Shape: {train.shape} | Test Shape: {test.shape}")

Train Shape: (464809, 13) | Test Shape: (116203, 13)

定义配置¶

正如您在基础教程中看到的，我们需要定义一组配置。即使在模型搜索中，我们也需要定义除 ModelConfig 之外的所有配置。我们将保持大部分为默认值，但设置一些配置以控制训练过程： - 自动学习率查找 - 批量大小 - 最大周期数 - 关闭进度条和模型摘要，以避免输出混乱。

from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)
from pytorch_tabular.models.common.heads import LinearHeadConfig

data_config = DataConfig(
    target=[target_col],
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
)
trainer_config = TrainerConfig(
    batch_size=1024,
    max_epochs=25,
    auto_lr_find=True,
    early_stopping=None,  # 监视有效损失以进行提前停止
    # early_stopping_mode="min",  # Set the mode as min because for val_loss, lower is better
    # early_stopping_patience=5，在终止训练前，允许训练降级的轮次数为5次。
    checkpoints="valid_loss",  # 保存最佳检查点监控验证损失
    load_best=True,  # 训练后，加载最佳检查点
    progress_bar="none",  # 关闭进度条
    trainer_kwargs=dict(enable_model_summary=False),  # 关闭模型摘要
    accelerator="cpu",
)
optimizer_config = OptimizerConfig()

head_config = LinearHeadConfig(
    layers="",
    dropout=0.1,
    initialization=(  # 头部没有额外的层，仅有一个映射层将输出映射到output_dim。
        "kaiming"
    ),
).__dict__  # 转换为字典以传递给模型配置（OmegaConf不接受对象）

模型遍历¶

模型遍历使您能够快速地浏览不同的模型和配置。它接受一个模型配置的列表或在 pytorch_tabular.MODEL_PRESETS 中定义的预设之一，并在数据上进行训练。然后，根据提供的指标对模型进行排名，并返回最佳模型。

model_sweep 函数的主要参数如下： - task: 预测任务的类型。可以是“分类”或“回归” - train: 训练数据 - test: 用于评估性能的测试数据 - Configs: 所有的配置对象可以作为对象或yaml文件的路径传递。 - model_list: 要比较的模型列表。这可以是 pytorch_tabular.MODEL_SWEEP_PRESETS 中定义的预设之一或 ModelConfig 对象的列表。

在 pytorch_tabular.MODEL_SWEEP_PRESETS 中定义了三种预设：

from pytorch_tabular import MODEL_SWEEP_PRESETS

print(list(MODEL_SWEEP_PRESETS.keys()))

['lite', 'standard', 'full', 'high_memory']

lite : 这是一个训练快速的模型集。这是model_list的默认值。模型及其超参数经过仔细选择，使它们具有可比的参数数量，训练相对较快，并且结果良好。包含的模型有：

pprint(MODEL_SWEEP_PRESETS["lite"])

(
│   ('CategoryEmbeddingModelConfig', {'layers': '256-128-64'}),
│   ('GANDALFConfig', {'gflu_stages': 6}),
│   ('TabNetModelConfig', {'n_d': 32, 'n_a': 32, 'n_steps': 3, 'gamma': 1.5, 'n_independent': 1, 'n_shared': 2})
)

标准 : 这是一个模型集合，其可学习参数数量少于或大约为十万，从而仍然不会占用高内存要求。所有来自 精简 预设的模型也包含在内。所包含的模型及其超参数经过精心选择，以确保它们在参数数量上可比，并且能够提供良好的结果。包含的模型有：

pprint(MODEL_SWEEP_PRESETS["standard"])

(
│   ('CategoryEmbeddingModelConfig', {'layers': '256-128-64'}),
│   ('CategoryEmbeddingModelConfig', {'layers': '512-128-64'}),
│   ('GANDALFConfig', {'gflu_stages': 6}),
│   ('GANDALFConfig', {'gflu_stages': 15}),
│   ('TabNetModelConfig', {'n_d': 32, 'n_a': 32, 'n_steps': 3, 'gamma': 1.5, 'n_independent': 1, 'n_shared': 2}),
│   ('TabNetModelConfig', {'n_d': 32, 'n_a': 32, 'n_steps': 5, 'gamma': 1.5, 'n_independent': 2, 'n_shared': 3}),
│   ('FTTransformerConfig', {'num_heads': 4, 'num_attn_blocks': 4})
)

full: 这是模型的全面测试，使用默认超参数，在 PyTorch Tabular 中实现，除了混合密度网络（这是一个用于概率回归的专门模型）和 NODE（这是一个需要高计算和内存的模型）。包括的模型有：

pprint(list(MODEL_SWEEP_PRESETS["full"]))

[
│   'AutoIntConfig',
│   'CategoryEmbeddingModelConfig',
│   'DANetConfig',
│   'FTTransformerConfig',
│   'GANDALFConfig',
│   'GatedAdditiveTreeEnsembleConfig',
│   'TabNetModelConfig',
│   'TabTransformerConfig'
]

high_memory: 这是对模型的全面扫描，使用默认的超参数，集成在 PyTorch Tabular 中，除了混合密度网络（这是一个专门用于概率回归的模型）。仅在您有足够的内存来容纳模型和数据在您的 CPU/GPU 中时，才建议使用此选项。包含的模型有：

pprint(list(MODEL_SWEEP_PRESETS["high_memory"]))

[
│   'AutoIntConfig',
│   'CategoryEmbeddingModelConfig',
│   'DANetConfig',
│   'FTTransformerConfig',
│   'GANDALFConfig',
│   'GatedAdditiveTreeEnsembleConfig',
│   'NodeConfig',
│   'TabNetModelConfig',
│   'TabTransformerConfig'
]

metrics, metrics_params, metrics_prob_input：用于评估的指标。这些参数在 ModelConfig 中具有相同的含义。
rank_metric：用于对模型进行排名的指标。这是一个元组，第一个元素是指标名称，第二个元素是方向（如果是 lower_the_better 或 higher_the_better）。默认为 ('loss', "lower_is_better")。
return_best_model：如果为 True，将返回最佳模型。默认为 True。

现在让我们尝试在 Covertype 数据集上运行扫频，使用 lite 预设。

%%time
from pytorch_tabular import model_sweep
import warnings

# 过滤掉警告
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    sweep_df, best_model = model_sweep(
        task="classification",  # One of "classification", "regression"
        train=train,
        test=test,
        data_config=data_config,
        optimizer_config=optimizer_config,
        trainer_config=trainer_config,
        model_list="lite",
        common_model_args=dict(head="LinearHead", head_config=head_config),
        metrics=["accuracy", "f1_score"],
        metrics_params=[{}, {"average": "macro"}],
        metrics_prob_input=[False, True],
        rank_metric=("accuracy", "higher_is_better"),
        progress_bar=True,
        verbose=False,
        suppress_lightning_logger=True,
    )

Output()

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

CPU times: user 2h 29min 42s, sys: 15.8 s, total: 2h 29min 58s
Wall time: 16min 37s

输出的sweep_df是一个pandas数据框，包含以下列： - model : 模型名称 - # Params : 模型中可训练参数的数量 - test_loss : 测试集上的损失 - test_<metric> : 测试集上的指标值 - time_taken : 训练模型所用的时间 - epochs : 训练的轮数 - time_taken_per_epoch : 每轮所用的时间 - params : 用于训练模型的配置

让我们检查一下哪个模型表现最好。

sweep_df.drop(columns=["params", "time_taken", "epochs"]).style.background_gradient(
    subset=["test_accuracy", "test_f1_score"], cmap="RdYlGn"
).background_gradient(subset=["time_taken_per_epoch", "test_loss"], cmap="RdYlGn_r")

	model	# Params	test_loss	test_accuracy	test_f1_score	time_taken_per_epoch
1	GANDALFModel	43 T	0.189933	0.924494	0.924418	10.985013
2	TabNetModel	50 T	0.259448	0.895175	0.894817	19.809555
0	CategoryEmbeddingModel	51 T	0.302084	0.878024	0.876729	7.634541

我们在数据集上训练了三个快速模型，耗时大约15分钟在CPU上。这是相当快的。我们可以看到，GANDALF模型在准确率、损失和F1分数方面表现最好。我们还可以看到，训练时间与常规MLP相当。一个自然的下一步是对模型进行进一步调整，以找到最佳参数。

或者，如果你有更多的时间，访问一个足够大小的GPU，并且想尝试更多的模型，你可以尝试standard预设。即使在CPU上，它也可能只运行几个小时。但它会让你对不同模型的性能有一个很好的了解。

让我们尝试运行standard预设。

%%time
# 过滤掉警告
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    sweep_df, best_model = model_sweep(
        task="classification",  # One of "classification", "regression"
        train=train,
        test=test,
        data_config=data_config,
        optimizer_config=optimizer_config,
        trainer_config=trainer_config,
        model_list="standard",
        common_model_args=dict(head="LinearHead", head_config=head_config),
        metrics=["accuracy", "f1_score"],
        metrics_params=[{}, {"average": "macro"}],
        metrics_prob_input=[False, True],
        rank_metric=("accuracy", "higher_is_better"),
        progress_bar=True,
        verbose=False,
        suppress_lightning_logger=True,
    )

Output()

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

CPU times: user 10h 11min 4s, sys: 2min 16s, total: 10h 13min 20s
Wall time: 1h 6min 18s

sweep_df.drop(columns=["params", "time_taken", "epochs"]).style.background_gradient(
    subset=["test_accuracy", "test_f1_score"], cmap="RdYlGn"
).background_gradient(subset=["time_taken_per_epoch", "test_loss"], cmap="RdYlGn_r")

	model	# Params	test_loss	test_accuracy	test_f1_score	time_taken_per_epoch
3	GANDALFModel	107 T	0.163602	0.935071	0.935061	15.870558
1	CategoryEmbeddingModel	93 T	0.233573	0.906560	0.905311	9.128509
6	FTTransformerModel	117 T	0.243499	0.900330	0.900065	63.771070
2	GANDALFModel	43 T	0.257583	0.898075	0.897640	10.899241
4	TabNetModel	50 T	0.260693	0.894461	0.894012	18.629878
0	CategoryEmbeddingModel	51 T	0.263826	0.893875	0.894207	7.868230
5	TabNetModel	129 T	0.534261	0.766813	0.760403	32.926586

较大的 GANDALF 模型在准确性、损失和 F1 分数方面表现最佳。尽管训练时间略高于可比较的 MLP，但仍然相当快。

现在，除了使用预设外，您还可以传递一系列 ModelConfig 对象。让我们尝试使用一系列 ModelConfig 对象进行一次试验。

from pytorch_tabular.models import CategoryEmbeddingModelConfig, GANDALFConfig
common_params = {
    "task": "classification",
    "head":"LinearHead", "head_config":head_config
}
model_list = [
    CategoryEmbeddingModelConfig(layers="1024-512-256", **common_params),
    GANDALFConfig(gflu_stages=2, **common_params),
    GANDALFConfig(gflu_stages=6, learnable_sparsity=False, **common_params),
]

# 过滤掉警告
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    sweep_df, best_model = model_sweep(
        task="classification",  # One of "classification", "regression"
        train=train,
        test=test,
        data_config=data_config,
        optimizer_config=optimizer_config,
        trainer_config=trainer_config,
        model_list=model_list,
        metrics=["accuracy", "f1_score"],
        metrics_params=[{}, {"average": "macro"}],
        metrics_prob_input=[False, True],
        rank_metric=("accuracy", "higher_is_better"),
        progress_bar=True,
        verbose=False,
        suppress_lightning_logger=True,
    )

Output()

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

sweep_df.drop(columns=["params", "time_taken", "epochs"]).style.background_gradient(
    subset=["test_accuracy", "test_f1_score"], cmap="RdYlGn"
).background_gradient(subset=["time_taken_per_epoch", "test_loss"], cmap="RdYlGn_r")

	model	# Params	test_loss	test_accuracy	test_f1_score	time_taken_per_epoch
0	CategoryEmbeddingModel	694 T	0.276405	0.888075	0.795560	14.553613
1	GANDALFModel	15 T	0.284878	0.885967	0.797202	8.369561
2	GANDALFModel	43 T	0.287677	0.884142	0.793678	10.864214

虽然我们选择了一些随机的超参数，但我们可以看到，GANDALF模型的表现非常接近MLP，同时使用的参数更少，训练时间也更短。

恭喜！: 你已经学会了如何在PyTorch Tabular中使用Model Sweep来检查多个模型在单个数据集上的表现。这在选择适合你问题的模型时将是一个非常有用的第一步。
现在尝试在你自己的数据集上使用这个方法。你也可以尝试使用`full`预设，看看它的表现如何。