在Tune中训练 (tune.Trainable, tune.report)#

训练可以通过 函数API (train.report()) 或 类API (tune.Trainable) 进行。

为了举例说明，让我们最大化这个目标函数：

def objective(x, a, b):
    return a * (x ** 0.5) + b

函数可训练API#

使用函数API定义一个自定义训练函数，Tune会在Ray actor进程中运行该函数。每个试验被放置在Ray actor进程中，并行运行。

函数中的 config 参数是一个字典，由 Ray Tune 自动填充，对应于从搜索空间中为试验选择的超参数。

通过函数API，您可以通过在函数内部简单地调用 train.report() 来报告中间指标。

from ray import train, tune


def trainable(config: dict):
    intermediate_score = 0
    for x in range(20):
        intermediate_score = objective(x, config["a"], config["b"])
        train.report({"score": intermediate_score})  # This sends the score to Tune.


tuner = tune.Tuner(trainable, param_space={"a": 2, "b": 4})
results = tuner.fit()

小技巧

不要在 Trainable 类中使用 train.report()。

在之前的例子中，我们报告了每一步，但这种指标报告频率是可配置的。例如，我们也可以只在最后报告一次，附上最终得分：

from ray import train, tune


def trainable(config: dict):
    final_score = 0
    for x in range(20):
        final_score = objective(x, config["a"], config["b"])

    train.report({"score": final_score})  # This sends the score to Tune.


tuner = tune.Tuner(trainable, param_space={"a": 2, "b": 4})
results = tuner.fit()

还可以通过从函数中返回它们，将最终的指标集返回给 Tune：

def trainable(config: dict):
    final_score = 0
    for x in range(20):
        final_score = objective(x, config["a"], config["b"])

    return {"score": final_score}  # This sends the score to Tune.

请注意，Ray Tune 除了用户报告的指标外，还会输出额外的值，例如 iterations_since_restore。有关这些值的解释，请参阅如何在 Tune 中使用日志指标？。

查看如何为可训练函数配置检查点这里。

类可训练 API#

小心

不要在 Trainable 类中使用 train.report()。

可训练的 类 API 将要求用户继承 ray.tune.Trainable。以下是该 API 的一个简单示例：

from ray import train, tune


class Trainable(tune.Trainable):
    def setup(self, config: dict):
        # config (dict): A dict of hyperparameters
        self.x = 0
        self.a = config["a"]
        self.b = config["b"]

    def step(self):  # This is called iteratively.
        score = objective(self.x, self.a, self.b)
        self.x += 1
        return {"score": score}


tuner = tune.Tuner(
    Trainable,
    run_config=train.RunConfig(
        # Train for 20 steps
        stop={"training_iteration": 20},
        checkpoint_config=train.CheckpointConfig(
            # We haven't implemented checkpointing yet. See below!
            checkpoint_at_end=False
        ),
    ),
    param_space={"a": 2, "b": 4},
)
results = tuner.fit()

作为 tune.Trainable 的子类，Tune 将在一个单独的进程中创建一个 Trainable 对象（使用 Ray Actor API）。

setup 函数在训练开始时被调用一次。

step 被调用多次。每次调用时，Trainable 对象在调优过程中执行一次逻辑训练迭代，这可能包括一次或多次实际训练迭代。

cleanup 在训练完成后被调用。

setup 方法中的 config 参数是一个字典，由 Tune 自动填充，对应于从搜索空间中为试验选择的超参数。

小技巧

一般来说，step 的执行时间应足够长以避免开销（即超过几秒钟），但也要足够短以定期报告进度（即最多几分钟）。

你会注意到 Ray Tune 除了用户报告的指标外，还会输出额外的值，例如 iterations_since_restore。有关这些值的解释/术语表，请参阅如何在 Tune 中使用日志指标？。

查看如何为类可训练对象配置检查点这里。

高级：在 Tune 中重复使用演员#

备注

此功能仅适用于可训练类API。

你的 Trainable 可能需要很长时间才能启动。为了避免这种情况，你可以使用 tune.TuneConfig(reuse_actors=True)``（由 ``Tuner 接收）来重用相同的 Trainable Python 进程和对象以用于多个超参数。

这要求你实现 Trainable.reset_config ，它提供了一组新的超参数。用户需要正确更新你的可训练对象的超参数。

class PytorchTrainable(tune.Trainable):
    """Train a Pytorch ConvNet."""

    def setup(self, config):
        self.train_loader, self.test_loader = get_data_loaders()
        self.model = ConvNet()
        self.optimizer = optim.SGD(
            self.model.parameters(),
            lr=config.get("lr", 0.01),
            momentum=config.get("momentum", 0.9))

    def reset_config(self, new_config):
        for param_group in self.optimizer.param_groups:
            if "lr" in new_config:
                param_group["lr"] = new_config["lr"]
            if "momentum" in new_config:
                param_group["momentum"] = new_config["momentum"]

        self.model = ConvNet()
        self.config = new_config
        return True

比较 Tune 的函数 API 和类 API#

以下是函数和类API的一些关键概念及其表现形式。

概念	函数 API	类 API
训练迭代	每次调用 `train.report` 时递增	在每次 `Trainable.step` 调用时递增
报告指标	`train.report(metrics)`	从 `Trainable.step` 返回指标
保存检查点	`train.report(..., checkpoint=checkpoint)`	`Trainable.save_checkpoint`
加载检查点	`train.get_checkpoint()`	`Trainable.load_checkpoint`
访问配置	作为参数传递 `def train_func(config):`	通过 `Trainable.setup`

高级资源分配#

可训练对象本身可以分布式运行。如果你的可训练函数/类创建了更多的 Ray 角色或任务，这些角色或任务也会消耗 CPU/GPU 资源，你将需要向 PlacementGroupFactory 添加更多的资源包以预留额外的资源槽。例如，如果一个可训练类本身需要 1 个 GPU，但还启动了 4 个角色，每个角色使用另一个 GPU，那么你应该像这样使用 tune.with_resources：

 tuner = tune.Tuner(
     tune.with_resources(my_trainable, tune.PlacementGroupFactory([
         {"CPU": 1, "GPU": 1},
         {"GPU": 1},
         {"GPU": 1},
         {"GPU": 1},
         {"GPU": 1}
     ])),
     run_config=RunConfig(name="my_trainable")
 )

Trainable 还提供了 default_resource_requests 接口，根据给定的配置自动声明每个试验的资源。

还可以指定内存（"memory"，以字节为单位）和自定义资源需求。

函数 API#

要使用函数API报告结果和检查点，请参阅 Ray Train 工具文档。

可训练 (类 API)#

构造函数#

Trainable

可训练模型、函数等的抽象类。

可训练方法的实现#

`setup`	子类应重写此方法以进行自定义初始化。
`save_checkpoint`	子类应重写此方法以实现 `save()`。
`load_checkpoint`	子类应重写此方法以实现 restore()。
`step`	子类应重写此方法以实现 train()。
`reset_config`	在不重新启动试验的情况下重置配置。
`cleanup`	子类应在停止时覆盖此方法以进行任何清理。
`default_resource_request`	为给定的配置提供静态资源需求。

调整可训练的实用工具#

调整数据摄取工具#

tune.with_parameters

用于可训练对象的包装器，以传递任意大的数据对象。

调整资源分配工具#

`tune.with_resources`	用于指定资源请求的可训练对象的包装器。
`PlacementGroupFactory`	包装类，用于为试验创建放置组。
`tune.utils.wait_for_gpu`	检查给定的GPU是否释放了内存。

调试可调参数的实用工具#

`tune.utils.diagnose_serialization`	用于检测为什么您的可训练函数无法序列化的工具。
`tune.utils.validate_save_restore`	帮助方法，用于检查您的可训练类是否能正确恢复。
`tune.utils.util.validate_warmstart`	通用验证 Searcher 的暖启动功能。