在 Tune 中使用 Aim#

Aim 是一个易于使用且功能强大的开源实验跟踪工具。 Aim 记录您的训练过程，提供一个精心设计的用户界面来比较这些过程，并提供一个 API 以编程方式查询它们。

Ray Tune 目前提供与 Aim 的内置集成。自动日志记录通过 Aim API 上报给 Tune 的度量指标。

日志记录 Tune 超参数配置和结果到 Aim#

以下示例演示了如何在 Tune 实验中使用 AimLoggerCallback。首先安装并导入必要的模块：

%pip install aim
%pip install ray[tune]

import numpy as np

import ray
from ray import train, tune
from ray.tune.logger.aim import AimLoggerCallback

接下来，定义一个简单的 train_function，这是一个 Trainable，用于向 Tune 报告损失。目标函数本身在这个例子中并不重要，因为我们的主要关注点是与 Aim 的集成。

def train_function(config):
    for _ in range(50):
        loss = config["mean"] + config["sd"] * np.random.randn()
        train.report({"loss": loss})

这是一个使用 AimLoggerCallback 进行简单网格搜索 Tune 实验的示例。日志记录器将每个 9 次网格搜索试验作为单独的 Aim 运行进行记录。

tuner = tune.Tuner(
    train_function,
    run_config=train.RunConfig(
        callbacks=[AimLoggerCallback()],
        storage_path="/tmp/ray_results",
        name="aim_example",
    ),
    param_space={
        "mean": tune.grid_search([1, 2, 3, 4, 5, 6, 7, 8, 9]),
        "sd": tune.uniform(0.1, 0.9),
    },
    tune_config=tune.TuneConfig(
        metric="loss",
        mode="min",
    ),
)
tuner.fit()

2023-02-07 00:04:11,228	INFO worker.py:1544 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 

Tune Status

Current time:	2023-02-07 00:04:19
Running for:	00:00:06.86
Memory:	32.8/64.0 GiB

System Info

Using FIFO scheduling algorithm.
Resources requested: 0/10 CPUs, 0/0 GPUs, 0.0/26.93 GiB heap, 0.0/2.0 GiB objects

Trial Status

Trial name	status	loc	mean	sd	iter	total time (s)	loss
train_function_01a3b_00000	TERMINATED	127.0.0.1:10277	1	0.385428	50	4.48031	1.01928
train_function_01a3b_00001	TERMINATED	127.0.0.1:10296	2	0.819716	50	2.97272	3.01491
train_function_01a3b_00002	TERMINATED	127.0.0.1:10301	3	0.769197	50	2.39572	3.87155
train_function_01a3b_00003	TERMINATED	127.0.0.1:10307	4	0.29466	50	2.41568	4.1507
train_function_01a3b_00004	TERMINATED	127.0.0.1:10313	5	0.152208	50	1.68383	5.10225
train_function_01a3b_00005	TERMINATED	127.0.0.1:10321	6	0.879814	50	1.54015	6.20238
train_function_01a3b_00006	TERMINATED	127.0.0.1:10329	7	0.487499	50	1.44706	7.79551
train_function_01a3b_00007	TERMINATED	127.0.0.1:10333	8	0.639783	50	1.4261	7.94189
train_function_01a3b_00008	TERMINATED	127.0.0.1:10341	9	0.12285	50	1.07701	8.82304

Trial Progress

Trial name	date	done	experiment_id	experiment_tag	hostname	iterations_since_restore	loss	node_ip	pid	time_since_restore	time_this_iter_s	time_total_s	timestamp	training_iteration	trial_id	warmup_time
train_function_01a3b_00000	2023-02-07_00-04-18	True	c8447fdceea6436c9edd6f030a5b1d82	0_mean=1,sd=0.3854	Justins-MacBook-Pro-16	50	1.01928	127.0.0.1	10277	4.48031	0.013865	4.48031	1675757058	50	01a3b_00000	0.00264072
train_function_01a3b_00001	2023-02-07_00-04-18	True	7dd6d3ee24244a0885b354c285064728	1_mean=2,sd=0.8197	Justins-MacBook-Pro-16	50	3.01491	127.0.0.1	10296	2.97272	0.0584073	2.97272	1675757058	50	01a3b_00001	0.0316792
train_function_01a3b_00002	2023-02-07_00-04-18	True	e3da49ebad034c4b8fdaf0aa87927b1a	2_mean=3,sd=0.7692	Justins-MacBook-Pro-16	50	3.87155	127.0.0.1	10301	2.39572	0.0695491	2.39572	1675757058	50	01a3b_00002	0.0315411
train_function_01a3b_00003	2023-02-07_00-04-18	True	95c60c4f67c4481ebccff25b0a49e75d	3_mean=4,sd=0.2947	Justins-MacBook-Pro-16	50	4.1507	127.0.0.1	10307	2.41568	0.0175381	2.41568	1675757058	50	01a3b_00003	0.0310779
train_function_01a3b_00004	2023-02-07_00-04-18	True	a216253cb41e47caa229e65488deb019	4_mean=5,sd=0.1522	Justins-MacBook-Pro-16	50	5.10225	127.0.0.1	10313	1.68383	0.064441	1.68383	1675757058	50	01a3b_00004	0.00450182
train_function_01a3b_00005	2023-02-07_00-04-18	True	23834104277f476cb99d9c696281fceb	5_mean=6,sd=0.8798	Justins-MacBook-Pro-16	50	6.20238	127.0.0.1	10321	1.54015	0.00910306	1.54015	1675757058	50	01a3b_00005	0.0480251
train_function_01a3b_00006	2023-02-07_00-04-18	True	15f650121df747c3bd2720481d47b265	6_mean=7,sd=0.4875	Justins-MacBook-Pro-16	50	7.79551	127.0.0.1	10329	1.44706	0.00600386	1.44706	1675757058	50	01a3b_00006	0.00202489
train_function_01a3b_00007	2023-02-07_00-04-19	True	78b1673cf2034ed99135b80a0cb31e0e	7_mean=8,sd=0.6398	Justins-MacBook-Pro-16	50	7.94189	127.0.0.1	10333	1.4261	0.00225306	1.4261	1675757059	50	01a3b_00007	0.00209713
train_function_01a3b_00008	2023-02-07_00-04-19	True	c7f5d86154cb46b6aa27bef523edcd6f	8_mean=9,sd=0.1228	Justins-MacBook-Pro-16	50	8.82304	127.0.0.1	10341	1.07701	0.00291467	1.07701	1675757059	50	01a3b_00008	0.00240111

2023-02-07 00:04:19,366	INFO tune.py:798 -- Total run time: 7.38 seconds (6.85 seconds for the tuning loop).

<ray.tune.result_grid.ResultGrid at 0x137de07c0>

当脚本执行时，进行网格搜索并将结果保存到Aim仓库，存储在默认位置——实验日志目录（在本例中，位于/tmp/ray_results/aim_example）。

Aim的更多配置选项#

在上述示例中，我们使用了AimLoggerCallback的默认配置。有一些选项可以作为回调的参数进行配置。例如，设置AimLoggerCallback(repo="/path/to/repo")将结果记录到该文件路径的Aim仓库，如果您有一个中心位置用来存储多个Tune实验的结果，这将非常有用。相对路径也可以用于Tune脚本启动的工作目录。默认情况下，仓库将设置为实验日志目录。有关更多配置，请参见API参考。

启动Aim UI#

现在我们已将结果记录到Aim仓库，可以在Aim的Web UI中查看它。为此，我们首先找到Aim仓库所在的目录，然后使用Aim CLI启动Web界面。

# 取消注释以下行以启动 Aim UI！
#!aim up --repo=/tmp/ray_results/aim_example

--------------------------------------------------------------------------
                Aim UI collects anonymous usage analytics.                
                        Read how to opt-out here:                         
    https://aimstack.readthedocs.io/en/latest/community/telemetry.html    
--------------------------------------------------------------------------
Running Aim UI on repo `<Repo#-5734997863388805469 path=/tmp/ray_results/aim_example/.aim read_only=None>`
Open http://127.0.0.1:43800
Press Ctrl+C to exit
^C

启动 Aim UI 后，我们可以在 localhost:43800 打开网页接口。

接下来的部分包含有关Tune-Aim集成API的更深入信息。

Tune Aim Logger API#

class ray.tune.logger.aim.AimLoggerCallback(repo: str | None = None, experiment_name: str | None = None, metrics: List[str] | None = None, **aim_run_kwargs)[源代码]

Aim Logger: logs metrics in Aim format.

Aim is an open-source, self-hosted ML experiment tracking tool. It’s good at tracking lots (thousands) of training runs, and it allows you to compare them with a performant and well-designed UI.

Source: aimhubio/aim

参数:

repo – Aim repository directory or a Repo object that the Run object will log results to. If not provided, a default repo will be set up in the experiment directory (one level above trial directories).
experiment – Sets the experiment property of each Run object, which is the experiment name associated with it. Can be used later to query runs/sequences. If not provided, the default will be the Tune experiment name set by RunConfig(name=...).
metrics – List of metric names (out of the metrics reported by Tune) to track in Aim. If no metric are specified, log everything that is reported.
aim_run_kwargs – Additional arguments that will be passed when creating the individual Run objects for each trial. For the full list of arguments, please see the Aim documentation: https://aimstack.readthedocs.io/en/latest/refs/sdk.html