在 Tune 中使用 Aim#
Aim 是一个易于使用且功能强大的开源实验跟踪工具。 Aim 记录您的训练过程,提供一个精心设计的用户界面来比较这些过程,并提供一个 API 以编程方式查询它们。
Ray Tune 目前提供与 Aim 的内置集成。 自动日志记录通过 Aim API 上报给 Tune 的度量指标。
日志记录 Tune 超参数配置和结果到 Aim#
以下示例演示了如何在 Tune 实验中使用 AimLoggerCallback
。
首先安装并导入必要的模块:
%pip install aim
%pip install ray[tune]
import numpy as np
import ray
from ray import train, tune
from ray.tune.logger.aim import AimLoggerCallback
接下来,定义一个简单的 train_function
,这是一个 Trainable
,用于向 Tune 报告损失。目标函数本身在这个例子中并不重要,因为我们的主要关注点是与 Aim 的集成。
def train_function(config):
for _ in range(50):
loss = config["mean"] + config["sd"] * np.random.randn()
train.report({"loss": loss})
这是一个使用 AimLoggerCallback
进行简单网格搜索 Tune 实验的示例。日志记录器将每个 9 次网格搜索试验作为单独的 Aim 运行进行记录。
tuner = tune.Tuner(
train_function,
run_config=train.RunConfig(
callbacks=[AimLoggerCallback()],
storage_path="/tmp/ray_results",
name="aim_example",
),
param_space={
"mean": tune.grid_search([1, 2, 3, 4, 5, 6, 7, 8, 9]),
"sd": tune.uniform(0.1, 0.9),
},
tune_config=tune.TuneConfig(
metric="loss",
mode="min",
),
)
tuner.fit()
2023-02-07 00:04:11,228 INFO worker.py:1544 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
Tune Status
Current time: | 2023-02-07 00:04:19 |
Running for: | 00:00:06.86 |
Memory: | 32.8/64.0 GiB |
System Info
Using FIFO scheduling algorithm.Resources requested: 0/10 CPUs, 0/0 GPUs, 0.0/26.93 GiB heap, 0.0/2.0 GiB objects
Trial Status
Trial name | status | loc | mean | sd | iter | total time (s) | loss |
---|---|---|---|---|---|---|---|
train_function_01a3b_00000 | TERMINATED | 127.0.0.1:10277 | 1 | 0.385428 | 50 | 4.48031 | 1.01928 |
train_function_01a3b_00001 | TERMINATED | 127.0.0.1:10296 | 2 | 0.819716 | 50 | 2.97272 | 3.01491 |
train_function_01a3b_00002 | TERMINATED | 127.0.0.1:10301 | 3 | 0.769197 | 50 | 2.39572 | 3.87155 |
train_function_01a3b_00003 | TERMINATED | 127.0.0.1:10307 | 4 | 0.29466 | 50 | 2.41568 | 4.1507 |
train_function_01a3b_00004 | TERMINATED | 127.0.0.1:10313 | 5 | 0.152208 | 50 | 1.68383 | 5.10225 |
train_function_01a3b_00005 | TERMINATED | 127.0.0.1:10321 | 6 | 0.879814 | 50 | 1.54015 | 6.20238 |
train_function_01a3b_00006 | TERMINATED | 127.0.0.1:10329 | 7 | 0.487499 | 50 | 1.44706 | 7.79551 |
train_function_01a3b_00007 | TERMINATED | 127.0.0.1:10333 | 8 | 0.639783 | 50 | 1.4261 | 7.94189 |
train_function_01a3b_00008 | TERMINATED | 127.0.0.1:10341 | 9 | 0.12285 | 50 | 1.07701 | 8.82304 |
Trial Progress
Trial name | date | done | episodes_total | experiment_id | experiment_tag | hostname | iterations_since_restore | loss | node_ip | pid | time_since_restore | time_this_iter_s | time_total_s | timestamp | timesteps_since_restore | timesteps_total | training_iteration | trial_id | warmup_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train_function_01a3b_00000 | 2023-02-07_00-04-18 | True | c8447fdceea6436c9edd6f030a5b1d82 | 0_mean=1,sd=0.3854 | Justins-MacBook-Pro-16 | 50 | 1.01928 | 127.0.0.1 | 10277 | 4.48031 | 0.013865 | 4.48031 | 1675757058 | 0 | 50 | 01a3b_00000 | 0.00264072 | ||
train_function_01a3b_00001 | 2023-02-07_00-04-18 | True | 7dd6d3ee24244a0885b354c285064728 | 1_mean=2,sd=0.8197 | Justins-MacBook-Pro-16 | 50 | 3.01491 | 127.0.0.1 | 10296 | 2.97272 | 0.0584073 | 2.97272 | 1675757058 | 0 | 50 | 01a3b_00001 | 0.0316792 | ||
train_function_01a3b_00002 | 2023-02-07_00-04-18 | True | e3da49ebad034c4b8fdaf0aa87927b1a | 2_mean=3,sd=0.7692 | Justins-MacBook-Pro-16 | 50 | 3.87155 | 127.0.0.1 | 10301 | 2.39572 | 0.0695491 | 2.39572 | 1675757058 | 0 | 50 | 01a3b_00002 | 0.0315411 | ||
train_function_01a3b_00003 | 2023-02-07_00-04-18 | True | 95c60c4f67c4481ebccff25b0a49e75d | 3_mean=4,sd=0.2947 | Justins-MacBook-Pro-16 | 50 | 4.1507 | 127.0.0.1 | 10307 | 2.41568 | 0.0175381 | 2.41568 | 1675757058 | 0 | 50 | 01a3b_00003 | 0.0310779 | ||
train_function_01a3b_00004 | 2023-02-07_00-04-18 | True | a216253cb41e47caa229e65488deb019 | 4_mean=5,sd=0.1522 | Justins-MacBook-Pro-16 | 50 | 5.10225 | 127.0.0.1 | 10313 | 1.68383 | 0.064441 | 1.68383 | 1675757058 | 0 | 50 | 01a3b_00004 | 0.00450182 | ||
train_function_01a3b_00005 | 2023-02-07_00-04-18 | True | 23834104277f476cb99d9c696281fceb | 5_mean=6,sd=0.8798 | Justins-MacBook-Pro-16 | 50 | 6.20238 | 127.0.0.1 | 10321 | 1.54015 | 0.00910306 | 1.54015 | 1675757058 | 0 | 50 | 01a3b_00005 | 0.0480251 | ||
train_function_01a3b_00006 | 2023-02-07_00-04-18 | True | 15f650121df747c3bd2720481d47b265 | 6_mean=7,sd=0.4875 | Justins-MacBook-Pro-16 | 50 | 7.79551 | 127.0.0.1 | 10329 | 1.44706 | 0.00600386 | 1.44706 | 1675757058 | 0 | 50 | 01a3b_00006 | 0.00202489 | ||
train_function_01a3b_00007 | 2023-02-07_00-04-19 | True | 78b1673cf2034ed99135b80a0cb31e0e | 7_mean=8,sd=0.6398 | Justins-MacBook-Pro-16 | 50 | 7.94189 | 127.0.0.1 | 10333 | 1.4261 | 0.00225306 | 1.4261 | 1675757059 | 0 | 50 | 01a3b_00007 | 0.00209713 | ||
train_function_01a3b_00008 | 2023-02-07_00-04-19 | True | c7f5d86154cb46b6aa27bef523edcd6f | 8_mean=9,sd=0.1228 | Justins-MacBook-Pro-16 | 50 | 8.82304 | 127.0.0.1 | 10341 | 1.07701 | 0.00291467 | 1.07701 | 1675757059 | 0 | 50 | 01a3b_00008 | 0.00240111 |
2023-02-07 00:04:19,366 INFO tune.py:798 -- Total run time: 7.38 seconds (6.85 seconds for the tuning loop).
<ray.tune.result_grid.ResultGrid at 0x137de07c0>
当脚本执行时,进行网格搜索并将结果保存到Aim仓库,存储在默认位置——实验日志目录(在本例中,位于/tmp/ray_results/aim_example
)。
Aim的更多配置选项#
在上述示例中,我们使用了AimLoggerCallback
的默认配置。有一些选项可以作为回调的参数进行配置。例如,设置AimLoggerCallback(repo="/path/to/repo")
将结果记录到该文件路径的Aim仓库,如果您有一个中心位置用来存储多个Tune实验的结果,这将非常有用。相对路径也可以用于Tune脚本启动的工作目录。默认情况下,仓库将设置为实验日志目录。有关更多配置,请参见API参考。
启动Aim UI#
现在我们已将结果记录到Aim仓库,可以在Aim的Web UI中查看它。为此,我们首先找到Aim仓库所在的目录,然后使用Aim CLI启动Web界面。
# 取消注释以下行以启动 Aim UI!
#!aim up --repo=/tmp/ray_results/aim_example
--------------------------------------------------------------------------
Aim UI collects anonymous usage analytics.
Read how to opt-out here:
https://aimstack.readthedocs.io/en/latest/community/telemetry.html
--------------------------------------------------------------------------
Running Aim UI on repo `<Repo#-5734997863388805469 path=/tmp/ray_results/aim_example/.aim read_only=None>`
Open http://127.0.0.1:43800
Press Ctrl+C to exit
^C
启动 Aim UI 后,我们可以在 localhost:43800
打开网页接口。
接下来的部分包含有关Tune-Aim集成API的更深入信息。
Tune Aim Logger API#
- class ray.tune.logger.aim.AimLoggerCallback(repo: str | None = None, experiment_name: str | None = None, metrics: List[str] | None = None, **aim_run_kwargs)[源代码]
Aim Logger: logs metrics in Aim format.
Aim is an open-source, self-hosted ML experiment tracking tool. It’s good at tracking lots (thousands) of training runs, and it allows you to compare them with a performant and well-designed UI.
Source: aimhubio/aim
- 参数:
repo – Aim repository directory or a
Repo
object that the Run object will log results to. If not provided, a default repo will be set up in the experiment directory (one level above trial directories).experiment – Sets the
experiment
property of each Run object, which is the experiment name associated with it. Can be used later to query runs/sequences. If not provided, the default will be the Tune experiment name set byRunConfig(name=...)
.metrics – List of metric names (out of the metrics reported by Tune) to track in Aim. If no metric are specified, log everything that is reported.
aim_run_kwargs – Additional arguments that will be passed when creating the individual
Run
objects for each trial. For the full list of arguments, please see the Aim documentation: https://aimstack.readthedocs.io/en/latest/refs/sdk.html