使用Weights & Biases与Tune#
Weights & Biases(Wandb)是一个用于实验跟踪、模型优化和数据集版本控制的工具。它在机器学习和数据科学社区中非常受欢迎,因为它提供了出色的可视化工具。
Ray Tune目前提供了两种轻量级的Weights & Biases集成。一个是WandbLoggerCallback,它会自动将报告给Tune的指标记录到Wandb API。
另一个是setup_wandb()函数,它可以与函数API一起使用。它会自动用Tune的训练信息初始化Wandb API。您可以像往常一样使用Wandb API,例如使用wandb.log()
记录您的训练过程。
运行Weights & Biases示例#
在以下示例中,我们将使用上述两种方法,即WandbLoggerCallback
和setup_wandb
函数来记录指标。
作为第一步,请确保在所有运行训练的机器上都已登录wandb:
wandb login
然后,我们可以开始进行一些关键的导入:
import numpy as np
import ray
from ray import train, tune
from ray.air.integrations.wandb import WandbLoggerCallback, setup_wandb
接下来,让我们定义一个简单的 train_function
函数(一个 Tune 的 Trainable
),该函数向 Tune 报告一个随机损失。目标函数本身对于这个例子并不重要,因为我们主要想重点关注 Weights & Biases 的集成。
def train_function(config):
for i in range(30):
loss = config["mean"] + config["sd"] * np.random.randn()
train.report({"loss": loss})
您可以使用 WandbLoggerCallback
定义一个简单的网格搜索调优运行,如下所示:
def tune_with_callback():
"""使用WandbLoggerCallback与函数API的示例"""
tuner = tune.Tuner(
train_function,
tune_config=tune.TuneConfig(
metric="loss",
mode="min",
),
run_config=train.RunConfig(
callbacks=[WandbLoggerCallback(project="Wandb_example")]
),
param_space={
"mean": tune.grid_search([1, 2, 3, 4, 5]),
"sd": tune.uniform(0.2, 0.8),
},
)
tuner.fit()
要使用 setup_wandb
工具,只需在您的目标中调用此函数。
请注意,我们还使用 wandb.log(...)
将 loss
作为字典记录到 Weights & Biases 中。
否则,这个版本的我们的目标与它的原始版本是一样的。
def train_function_wandb(config):
wandb = setup_wandb(config, project="Wandb_example")
for i in range(30):
loss = config["mean"] + config["sd"] * np.random.randn()
train.report({"loss": loss})
wandb.log(dict(loss=loss))
定义了 train_function_wandb
后,您的 Tune 实验将在每个试验开始时设置 wandb
!
def tune_with_setup():
"""使用setup_wandb实用工具的函数API示例"""
tuner = tune.Tuner(
train_function_wandb,
tune_config=tune.TuneConfig(
metric="loss",
mode="min",
),
param_space={
"mean": tune.grid_search([1, 2, 3, 4, 5]),
"sd": tune.uniform(0.2, 0.8),
},
)
tuner.fit()
最后,您还可以通过在 setup()
方法中使用 setup_wandb
来定义基于类的 Tune Trainable
,并将运行对象作为属性存储。请注意,使用类可训练时,您必须分别传递试验 ID、名称和组:
class WandbTrainable(tune.Trainable):
def setup(self, config):
self.wandb = setup_wandb(
config,
trial_id=self.trial_id,
trial_name=self.trial_name,
group="Example",
project="Wandb_example",
)
def step(self):
for i in range(30):
loss = self.config["mean"] + self.config["sd"] * np.random.randn()
self.wandb.log({"loss": loss})
return {"loss": loss, "done": True}
def save_checkpoint(self, checkpoint_dir: str):
pass
def load_checkpoint(self, checkpoint_dir: str):
pass
使用这个 WandbTrainable
运行 Tune 与使用函数 API 完全相同。
下面的 tune_trainable
函数与上面的 tune_decorated
唯一区别在于我们传递给 Tuner()
的第一个参数:
def tune_trainable():
"""使用WandTrainableMixin与类API的示例"""
tuner = tune.Tuner(
WandbTrainable,
tune_config=tune.TuneConfig(
metric="loss",
mode="min",
),
param_space={
"mean": tune.grid_search([1, 2, 3, 4, 5]),
"sd": tune.uniform(0.2, 0.8),
},
)
results = tuner.fit()
return results.get_best_result().config
由于您可能没有 Wandb 的 API 密钥,我们可以_模拟_ Wandb 记录器并测试我们的三个训练函数,如下所示。
如果您已登录到 wandb,您可以将 mock_api = False
设置为实际将您的结果上传到 Weights & Biases。
import os
mock_api = True
if mock_api:
os.environ.setdefault("WANDB_MODE", "disabled")
os.environ.setdefault("WANDB_API_KEY", "abcd")
ray.init(
runtime_env={"env_vars": {"WANDB_MODE": "disabled", "WANDB_API_KEY": "abcd"}}
)
tune_with_callback()
tune_with_setup()
tune_trainable()
2022-11-02 16:02:45,355 INFO worker.py:1534 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266
2022-11-02 16:02:46,513 INFO wandb.py:282 -- Already logged into W&B.
Tune Status
Current time: | 2022-11-02 16:03:13 |
Running for: | 00:00:27.28 |
Memory: | 10.8/16.0 GiB |
System Info
Using FIFO scheduling algorithm.Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/3.44 GiB heap, 0.0/1.72 GiB objects
Trial Status
Trial name | status | loc | mean | sd | iter | total time (s) | loss |
---|---|---|---|---|---|---|---|
train_function_7676d_00000 | TERMINATED | 127.0.0.1:14578 | 1 | 0.411212 | 30 | 0.236137 | 0.828527 |
train_function_7676d_00001 | TERMINATED | 127.0.0.1:14591 | 2 | 0.756339 | 30 | 5.57185 | 3.13156 |
train_function_7676d_00002 | TERMINATED | 127.0.0.1:14593 | 3 | 0.436643 | 30 | 5.50237 | 3.26679 |
train_function_7676d_00003 | TERMINATED | 127.0.0.1:14595 | 4 | 0.295929 | 30 | 5.60986 | 3.70388 |
train_function_7676d_00004 | TERMINATED | 127.0.0.1:14596 | 5 | 0.335292 | 30 | 5.61385 | 4.74294 |
Trial Progress
Trial name | date | done | episodes_total | experiment_id | experiment_tag | hostname | iterations_since_restore | loss | node_ip | pid | time_since_restore | time_this_iter_s | time_total_s | timestamp | timesteps_since_restore | timesteps_total | training_iteration | trial_id | warmup_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train_function_7676d_00000 | 2022-11-02_16-02-53 | True | a9f242fa70184d9dadd8952b16fb0ecc | 0_mean=1,sd=0.4112 | Kais-MBP.local.meter | 30 | 0.828527 | 127.0.0.1 | 14578 | 0.236137 | 0.00381589 | 0.236137 | 1667430173 | 0 | 30 | 7676d_00000 | 0.00366998 | ||
train_function_7676d_00001 | 2022-11-02_16-03-03 | True | f57118365bcb4c229fe41c5911f05ad6 | 1_mean=2,sd=0.7563 | Kais-MBP.local.meter | 30 | 3.13156 | 127.0.0.1 | 14591 | 5.57185 | 0.00627518 | 5.57185 | 1667430183 | 0 | 30 | 7676d_00001 | 0.0027349 | ||
train_function_7676d_00002 | 2022-11-02_16-03-03 | True | 394021d4515d4616bae7126668f73b2b | 2_mean=3,sd=0.4366 | Kais-MBP.local.meter | 30 | 3.26679 | 127.0.0.1 | 14593 | 5.50237 | 0.00494576 | 5.50237 | 1667430183 | 0 | 30 | 7676d_00002 | 0.00286222 | ||
train_function_7676d_00003 | 2022-11-02_16-03-03 | True | a575e79c9d95485fa37deaa86267aea4 | 3_mean=4,sd=0.2959 | Kais-MBP.local.meter | 30 | 3.70388 | 127.0.0.1 | 14595 | 5.60986 | 0.00689816 | 5.60986 | 1667430183 | 0 | 30 | 7676d_00003 | 0.00299597 | ||
train_function_7676d_00004 | 2022-11-02_16-03-03 | True | 91ce57dcdbb54536b1874666b711350d | 4_mean=5,sd=0.3353 | Kais-MBP.local.meter | 30 | 4.74294 | 127.0.0.1 | 14596 | 5.61385 | 0.00672579 | 5.61385 | 1667430183 | 0 | 30 | 7676d_00004 | 0.00323987 |
2022-11-02 16:03:13,913 INFO tune.py:788 -- Total run time: 28.53 seconds (27.28 seconds for the tuning loop).
Tune Status
Current time: | 2022-11-02 16:03:22 |
Running for: | 00:00:08.49 |
Memory: | 9.9/16.0 GiB |
System Info
Using FIFO scheduling algorithm.Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/3.44 GiB heap, 0.0/1.72 GiB objects
Trial Status
Trial name | status | loc | mean | sd | iter | total time (s) | loss |
---|---|---|---|---|---|---|---|
train_function_wandb_877eb_00000 | TERMINATED | 127.0.0.1:14647 | 1 | 0.738281 | 30 | 1.61319 | 0.555153 |
train_function_wandb_877eb_00001 | TERMINATED | 127.0.0.1:14660 | 2 | 0.321178 | 30 | 1.72447 | 2.52109 |
train_function_wandb_877eb_00002 | TERMINATED | 127.0.0.1:14661 | 3 | 0.202487 | 30 | 1.8159 | 2.45412 |
train_function_wandb_877eb_00003 | TERMINATED | 127.0.0.1:14662 | 4 | 0.515434 | 30 | 1.715 | 4.51413 |
train_function_wandb_877eb_00004 | TERMINATED | 127.0.0.1:14663 | 5 | 0.216098 | 30 | 1.72827 | 5.2814 |
(train_function_wandb pid=14647) 2022-11-02 16:03:17,149 INFO wandb.py:282 -- Already logged into W&B.
Trial Progress
Trial name | date | done | episodes_total | experiment_id | experiment_tag | hostname | iterations_since_restore | loss | node_ip | pid | time_since_restore | time_this_iter_s | time_total_s | timestamp | timesteps_since_restore | timesteps_total | training_iteration | trial_id | warmup_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train_function_wandb_877eb_00000 | 2022-11-02_16-03-18 | True | 7b250c9f31ab484dad1a1fd29823afdf | 0_mean=1,sd=0.7383 | Kais-MBP.local.meter | 30 | 0.555153 | 127.0.0.1 | 14647 | 1.61319 | 0.00232315 | 1.61319 | 1667430198 | 0 | 30 | 877eb_00000 | 0.00391102 | ||
train_function_wandb_877eb_00001 | 2022-11-02_16-03-22 | True | 5172868368074557a3044ea3a9146673 | 1_mean=2,sd=0.3212 | Kais-MBP.local.meter | 30 | 2.52109 | 127.0.0.1 | 14660 | 1.72447 | 0.0152011 | 1.72447 | 1667430202 | 0 | 30 | 877eb_00001 | 0.00901699 | ||
train_function_wandb_877eb_00002 | 2022-11-02_16-03-22 | True | b13d9bccb1964b4b95e1a858a3ea64c7 | 2_mean=3,sd=0.2025 | Kais-MBP.local.meter | 30 | 2.45412 | 127.0.0.1 | 14661 | 1.8159 | 0.00437403 | 1.8159 | 1667430202 | 0 | 30 | 877eb_00002 | 0.00844812 | ||
train_function_wandb_877eb_00003 | 2022-11-02_16-03-22 | True | 869d7ec7a3544a8387985103e626818f | 3_mean=4,sd=0.5154 | Kais-MBP.local.meter | 30 | 4.51413 | 127.0.0.1 | 14662 | 1.715 | 0.00247812 | 1.715 | 1667430202 | 0 | 30 | 877eb_00003 | 0.00282907 | ||
train_function_wandb_877eb_00004 | 2022-11-02_16-03-22 | True | 84d3112d66f64325bc469e44b8447ef5 | 4_mean=5,sd=0.2161 | Kais-MBP.local.meter | 30 | 5.2814 | 127.0.0.1 | 14663 | 1.72827 | 0.00517201 | 1.72827 | 1667430202 | 0 | 30 | 877eb_00004 | 0.00272107 |
(train_function_wandb pid=14660) 2022-11-02 16:03:20,600 INFO wandb.py:282 -- Already logged into W&B.
(train_function_wandb pid=14661) 2022-11-02 16:03:20,600 INFO wandb.py:282 -- Already logged into W&B.
(train_function_wandb pid=14663) 2022-11-02 16:03:20,628 INFO wandb.py:282 -- Already logged into W&B.
(train_function_wandb pid=14662) 2022-11-02 16:03:20,723 INFO wandb.py:282 -- Already logged into W&B.
2022-11-02 16:03:22,565 INFO tune.py:788 -- Total run time: 8.60 seconds (8.48 seconds for the tuning loop).
Tune Status
Current time: | 2022-11-02 16:03:31 |
Running for: | 00:00:09.28 |
Memory: | 9.9/16.0 GiB |
System Info
Using FIFO scheduling algorithm.Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/3.44 GiB heap, 0.0/1.72 GiB objects
Trial Status
Trial name | status | loc | mean | sd | iter | total time (s) | loss |
---|---|---|---|---|---|---|---|
WandbTrainable_8ca33_00000 | TERMINATED | 127.0.0.1:14718 | 1 | 0.397894 | 1 | 0.000187159 | 0.742345 |
WandbTrainable_8ca33_00001 | TERMINATED | 127.0.0.1:14737 | 2 | 0.386883 | 1 | 0.000151873 | 2.5709 |
WandbTrainable_8ca33_00002 | TERMINATED | 127.0.0.1:14738 | 3 | 0.290693 | 1 | 0.00014019 | 2.99601 |
WandbTrainable_8ca33_00003 | TERMINATED | 127.0.0.1:14739 | 4 | 0.33333 | 1 | 0.00015831 | 3.91276 |
WandbTrainable_8ca33_00004 | TERMINATED | 127.0.0.1:14740 | 5 | 0.645479 | 1 | 0.000150919 | 5.47779 |
(WandbTrainable pid=14718) 2022-11-02 16:03:25,742 INFO wandb.py:282 -- Already logged into W&B.
Trial Progress
Trial name | date | done | episodes_total | experiment_id | hostname | iterations_since_restore | loss | node_ip | pid | time_since_restore | time_this_iter_s | time_total_s | timestamp | timesteps_since_restore | timesteps_total | training_iteration | trial_id | warmup_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
WandbTrainable_8ca33_00000 | 2022-11-02_16-03-27 | True | 3adb4d0ae0d74d1c9ddd07924b5653b0 | Kais-MBP.local.meter | 1 | 0.742345 | 127.0.0.1 | 14718 | 0.000187159 | 0.000187159 | 0.000187159 | 1667430207 | 0 | 1 | 8ca33_00000 | 1.31382 | ||
WandbTrainable_8ca33_00001 | 2022-11-02_16-03-31 | True | f1511cfd51f94b3d9cf192181ccc08a9 | Kais-MBP.local.meter | 1 | 2.5709 | 127.0.0.1 | 14737 | 0.000151873 | 0.000151873 | 0.000151873 | 1667430211 | 0 | 1 | 8ca33_00001 | 1.31668 | ||
WandbTrainable_8ca33_00002 | 2022-11-02_16-03-31 | True | a7528ec6adf74de0b73aa98ebedab66d | Kais-MBP.local.meter | 1 | 2.99601 | 127.0.0.1 | 14738 | 0.00014019 | 0.00014019 | 0.00014019 | 1667430211 | 0 | 1 | 8ca33_00002 | 1.32008 | ||
WandbTrainable_8ca33_00003 | 2022-11-02_16-03-31 | True | b7af756ca586449ba2d4c44141b53b06 | Kais-MBP.local.meter | 1 | 3.91276 | 127.0.0.1 | 14739 | 0.00015831 | 0.00015831 | 0.00015831 | 1667430211 | 0 | 1 | 8ca33_00003 | 1.31879 | ||
WandbTrainable_8ca33_00004 | 2022-11-02_16-03-31 | True | 196624f42bcc45c18a26778573a43a2c | Kais-MBP.local.meter | 1 | 5.47779 | 127.0.0.1 | 14740 | 0.000150919 | 0.000150919 | 0.000150919 | 1667430211 | 0 | 1 | 8ca33_00004 | 1.31945 |
(WandbTrainable pid=14739) 2022-11-02 16:03:30,360 INFO wandb.py:282 -- Already logged into W&B.
(WandbTrainable pid=14740) 2022-11-02 16:03:30,393 INFO wandb.py:282 -- Already logged into W&B.
(WandbTrainable pid=14737) 2022-11-02 16:03:30,454 INFO wandb.py:282 -- Already logged into W&B.
(WandbTrainable pid=14738) 2022-11-02 16:03:30,510 INFO wandb.py:282 -- Already logged into W&B.
2022-11-02 16:03:31,985 INFO tune.py:788 -- Total run time: 9.40 seconds (9.27 seconds for the tuning loop).
{'mean': 1, 'sd': 0.3978937765393781, 'wandb': {'project': 'Wandb_example'}}
这完成了我们的Tune和Wandb的演示。 在接下来的部分,您可以找到关于Tune-Wandb集成的API的更多详细信息。
Tune Wandb API参考#
WandbLoggerCallback#
- class ray.air.integrations.wandb.WandbLoggerCallback(project: str | None = None, group: str | None = None, api_key_file: str | None = None, api_key: str | None = None, excludes: List[str] | None = None, log_config: bool = False, upload_checkpoints: bool = False, save_checkpoints: bool = False, upload_timeout: int = 1800, **kwargs)[源代码]
Weights and biases (https://www.wandb.ai/) is a tool for experiment tracking, model optimization, and dataset versioning. This Ray Tune
LoggerCallback
sends metrics to Wandb for automatic tracking and visualization.示例
import random from ray import train, tune from ray.train import RunConfig from ray.air.integrations.wandb import WandbLoggerCallback def train_func(config): offset = random.random() / 5 for epoch in range(2, config["epochs"]): acc = 1 - (2 + config["lr"]) ** -epoch - random.random() / epoch - offset loss = (2 + config["lr"]) ** -epoch + random.random() / epoch + offset train.report({"acc": acc, "loss": loss}) tuner = tune.Tuner( train_func, param_space={ "lr": tune.grid_search([0.001, 0.01, 0.1, 1.0]), "epochs": 10, }, run_config=RunConfig( callbacks=[WandbLoggerCallback(project="Optimization_Project")] ), ) results = tuner.fit()
- 参数:
project – Name of the Wandb project. Mandatory.
group – Name of the Wandb group. Defaults to the trainable name.
api_key_file – Path to file containing the Wandb API KEY. This file only needs to be present on the node running the Tune script if using the WandbLogger.
api_key – Wandb API Key. Alternative to setting
api_key_file
.excludes – List of metrics and config that should be excluded from the log.
log_config – Boolean indicating if the
config
parameter of theresults
dict should be logged. This makes sense if parameters will change during training, e.g. with PopulationBasedTraining. Defaults to False.upload_checkpoints – If
True
, model checkpoints will be uploaded to Wandb as artifacts. Defaults toFalse
.**kwargs – The keyword arguments will be pased to
wandb.init()
.
Wandb’s
group
,run_id
andrun_name
are automatically selected by Tune, but can be overwritten by filling out the respective configuration values.Please see here for all other valid configuration settings: https://docs.wandb.ai/library/init
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
setup_wandb#
- ray.air.integrations.wandb.setup_wandb(config: Dict | None = None, api_key: str | None = None, api_key_file: str | None = None, rank_zero_only: bool = True, **kwargs) wandb.wandb_run.Run | wandb.sdk.lib.disabled.RunDisabled [源代码]
Set up a Weights & Biases session.
This function can be used to initialize a Weights & Biases session in a (distributed) training or tuning run.
By default, the run ID is the trial ID, the run name is the trial name, and the run group is the experiment name. These settings can be overwritten by passing the respective arguments as
kwargs
, which will be passed towandb.init()
.In distributed training with Ray Train, only the zero-rank worker will initialize wandb. All other workers will return a disabled run object, so that logging is not duplicated in a distributed run. This can be disabled by passing
rank_zero_only=False
, which will then initialize wandb in every training worker.The
config
argument will be passed to Weights and Biases and will be logged as the run configuration.If no API key or key file are passed, wandb will try to authenticate using locally stored credentials, created for instance by running
wandb login
.Keyword arguments passed to
setup_wandb()
will be passed towandb.init()
and take precedence over any potential default settings.- 参数:
config – Configuration dict to be logged to Weights and Biases. Can contain arguments for
wandb.init()
as well as authentication information.api_key – API key to use for authentication with Weights and Biases.
api_key_file – File pointing to API key for with Weights and Biases.
rank_zero_only – If True, will return an initialized session only for the rank 0 worker in distributed training. If False, will initialize a session for all workers.
kwargs – Passed to
wandb.init()
.
示例
from ray.air.integrations.wandb import setup_wandb def training_loop(config): wandb = setup_wandb(config) # ... wandb.log({"loss": 0.123})
PublicAPI (alpha): This API is in alpha and may change before becoming stable.