显式报告教程

在本教程中，学习如何通过显式报告扩展ClearML自动捕获输入和输出的功能。

在这个例子中，您将向ClearML的GitHub仓库中的pytorch_mnist.py示例脚本添加以下内容：

设置模型检查点（快照）的输出目的地。
Explicitly record scalars, other (non-scalar) data, and record text.
注册一个工件，该工件上传到ClearML Server，并且ClearML记录对其的更改。
上传一个工件，该工件已上传，但对其的更改未记录。

先决条件

clearml 仓库已被克隆。
clearml 包已安装。

开始之前

复制pytorch_mnist.py 以添加显式报告。

cp pytorch_mnist.py pytorch_mnist_tutorial.py

步骤1：设置模型检查点的输出目标

指定一个默认的输出位置，这是实验运行时模型检查点（快照）和工件将存储的地方。一些可能的目的地包括：

Local Destination
Shared Folder
云存储：
- S3 EC2
- Google 云存储
- Azure 存储.

在Task.init()的output_uri参数中指定输出位置。在本教程中，指定一个本地文件夹作为目标。

在 pytorch_mnist_tutorial.py 中，将代码从：

task = Task.init(project_name='examples', task_name='pytorch mnist train')

至:

model_snapshots_path = '/mnt/clearml'
if not os.path.exists(model_snapshots_path):
    os.makedirs(model_snapshots_path)

task = Task.init(
  project_name='examples', 
  task_name='extending automagical ClearML example', 
  output_uri=model_snapshots_path
)

当脚本运行时，ClearML 会创建以下目录结构：

+ - <output destination name>
|   +-- <project name>
|       +-- <task name>.<Task Id>
|           +-- models
|           +-- artifacts

并将模型检查点（快照）和工件放入该文件夹中。

例如，如果任务ID是9ed78536b91a44fbb3cc7a006128c1b0，那么目录结构将是：

+ - model_snapshots
|   +-- examples
|       +-- extending automagical ClearML example.9ed78536b91a44fbb3cc7a006128c1b0
|           +-- models
|           +-- artifacts

步骤2：Logger类报告方法

除了ClearML自动记录日志外，clearml Python包还包含用于显式报告图表、日志文本、媒体和表格的方法。这些方法包括：

Logger.report_histogram
Logger.report_confusion_matrix
Logger.report_scatter2d
Logger.report_scatter3d
Logger.report_surface (表面图)
Logger.report_image - 报告一张图片并上传其内容。
Logger.report_table - 将表格报告为Pandas DataFrame、CSV文件或CSV文件的URL。
Logger.report_media - 报告媒体，包括图像、音频和视频。
Logger.get_default_upload_destination - 检索为上传媒体设置的目标。

获取日志记录器

首先，使用Task.get_logger()为任务创建一个日志记录器：

logger = task.get_logger()

绘制标量指标

使用Logger.report_scalar()添加标量指标来报告损失指标。

def train(args, model, device, train_loader, optimizer, epoch):
    
    save_loss = []
    
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
    
        save_loss.append(loss)
    
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                        100. * batch_idx / len(train_loader), loss.item()))
            # Add manual scalar reporting for loss metrics
            logger.report_scalar(title='Scalar example {} - epoch'.format(epoch), 
                series='Loss', value=loss.item(), iteration=batch_idx)

绘制其他（非标量）数据

脚本包含一个名为 test 的函数，该函数用于确定训练模型的损失和正确性。添加直方图和混淆矩阵以记录它们。

def test(args, model, device, test_loader):
    
    save_test_loss = []
    save_correct = []
    
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            # sum up batch loss
            test_loss += F.nll_loss(output, target, reduction='sum').item()
            # get the index of the max log-probability
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()
    
            save_test_loss.append(test_loss)
            save_correct.append(correct)
    
    test_loss /= len(test_loader.dataset)
    
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
    
    logger.report_histogram(
      title='Histogram example', 
      series='correct',
      iteration=1, 
      values=save_correct, 
      xaxis='Test', 
      yaxis='Correct'
    )
    
    # Manually report test loss and correct as a confusion matrix
    matrix = np.array([save_test_loss, save_correct])
    logger.report_confusion_matrix(
      title='Confusion matrix example', 
      series='Test loss / correct', 
      matrix=matrix, 
      iteration=1
    )

日志文本

通过显式记录文本（包括错误、警告和调试语句）来扩展 ClearML。使用 Logger.report_text() 及其 level 参数来报告调试消息。

logger.report_text(
  'The default output destination for model snapshots and artifacts is: {}'.format(
    model_snapshots_path
  ), 
  level=logging.DEBUG
)

步骤 3: 注册工件

注册一个工件会将其上传到ClearML服务器，如果它发生变化，变化会被记录在ClearML服务器中。目前，ClearML支持将Pandas DataFrames作为注册的工件。

注册工件

在教程脚本中，test 函数，你可以将测试损失和正确数据分配给一个 Pandas DataFrame 对象，并使用 Task.register_artifact() 注册该 Pandas DataFrame。

# Create the Pandas DataFrame
test_loss_correct = {
        'test lost': save_test_loss,
        'correct': save_correct
}
df = pd.DataFrame(test_loss_correct, columns=['test lost','correct'])
    
# Register the test loss and correct as a Pandas DataFrame artifact
task.register_artifact(
  'Test_Loss_Correct', 
  df, 
  metadata={
    'metadata string': 'apple', 
    'metadata int': 100, 
    'metadata dict': {'dict string': 'pear', 'dict int': 200}
  }
)

引用已注册的工件

一旦工件被注册，它就可以在Python实验脚本中被引用和利用。

在教程脚本中，添加 Task.current_task() 和 Task.get_registered_artifacts() 以获取样本。

# Once the artifact is registered, we can get it and work with it. Here, we sample it.
sample = Task.current_task().get_registered_artifacts()['Test_Loss_Correct'].sample(
  frac=0.5, 
  replace=True, 
  random_state=1
)

步骤 4: 上传工件

工件可以上传到ClearML服务器，但更改不会被记录。

支持的工件包括：

Pandas 数据框
任何类型的文件，包括图像文件
文件夹 - 存储为ZIP文件
图像 - 存储为PNG文件
字典 - 存储为JSON格式
Numpy 数组 - 存储为 NPZ 文件

在教程脚本中，使用Task.upload_artifact()上传损失数据作为工件，并在metadata参数中指定元数据。

# Upload test loss as an artifact. Here, the artifact is numpy array
task.upload_artifact(
  'Predictions',
  artifact_object=np.array(save_test_loss),
  metadata={
    'metadata string': 'banana', 
    'metadata integer': 300,
    'metadata dictionary': {'dict string': 'orange', 'dict int': 400}
  }
)

附加信息

在扩展了Python实验脚本后，运行它并在ClearML Web UI中查看结果。

python pytorch_mnist_tutorial.py

要查看实验结果，请执行以下操作：

在ClearML Web UI中，在项目页面上，点击示例项目。
在实验表中，点击自动扩展的ClearML示例实验。
在ARTIFACTS标签页的DATA AUDIT部分，点击Test_Loss_Correct。注册的Pandas DataFrame会显示出来，包括文件路径、大小、哈希值、元数据和预览。
在其他部分，点击损失。上传的numpy数组及其相关信息将显示出来。
点击CONSOLE选项卡，查看显示Pandas DataFrame示例的调试信息。
点击SCALARS标签，查看用于记录损失的标量图。
点击PLOTS标签，查看混淆矩阵和直方图。

下一步

请参阅用户界面部分以了解其功能。
查看ClearML Python 包参考以了解所有可用的类和方法。

先决条件​

开始之前​

步骤1：设置模型检查点的输出目标​

步骤2：Logger类报告方法​

获取日志记录器​

绘制标量指标​

绘制其他（非标量）数据​

日志文本​

步骤 3: 注册工件​

注册工件​

引用已注册的工件​

步骤 4: 上传工件​

附加信息​

下一步​