ClearML 数据命令行界面

important

本页面介绍clearml-data，ClearML的基于文件的数据管理解决方案。有关ClearML的高级可查询数据集管理解决方案，请参见Hyper-Datasets。

clearml-data 是一个数据管理 CLI 工具，作为 clearml Python 包的一部分提供。使用 clearml-data 来创建、修改和管理您的数据集。您可以通过设置数据集的上传目标（参见 --storage）将数据集上传到您选择的任何存储服务（S3 / GS / Azure / 网络存储）。一旦您上传了数据集，您就可以从任何机器访问它。

以下页面提供了clearml-data的CLI命令的参考。

创建

创建一个新的数据集。

clearml-data create [-h] [--parents [PARENTS [PARENTS ...]]] [--project PROJECT] 
                    --name NAME [--version VERSION] [--output-uri OUTPUT_URI] 
                    [--tags [TAGS [TAGS ...]]]

参数

Name	Description	Optional
`--name`	Dataset's name
`--project`	Dataset's project
`--version`	Dataset version. Use the semantic versioning scheme. If not specified a version will automatically be assigned
`--parents`	IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered
`--output-uri`	Sets where dataset and its previews are uploaded to
`--tags`	Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets

Dataset ID

对于在ClearML Server v1.6或更新版本上使用clearml v1.6或更新版本创建的数据集，请在数据集用户界面的数据集版本信息面板中查找ID。
对于使用早期版本的clearml创建的数据集，或者如果使用的是早期版本的ClearML Server，请在数据集任务的信息面板的任务标题中查找ID。
clearml-data 以有状态模式工作，因此一旦创建了新数据集，后续命令就不需要 --id 标志。

添加

将单个文件或完整文件夹添加到数据集中。

clearml-data add [-h] [--id ID] [--dataset-folder DATASET_FOLDER]
                 [--files [FILES [FILES ...]]] [--wildcard [WILDCARD [WILDCARD ...]]]
                 [--links [LINKS [LINKS ...]]] [--non-recursive] [--verbose]

参数

Name	Description	Optional
`--id`	Dataset's ID. Default: previously created / accessed dataset
`--files`	Files / folders to add. Items will be uploaded to the dataset's designated storage.
`--wildcard`	Add specific set of files, denoted by these wildcards. For example: `~/data/*.jpg ~/data/json`. Multiple wildcards can be passed.
`--links`	Files / folders link to add. Supports S3, GS, Azure links. Example: `s3://bucket/data` `azure://<account name>.blob.core.windows.net/path/to/file`. Items remain in their original location.
`--dataset-folder`	Dataset base folder to add the files to in the dataset. Default: dataset root
`--non-recursive`	Disable recursive scan of files
`--verbose`	Verbose reporting

删除

从数据集中删除文件/链接。

clearml-data remove [-h] [--id ID] [--files [FILES [FILES ...]]] 
                    [--non-recursive] [--verbose]

参数

Name	Description	Optional
`--id`	Dataset's ID. Default: previously created / accessed dataset
`--files`	Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path. For links, you can specify their URL (for example, `s3://bucket/data`)
`--non-recursive`	Disable recursive scan of files
`--verbose`	Verbose reporting

上传

将本地数据集更改上传到服务器。默认情况下，它会上传到ClearML文件服务器。您可以通过输入上传目标来指定不同的存储介质。例如：

共享文件夹：/mnt/shared/folder
S3: s3://bucket/folder
非AWS S3类似服务（如MinIO）：s3://host_addr:port/bucket
Google Cloud Storage: gs://bucket-name/folder
Azure 存储: azure://.blob.core.windows.net/path/to/file

clearml-data upload [-h] [--id ID] [--storage STORAGE] [--chunk-size CHUNK_SIZE] 
                    [--verbose]

参数

Name	Description	Optional
`--id`	Dataset's ID. Default: previously created / accessed dataset
`--storage`	Remote storage to use for the dataset files. Default: files_server
`--chunk-size`	Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks.
`--verbose`	Verbose reporting

关闭

完成数据集并使其准备好被使用。这将自动上传所有之前未上传的文件。一旦数据集被完成，它将不能再被修改。

clearml-data close [-h] [--id ID] [--storage STORAGE] [--disable-upload]
                   [--chunk-size CHUNK_SIZE] [--verbose]

参数

Name	Description	Optional
`--id`	Dataset's ID. Default: previously created / accessed dataset
`--storage`	Remote storage to use for the dataset files. Default: files_server
`--disable-upload`	Disable automatic upload when closing the dataset
`--chunk-size`	Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks.
`--verbose`	Verbose reporting

同步

将文件夹的内容与ClearML同步。此选项在用户有一个单一的真实来源（即一个文件夹）并且该文件夹会不时更新的情况下非常有用。

一旦更新应该在ClearML的系统中反映出来，调用clearml-data sync并传递文件夹路径，更改（无论是文件添加、修改还是删除）都将在ClearML中反映出来。

此命令还会自动上传数据并完成数据集的最终化。

clearml-data sync [-h] [--id ID] [--dataset-folder DATASET_FOLDER] --folder FOLDER
                  [--parents [PARENTS [PARENTS ...]]] [--project PROJECT] [--name NAME]
                  [--version VERSION] [--output-uri OUTPUT_URI] [--tags [TAGS [TAGS ...]]]
                  [--storage STORAGE] [--skip-close] [--chunk-size CHUNK_SIZE] [--verbose]

参数

Name	Description	Optional
`--id`	Dataset's ID. Default: previously created / accessed dataset
`--dataset-folder`	Dataset base folder to add the files to (default: Dataset root)
`--folder`	Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`
`--storage`	Remote storage to use for the dataset files. Default: files server
`--parents`	IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset
`--project`	If creating a new dataset, specify the dataset's project name
`--name`	If creating a new dataset, specify the dataset's name
`--version`	Specify the dataset's version using the semantic versioning scheme. Default: `1.0.0`
`--tags`	Dataset user tags
`--skip-close`	Do not auto close dataset after syncing folders
`--chunk-size`	Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks.
`--verbose`	Verbose reporting

列表

列出数据集的内容。

clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME] [--version VERSION]
                  [--filter [FILTER [FILTER ...]]] [--modified]

参数

Name	Description	Optional
`--id`	Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset
`--project`	Specify dataset project name (if used instead of ID, dataset name is also required)
`--name`	Specify dataset name (if used instead of ID, dataset project is also required)
`--version`	Specify dataset version. Default: most recent version
`--filter`	Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/subfolder`
`--modified`	Only list file changes (add / remove / modify) introduced in this version

设置描述

设置现有数据集的描述。

clearml-data set-description [-h] [--id ID] [--description DESCRIPTION]

参数

Name	Description	Optional
`--id`	Dataset's ID
`--description`	Description to be set

删除

删除数据集。传递你想要删除的数据集的任何属性。除非你传递--entire-dataset和--force，否则匹配请求的多个数据集将引发异常。在这种情况下，所有匹配的数据集都将被删除。

如果一个数据集是其他数据集的父数据集，你必须传递--force来删除它。

warning

删除父数据集可能会导致子数据集丢失数据！

clearml-data delete [-h] [--id ID] [--project PROJECT] [--name NAME] 
                    [--version VERSION] [--force] [--entire-dataset]

参数

Name	Description	Optional
`--id`	ID of the dataset to delete (alternatively, use project / name combination).
`--project`	Specify dataset project name (if used instead of ID, dataset name is also required)
`--name`	Specify dataset name (if used instead of ID, dataset project is also required)
`--version`	Specify dataset version
`-–force`	Force dataset deletion even if other dataset versions depend on it. Must also be used if `--entire-dataset` flag is used
`--entire-dataset`	Delete all found datasets

重命名

重命名数据集（及其所有版本）。

clearml-data rename [-h] --new-name NEW_NAME --project PROJECT --name NAME

参数

Name	Description	Optional
`--new-name`	The new name of the dataset
`--project`	The project the dataset to be renamed belongs to
`--name`	The current name of the dataset(s) to be renamed

移动

将数据集移动到另一个项目

clearml-data move [-h] --new-project NEW_PROJECT --project PROJECT --name NAME

参数

Name	Description	Optional
`--new-project`	The new project of the dataset
`--project`	The current project the dataset to be move belongs to
`--name`	The name of the dataset to be moved

搜索

通过项目、名称、ID和/或标签在系统中搜索数据集。

返回系统中符合搜索请求的所有数据集列表，按创建时间排序。

clearml-data search [-h] [--ids [IDS [IDS ...]]] [--project PROJECT] 
                    [--name NAME] [--tags [TAGS [TAGS ...]]]

参数

Name	Description	Optional
`--ids`	A list of dataset IDs
`--project`	The project name of the datasets
`--name`	A dataset name or a partial name to filter datasets by
`--tags`	A list of dataset user tags

比较

比较两个数据集（目标与源）。该命令返回一个比较摘要，看起来像这样： Comparison summary: 4 files removed, 3 files modified, 0 files added

clearml-data compare [-h] --source SOURCE --target TARGET [--verbose]

参数

Name	Description	Optional
`--source`	Source dataset ID (used as baseline)
`--target`	Target dataset ID (compare against the source baseline dataset)
`--verbose`	Verbose report all file changes (instead of summary)

压缩

将多个数据集压缩成一个单一的数据集版本（向下合并）。

clearml-data squash [-h] --name NAME --ids [IDS [IDS ...]] [--storage STORAGE] [--verbose]

参数

Name	Description	Optional
`--name`	Create squashed dataset name
`--ids`	Source dataset IDs to squash (merge down)
`--storage`	Remote storage to use for the dataset files. Default: files_server
`--verbose`	Verbose report all file changes (instead of summary)

验证

验证数据集内容是否与本地源数据匹配。

clearml-data verify [-h] [--id ID] [--folder FOLDER] [--filesize] [--verbose]

参数

Name	Description	Optional
`--id`	Specify dataset ID. Default: previously created/accessed dataset
`--folder`	Specify dataset local copy (if not provided the local cache folder will be verified)
`--filesize`	If `True`, only verify file size and skip hash checks (default: `False`)
`--verbose`	Verbose report all file changes (instead of summary)

获取

获取数据集的本地副本。默认情况下，您会获得一个只读的缓存文件夹，但您可以通过使用--copy标志来获取一个可变的副本。

clearml-data get [-h] [--id ID] [--copy COPY] [--link LINK] [--part PART]
                 [--num-parts NUM_PARTS] [--overwrite] [--verbose]

参数

Name	Description	Optional
`--id`	Specify dataset ID. Default: previously created / accessed dataset
`--copy`	Get a writable copy of the dataset to a specific output folder
`--link`	Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset
`--part`	Retrieve a partial copy of the dataset. Part number (0 to `--num-parts`-1) of total parts `--num-parts`.
`--num-parts`	Total number of parts to divide the dataset into. Notice, minimum retrieved part is a single chunk in a dataset (or its parents). Example: Dataset gen4, with 3 parents, each with a single chunk, can be divided into 4 parts
`--overwrite`	If `True`, overwrite the target folder
`--verbose`	Verbose report all file changes (instead of summary)

发布

发布数据集以供公众使用。数据集在发布之前必须最终确定。

clearml-data publish [-h] --id ID

参数

Name	Description	Optional
`--id`	The dataset task ID to be published.

创建​

添加​

删除​

上传​

关闭​

同步​

列表​

设置描述​

删除​

重命名​

移动​

搜索​

比较​

压缩​

验证​

获取​

发布​

创建

添加

删除

上传

关闭

同步

列表

设置描述

删除

重命名

移动

搜索

比较

压缩

验证

获取

发布