ClearML 数据命令行界面
本页面介绍clearml-data
,ClearML的基于文件的数据管理解决方案。
有关ClearML的高级可查询数据集管理解决方案,请参见Hyper-Datasets。
clearml-data
是一个数据管理 CLI 工具,作为 clearml
Python 包的一部分提供。使用 clearml-data
来创建、修改和管理您的数据集。您可以通过设置数据集的上传目标(参见 --storage
)将数据集上传到您选择的任何存储服务(S3 / GS / Azure / 网络存储)。一旦您上传了数据集,您就可以从任何机器访问它。
以下页面提供了clearml-data
的CLI命令的参考。
创建
创建一个新的数据集。
clearml-data create [-h] [--parents [PARENTS [PARENTS ...]]] [--project PROJECT]
--name NAME [--version VERSION] [--output-uri OUTPUT_URI]
[--tags [TAGS [TAGS ...]]]
参数
Name | Description | Optional |
---|---|---|
--name | Dataset's name | |
--project | Dataset's project | |
--version | Dataset version. Use the semantic versioning scheme. If not specified a version will automatically be assigned | |
--parents | IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered | |
--output-uri | Sets where dataset and its previews are uploaded to | |
--tags | Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets |
- 对于在ClearML Server v1.6或更新版本上使用
clearml
v1.6或更新版本创建的数据集,请在数据集用户界面的数据集版本信息面板中查找ID。
对于使用早期版本的clearml
创建的数据集,或者如果使用的是早期版本的ClearML Server,请在数据集任务的信息面板的任务标题中查找ID。 - clearml-data 以有状态模式工作,因此一旦创建了新数据集,后续命令就不需要
--id
标志。
添加
将单个文件或完整文件夹添加到数据集中。
clearml-data add [-h] [--id ID] [--dataset-folder DATASET_FOLDER]
[--files [FILES [FILES ...]]] [--wildcard [WILDCARD [WILDCARD ...]]]
[--links [LINKS [LINKS ...]]] [--non-recursive] [--verbose]
参数
Name | Description | Optional |
---|---|---|
--id | Dataset's ID. Default: previously created / accessed dataset | |
--files | Files / folders to add. Items will be uploaded to the dataset's designated storage. | |
--wildcard | Add specific set of files, denoted by these wildcards. For example: ~/data/*.jpg ~/data/json . Multiple wildcards can be passed. | |
--links | Files / folders link to add. Supports S3, GS, Azure links. Example: s3://bucket/data azure://<account name>.blob.core.windows.net/path/to/file . Items remain in their original location. | |
--dataset-folder | Dataset base folder to add the files to in the dataset. Default: dataset root | |
--non-recursive | Disable recursive scan of files | |
--verbose | Verbose reporting |
删除
从数据集中删除文件/链接。
clearml-data remove [-h] [--id ID] [--files [FILES [FILES ...]]]
[--non-recursive] [--verbose]
参数
Name | Description | Optional |
---|---|---|
--id | Dataset's ID. Default: previously created / accessed dataset | |
--files | Files / folders to remove (wildcard selection is supported, for example: ~/data/*.jpg ~/data/json ). Notice: file path is the path within the dataset, not the local path. For links, you can specify their URL (for example, s3://bucket/data ) | |
--non-recursive | Disable recursive scan of files | |
--verbose | Verbose reporting |
上传
将本地数据集更改上传到服务器。默认情况下,它会上传到ClearML文件服务器。您可以通过输入上传目标来指定不同的存储介质。例如:
- 共享文件夹:
/mnt/shared/folder
- S3:
s3://bucket/folder
- 非AWS S3类似服务(如MinIO):
s3://host_addr:port/bucket
- Google Cloud Storage:
gs://bucket-name/folder
- Azure 存储:
azure://
.blob.core.windows.net/path/to/file
clearml-data upload [-h] [--id ID] [--storage STORAGE] [--chunk-size CHUNK_SIZE]
[--verbose]
参数
Name | Description | Optional |
---|---|---|
--id | Dataset's ID. Default: previously created / accessed dataset | |
--storage | Remote storage to use for the dataset files. Default: files_server | |
--chunk-size | Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks. | |
--verbose | Verbose reporting |
关闭
完成数据集并使其准备好被使用。这将自动上传所有之前未上传的文件。 一旦数据集被完成,它将不能再被修改。
clearml-data close [-h] [--id ID] [--storage STORAGE] [--disable-upload]
[--chunk-size CHUNK_SIZE] [--verbose]
参数
Name | Description | Optional |
---|---|---|
--id | Dataset's ID. Default: previously created / accessed dataset | |
--storage | Remote storage to use for the dataset files. Default: files_server | |
--disable-upload | Disable automatic upload when closing the dataset | |
--chunk-size | Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks. | |
--verbose | Verbose reporting |
同步
将文件夹的内容与ClearML同步。此选项在用户有一个单一的真实来源(即一个文件夹)并且该文件夹会不时更新的情况下非常有用。
一旦更新应该在ClearML的系统中反映出来,调用clearml-data sync
并传递文件夹路径,更改(无论是文件添加、修改还是删除)都将在ClearML中反映出来。
此命令还会自动上传数据并完成数据集的最终化。
clearml-data sync [-h] [--id ID] [--dataset-folder DATASET_FOLDER] --folder FOLDER
[--parents [PARENTS [PARENTS ...]]] [--project PROJECT] [--name NAME]
[--version VERSION] [--output-uri OUTPUT_URI] [--tags [TAGS [TAGS ...]]]
[--storage STORAGE] [--skip-close] [--chunk-size CHUNK_SIZE] [--verbose]
参数
Name | Description | Optional |
---|---|---|
--id | Dataset's ID. Default: previously created / accessed dataset | |
--dataset-folder | Dataset base folder to add the files to (default: Dataset root) | |
--folder | Local folder to sync. Wildcard selection is supported, for example: ~/data/*.jpg ~/data/json | |
--storage | Remote storage to use for the dataset files. Default: files server | |
--parents | IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset | |
--project | If creating a new dataset, specify the dataset's project name | |
--name | If creating a new dataset, specify the dataset's name | |
--version | Specify the dataset's version using the semantic versioning scheme. Default: 1.0.0 | |
--tags | Dataset user tags | |
--skip-close | Do not auto close dataset after syncing folders | |
--chunk-size | Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks. | |
--verbose | Verbose reporting |
列表
列出数据集的内容。
clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME] [--version VERSION]
[--filter [FILTER [FILTER ...]]] [--modified]
参数
Name | Description | Optional |
---|---|---|
--id | Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset | |
--project | Specify dataset project name (if used instead of ID, dataset name is also required) | |
--name | Specify dataset name (if used instead of ID, dataset project is also required) | |
--version | Specify dataset version. Default: most recent version | |
--filter | Filter files based on folder / wildcard. Multiple filters are supported. Example: folder/date_*.json folder/subfolder | |
--modified | Only list file changes (add / remove / modify) introduced in this version |
设置描述
设置现有数据集的描述。
clearml-data set-description [-h] [--id ID] [--description DESCRIPTION]
参数
Name | Description | Optional |
---|---|---|
--id | Dataset's ID | |
--description | Description to be set |
删除
删除数据集。传递你想要删除的数据集的任何属性。除非你传递--entire-dataset
和--force
,否则匹配请求的多个数据集将引发异常。在这种情况下,所有匹配的数据集都将被删除。
如果一个数据集是其他数据集的父数据集,你必须传递--force
来删除它。
删除父数据集可能会导致子数据集丢失数据!
clearml-data delete [-h] [--id ID] [--project PROJECT] [--name NAME]
[--version VERSION] [--force] [--entire-dataset]
参数
Name | Description | Optional |
---|---|---|
--id | ID of the dataset to delete (alternatively, use project / name combination). | |
--project | Specify dataset project name (if used instead of ID, dataset name is also required) | |
--name | Specify dataset name (if used instead of ID, dataset project is also required) | |
--version | Specify dataset version | |
-–force | Force dataset deletion even if other dataset versions depend on it. Must also be used if --entire-dataset flag is used | |
--entire-dataset | Delete all found datasets |
重命名
重命名数据集(及其所有版本)。
clearml-data rename [-h] --new-name NEW_NAME --project PROJECT --name NAME
参数
Name | Description | Optional |
---|---|---|
--new-name | The new name of the dataset | |
--project | The project the dataset to be renamed belongs to | |
--name | The current name of the dataset(s) to be renamed |
移动
将数据集移动到另一个项目
clearml-data move [-h] --new-project NEW_PROJECT --project PROJECT --name NAME
参数
Name | Description | Optional |
---|---|---|
--new-project | The new project of the dataset | |
--project | The current project the dataset to be move belongs to | |
--name | The name of the dataset to be moved |
搜索
通过项目、名称、ID和/或标签在系统中搜索数据集。
返回系统中符合搜索请求的所有数据集列表,按创建时间排序。
clearml-data search [-h] [--ids [IDS [IDS ...]]] [--project PROJECT]
[--name NAME] [--tags [TAGS [TAGS ...]]]
参数
Name | Description | Optional |
---|---|---|
--ids | A list of dataset IDs | |
--project | The project name of the datasets | |
--name | A dataset name or a partial name to filter datasets by | |
--tags | A list of dataset user tags |
比较
比较两个数据集(目标与源)。该命令返回一个比较摘要,看起来像这样:
Comparison summary: 4 files removed, 3 files modified, 0 files added
clearml-data compare [-h] --source SOURCE --target TARGET [--verbose]
参数
Name | Description | Optional |
---|---|---|
--source | Source dataset ID (used as baseline) | |
--target | Target dataset ID (compare against the source baseline dataset) | |
--verbose | Verbose report all file changes (instead of summary) |
压缩
将多个数据集压缩成一个单一的数据集版本(向下合并)。
clearml-data squash [-h] --name NAME --ids [IDS [IDS ...]] [--storage STORAGE] [--verbose]
参数
Name | Description | Optional |
---|---|---|
--name | Create squashed dataset name | |
--ids | Source dataset IDs to squash (merge down) | |
--storage | Remote storage to use for the dataset files. Default: files_server | |
--verbose | Verbose report all file changes (instead of summary) |
验证
验证数据集内容是否与本地源数据匹配。
clearml-data verify [-h] [--id ID] [--folder FOLDER] [--filesize] [--verbose]
参数
Name | Description | Optional |
---|---|---|
--id | Specify dataset ID. Default: previously created/accessed dataset | |
--folder | Specify dataset local copy (if not provided the local cache folder will be verified) | |
--filesize | If True , only verify file size and skip hash checks (default: False ) | |
--verbose | Verbose report all file changes (instead of summary) |
获取
获取数据集的本地副本。默认情况下,您会获得一个只读的缓存文件夹,但您可以通过使用--copy
标志来获取一个可变的副本。
clearml-data get [-h] [--id ID] [--copy COPY] [--link LINK] [--part PART]
[--num-parts NUM_PARTS] [--overwrite] [--verbose]
参数
Name | Description | Optional |
---|---|---|
--id | Specify dataset ID. Default: previously created / accessed dataset | |
--copy | Get a writable copy of the dataset to a specific output folder | |
--link | Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset | |
--part | Retrieve a partial copy of the dataset. Part number (0 to --num-parts -1) of total parts --num-parts . | |
--num-parts | Total number of parts to divide the dataset into. Notice, minimum retrieved part is a single chunk in a dataset (or its parents). Example: Dataset gen4, with 3 parents, each with a single chunk, can be divided into 4 parts | |
--overwrite | If True , overwrite the target folder | |
--verbose | Verbose report all file changes (instead of summary) |
发布
发布数据集以供公众使用。数据集在发布之前必须最终确定。
clearml-data publish [-h] --id ID
参数
Name | Description | Optional |
---|---|---|
--id | The dataset task ID to be published. |