Skip to main content

ClearML 数据命令行界面

important

本页面介绍clearml-data,ClearML的基于文件的数据管理解决方案。 有关ClearML的高级可查询数据集管理解决方案,请参见Hyper-Datasets

clearml-data 是一个数据管理 CLI 工具,作为 clearml Python 包的一部分提供。使用 clearml-data 来创建、修改和管理您的数据集。您可以通过设置数据集的上传目标(参见 --storage)将数据集上传到您选择的任何存储服务(S3 / GS / Azure / 网络存储)。一旦您上传了数据集,您就可以从任何机器访问它。

以下页面提供了clearml-data的CLI命令的参考。

创建

创建一个新的数据集。

clearml-data create [-h] [--parents [PARENTS [PARENTS ...]]] [--project PROJECT] 
--name NAME [--version VERSION] [--output-uri OUTPUT_URI]
[--tags [TAGS [TAGS ...]]]

参数

NameDescriptionOptional
--nameDataset's nameNo
--projectDataset's projectNo
--versionDataset version. Use the semantic versioning scheme. If not specified a version will automatically be assignedYes
--parentsIDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were enteredYes
--output-uriSets where dataset and its previews are uploaded toYes
--tagsDataset user tags. The dataset can be labeled, which can be useful for organizing datasetsYes
Dataset ID
  • 对于在ClearML Server v1.6或更新版本上使用clearml v1.6或更新版本创建的数据集,请在数据集用户界面的数据集版本信息面板中查找ID。
    对于使用早期版本的clearml创建的数据集,或者如果使用的是早期版本的ClearML Server,请在数据集任务的信息面板的任务标题中查找ID。
  • clearml-data 以有状态模式工作,因此一旦创建了新数据集,后续命令就不需要 --id 标志。

添加

将单个文件或完整文件夹添加到数据集中。

clearml-data add [-h] [--id ID] [--dataset-folder DATASET_FOLDER]
[--files [FILES [FILES ...]]] [--wildcard [WILDCARD [WILDCARD ...]]]
[--links [LINKS [LINKS ...]]] [--non-recursive] [--verbose]

参数

NameDescriptionOptional
--idDataset's ID. Default: previously created / accessed datasetYes
--filesFiles / folders to add. Items will be uploaded to the dataset's designated storage.Yes
--wildcardAdd specific set of files, denoted by these wildcards. For example: ~/data/*.jpg ~/data/json. Multiple wildcards can be passed.Yes
--linksFiles / folders link to add. Supports S3, GS, Azure links. Example: s3://bucket/data azure://<account name>.blob.core.windows.net/path/to/file. Items remain in their original location.Yes
--dataset-folderDataset base folder to add the files to in the dataset. Default: dataset rootYes
--non-recursiveDisable recursive scan of filesYes
--verboseVerbose reportingYes

删除

从数据集中删除文件/链接。

clearml-data remove [-h] [--id ID] [--files [FILES [FILES ...]]] 
[--non-recursive] [--verbose]

参数

NameDescriptionOptional
--idDataset's ID. Default: previously created / accessed datasetYes
--filesFiles / folders to remove (wildcard selection is supported, for example: ~/data/*.jpg ~/data/json). Notice: file path is the path within the dataset, not the local path. For links, you can specify their URL (for example, s3://bucket/data)No
--non-recursiveDisable recursive scan of filesYes
--verboseVerbose reportingYes

上传

将本地数据集更改上传到服务器。默认情况下,它会上传到ClearML文件服务器。您可以通过输入上传目标来指定不同的存储介质。例如:

  • 共享文件夹:/mnt/shared/folder
  • S3: s3://bucket/folder
  • 非AWS S3类似服务(如MinIO):s3://host_addr:port/bucket
  • Google Cloud Storage: gs://bucket-name/folder
  • Azure 存储: azure://.blob.core.windows.net/path/to/file
clearml-data upload [-h] [--id ID] [--storage STORAGE] [--chunk-size CHUNK_SIZE] 
[--verbose]

参数

NameDescriptionOptional
--idDataset's ID. Default: previously created / accessed datasetYes
--storageRemote storage to use for the dataset files. Default: files_serverYes
--chunk-sizeSet dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks.Yes
--verboseVerbose reportingYes

关闭

完成数据集并使其准备好被使用。这将自动上传所有之前未上传的文件。 一旦数据集被完成,它将不能再被修改。

clearml-data close [-h] [--id ID] [--storage STORAGE] [--disable-upload]
[--chunk-size CHUNK_SIZE] [--verbose]

参数

NameDescriptionOptional
--idDataset's ID. Default: previously created / accessed datasetYes
--storageRemote storage to use for the dataset files. Default: files_serverYes
--disable-uploadDisable automatic upload when closing the datasetYes
--chunk-sizeSet dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks.Yes
--verboseVerbose reportingYes

同步

将文件夹的内容与ClearML同步。此选项在用户有一个单一的真实来源(即一个文件夹)并且该文件夹会不时更新的情况下非常有用。

一旦更新应该在ClearML的系统中反映出来,调用clearml-data sync并传递文件夹路径,更改(无论是文件添加、修改还是删除)都将在ClearML中反映出来。

此命令还会自动上传数据并完成数据集的最终化。

clearml-data sync [-h] [--id ID] [--dataset-folder DATASET_FOLDER] --folder FOLDER
[--parents [PARENTS [PARENTS ...]]] [--project PROJECT] [--name NAME]
[--version VERSION] [--output-uri OUTPUT_URI] [--tags [TAGS [TAGS ...]]]
[--storage STORAGE] [--skip-close] [--chunk-size CHUNK_SIZE] [--verbose]

参数

NameDescriptionOptional
--idDataset's ID. Default: previously created / accessed datasetYes
--dataset-folderDataset base folder to add the files to (default: Dataset root)Yes
--folderLocal folder to sync. Wildcard selection is supported, for example: ~/data/*.jpg ~/data/jsonNo
--storageRemote storage to use for the dataset files. Default: files serverYes
--parentsIDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the datasetYes
--projectIf creating a new dataset, specify the dataset's project nameYes
--nameIf creating a new dataset, specify the dataset's nameYes
--versionSpecify the dataset's version using the semantic versioning scheme. Default: 1.0.0Yes
--tagsDataset user tagsYes
--skip-closeDo not auto close dataset after syncing foldersYes
--chunk-sizeSet dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks.Yes
--verboseVerbose reportingYes

列表

列出数据集的内容。

clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME] [--version VERSION]
[--filter [FILTER [FILTER ...]]] [--modified]

参数

NameDescriptionOptional
--idDataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed datasetYes
--projectSpecify dataset project name (if used instead of ID, dataset name is also required)Yes
--nameSpecify dataset name (if used instead of ID, dataset project is also required)Yes
--versionSpecify dataset version. Default: most recent versionYes
--filterFilter files based on folder / wildcard. Multiple filters are supported. Example: folder/date_*.json folder/subfolderYes
--modifiedOnly list file changes (add / remove / modify) introduced in this versionYes

设置描述

设置现有数据集的描述。

clearml-data set-description [-h] [--id ID] [--description DESCRIPTION]

参数

NameDescriptionOptional
--idDataset's IDNo
--descriptionDescription to be setNo

删除

删除数据集。传递你想要删除的数据集的任何属性。除非你传递--entire-dataset--force,否则匹配请求的多个数据集将引发异常。在这种情况下,所有匹配的数据集都将被删除。

如果一个数据集是其他数据集的父数据集,你必须传递--force来删除它。

warning

删除父数据集可能会导致子数据集丢失数据!

clearml-data delete [-h] [--id ID] [--project PROJECT] [--name NAME] 
[--version VERSION] [--force] [--entire-dataset]

参数

NameDescriptionOptional
--idID of the dataset to delete (alternatively, use project / name combination).Yes
--projectSpecify dataset project name (if used instead of ID, dataset name is also required)Yes
--nameSpecify dataset name (if used instead of ID, dataset project is also required)Yes
--versionSpecify dataset versionYes
-–forceForce dataset deletion even if other dataset versions depend on it. Must also be used if --entire-dataset flag is usedYes
--entire-datasetDelete all found datasetsYes

重命名

重命名数据集(及其所有版本)。

clearml-data rename [-h] --new-name NEW_NAME --project PROJECT --name NAME

参数

NameDescriptionOptional
--new-nameThe new name of the datasetNo
--projectThe project the dataset to be renamed belongs toNo
--nameThe current name of the dataset(s) to be renamedNo

移动

将数据集移动到另一个项目

clearml-data move [-h] --new-project NEW_PROJECT --project PROJECT --name NAME

参数

NameDescriptionOptional
--new-projectThe new project of the datasetNo
--projectThe current project the dataset to be move belongs toNo
--nameThe name of the dataset to be movedNo

通过项目、名称、ID和/或标签在系统中搜索数据集。

返回系统中符合搜索请求的所有数据集列表,按创建时间排序。

clearml-data search [-h] [--ids [IDS [IDS ...]]] [--project PROJECT] 
[--name NAME] [--tags [TAGS [TAGS ...]]]

参数

NameDescriptionOptional
--idsA list of dataset IDsYes
--projectThe project name of the datasetsYes
--nameA dataset name or a partial name to filter datasets byYes
--tagsA list of dataset user tagsYes

比较

比较两个数据集(目标与源)。该命令返回一个比较摘要,看起来像这样: Comparison summary: 4 files removed, 3 files modified, 0 files added

clearml-data compare [-h] --source SOURCE --target TARGET [--verbose]

参数

NameDescriptionOptional
--sourceSource dataset ID (used as baseline)No
--targetTarget dataset ID (compare against the source baseline dataset)No
--verboseVerbose report all file changes (instead of summary)Yes

压缩

将多个数据集压缩成一个单一的数据集版本(向下合并)。

clearml-data squash [-h] --name NAME --ids [IDS [IDS ...]] [--storage STORAGE] [--verbose]

参数

NameDescriptionOptional
--nameCreate squashed dataset nameNo
--idsSource dataset IDs to squash (merge down)No
--storageRemote storage to use for the dataset files. Default: files_serverYes
--verboseVerbose report all file changes (instead of summary)Yes

验证

验证数据集内容是否与本地源数据匹配。

clearml-data verify [-h] [--id ID] [--folder FOLDER] [--filesize] [--verbose]

参数

NameDescriptionOptional
--idSpecify dataset ID. Default: previously created/accessed datasetYes
--folderSpecify dataset local copy (if not provided the local cache folder will be verified)Yes
--filesizeIf True, only verify file size and skip hash checks (default: False)Yes
--verboseVerbose report all file changes (instead of summary)Yes

获取

获取数据集的本地副本。默认情况下,您会获得一个只读的缓存文件夹,但您可以通过使用--copy标志来获取一个可变的副本。

clearml-data get [-h] [--id ID] [--copy COPY] [--link LINK] [--part PART]
[--num-parts NUM_PARTS] [--overwrite] [--verbose]

参数

NameDescriptionOptional
--idSpecify dataset ID. Default: previously created / accessed datasetYes
--copyGet a writable copy of the dataset to a specific output folderYes
--linkCreate a soft link (not supported on Windows) to a read-only cached folder containing the datasetYes
--partRetrieve a partial copy of the dataset. Part number (0 to --num-parts-1) of total parts --num-parts.Yes
--num-partsTotal number of parts to divide the dataset into. Notice, minimum retrieved part is a single chunk in a dataset (or its parents). Example: Dataset gen4, with 3 parents, each with a single chunk, can be divided into 4 partsYes
--overwriteIf True, overwrite the target folderYes
--verboseVerbose report all file changes (instead of summary)Yes

发布

发布数据集以供公众使用。数据集在发布之前必须最终确定

clearml-data publish [-h] --id ID

参数

NameDescriptionOptional
--idThe dataset task ID to be published.No