ray.data.读取图像#

ray.data.read_images(paths: str | List[str], *, filesystem: pyarrow.fs.FileSystem | None = None, parallelism: int = -1, meta_provider: BaseFileMetadataProvider | None = None, ray_remote_args: Dict[str, Any] = None, arrow_open_file_args: Dict[str, Any] | None = None, partition_filter: PathPartitionFilter | None = None, partitioning: Partitioning = None, size: Tuple[int, int] | None = None, mode: str | None = None, include_paths: bool = False, ignore_missing_paths: bool = False, shuffle: Literal['files'] | None = None, file_extensions: List[str] | None = ['png', 'jpg', 'jpeg', 'tif', 'tiff', 'bmp', 'gif'], concurrency: int | None = None, override_num_blocks: int | None = None) → Dataset[源代码]#

从图像文件创建一个 Dataset。

示例

>>> import ray
>>> path = "s3://anonymous@ray-example-data/batoidea/JPEGImages/"
>>> ds = ray.data.read_images(path)
>>> ds.schema()
Column  Type
------  ----
image   numpy.ndarray(shape=(32, 32, 3), dtype=uint8)

如果你需要图片文件路径，设置 include_paths=True。

>>> ds = ray.data.read_images(path, include_paths=True)
>>> ds.schema()
Column  Type
------  ----
image   numpy.ndarray(shape=(32, 32, 3), dtype=uint8)
path    string
>>> ds.take(1)[0]["path"]
'ray-example-data/batoidea/JPEGImages/1.jpeg'

如果你的图片是这样排列的：

root/dog/xxx.png
root/dog/xxy.png

root/cat/123.png
root/cat/nsdf3.png

然后你可以通过指定 Partitioning 来包含标签。

>>> import ray
>>> from ray.data.datasource.partitioning import Partitioning
>>> root = "s3://anonymous@ray-example-data/image-datasets/dir-partitioned"
>>> partitioning = Partitioning("dir", field_names=["class"], base_dir=root)
>>> ds = ray.data.read_images(root, size=(224, 224), partitioning=partitioning)
>>> ds.schema()
Column  Type
------  ----
image   numpy.ndarray(shape=(224, 224, 3), dtype=uint8)
class   string

参数:

paths – 单个文件或目录，或文件或目录路径的列表。路径列表可以同时包含文件和目录。
filesystem – 用于读取的 pyarrow 文件系统实现。这些文件系统在 pyarrow 文档中指定。如果你需要为文件系统提供特定的配置，请指定此参数。默认情况下，文件系统会根据路径的方案自动选择。例如，如果路径以 s3:// 开头，则使用 S3FileSystem。
parallelism – 此参数已弃用。请使用 override_num_blocks 参数。
meta_provider – 一个文件元数据提供者。自定义元数据提供者可能能够更快和/或更准确地解析文件元数据。在大多数情况下，您不需要设置此项。如果为 None ，此函数使用系统选择的实现。
ray_remote_args – 传递给读取任务中 remote() 的 kwargs。
arrow_open_file_args – 传递给 pyarrow.fs.FileSystem.open_input_file 的 kwargs，用于打开输入文件进行读取。
partition_filter – 一个 PathPartitionFilter 。与自定义回调一起使用，以仅读取数据集的选定分区。默认情况下，这将过滤掉文件扩展名与 *.png、*.jpg、*.jpeg、*.tiff、*.bmp 或 *.gif 不匹配的任何文件路径。
partitioning – 一个描述路径如何组织的 Partitioning 对象。默认为 None。
size – 加载图像所需的宽度和高度。如果未指定，图像将保留其原始形状。
mode – 描述所需像素类型和深度的 Pillow 模式。如果未指定，图像模式将由 Pillow 推断。
include_paths – 如果 True，则包含每个图像的路径。文件路径存储在 'path' 列中。
ignore_missing_paths – 如果为 True，则忽略 paths 中未找到的任何文件/目录路径。默认为 False。
shuffle – 如果设置为“files”，在读取前随机打乱输入文件的顺序。默认不进行打乱，使用 None。
file_extensions – 用于筛选文件的文件扩展名列表。
concurrency – Ray 任务的最大并发运行数量。设置此项以控制并发运行的任务数量。这不会改变运行的总任务数或输出的总块数。默认情况下，并发性是根据可用资源动态决定的。
override_num_blocks – 覆盖所有读取任务的输出块数量。默认情况下，输出块的数量是根据输入数据大小和可用资源动态决定的。在大多数情况下，您不应手动设置此值。

返回:

一个 Dataset 生成表示指定路径图像的张量。有关使用张量的信息，请阅读张量数据指南。

抛出:

ValueError – 如果 size 包含非正数。
ValueError – 如果 mode 不受支持。

PublicAPI (测试版): 此API目前处于测试阶段，在成为稳定版本之前可能会发生变化。