ray.data.read_tfrecords#

ray.data.read_tfrecords(paths: str | List[str], *, filesystem: pyarrow.fs.FileSystem | None = None, parallelism: int = -1, arrow_open_stream_args: Dict[str, Any] | None = None, meta_provider: BaseFileMetadataProvider | None = None, partition_filter: PathPartitionFilter | None = None, include_paths: bool = False, ignore_missing_paths: bool = False, tf_schema: schema_pb2.Schema | None = None, shuffle: Literal['files'] | None = None, file_extensions: List[str] | None = None, concurrency: int | None = None, override_num_blocks: int | None = None, tfx_read_options: TFXReadOptions | None = None) → Dataset[源代码]#

从包含 tf.train.Example 消息的 TFRecord 文件创建一个 Dataset。

小技巧

在使用 tfx-bsl 库读取大型数据集时（例如，在生产用例中），性能更佳。要使用此实现，您必须首先安装 tfx-bsl：

pip install tfx_bsl --no-dependencies
将 tfx_read_options 传递给 read_tfrecords，例如：ds = read_tfrecords(path, ..., tfx_read_options=TFXReadOptions())

警告

此函数仅支持 tf.train.Example 消息。如果文件包含的消息不是 tf.train.Example 类型，则此函数将失败。

示例

>>> import ray
>>> ray.data.read_tfrecords("s3://anonymous@ray-example-data/iris.tfrecords")
Dataset(
   num_rows=?,
   schema={...}
)

我们还可以读取压缩的 TFRecord 文件，这些文件使用 Arrow 支持的压缩类型之一：

>>> ray.data.read_tfrecords(
...     "s3://anonymous@ray-example-data/iris.tfrecords.gz",
...     arrow_open_stream_args={"compression": "gzip"},
... )
Dataset(
   num_rows=?,
   schema={...}
)

参数:

paths – 单个文件或目录，或文件或目录路径的列表。路径列表可以同时包含文件和目录。
filesystem – 用于读取的 PyArrow 文件系统实现。这些文件系统在 PyArrow 文档中指定。如果需要为文件系统提供特定配置，请指定此参数。默认情况下，文件系统会根据路径的方案自动选择。例如，如果路径以 s3:// 开头，则使用 S3FileSystem。
parallelism – 此参数已弃用。请使用 override_num_blocks 参数。
arrow_open_stream_args – 传递给 pyarrow.fs.FileSystem.open_input_file 的 kwargs，用于打开输入文件进行读取。要读取压缩的 TFRecord 文件，请传递相应的压缩类型（例如，对于 GZIP 或 ZLIB），使用 arrow_open_stream_args={'compression': 'gzip'}）。
meta_provider – 一个文件元数据提供者。自定义元数据提供者可能能够更快和/或更准确地解析文件元数据。在大多数情况下，您不需要设置此项。如果为 None ，此函数使用系统选择的实现。
partition_filter – 一个 PathPartitionFilter。与自定义回调一起使用，以仅读取数据集的选定分区。
include_paths – 如果 True，则包含每个文件的路径。文件路径存储在 'path' 列中。
ignore_missing_paths – 如果为 True，则忽略 paths 中未找到的任何文件路径。默认为 False。
tf_schema – 可选的 TensorFlow 模式，用于显式设置底层数据集的模式。
shuffle – 如果设置为“files”，在读取前随机打乱输入文件的顺序。默认不进行打乱，使用 None。
file_extensions – 用于筛选文件的文件扩展名列表。
concurrency – Ray 任务的最大并发运行数量。设置此项以控制并发运行的任务数量。这不会改变运行的总任务数或输出的总块数。默认情况下，并发性是根据可用资源动态决定的。
override_num_blocks – 覆盖所有读取任务的输出块数量。默认情况下，输出块的数量是根据输入数据大小和可用资源动态决定的。在大多数情况下，您不应手动设置此值。
tfx_read_options – 指定使用 TFX 读取 TFRecord 文件时的读取选项。如果没有提供选项，将使用不带 tfx-bsl 的默认版本读取 tfrecords。

返回:

一个包含示例特征的 Dataset。

抛出:

ValueError – 如果一个文件包含的消息不是 tf.train.Example。

PublicAPI (alpha): 此API处于alpha阶段，可能在稳定之前发生变化。