dask.bag.read_text

dask.bag.read_text¶

dask.bag.read_text(urlpath, blocksize=None, compression='infer', encoding='utf-8', errors='strict', linedelimiter=None, collection=True, storage_options=None, files_per_partition=None, include_path=False)[源代码]¶

从文本文件中读取行

参数

urlpath字符串或列表: 绝对或相对文件路径。使用 s3:// 等协议前缀可以从其他文件系统读取。要从多个文件读取，可以传递一个全局字符串或路径列表，但前提是它们必须使用相同的协议。
blocksize: None, int, 或 str: 分割大文件的大小（以字节为单位）。默认情况下使用流。可以是 None 表示流式传输，一个整数字节数，或像 “128MiB” 这样的字符串。
压缩: 字符串: 压缩格式，如 ‘gzip’ 或 ‘xz’。默认为 ‘infer’
编码: 字符串
错误: 字符串
linedelimiter: 字符串或None
collection: bool, 可选: 如果为 True，则返回 dask.bag，如果为 False，则返回延迟值的列表
storage_options: dict: 特定存储连接有意义的额外选项，例如主机、端口、用户名、密码等。
files_per_partition: None 或 int: 如果设置，将输入文件分组为请求大小的分区，而不是每个文件一个分区。与块大小互斥。
include_path: bool: 是否在包中包含路径。如果为真，元素是 (行, 路径) 的元组。默认为 False。

返回

dask.bag.Bag 或 list: dask.bag.Bag 如果 collection 为 True，否则为 Delayed 列表的列表。

参见

from_sequence: 从 Python 序列构建包

示例

>>> b = read_text('myfiles.1.txt')  
>>> b = read_text('myfiles.*.txt')  
>>> b = read_text('myfiles.*.txt.gz')  
>>> b = read_text('s3://bucket/myfiles.*.txt')  
>>> b = read_text('s3://key:secret@bucket/myfiles.*.txt')  
>>> b = read_text('hdfs://namenode.example.com/myfiles.*.txt')  

通过提供每个分区加载的未压缩字节数来并行化一个大文件。

>>> b = read_text('largefile.txt', blocksize='10MB')  

通过设置 include_path=True 获取文件路径

>>> b = read_text('myfiles.*.txt', include_path=True) 
>>> b.take(1) 
(('first line of the first file', '/home/dask/myfiles.0.txt'),)

dask.bag.range

dask.bag.read_avro