DedocBaseLoader#

class langchain_community.document_loaders.dedoc.DedocBaseLoader(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None)[source]#

使用dedochttps://dedoc.readthedocs.io)的基础加载器。

Loader enables extracting text, tables and attached files from the given file:
  • Text 可以按页面、dedoc 树节点、文本行进行分割

    (根据 split 参数)。

  • 附件文件 (当 with_attachments=True 时)

    根据 split 参数进行分割。 对于附件,langchain Document 对象有一个额外的元数据字段 `type`=”attachment”。

  • Tables (当 with_tables=True 时) 不会被拆分 - 每个表格对应一个

    langchain Document 对象。 对于表格,Document 对象有额外的元数据字段 type`=”table” 和 `text_as_html 包含表格的 HTML 表示。

使用文件路径和解析参数进行初始化。

Parameters:
  • file_path (str) – 用于处理的文件路径

  • split (str) –

    文档分割成部分的类型(每个部分单独返回),默认值为“document” “document”:文档文本作为单个langchain文档返回

    对象(不分割)

    ”page”:将文档文本分割成页面(适用于PDF、DJVU、PPTX、PPT、

    ODP)

    ”node”:将文档文本分割成树节点(标题节点、列表项

    节点、原始文本节点)

    ”line”:将文档文本分割成行

  • with_tables (bool) – 将表格添加到结果中 - 每个表格作为一个单独的 langchain Document 对象返回

  • dedoc (Parameters used for document parsing via) –

    (https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

    with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

    extraction, works only when with_attachments==True

    pdf_with_text_layer: type of handler for parsing PDF documents,

    available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

    language: language of the document for PDF without a textual layer and

    images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

    pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

    a textual layer and images, available options [“true”, “false”, “auto” (default)]

    document_orientation: fix document orientation (90, 180, 270 degrees)

    for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

    need_header_footer_analysis: remove headers and footers from the output

    result for parsing PDF and images

    need_binarization: clean pages background (binarize) for PDF without a

    textual layer and images

    need_pdf_table_analysis: parse tables for PDF without a textual layer

    and images

    delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV

  • with_attachments (str | bool)

  • recursion_deep_attachments (int)

  • pdf_with_text_layer (str)

  • language (str)

  • pages (str)

  • is_one_column_document (str)

  • document_orientation (str)

  • need_header_footer_analysis (str | bool)

  • need_binarization (str | bool)

  • need_pdf_table_analysis (str | bool)

  • delimiter (str | None)

  • encoding (str | None)

方法

__init__(file_path, *[, split, with_tables, ...])

使用文件路径和解析参数进行初始化。

alazy_load()

文档的懒加载器。

aload()

将数据加载到Document对象中。

lazy_load()

懒加载文档。

load()

将数据加载到Document对象中。

load_and_split([text_splitter])

加载文档并将其分割成块。

__init__(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None) None[source]#

使用文件路径和解析参数进行初始化。

Parameters:
  • file_path (str) – 用于处理的文件路径

  • split (str) –

    文档分割成部分的类型(每个部分单独返回),默认值为“document” “document”:文档文本作为单个langchain文档返回

    对象(不分割)

    ”page”:将文档文本分割成页面(适用于PDF、DJVU、PPTX、PPT、

    ODP)

    ”node”:将文档文本分割成树节点(标题节点、列表项

    节点、原始文本节点)

    ”line”:将文档文本分割成行

  • with_tables (bool) – 将表格添加到结果中 - 每个表格作为一个单独的 langchain Document 对象返回

  • dedoc (Parameters used for document parsing via) –

    (https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

    with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

    extraction, works only when with_attachments==True

    pdf_with_text_layer: type of handler for parsing PDF documents,

    available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

    language: language of the document for PDF without a textual layer and

    images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

    pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

    a textual layer and images, available options [“true”, “false”, “auto” (default)]

    document_orientation: fix document orientation (90, 180, 270 degrees)

    for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

    need_header_footer_analysis: remove headers and footers from the output

    result for parsing PDF and images

    need_binarization: clean pages background (binarize) for PDF without a

    textual layer and images

    need_pdf_table_analysis: parse tables for PDF without a textual layer

    and images

    delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV

  • with_attachments (str | bool)

  • recursion_deep_attachments (int)

  • pdf_with_text_layer (str)

  • language (str)

  • pages (str)

  • is_one_column_document (str)

  • document_orientation (str)

  • need_header_footer_analysis (str | bool)

  • need_binarization (str | bool)

  • need_pdf_table_analysis (str | bool)

  • delimiter (str | None)

  • encoding (str | None)

Return type:

async alazy_load() AsyncIterator[Document]#

文档的懒加载器。

Return type:

AsyncIterator[Document]

async aload() list[Document]#

将数据加载到Document对象中。

Return type:

列表[Document]

lazy_load() Iterator[Document][source]#

懒加载文档。

Return type:

迭代器[文档]

load() list[Document]#

将数据加载到Document对象中。

Return type:

列表[Document]

load_and_split(text_splitter: TextSplitter | None = None) list[Document]#

加载文档并将其分割成块。块以文档形式返回。

不要重写此方法。它应该被视为已弃用!

Parameters:

text_splitter (可选[TextSplitter]) – 用于分割文档的TextSplitter实例。 默认为RecursiveCharacterTextSplitter。

Returns:

文档列表。

Return type:

列表[Document]