DedocAPIFileLoader#

class langchain_community.document_loaders.dedoc.DedocAPIFileLoader(file_path: str, *, url: str = 'http://0.0.0.0:1231', split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None)[source]#

使用dedoc API加载文件。文件加载器会自动检测文件类型（即使扩展名错误）。默认情况下，加载器会调用本地托管的dedoc API。有关dedoc API的更多信息可以在dedoc文档中找到：

https://dedoc.readthedocs.io/en/latest/dedoc_api_usage/api.html

请参阅 DedocBaseLoader 的文档以获取更多详细信息。

Setup:

您不需要安装dedoc库来使用此加载器。相反，需要运行dedoc API。为此，您可以使用Docker容器。请参阅dedoc文档以获取更多详细信息：

https://dedoc.readthedocs.io/en/latest/getting_started/installation.html#install-and-run-dedoc-using-docker

docker pull dedocproject/dedoc
docker run -p 1231:1231

Instantiate:

from langchain_community.document_loaders import DedocAPIFileLoader

loader = DedocAPIFileLoader(
    file_path="example.pdf",
    # url=...,
    # split=...,
    # with_tables=...,
    # pdf_with_text_layer=...,
    # pages=...,
    # ...
)

Load:

docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)

Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

Lazy load:

docs = []
docs_lazy = loader.lazy_load()

for doc in docs_lazy:
    docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)

Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

使用文件路径、API URL 和解析参数进行初始化。

Parameters:

file_path (str) – 用于处理的文件路径
url (str) – 调用 dedoc API 的 URL
split (str) –
文档分割成部分的类型（每个部分单独返回），默认值为“document” “document”：文档作为单个langchain Document对象返回

（不分割）

“page”：将文档分割成页面（适用于PDF、DJVU、PPTX、PPT、ODP） “node”：将文档分割成树节点（标题节点、列表项节点、

原始文本节点）

“line”：将文档分割成行
with_tables (bool) – 将表格添加到结果中 - 每个表格作为一个单独的 langchain Document 对象返回
dedoc (Parameters used for document parsing via) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

extraction, works only when with_attachments==True

pdf_with_text_layer: type of handler for parsing PDF documents,
available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

language: language of the document for PDF without a textual layer and
images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

a textual layer and images, available options [“true”, “false”, “auto” (default)]

document_orientation: fix document orientation (90, 180, 270 degrees)
for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

need_header_footer_analysis: remove headers and footers from the output
result for parsing PDF and images

need_binarization: clean pages background (binarize) for PDF without a
textual layer and images

need_pdf_table_analysis: parse tables for PDF without a textual layer
and images

delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV
with_attachments (str | bool)
recursion_deep_attachments (int)
pdf_with_text_layer (str)
language (str)
pages (str)
is_one_column_document (str)
document_orientation (str)
need_header_footer_analysis (str | bool)
need_binarization (str | bool)
need_pdf_table_analysis (str | bool)
delimiter (str | None)
encoding (str | None)

方法

`__init__`(file_path, *[, url, split, ...])	使用文件路径、API URL 和解析参数进行初始化。
`alazy_load`()	一个用于文档的懒加载器。
`aload`()	将数据加载到Document对象中。
`lazy_load`()	懒加载文档。
`load`()	将数据加载到Document对象中。
`load_and_split`([text_splitter])	加载文档并将其分割成块。

__init__(file_path: str, *, url: str = 'http://0.0.0.0:1231', split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None) → None[来源]#

使用文件路径、API URL 和解析参数进行初始化。

Parameters:

file_path (str) – 用于处理的文件路径
url (str) – 调用 dedoc API 的 URL
split (str) –
文档分割成部分的类型（每个部分单独返回），默认值为“document” “document”：文档作为单个langchain Document对象返回

（不分割）

“page”：将文档分割成页面（适用于PDF、DJVU、PPTX、PPT、ODP） “node”：将文档分割成树节点（标题节点、列表项节点、

原始文本节点）

“line”：将文档分割成行
with_tables (bool) – 将表格添加到结果中 - 每个表格作为一个单独的 langchain Document 对象返回
dedoc (Parameters used for document parsing via) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):

with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files

extraction, works only when with_attachments==True

pdf_with_text_layer: type of handler for parsing PDF documents,
available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]

language: language of the document for PDF without a textual layer and
images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html

pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without

a textual layer and images, available options [“true”, “false”, “auto” (default)]

document_orientation: fix document orientation (90, 180, 270 degrees)
for PDF without a textual layer and images, available options [“auto” (default), “no_change”]

need_header_footer_analysis: remove headers and footers from the output
result for parsing PDF and images

need_binarization: clean pages background (binarize) for PDF without a
textual layer and images

need_pdf_table_analysis: parse tables for PDF without a textual layer
and images

delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV
with_attachments (str | bool)
recursion_deep_attachments (int)
pdf_with_text_layer (str)
language (str)
pages (str)
is_one_column_document (str)
document_orientation (str)
need_header_footer_analysis (str | bool)
need_binarization (str | bool)
need_pdf_table_analysis (str | bool)
delimiter (str | None)
encoding (str | None)

Return type:

无

async alazy_load() → AsyncIterator[Document]#

文档的懒加载器。

Return type:: AsyncIterator[Document]

async aload() → list[Document]#

将数据加载到Document对象中。

Return type:: 列表[Document]

lazy_load() → Iterator[Document][source]#

懒加载文档。

Return type:: 迭代器[文档]

load() → list[Document]#

将数据加载到Document对象中。

Return type:: 列表[Document]

load_and_split(text_splitter: TextSplitter | None = None) → list[Document]#

加载文档并将其分割成块。块以文档形式返回。

不要重写此方法。它应该被视为已弃用！

Parameters:: text_splitter (可选[TextSplitter]) – 用于分割文档的TextSplitter实例。默认为RecursiveCharacterTextSplitter。
Returns:: 文档列表。
Return type:: 列表[Document]

使用 DedocAPIFileLoader 的示例

Dedoc