DedocFileLoader#
- class langchain_community.document_loaders.dedoc.DedocFileLoader(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None)[source]#
DedocFileLoader 文档加载器集成,用于使用 dedoc 加载文件。
文件加载器会自动检测文件类型(具有正确的扩展名)。 支持的文件类型列表在 https://dedoc.readthedocs.io/en/latest/index.html#id1。 请参阅DedocBaseLoader的文档以获取更多详细信息。
- Setup:
安装
dedoc
包。pip install -U dedoc
- Instantiate:
from langchain_community.document_loaders import DedocFileLoader loader = DedocFileLoader( file_path="example.pdf", # split=..., # with_tables=..., # pdf_with_text_layer=..., # pages=..., # ... )
- Load:
docs = loader.load() print(docs[0].page_content[:100]) print(docs[0].metadata)
Some text { 'file_name': 'example.pdf', 'file_type': 'application/pdf', # ... }
- Lazy load:
docs = [] docs_lazy = loader.lazy_load() for doc in docs_lazy: docs.append(doc) print(docs[0].page_content[:100]) print(docs[0].metadata)
Some text { 'file_name': 'example.pdf', 'file_type': 'application/pdf', # ... }
使用文件路径和解析参数进行初始化。
- Parameters:
file_path (str) – 用于处理的文件路径
split (str) –
文档分割成部分的类型(每个部分单独返回),默认值为“document” “document”:文档文本作为单个langchain文档返回
对象(不分割)
- ”page”:将文档文本分割成页面(适用于PDF、DJVU、PPTX、PPT、
ODP)
- ”node”:将文档文本分割成树节点(标题节点、列表项
节点、原始文本节点)
”line”:将文档文本分割成行
with_tables (bool) – 将表格添加到结果中 - 每个表格作为一个单独的 langchain Document 对象返回
dedoc (Parameters used for document parsing via) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):
with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files
extraction, works only when with_attachments==True
- pdf_with_text_layer: type of handler for parsing PDF documents,
available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]
- language: language of the document for PDF without a textual layer and
images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html
pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without
a textual layer and images, available options [“true”, “false”, “auto” (default)]
- document_orientation: fix document orientation (90, 180, 270 degrees)
for PDF without a textual layer and images, available options [“auto” (default), “no_change”]
- need_header_footer_analysis: remove headers and footers from the output
result for parsing PDF and images
- need_binarization: clean pages background (binarize) for PDF without a
textual layer and images
- need_pdf_table_analysis: parse tables for PDF without a textual layer
and images
delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV
with_attachments (str | bool)
recursion_deep_attachments (int)
pdf_with_text_layer (str)
language (str)
pages (str)
is_one_column_document (str)
document_orientation (str)
need_header_footer_analysis (str | bool)
need_binarization (str | bool)
need_pdf_table_analysis (str | bool)
delimiter (str | None)
encoding (str | None)
方法
__init__
(file_path, *[, split, with_tables, ...])使用文件路径和解析参数进行初始化。
文档的懒加载器。
aload
()将数据加载到Document对象中。
懒加载文档。
load
()将数据加载到Document对象中。
load_and_split
([text_splitter])加载文档并将其分割成块。
- __init__(file_path: str, *, split: str = 'document', with_tables: bool = True, with_attachments: str | bool = False, recursion_deep_attachments: int = 10, pdf_with_text_layer: str = 'auto_tabby', language: str = 'rus+eng', pages: str = ':', is_one_column_document: str = 'auto', document_orientation: str = 'auto', need_header_footer_analysis: str | bool = False, need_binarization: str | bool = False, need_pdf_table_analysis: str | bool = True, delimiter: str | None = None, encoding: str | None = None) None #
使用文件路径和解析参数进行初始化。
- Parameters:
file_path (str) – 用于处理的文件路径
split (str) –
文档分割成部分的类型(每个部分单独返回),默认值为“document” “document”:文档文本作为单个langchain文档返回
对象(不分割)
- ”page”:将文档文本分割成页面(适用于PDF、DJVU、PPTX、PPT、
ODP)
- ”node”:将文档文本分割成树节点(标题节点、列表项
节点、原始文本节点)
”line”:将文档文本分割成行
with_tables (bool) – 将表格添加到结果中 - 每个表格作为一个单独的 langchain Document 对象返回
dedoc (Parameters used for document parsing via) –
(https://dedoc.readthedocs.io/en/latest/parameters/parameters.html):
with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files
extraction, works only when with_attachments==True
- pdf_with_text_layer: type of handler for parsing PDF documents,
available options [“true”, “false”, “tabby”, “auto”, “auto_tabby” (default)]
- language: language of the document for PDF without a textual layer and
images, available options [“eng”, “rus”, “rus+eng” (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html
pages: page slice to define the reading range for parsing PDF documents is_one_column_document: detect number of columns for PDF without
a textual layer and images, available options [“true”, “false”, “auto” (default)]
- document_orientation: fix document orientation (90, 180, 270 degrees)
for PDF without a textual layer and images, available options [“auto” (default), “no_change”]
- need_header_footer_analysis: remove headers and footers from the output
result for parsing PDF and images
- need_binarization: clean pages background (binarize) for PDF without a
textual layer and images
- need_pdf_table_analysis: parse tables for PDF without a textual layer
and images
delimiter: column separator for CSV, TSV files encoding: encoding of TXT, CSV, TSV
with_attachments (str | bool)
recursion_deep_attachments (int)
pdf_with_text_layer (str)
language (str)
pages (str)
is_one_column_document (str)
document_orientation (str)
need_header_footer_analysis (str | bool)
need_binarization (str | bool)
need_pdf_table_analysis (str | bool)
delimiter (str | None)
encoding (str | None)
- Return type:
无
- load_and_split(text_splitter: TextSplitter | None = None) list[Document] #
加载文档并将其分割成块。块以文档形式返回。
不要重写此方法。它应该被视为已弃用!
- Parameters:
text_splitter (可选[TextSplitter]) – 用于分割文档的TextSplitter实例。 默认为RecursiveCharacterTextSplitter。
- Returns:
文档列表。
- Return type:
列表[Document]
使用 DedocFileLoader 的示例