LLM Sherpa

本文介绍如何使用 LLM Sherpa 加载多种类型的文件。LLM Sherpa 支持不同的文件格式，包括 DOCX、PPTX、HTML、TXT 和 XML。

LLMSherpaFileLoader 使用 LayoutPDFReader，它是 LLMSherpa 库的一部分。该工具旨在解析 PDF 文件，同时保留其布局信息，而在使用大多数 PDF 转文本解析器时，布局信息通常会丢失。

以下是 LayoutPDFReader 的一些关键特性：

它可以识别和提取章节和子章节以及它们的级别。
它可以将行组合成段落。
它可以识别章节和段落之间的链接。
它可以提取表格以及找到表格所在的章节。
它可以识别和提取列表和嵌套列表。
它可以合并跨页面的内容。
它可以移除重复的页眉和页脚。
它可以移除水印。

查看 llmsherpa 文档。

提示：该库在处理某些 PDF 文件时会失败，因此使用时请谨慎。

# 安装包
# !pip install --upgrade --quiet llmsherpa

LLMSherpaFileLoader

在幕后，LLMSherpaFileLoader 定义了一些策略来加载文件内容：["sections", "chunks", "html", "text"]，设置 nlm-ingestor 来获取 llmsherpa_api_url 或使用默认设置。

sections 策略：将文件解析为章节

from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader
loader = LLMSherpaFileLoader(
    file_path="https://arxiv.org/pdf/2402.14207.pdf",
    new_indent_parser=True,
    apply_ocr=True,
    strategy="sections",
    llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()

docs[1]

Document(page_content='Abstract\nWe study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.\nThis underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing.\nWe propose STORM, a writing system for the Synthesis of Topic Outlines through\nReferences\nFull-length Article\nTopic\nOutline\n2022 Winter Olympics\nOpening Ceremony\nResearch via Question Asking\nRetrieval and Multi-perspective Question Asking.\nSTORM models the pre-writing stage by\nLLM\n(1) discovering diverse perspectives in researching the given topic, (2) simulating conversations where writers carrying different perspectives pose questions to a topic expert grounded on trusted Internet sources, (3) curating the collected information to create an outline.\nFor evaluation, we curate FreshWiki, a dataset of recent high-quality Wikipedia articles, and formulate outline assessments to evaluate the pre-writing stage.\nWe further gather feedback from experienced Wikipedia editors.\nCompared to articles generated by an outlinedriven retrieval-augmented baseline, more of STORM’s articles are deemed to be organized (by a 25% absolute increase) and broad in coverage (by 10%).\nThe expert feedback also helps identify new challenges for generating grounded long articles, such as source bias transfer and over-association of unrelated facts.\n1. Can you provide any information about the transportation arrangements for the opening ceremony?\nLLM\n2. Can you provide any information about the budget for the 2022 Winter Olympics opening ceremony?…\nLLM- Role1\nLLM- Role2\nLLM- Role1', metadata={'source': 'https://arxiv.org/pdf/2402.14207.pdf', 'section_number': 1, 'section_title': 'Abstract'})

len(docs)

chunks 策略：将文件解析为块

from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader
loader = LLMSherpaFileLoader(
    file_path="https://arxiv.org/pdf/2402.14207.pdf",
    new_indent_parser=True,
    apply_ocr=True,
    strategy="chunks",
    llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()

docs[1]

Document(page_content='Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models\nStanford University {shaoyj, yuchengj, tkanell, peterxu, okhattab}@stanford.edu lam@cs.stanford.edu', metadata={'source': 'https://arxiv.org/pdf/2402.14207.pdf', 'chunk_number': 1, 'chunk_type': 'para'})

len(docs)

html 策略：将文件解析为一个 HTML 文档

from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader
loader = LLMSherpaFileLoader(
    file_path="https://arxiv.org/pdf/2402.14207.pdf",
    new_indent_parser=True,
    apply_ocr=True,
    strategy="html",
    llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()

用大型语言模型协助从零开始编写类似维基百科的文章

摘要

我们研究了如何应用大型语言模型来编写有根据、有组织的长篇文章。我们提出了一种新的方法，可以帮助作者使用大型语言模型来生成维基百科风格的文章。我们的方法结合了信息检索、大型语言模型和交互式提示，以帮助作者编写具有高质量内容和结构的文章。我们还提出了一种评估方法，用于评估生成的文章的质量和结构。我们的实验结果表明，我们的方法可以帮助作者编写高质量的维基百科风格的文章，而无需事先编写大量内容或具有专业知识。

介绍

维基百科是一个由志愿者撰写和编辑的在线百科全书，它包含了广泛的主题，从历史和科学到流行文化和当前事件。然而，撰写维基百科文章需要大量的时间和专业知识。为了帮助作者编写维基百科风格的文章，我们提出了一种新的方法，利用大型语言模型来协助从零开始编写这些文章。

我们的方法结合了信息检索、大型语言模型和交互式提示。首先，作者提供一个主题或关键词，然后我们的系统使用信息检索来收集相关的文本片段。接下来，大型语言模型被用来生成文章的初始版本，其中包含了从信息检索中收集的内容。最后，交互式提示被用来帮助作者组织和完善文章，以确保它具有高质量的内容和结构。

为了评估生成的文章的质量和结构，我们提出了一种评估方法，该方法结合了自动评估和人工评估。我们的实验结果表明，我们的方法可以帮助作者编写高质量的维基百科风格的文章，而无需事先编写大量内容或具有专业知识。

结论

在本文中，我们提出了一种新的方法，可以帮助作者使用大型语言模型来编写维基百科风格的文章。我们的方法结合了信息检索、大型语言模型和交互式提示，以帮助作者编写具有高质量内容和结构的文章。我们的实验结果表明，我们的方法可以帮助作者编写高质量的维基百科风格的文章，而无需事先编写大量内容或具有专业知识。

[20] 论文引用：Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models

LLM Sherpa

LLMSherpaFileLoader

sections 策略：将文件解析为章节

chunks 策略：将文件解析为块

html 策略：将文件解析为一个 HTML 文档

用大型语言模型协助从零开始编写类似维基百科的文章

摘要

介绍

结论

Was this page helpful?

You can leave detailed feedback on GitHub.

LLM Sherpa

LLMSherpaFileLoader​

sections 策略：将文件解析为章节​

chunks 策略：将文件解析为块​

html 策略：将文件解析为一个 HTML 文档​

用大型语言模型协助从零开始编写类似维基百科的文章

摘要​

介绍​

结论​

Was this page helpful?

You can leave detailed feedback on GitHub.

LLMSherpaFileLoader

sections 策略：将文件解析为章节

chunks 策略：将文件解析为块

html 策略：将文件解析为一个 HTML 文档

摘要

介绍

结论