Skip to main content
Open In ColabOpen on GitHub

ZeroxPDFLoader

概述

ZeroxPDFLoader 是一个文档加载器,它利用了 Zerox 库。Zerox 将 PDF 文档转换为图像,使用具有视觉能力的语言模型进行处理,并生成结构化的 Markdown 表示。此加载器允许异步操作,并提供页面级别的文档提取。

集成详情

本地可序列化JS支持
ZeroxPDFLoaderlangchain_community

加载器特性

来源文档懒加载原生异步支持
ZeroxPDFLoader

设置

凭证

需要在环境变量中设置适当的凭证。加载器支持多种不同的模型和模型提供者。请参阅下面的使用标题以查看一些示例,或访问Zerox文档以获取支持的模型的完整列表。

安装

要使用ZeroxPDFLoader,你需要安装zerox包。同时确保已安装langchain-community

pip install zerox langchain-community

初始化

ZeroxPDFLoader 通过将每一页转换为图像并异步处理,使用具备视觉能力的语言模型来提取PDF文本。要使用此加载器,您需要指定一个模型并配置Zerox所需的任何环境变量,例如API密钥。

如果你在像Jupyter Notebook这样的环境中工作,你可能需要通过使用nest_asyncio来处理异步代码。你可以按照以下方式设置:

import nest_asyncio
nest_asyncio.apply()
import os

# use nest_asyncio (only necessary inside of jupyter notebook)
import nest_asyncio
from langchain_community.document_loaders.pdf import ZeroxPDFLoader

nest_asyncio.apply()

# Specify the url or file path for the PDF you want to process
# In this case let's use pdf from web
file_path = "https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf"

# Set up necessary env vars for a vision model
os.environ["OPENAI_API_KEY"] = (
"zK3BAhQUmbwZNoHoOcscBwQdwi3oc3hzwJmbgdZ" ## your-api-key
)

# Initialize ZeroxPDFLoader with the desired model
loader = ZeroxPDFLoader(file_path=file_path, model="azure/gpt-4o-mini")
API Reference:ZeroxPDFLoader

加载

# Load the document and look at the first page:
documents = loader.load()
documents[0]
Document(metadata={'source': 'https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf', 'page': 1, 'num_pages': 5}, page_content='# OpenAI\n\nOpenAI is an AI research laboratory.\n\n#ai-models #ai\n\n## Revenue\n- **$1,000,000,000**  \n  2023\n\n## Valuation\n- **$28,000,000,000**  \n  2023\n\n## Growth Rate (Y/Y)\n- **400%**  \n  2023\n\n## Funding\n- **$11,300,000,000**  \n  2023\n\n---\n\n## Details\n- **Headquarters:** San Francisco, CA\n- **CEO:** Sam Altman\n\n[Visit Website](#)\n\n---\n\n## Revenue\n### ARR ($M)  | Growth\n--- | ---\n$1000M  | 456%\n$750M   | \n$500M   | \n$250M   | $36M\n$0     | $200M\n\nis on track to hit $1B in annual recurring revenue by the end of 2023, up about 400% from an estimated $200M at the end of 2022.\n\nOpenAI overall lost about $540M last year while developing ChatGPT, and those losses are expected to increase dramatically in 2023 with the growth in popularity of their consumer tools, with CEO Sam Altman remarking that OpenAI is likely to be "the most capital-intensive startup in Silicon Valley history."\n\nThe reason for that is operating ChatGPT is massively expensive. One analysis of ChatGPT put the running cost at about $700,000 per day taking into account the underlying costs of GPU hours and hardware. That amount—derived from the 175 billion parameter-large architecture of GPT-3—would be even higher with the 100 trillion parameters of GPT-4.\n\n---\n\n## Valuation\nIn April 2023, OpenAI raised its latest round of $300M at a roughly $29B valuation from Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global.\n\nAssuming OpenAI was at roughly $300M in ARR at the time, that would have given them a 96x forward revenue multiple.\n\n---\n\n## Product\n\n### ChatGPT\n| Examples                       | Capabilities                        | Limitations                        |\n|---------------------------------|-------------------------------------|------------------------------------|\n| "Explain quantum computing in simple terms" | "Remember what users said earlier in the conversation" | May occasionally generate incorrect information |\n| "What can you give me for my dad\'s birthday?" | "Allows users to follow-up questions" | Limited knowledge of world events after 2021 |\n| "How do I make an HTTP request in JavaScript?" | "Trained to provide harmless requests" |                                    |')
# Let's look at parsed first page
print(documents[0].page_content)
# OpenAI

OpenAI is an AI research laboratory.

#ai-models #ai

## Revenue
- **$1,000,000,000**
2023

## Valuation
- **$28,000,000,000**
2023

## Growth Rate (Y/Y)
- **400%**
2023

## Funding
- **$11,300,000,000**
2023

---

## Details
- **Headquarters:** San Francisco, CA
- **CEO:** Sam Altman

[Visit Website](#)

---

## Revenue
### ARR ($M) | Growth
--- | ---
$1000M | 456%
$750M |
$500M |
$250M | $36M
$0 | $200M

is on track to hit $1B in annual recurring revenue by the end of 2023, up about 400% from an estimated $200M at the end of 2022.

OpenAI overall lost about $540M last year while developing ChatGPT, and those losses are expected to increase dramatically in 2023 with the growth in popularity of their consumer tools, with CEO Sam Altman remarking that OpenAI is likely to be "the most capital-intensive startup in Silicon Valley history."

The reason for that is operating ChatGPT is massively expensive. One analysis of ChatGPT put the running cost at about $700,000 per day taking into account the underlying costs of GPU hours and hardware. That amount—derived from the 175 billion parameter-large architecture of GPT-3—would be even higher with the 100 trillion parameters of GPT-4.

---

## Valuation
In April 2023, OpenAI raised its latest round of $300M at a roughly $29B valuation from Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global.

Assuming OpenAI was at roughly $300M in ARR at the time, that would have given them a 96x forward revenue multiple.

---

## Product

### ChatGPT
| Examples | Capabilities | Limitations |
|---------------------------------|-------------------------------------|------------------------------------|
| "Explain quantum computing in simple terms" | "Remember what users said earlier in the conversation" | May occasionally generate incorrect information |
| "What can you give me for my dad's birthday?" | "Allows users to follow-up questions" | Limited knowledge of world events after 2021 |
| "How do I make an HTTP request in JavaScript?" | "Trained to provide harmless requests" | |

懒加载

加载器总是懒加载结果。.load() 方法等同于 .lazy_load()

API 参考

ZeroxPDFLoader

此加载器类使用文件路径和模型类型进行初始化,并支持通过zerox_kwargs进行自定义配置,以处理Zerox特定的参数。

参数:

  • file_path (Union[str, Path]): PDF文件的路径。
  • model (str): 用于处理的视觉模型,格式为 /。 一些有效的值示例包括:
    • model = "gpt-4o-mini" ## openai 模型
    • model = "azure/gpt-4o-mini"
    • model = "gemini/gpt-4o-mini"
    • model="claude-3-opus-20240229"
    • model = "vertex_ai/gemini-1.5-flash-001"
    • 更多详情请参阅 Zerox 文档
    • 默认值为 "gpt-4o-mini".
  • **zerox_kwargs (dict): 额外的Zerox特定参数,如API密钥、端点等。

方法:

  • lazy_load: 生成一个Document实例的迭代器,每个实例代表PDF的一页,并包含元数据,如页码和来源。

查看完整的API文档 这里

注释

  • 模型兼容性: Zerox 支持一系列具备视觉能力的模型。请参考 Zerox 的 GitHub 文档 获取支持的模型列表和配置详情。
  • 环境变量:确保设置所需的环境变量,例如API_KEY或端点详细信息,如Zerox文档中所指定的。
  • 异步处理: 如果你在Jupyter Notebooks中遇到与事件循环相关的错误,你可能需要按照设置部分所示应用nest_asyncio

故障排除

  • 运行时错误:此事件循环已在运行:在Jupyter等环境中使用nest_asyncio.apply()来防止异步循环冲突。
  • 配置错误:验证zerox_kwargs是否与所选模型的预期参数匹配,并确保所有必要的环境变量都已设置。

其他资源


这个页面有帮助吗?