递归URL
RecursiveUrlLoader
允许您从根URL递归地抓取所有子链接,并将它们解析为文档。
概述
集成详情
类 | 包 | 本地 | 可序列化 | JS支持 |
---|---|---|---|---|
RecursiveUrlLoader | langchain_community | ✅ | ❌ | ✅ |
加载器特性
来源 | 文档懒加载 | 原生异步支持 |
---|---|---|
递归URL加载器 | ✅ | ❌ |
设置
凭证
使用RecursiveUrlLoader
不需要任何凭证。
安装
RecursiveUrlLoader
位于 langchain-community
包中。虽然没有其他必需的包,但如果你安装了 `beautifulsoup4`,你将获得更丰富的默认文档元数据。
%pip install -qU langchain-community beautifulsoup4 lxml
实例化
现在我们可以实例化我们的文档加载器对象并加载文档:
from langchain_community.document_loaders import RecursiveUrlLoader
loader = RecursiveUrlLoader(
"https://docs.python.org/3.9/",
# max_depth=2,
# use_async=False,
# extractor=None,
# metadata_extractor=None,
# exclude_dirs=(),
# timeout=10,
# check_response_status=True,
# continue_on_failure=True,
# prevent_outside=True,
# base_url=None,
# ...
)
加载
使用 .load()
将所有文档同步加载到内存中,每个访问的 URL 对应一个文档。从初始 URL 开始,我们递归遍历所有链接的 URL,直到达到指定的最大深度。
让我们通过一个基本示例来了解如何在Python 3.9 文档上使用RecursiveUrlLoader
。
docs = loader.load()
docs[0].metadata
/Users/bagatur/.pyenv/versions/3.9.1/lib/python3.9/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
k = self.parse_starttag(i)
{'source': 'https://docs.python.org/3.9/',
'content_type': 'text/html',
'title': '3.9.19 Documentation',
'language': None}
太好了!第一个文档看起来像我们开始的根页面。让我们看看下一个文档的元数据
docs[1].metadata
{'source': 'https://docs.python.org/3.9/using/index.html',
'content_type': 'text/html',
'title': 'Python Setup and Usage — Python 3.9.19 documentation',
'language': None}
那个URL看起来像是我们根页面的子页面,这很棒!让我们从元数据继续,检查我们其中一个文档的内容。
print(docs[0].page_content[:300])
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" /><title>3.9.19 Documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
<link rel=
这看起来确实像是来自网址 https://docs.python.org/3.9/ 的HTML,这正是我们所期望的。现在让我们看看可以对我们的基本示例进行的一些变化,这些变化在不同的情况下可能会有所帮助。
懒加载
如果我们正在加载大量文档,并且我们的下游操作可以在所有加载文档的子集上完成,我们可以懒加载我们的文档,一次一个,以最小化我们的内存占用:
pages = []
for doc in loader.lazy_load():
pages.append(doc)
if len(pages) >= 10:
# do some paged operation, e.g.
# index.upsert(page)
pages = []
/var/folders/4j/2rz3865x6qg07tx43146py8h0000gn/T/ipykernel_73962/2110507528.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
soup = BeautifulSoup(html, "lxml")
在这个例子中,我们一次加载到内存中的文档数量不会超过10个。
添加提取器
默认情况下,加载器将每个链接的原始HTML设置为文档页面内容。要将此HTML解析为更人性化/LLM友好的格式,您可以传入一个自定义的extractor
方法:
import re
from bs4 import BeautifulSoup
def bs4_extractor(html: str) -> str:
soup = BeautifulSoup(html, "lxml")
return re.sub(r"\n\n+", "\n\n", soup.text).strip()
loader = RecursiveUrlLoader("https://docs.python.org/3.9/", extractor=bs4_extractor)
docs = loader.load()
print(docs[0].page_content[:200])
/var/folders/td/vzm913rx77x21csd90g63_7c0000gn/T/ipykernel_10935/1083427287.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
soup = BeautifulSoup(html, "lxml")
/Users/isaachershenson/.pyenv/versions/3.11.9/lib/python3.11/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
k = self.parse_starttag(i)
``````output
3.9.19 Documentation
Download
Download these documents
Docs by version
Python 3.13 (in development)
Python 3.12 (stable)
Python 3.11 (security-fixes)
Python 3.10 (security-fixes)
Python 3.9 (securit
这看起来好多了!
你可以类似地传入一个metadata_extractor
来自定义如何从HTTP响应中提取文档元数据。有关更多信息,请参阅API参考。
API 参考
这些示例展示了您可以修改默认RecursiveUrlLoader
的几种方式,但还有许多其他修改可以根据您的使用场景进行最佳适配。使用参数link_regex
和exclude_dirs
可以帮助您过滤掉不需要的URL,aload()
和alazy_load()
可以用于异步加载,等等。
有关配置和调用RecursiveUrlLoader
的详细信息,请参阅API参考:https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html。