URL
本示例介绍了如何从一系列URL中加载HTML
文档,并将其转换为我们可以在下游使用的Document
格式。
无结构URL加载器
您需要安装unstructured
库:
!pip install -U unstructured
from langchain_community.document_loaders import UnstructuredURLLoader
urls = [
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023",
]
在headers=headers中传入ssl_verify=False以解决ssl_verification错误。
loader = UnstructuredURLLoader(urls=urls)
data = loader.load()
Selenium URL加载器
这部分介绍了如何使用SeleniumURLLoader
从URL列表中加载HTML文档。
使用Selenium
可以加载需要JavaScript渲染的页面。
要使用SeleniumURLLoader
,您需要安装selenium
和unstructured
。
!pip install -U selenium unstructured
from langchain_community.document_loaders import SeleniumURLLoader
urls = [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]
loader = SeleniumURLLoader(urls=urls)
data = loader.load()
Playwright URL加载器
这部分介绍了如何使用PlaywrightURLLoader
从URL列表中加载HTML文档。
Playwright为现代Web应用程序提供可靠的端到端测试。
与Selenium的情况类似,Playwright
允许我们加载和渲染JavaScript页面。
要使用PlaywrightURLLoader
,您需要安装playwright
和unstructured
。此外,您还需要安装Playwright Chromium
浏览器:
!pip install -U playwright unstructured
!playwright install
from langchain_community.document_loaders import PlaywrightURLLoader
urls = [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]
loader = PlaywrightURLLoader(urls=urls, remove_selectors=["header", "footer"])
data = loader.load()