Skip to main content
Open In ColabOpen on GitHub


本指南涵盖了如何将网页加载到我们在下游使用的LangChain Document 格式中。网页包含文本、图像和其他多媒体元素,通常用HTML表示。它们可能包含指向其他页面或资源的链接。

LangChain 集成了多种适合网页的解析器。选择合适的解析器取决于您的需求。下面我们展示两种可能性:

  • Simple and fast 解析,其中我们为每个网页恢复一个 Document,其内容表示为“扁平化”的字符串;
  • Advanced 解析,其中我们恢复每个页面的多个 Document 对象,允许识别和遍历部分、链接、表格和其他结构。



%pip install -qU langchain-community beautifulsoup4

对于高级解析,我们将使用 langchain-unstructured

%pip install -qU langchain-unstructured


如果您正在寻找嵌入网页中的文本的简单字符串表示,以下方法是合适的。它将返回一个Document对象列表——每个页面一个——包含页面文本的单个字符串。在底层,它使用了beautifulsoup4 Python库。

LangChain 文档加载器实现了 lazy_load 及其异步变体 alazy_load,它们返回 Document objects 的迭代器。我们将在下面使用这些功能。

import bs4
from langchain_community.document_loaders import WebBaseLoader

page_url = ""

loader = WebBaseLoader(web_paths=[page_url])
docs = []
async for doc in loader.alazy_load():

assert len(docs) == 1
doc = docs[0]
API Reference:WebBaseLoader
USER_AGENT environment variable not set, consider setting it to identify your requests.
{'source': '', 'title': 'How to add memory to chatbots | \uf8ffü¶úÔ∏è\uf8ffüîó LangChain', 'description': 'A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including:', 'language': 'en'}

How to add memory to chatbots | 🦜️🔗 LangChain

Skip to main contentShare your thoughts on AI agents. Take the 3-min survey.IntegrationsAPI ReferenceMoreContributingPeopleLangSmithLangGraphLangChain HubLangChain JS/TSv0.3v0.3v0.2v0.1💬SearchIntroductionTutorialsBuild a Question Answering application over a Graph DatabaseTutorialsBuild a Simple LLM Application with LCELBuild a Query Analysis SystemBuild a ChatbotConversational RAGBuild an Extraction ChainBuild an AgentTaggingd



loader = WebBaseLoader(
"parse_only": bs4.SoupStrainer(class_="theme-doc-markdown markdown"),
bs_get_text_kwargs={"separator": " | ", "strip": True},

docs = []
async for doc in loader.alazy_load():

assert len(docs) == 1
doc = docs[0]
{'source': ''}

How to add memory to chatbots | A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including: | Simply stuffing previous messages into a chat model prompt. | The above, but trimming old messages to reduce the amount of distracting information the model has to deal with. | More complex modifications like synthesizing summaries for long running conversations. | We'll go into more detail on a few techniq
a greeting. Nemo then asks the AI how it is doing, and the AI responds that it is fine.'), | HumanMessage(content='What did I say my name was?'), | AIMessage(content='You introduced yourself as Nemo. How can I assist you today, Nemo?')] | Note that invoking the chain again will generate another summary generated from the initial summary plus new messages and so on. You could also design a hybrid approach where a certain number of messages are retained in chat history while others are summarized.






from langchain_unstructured import UnstructuredLoader

page_url = ""
loader = UnstructuredLoader(web_url=page_url)

docs = []
async for doc in loader.alazy_load():
API Reference:UnstructuredLoader
INFO: Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO: NumExpr defaulting to 8 threads.


for doc in docs[:5]:
How to add memory to chatbots
A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including:
Simply stuffing previous messages into a chat model prompt.
The above, but trimming old messages to reduce the amount of distracting information the model has to deal with.
More complex modifications like synthesizing summaries for long running conversations.
ERROR! Session/line number was not unique in database. History logging moved to new session 2747



for doc in docs[:5]:
print(f'{doc.metadata["category"]}: {doc.page_content}')
Title: How to add memory to chatbots
NarrativeText: A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including:
ListItem: Simply stuffing previous messages into a chat model prompt.
ListItem: The above, but trimming old messages to reduce the amount of distracting information the model has to deal with.
ListItem: More complex modifications like synthesizing summaries for long running conversations.



from typing import List

from langchain_core.documents import Document

async def _get_setup_docs_from_url(url: str) -> List[Document]:
loader = UnstructuredLoader(web_url=url)

setup_docs = []
parent_id = -1
async for doc in loader.alazy_load():
if doc.metadata["category"] == "Title" and doc.page_content.startswith("Setup"):
parent_id = doc.metadata["element_id"]
if doc.metadata.get("parent_id") == parent_id:

return setup_docs

page_urls = [
setup_docs = []
for url in page_urls:
page_setup_docs = await _get_setup_docs_from_url(url)
API Reference:Document
from collections import defaultdict

setup_text = defaultdict(str)

for doc in setup_docs:
url = doc.metadata["url"]
setup_text[url] += f"{doc.page_content}\n"

{'': "You'll need to install a few packages, and have your OpenAI API key set as an environment variable named OPENAI_API_KEY:\n%pip install --upgrade --quiet langchain langchain-openai\n\n# Set env var OPENAI_API_KEY or load from a .env file:\nimport dotenv\n\ndotenv.load_dotenv()\n[33mWARNING: You are using pip version 22.0.4; however, version 23.3.2 is available.\nYou should consider upgrading via the '/Users/jacoblee/.pyenv/versions/3.10.5/bin/python -m pip install --upgrade pip' command.[0m[33m\n[0mNote: you may need to restart the kernel to use updated packages.\n",
'': "For this guide, we'll be using a tool calling agent with a single tool for searching the web. The default will be powered by Tavily, but you can switch it out for any similar tool. The rest of this section will assume you're using Tavily.\nYou'll need to sign up for an account on the Tavily website, and install the following packages:\n%pip install --upgrade --quiet langchain-community langchain-openai tavily-python\n\n# Set env var OPENAI_API_KEY or load from a .env file:\nimport dotenv\n\ndotenv.load_dotenv()\nYou will also need your OpenAI key set as OPENAI_API_KEY and your Tavily API key set as TAVILY_API_KEY.\n"}


一旦我们将页面内容加载到LangChain Document对象中,我们就可以以通常的方式对它们进行索引(例如,用于RAG应用程序)。下面我们使用OpenAI embeddings,尽管任何LangChain嵌入模型都可以胜任。

%pip install -qU langchain-openai
import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings

vector_store = InMemoryVectorStore.from_documents(setup_docs, OpenAIEmbeddings())
retrieved_docs = vector_store.similarity_search("Install Tavily", k=2)
for doc in retrieved_docs:
print(f'Page {doc.metadata["url"]}: {doc.page_content[:300]}\n')
INFO: HTTP Request: POST "HTTP/1.1 200 OK"
INFO: HTTP Request: POST "HTTP/1.1 200 OK"
Page You'll need to sign up for an account on the Tavily website, and install the following packages:

Page For this guide, we'll be using a tool calling agent with a single tool for searching the web. The default will be powered by Tavily, but you can switch it out for any similar tool. The rest of this section will assume you're using Tavily.


