如何通过迭代优化总结文本

LLMs 可以从文本中总结并提炼出所需的信息，包括大量文本。在许多情况下，特别是当文本量相对于模型的上下文窗口大小较大时，将总结任务分解为较小的部分可能会有所帮助（或必要）。

迭代精炼是总结长文本的一种策略。该策略如下：

将文本分割成较小的文档；
总结第一个文档;
根据下一个文档优化或更新结果；
重复遍历文档序列直到完成。

请注意，此策略并未并行化。当理解子文档依赖于先前的上下文时，它特别有效——例如，在总结具有固有序列的小说或文本主体时。

LangGraph，建立在langchain-core之上，非常适合这个问题：

LangGraph 允许流式传输各个步骤（例如连续的摘要），从而更好地控制执行过程；
LangGraph的检查点支持错误恢复，扩展了人机交互工作流程，并更容易集成到对话应用中。
由于它是由模块化组件组装而成，因此扩展或修改也很简单（例如，合并工具调用或其他行为）。

下面，我们演示如何通过迭代优化来总结文本。

加载聊天模型

首先加载一个聊天模型：

Select chat model:

pip install -qU langchain-openai

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

加载文档

接下来，我们需要一些文档来进行总结。下面，我们生成一些示例文档用于说明目的。有关更多数据来源，请参阅文档加载器操作指南和集成页面。总结教程还包括一个总结博客文章的示例。

from langchain_core.documents import Document

documents = [
    Document(page_content="Apples are red", metadata={"title": "apple_book"}),
    Document(page_content="Blueberries are blue", metadata={"title": "blueberry_book"}),
    Document(page_content="Bananas are yelow", metadata={"title": "banana_book"}),
]

API Reference:Document

创建图表

下面我们展示这个过程的LangGraph实现：

我们为初始摘要生成一个简单的链，该链提取第一个文档，将其格式化为提示，并使用我们的LLM进行推理。
我们生成第二个refine_summary_chain，它在每个后续文档上操作，细化初始摘要。

我们需要安装 langgraph:

pip install -qU langgraph

import operator
from typing import List, Literal, TypedDict

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig
from langgraph.constants import Send
from langgraph.graph import END, START, StateGraph

# Initial summary
summarize_prompt = ChatPromptTemplate(
    [
        ("human", "Write a concise summary of the following: {context}"),
    ]
)
initial_summary_chain = summarize_prompt | llm | StrOutputParser()

# Refining the summary with new docs
refine_template = """
Produce a final summary.

Existing summary up to this point:
{existing_answer}

New context:
------------
{context}
------------

Given the new context, refine the original summary.
"""
refine_prompt = ChatPromptTemplate([("human", refine_template)])

refine_summary_chain = refine_prompt | llm | StrOutputParser()


# We will define the state of the graph to hold the document
# contents and summary. We also include an index to keep track
# of our position in the sequence of documents.
class State(TypedDict):
    contents: List[str]
    index: int
    summary: str


# We define functions for each node, including a node that generates
# the initial summary:
async def generate_initial_summary(state: State, config: RunnableConfig):
    summary = await initial_summary_chain.ainvoke(
        state["contents"][0],
        config,
    )
    return {"summary": summary, "index": 1}


# And a node that refines the summary based on the next document
async def refine_summary(state: State, config: RunnableConfig):
    content = state["contents"][state["index"]]
    summary = await refine_summary_chain.ainvoke(
        {"existing_answer": state["summary"], "context": content},
        config,
    )

    return {"summary": summary, "index": state["index"] + 1}


# Here we implement logic to either exit the application or refine
# the summary.
def should_refine(state: State) -> Literal["refine_summary", END]:
    if state["index"] >= len(state["contents"]):
        return END
    else:
        return "refine_summary"


graph = StateGraph(State)
graph.add_node("generate_initial_summary", generate_initial_summary)
graph.add_node("refine_summary", refine_summary)

graph.add_edge(START, "generate_initial_summary")
graph.add_conditional_edges("generate_initial_summary", should_refine)
graph.add_conditional_edges("refine_summary", should_refine)
app = graph.compile()

API Reference:StrOutputParser | ChatPromptTemplate | RunnableConfig | Send | StateGraph

LangGraph 允许绘制图形结构以帮助可视化其功能：

from IPython.display import Image

Image(app.get_graph().draw_mermaid_png())

调用图表

我们可以按照以下步骤逐步执行，并在优化时打印出摘要：

async for step in app.astream(
    {"contents": [doc.page_content for doc in documents]},
    stream_mode="values",
):
    if summary := step.get("summary"):
        print(summary)

Apples are characterized by their red color.
Apples are characterized by their red color, while blueberries are known for their blue hue.
Apples are characterized by their red color, blueberries are known for their blue hue, and bananas are recognized for their yellow color.

最后的step包含从整个文档集合中合成的摘要。

下一步

查看总结操作指南以获取更多总结策略，包括那些专为大量文本设计的策略。

有关摘要的更多详细信息，请参见本教程。

另请参阅LangGraph文档以获取有关使用LangGraph构建的详细信息。

加载聊天模型​

加载文档​

创建图表​

调用图表​

下一步​

这个页面有帮助吗？

加载聊天模型

加载文档

创建图表

调用图表

下一步