Skip to content

如何构建一个聊天机器人#

LlamaIndex 充当您的数据和大型语言模型(LLMs)之间的桥梁,提供了一个工具包,使您能够为各种任务建立围绕数据的查询接口,例如问答和摘要。

在本教程中,我们将带您逐步构建一个使用 Data Agent 的上下文增强型聊天机器人。这个由LLMs驱动的代理能够智能地在您的数据上执行任务。最终结果是一个聊天机器人代理,它配备了由LlamaIndex提供的一套强大的数据接口工具,可以回答关于您的数据的查询。

注意:本教程是在之前创建一个针对SEC 10-K申报文件的查询接口的基础上进行的 - 点击这里查看

背景#

在本指南中,我们将构建一个“10-K聊天机器人”,它使用来自Dropbox的原始UBER 10-K HTML申报文件。用户可以与聊天机器人交互,询问与10-K申报文件相关的问题。

准备工作#

import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]

import nest_asyncio

nest_asyncio.apply()

数据摄入#

首先让我们下载2019年至2022年的原始10-K文件。

# 注意:代码示例假定您在Jupyter笔记本中操作。
# 下载文件
!mkdir data
!wget "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -O data/UBER.zip
!unzip data/UBER.zip -d data

为了将HTML文件解析为格式化文本,我们使用 Unstructured 库。多亏了 LlamaHub,我们可以直接与Unstructured集成,将任何文本转换为LlamaIndex可以摄入的文档格式。

首先安装必要的包:

!pip install llama-hub unstructured

然后我们可以使用 UnstructuredReader 将HTML文件解析为一组 Document 对象。

from llama_index.readers.file import UnstructuredReader
from pathlib import Path

years = [2022, 2021, 2020, 2019]

loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(
        file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
    )
    # 将年份元数据插入到每个年份中
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

为每年设置向量索引#

我们首先为每年设置一个向量索引。每个向量索引都允许我们询问关于给定年份的10-K申报的问题。

我们构建每个索引并将其保存到磁盘。

# 初始化简单的向量索引
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core import Settings

Settings.chunk_size = 512
index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults()
    cur_index = VectorStoreIndex.from_documents(
        doc_set[year],
        storage_context=storage_context,
    )
    index_set[year] = cur_index
    storage_context.persist(persist_dir=f"./storage/{year}")

要从磁盘加载索引,请执行以下操作

# 从磁盘加载索引
from llama_index.core import load_index_from_storage

index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults(
        persist_dir=f"./storage/{year}"
    )
    cur_index = load_index_from_storage(
        storage_context,
    )
    index_set[year] = cur_index

设置子问题查询引擎以在所有10-K申报中综合回答问题#

由于我们可以访问4年的文档,我们不仅可以询问关于给定年份的10-K文档的问题,还可以询问需要分析所有10-K申报的问题。

为了解决这个问题,我们可以使用 子问题查询引擎。它将查询分解为子查询,每个子查询由单独的向量索引回答,并综合结果以回答整体查询。

LlamaIndex提供了一些围绕索引(和查询引擎)的包装器,以便它们可以被查询引擎和代理使用。首先为每个向量索引定义一个 QueryEngineTool。 每个工具都有一个名称和描述;这些是LLM代理看到的,以决定选择哪个工具。

from llama_index.core.tools import QueryEngineTool, ToolMetadata

individual_query_engine_tools = [
    QueryEngineTool(
        query_engine=index_set[year].as_query_engine(),
        metadata=ToolMetadata(
            name=f"vector_index_{year}",
            description=f"用于当您想要回答有关Uber {year}年SEC 10-K的查询时非常有用",
        ),
    )
    for year in years
]
现在我们可以创建子问题查询引擎(Sub Question Query Engine),这将允许我们在10-K备案中综合回答问题。我们传入上面定义的individual_query_engine_tools,以及将用于运行子查询的llm

from llama_index.llms.openai import OpenAI
from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=individual_query_engine_tools,
    llm=OpenAI(model="gpt-3.5-turbo"),
)

设置聊天机器人代理#

我们使用LlamaIndex数据代理(Data Agent)来设置外部聊天机器人代理,该代理可以访问一组工具(Tools)。具体来说,我们将使用一个OpenAIAgent,利用OpenAI API的函数调用。我们希望使用之前为每个指数(对应特定年份)定义的单独工具,以及上面定义的子问题查询引擎的工具。

首先,我们为子问题查询引擎定义一个QueryEngineTool

query_engine_tool = QueryEngineTool(
    query_engine=query_engine,
    metadata=ToolMetadata(
        name="sub_question_query_engine",
        description="useful for when you want to answer queries that require analyzing multiple SEC 10-K documents for Uber",
    ),
)

然后,我们将上面定义的工具组合成代理的工具列表:

tools = individual_query_engine_tools + [query_engine_tool]

最后,我们调用OpenAIAgent.from_tools来创建代理,传入我们上面定义的工具列表。

from llama_index.agent.openai import OpenAIAgent

agent = OpenAIAgent.from_tools(tools, verbose=True)

测试代理#

现在我们可以用各种查询来测试代理。

如果我们用一个简单的“你好”查询进行测试,代理不会使用任何工具。

response = agent.chat("hi, i am bob")
print(str(response))
Hello Bob! How can I assist you today?

如果我们用一个关于某一年度10-K的查询进行测试,代理将使用相关的向量索引工具。

response = agent.chat(
    "What were some of the biggest risk factors in 2020 for Uber?"
)
print(str(response))

=== 调用函数 ===
调用函数: vector_index_2020 with args: {
  "input": "biggest risk factors"
}
得到输出: 在这个背景下提到的最大风险因素有:
1. COVID-19大流行及为减轻其影响而采取的行动对业务的不利影响。
2. 将司机重新分类为雇员、工人或准雇员,而不是独立承包商。
3. 在移动、交付和物流行业中存在激烈竞争,存在低成本替代品和资金充裕的竞争对手。
4. 为保持竞争力而需要降低车费或服务费,并提供司机激励和消费者折扣。
5. 遭受重大损失,实现盈利的不确定性。
6. 面临未能吸引或保持平台用户的风险。
7. 与工作场所文化和前瞻性方法相关的运营、合规性和文化挑战。
8. 国际投资的潜在负面影响以及在外国开展业务的挑战。
9. 与运营和合规性挑战、本地化、法律和法规、竞争、社会接受度、技术兼容性、不当商业行为、责任不确定性、管理国际业务、货币波动、现金交易、税务后果和支付欺诈相关的风险。
========================
Uber在2020年面临的一些最大风险因素包括:

1. COVID-19大流行及为减轻其影响而采取的行动对业务的不利影响。
2. 将司机重新分类为雇员、工人或准雇员,而不是独立承包商。
3. 在移动、交付和物流行业中存在激烈竞争,存在低成本替代品和资金充裕的竞争对手。
4. 为保持竞争力而需要降低车费或服务费,并提供司机激励和消费者折扣。
5. 遭受重大损失,实现盈利的不确定性。
6. 面临未能吸引或保持平台用户的风险。
7. 与工作场所文化和前瞻性方法相关的运营、合规性和文化挑战。
8. 国际投资的潜在负面影响以及在外国开展业务的挑战。
9. 与运营和合规性挑战、本地化、法律和法规、竞争、社会接受度、技术兼容性、不当商业行为、责任不确定性、管理国际业务、货币波动、现金交易、税务后果和支付欺诈相关的风险。

这些风险因素突显了Uber在2020年面临的挑战和不确定性。
最后,如果我们使用一个查询来比较/对比跨年度的风险因素,代理将使用子问题查询引擎工具。

cross_query_str = "比较/对比 Uber 10-K 中描述的风险因素跨年度的情况。以项目符号形式给出答案。"

response = agent.chat(cross_query_str)
print(str(response))
=== Calling Function ===
Calling function: sub_question_query_engine with args: {
  "input": "Compare/contrast the risk factors described in the Uber 10-K across years"
}
Generated 4 sub questions.
[vector_index_2022] Q: What are the risk factors described in the 2022 SEC 10-K for Uber?
[vector_index_2021] Q: What are the risk factors described in the 2021 SEC 10-K for Uber?
[vector_index_2020] Q: What are the risk factors described in the 2020 SEC 10-K for Uber?
[vector_index_2019] Q: What are the risk factors described in the 2019 SEC 10-K for Uber?
[vector_index_2021] A: The risk factors described in the 2021 SEC 10-K for Uber include the adverse impact of the COVID-19 pandemic on their business, the potential reclassification of drivers as employees instead of independent contractors, intense competition in the mobility, delivery, and logistics industries, the need to lower fares and offer incentives to remain competitive, significant losses incurred by the company, the importance of attracting and maintaining a critical mass of platform users, and the ongoing legal challenges regarding driver classification.
[vector_index_2020] A: The risk factors described in the 2020 SEC 10-K for Uber include the adverse impact of the COVID-19 pandemic on their business, the potential reclassification of drivers as employees instead of independent contractors, intense competition in the mobility, delivery, and logistics industries, the need to lower fares and offer incentives to remain competitive, significant losses and the uncertainty of achieving profitability, the importance of attracting and retaining a critical mass of drivers and users, and the challenges associated with their workplace culture and operational compliance.
[vector_index_2022] A: The risk factors described in the 2022 SEC 10-K for Uber include the potential adverse effect on their business if drivers were classified as employees instead of independent contractors, the highly competitive nature of the mobility, delivery, and logistics industries, the need to lower fares or service fees to remain competitive in certain markets, the company's history of significant losses and the expectation of increased operating expenses in the future, and the potential impact on their platform if they are unable to attract or maintain a critical mass of drivers, consumers, merchants, shippers, and carriers.
[vector_index_2019] A: The risk factors described in the 2019 SEC 10-K for Uber include the loss of their license to operate in London, the complexity of their business and operating model due to regulatory uncertainties, the potential for additional regulations for their other products in the Other Bets segment, the evolving laws and regulations regarding the development and deployment of autonomous vehicles, and the increasing number of data protection and privacy laws around the world. Additionally, there are legal proceedings, litigation, claims, and government investigations that Uber is involved in, which could impose a burden on management and employees and come with defense costs or unfavorable rulings.
Got output: The risk factors described in the Uber 10-K reports across the years include the potential reclassification of drivers as employees instead of independent contractors, intense competition in the mobility, delivery, and logistics industries, the need to lower fares and offer incentives to remain competitive, significant losses incurred by the company, the importance of attracting and maintaining a critical mass of platform users, and the ongoing legal challenges regarding driver classification. Additionally, there are specific risk factors mentioned in each year's report, such as the adverse impact of the COVID-19 pandemic in 2020 and 2021, the loss of their license to operate in London in 2019, and the evolving laws and regulations regarding autonomous vehicles in 2019. Overall, while there are some similarities in the risk factors mentioned, there are also specific factors that vary across the years.
========================
=== Calling Function ===
Calling function: vector_index_2022 with args: {
  "input": "risk factors"
}
Got output: Some of the risk factors mentioned in the context include the potential adverse effect on the business if drivers were classified as employees instead of independent contractors, the highly competitive nature of the mobility, delivery, and logistics industries, the need to lower fares or service fees to remain competitive, the company's history of significant losses and the expectation of increased operating expenses, the impact of future pandemics or disease outbreaks on the business and financial results, and the potential harm to the business due to economic conditions and their effect on discretionary consumer spending.
========================
=== Calling Function ===
Calling function: vector_index_2021 with args: {
  "input": "risk factors"
}
Got output: The COVID-19 pandemic and the impact of actions to mitigate the pandemic have adversely affected and may continue to adversely affect parts of our business. Our business would be adversely affected if Drivers were classified as employees, workers or quasi-employees instead of independent contractors. The mobility, delivery, and logistics industries are highly competitive, with well-established and low-cost alternatives that have been available for decades, low barriers to entry, low switching costs, and well-capitalized competitors in nearly every major geographic region. To remain competitive in certain markets, we have in the past lowered, and may continue to lower, fares or service fees, and we have in the past offered, and may continue to offer, significant Driver incentives and consumer discounts and promotions. We have incurred significant losses since inception, including in the United States and other major markets. We expect our operating expenses to increase significantly in the foreseeable future, and we may not achieve or maintain profitability. If we are unable to attract or maintain a critical mass of Drivers, consumers, merchants, shippers, and carriers, whether as a result of competition or other factors, our platform will become less appealing to platform users.
========================
=== Calling Function ===
Calling function: vector_index_2020 with args: {
  "input": "risk factors"
}
Got output: The risk factors mentioned in the context include the adverse impact of the COVID-19 pandemic on the business, the potential reclassification of drivers as employees, the highly competitive nature of the mobility, delivery, and logistics industries, the need to lower fares or service fees to remain competitive, the company's history of significant losses and potential future expenses, the importance of attracting and maintaining a critical mass of platform users, and the operational and cultural challenges faced by the company.
========================
=== Calling Function ===
Calling function: vector_index_2019 with args: {
  "input": "risk factors"
}
Got output: The risk factors mentioned in the context include competition with local companies, differing levels of social acceptance, technological compatibility issues, exposure to improper business practices, legal uncertainty, difficulties in managing international operations, fluctuations in currency exchange rates, regulations governing local currencies, tax consequences, financial accounting burdens, difficulties in implementing financial systems, import and export restrictions, political and economic instability, public health concerns, reduced protection for intellectual property rights, limited influence over minority-owned affiliates, and regulatory complexities. These risk factors could adversely affect the international operations, business, financial condition, and operating results of the company.
========================
Here is a comparison of the risk factors described in the Uber 10-K reports across years:

2022 Risk Factors:
- Potential adverse effect if drivers were classified as employees instead of independent contractors.
- Highly competitive nature of the mobility, delivery, and logistics industries.
- Need to lower fares or service fees to remain competitive.
- History of significant losses and expectation of increased operating expenses.
- Impact of future pandemics or disease outbreaks on the business and financial results.
- Potential harm to the business due to economic conditions and their effect on discretionary consumer spending.

2021 Risk Factors:
- Adverse impact of the COVID-19 pandemic and actions to mitigate it on the business.
- Potential reclassification of drivers as employees instead of independent contractors.
- Highly competitive nature of the mobility, delivery, and logistics industries.
- Need to lower fares or service fees and offer incentives to remain competitive.
- History of significant losses and uncertainty of achieving profitability.
- Importance of attracting and maintaining a critical mass of platform users.

2020 Risk Factors:
- Adverse impact of the COVID-19 pandemic on the business.
- Potential reclassification of drivers as employees.
- Highly competitive nature of the mobility, delivery, and logistics industries.
- Need to lower fares or service fees to remain competitive.
- History of significant losses and potential future expenses.
- Importance of attracting and maintaining a critical mass of platform users.
- Operational and cultural challenges faced by the company.

2019 Risk Factors:
- Competition with local companies.
- Differing levels of social acceptance.
- Technological compatibility issues.
- Exposure to improper business practices.
- Legal uncertainty.
- Difficulties in managing international operations.
- Fluctuations in currency exchange rates.
- Regulations governing local currencies.
- Tax consequences.
- Financial accounting burdens.
- Difficulties in implementing financial systems.
- Import and export restrictions.
- Political and economic instability.
- Public health concerns.
- Reduced protection for intellectual property rights.
- Limited influence over minority-owned affiliates.
- Regulatory complexities.

These comparisons highlight both common and unique risk factors that Uber faced in different years.

设置聊天机器人循环#

现在我们已经设置好了聊天机器人,只需要再进行几个步骤,就可以建立一个基本的交互循环,与我们的 SEC 增强聊天机器人进行对话了!

agent = OpenAIAgent.from_tools(tools)  # 默认情况下 verbose=False

while True:
    text_input = input("User: ")
    if text_input == "exit":
        break
    response = agent.chat(text_input)
    print(f"Agent: {response}")

以下是循环运行的示例:

User:  What were some of the legal proceedings against Uber in 2022?
Agent: In 2022, Uber faced several legal proceedings. Some of the notable ones include:

1. Petition against Proposition 22: A petition was filed in California alleging that Proposition 22, which classifies app-based drivers as independent contractors, is unconstitutional.

2. Lawsuit by Massachusetts Attorney General: The Massachusetts Attorney General filed a lawsuit against Uber, claiming that drivers should be classified as employees and entitled to protections under wage and labor laws.

3. Allegations by New York Attorney General: The New York Attorney General made allegations against Uber regarding the misclassification of drivers and related employment violations.

4. Swiss social security rulings: Swiss social security rulings classified Uber drivers as employees, which could have implications for Uber's operations in Switzerland.

5. Class action lawsuits in Australia: Uber faced class action lawsuits in Australia, with allegations that the company conspired to harm participants in the taxi, hire-car, and limousine industries.

It's important to note that the outcomes of these legal proceedings are uncertain and may vary.

User:

笔记本#

请查看我们的对应笔记本