跳到主要内容

使用LlamaIndex进行财务文档分析

nbviewer

在这个示例笔记本中,我们展示了如何使用LlamaIndex框架仅需几行代码对10-K文件进行财务分析。

笔记大纲

简介

LLamaIndex

LlamaIndex 是用于LLM应用程序的数据框架。 您只需几行代码就可以开始构建一个检索增强生成(RAG)系统,并在几分钟内完成。 对于更高级的用户,LlamaIndex提供了丰富的工具包,用于摄取和索引数据,检索和重新排序的模块,以及用于构建自定义查询引擎的可组合组件。

更多详细信息,请参阅完整文档

对10-K文件的财务分析

财务分析师工作的关键部分是从长篇财务文件中提取信息并综合洞察。 一个很好的例子是10-K表格 - 美国证券交易委员会(SEC)要求的年度报告,它提供了公司财务表现的全面摘要。 这些文件通常有数百页之长,并包含领域特定术语,这使得普通人很难快速理解。

我们展示了LlamaIndex如何支持财务分析师快速提取信息并综合洞察跨多个文件,而几乎不需要编码。

设置

首先,我们需要安装llama-index库。

!pip install llama-index pypdf

现在,我们导入本教程中使用的所有模块。

from langchain import OpenAI

from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex
from llama_index import set_global_service_context
from llama_index.response.pprint_utils import pprint_response
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine

在开始之前,我们可以配置将为我们的RAG系统提供动力的LLM提供程序和模型。
在这里,我们选择来自OpenAI的gpt-3.5-turbo-instruct

llm = OpenAI(temperature=0, model_name="gpt-3.5-turbo-instruct", max_tokens=-1)

我们构建一个ServiceContext并将其设置为全局默认值,因此所有依赖LLM调用的后续操作都将使用我们在此处配置的模型。

service_context = ServiceContext.from_defaults(llm=llm)
set_global_service_context(service_context=service_context)

数据加载和索引

现在,我们加载并解析2个PDF文件(一个是2021年Uber的10-K文件,另一个是2021年Lyft的10-K文件)。
在幕后,这些PDF文件被转换为纯文本Document对象,按页面分开。

注意:这个操作可能需要一些时间才能运行完毕,因为每个文件都超过100页。

lyft_docs = SimpleDirectoryReader(input_files=["../data/10k/lyft_2021.pdf"]).load_data()
uber_docs = SimpleDirectoryReader(input_files=["../data/10k/uber_2021.pdf"]).load_data()

print(f'Loaded lyft 10-K with {len(lyft_docs)} pages')
print(f'Loaded Uber 10-K with {len(uber_docs)} pages')

Loaded lyft 10-K with 238 pages
Loaded Uber 10-K with 307 pages

现在,我们可以在加载的文档上构建一个(内存中的)VectorStoreIndex

注意:这个操作可能需要一段时间才能运行,因为它调用OpenAI API来计算文档块的向量嵌入。

lyft_index = VectorStoreIndex.from_documents(lyft_docs)
uber_index = VectorStoreIndex.from_documents(uber_docs)

简单问答

这个Python文件包含了一个简单的问答系统,用于回答一些基本问题。

现在我们已经准备好对我们的索引运行一些查询了!
为此,我们首先配置一个QueryEngine,它只是捕获了一组配置,用于指定我们希望如何查询底层索引。

对于VectorStoreIndex,最常见的要调整的配置是similarity_top_k,它控制要检索多少文档块(我们称之为Node对象)以用作回答问题的上下文。

lyft_engine = lyft_index.as_query_engine(similarity_top_k=3)

uber_engine = uber_index.as_query_engine(similarity_top_k=3)

让我们看看一些查询的实际操作!

response = await lyft_engine.aquery('What is the revenue of Lyft in 2021? Answer in millions with page reference')

print(response)


$3,208.3 million (page 63)
response = await uber_engine.aquery('What is the revenue of Uber in 2021? Answer in millions, with page reference')

print(response)


$17,455 (page 53)

高级问答 - 比较和对比

对于更复杂的财务分析,通常需要参考多个文档。

例如,让我们看看如何在Lyft和Uber的财务数据上进行比较和对比查询。为此,我们构建一个SubQuestionQueryEngine,它将复杂的比较和对比查询拆分为更简单的子问题,以在由各自的索引支持的子查询引擎上执行。

query_engine_tools = [
QueryEngineTool(
query_engine=lyft_engine,
metadata=ToolMetadata(name='lyft_10k', description='Provides information about Lyft financials for year 2021')
),
QueryEngineTool(
query_engine=uber_engine,
metadata=ToolMetadata(name='uber_10k', description='Provides information about Uber financials for year 2021')
),
]

s_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=query_engine_tools)

让我们看看这些查询是如何运行的!

response = await s_engine.aquery('Compare and contrast the customer segments and geographies that grew the fastest')

Generated 4 sub questions.
[uber_10k] Q: What customer segments grew the fastest for Uber
[uber_10k] A: in 2021?

The customer segments that grew the fastest for Uber in 2021 were its Mobility Drivers, Couriers, Riders, and Eaters. These segments experienced growth due to the continued stay-at-home order demand related to COVID-19, as well as Uber's introduction of its Uber One, Uber Pass, Eats Pass, and Rides Pass membership programs. Additionally, Uber's marketplace-centric advertising helped to connect merchants and brands with its platform network, further driving growth.
[uber_10k] Q: What geographies grew the fastest for Uber
[uber_10k] A:
Based on the context information, it appears that Uber experienced the most growth in large metropolitan areas, such as Chicago, Miami, New York City, Sao Paulo, and London. Additionally, Uber experienced growth in suburban and rural areas, as well as in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain.
[lyft_10k] Q: What customer segments grew the fastest for Lyft
[lyft_10k] A:
The customer segments that grew the fastest for Lyft were ridesharing, light vehicles, and public transit. Ridesharing grew as Lyft was able to predict demand and proactively incentivize drivers to be available for rides in the right place at the right time. Light vehicles grew as users were looking for options that were more active, usually lower-priced, and often more efficient for short trips during heavy traffic. Public transit grew as Lyft integrated third-party public transit data into the Lyft App to offer users a robust view of transportation options around them.
[lyft_10k] Q: What geographies grew the fastest for Lyft
[lyft_10k] A:
It is not possible to answer this question with the given context information.
print(response)


The customer segments that grew the fastest for Uber in 2021 were its Mobility Drivers, Couriers, Riders, and Eaters. These segments experienced growth due to the continued stay-at-home order demand related to COVID-19, as well as Uber's introduction of its Uber One, Uber Pass, Eats Pass, and Rides Pass membership programs. Additionally, Uber's marketplace-centric advertising helped to connect merchants and brands with its platform network, further driving growth. Uber experienced the most growth in large metropolitan areas, such as Chicago, Miami, New York City, Sao Paulo, and London. Additionally, Uber experienced growth in suburban and rural areas, as well as in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain.

The customer segments that grew the fastest for Lyft were ridesharing, light vehicles, and public transit. Ridesharing grew as Lyft was able to predict demand and proactively incentivize drivers to be available for rides in the right place at the right time. Light vehicles grew as users were looking for options that were more active, usually lower-priced, and often more efficient for short trips during heavy traffic. Public transit grew as Lyft integrated third-party public transit data into the Lyft App to offer users a robust view of transportation options around them. It is not possible to answer the question of which geographies grew the fastest for Lyft with the given context information.

In summary, Uber and Lyft both experienced growth in customer segments related to mobility, couriers, riders, and eaters. Uber experienced the most growth in large metropolitan areas, as well as in suburban and rural areas, and in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain. Lyft experienced the most growth in ridesharing, light vehicles, and public transit. It is not possible to answer the question of which geographies grew the fastest for Lyft with the given context information.
response = await s_engine.aquery('Compare revenue growth of Uber and Lyft from 2020 to 2021')

Generated 2 sub questions.
[uber_10k] Q: What is the revenue growth of Uber from 2020 to 2021
[uber_10k] A:
The revenue growth of Uber from 2020 to 2021 was 57%, or 54% on a constant currency basis.
[lyft_10k] Q: What is the revenue growth of Lyft from 2020 to 2021
[lyft_10k] A:
The revenue growth of Lyft from 2020 to 2021 is 36%, increasing from $2,364,681 thousand to $3,208,323 thousand.
print(response)


The revenue growth of Uber from 2020 to 2021 was 57%, or 54% on a constant currency basis, while the revenue growth of Lyft from 2020 to 2021 was 36%. This means that Uber had a higher revenue growth than Lyft from 2020 to 2021.