OpenAI助手高级检索手册¶
在这个笔记本中,我们尝试使用OpenAI助手API来进行高级检索任务,通过插入各种查询引擎工具和数据集。我们使用的包装抽象是我们的OpenAIAssistantAgent
类,它允许我们插入自定义工具。我们探讨了OpenAIAssistant
如何通过其代理执行+函数调用循环来补充/替代现有的由我们的检索器/查询引擎解决的工作流程。
- 联合问答+摘要
- 自动检索
- 联合SQL和向量搜索
In [ ]:
Copied!
%pip install llama-index-agent-openai
%pip install llama-index-vector-stores-pinecone
%pip install llama-index-readers-wikipedia
%pip install llama-index-llms-openai
%pip install llama-index-agent-openai
%pip install llama-index-vector-stores-pinecone
%pip install llama-index-readers-wikipedia
%pip install llama-index-llms-openai
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
In [ ]:
Copied!
import nest_asyncio
nest_asyncio.apply()
import nest_asyncio
nest_asyncio.apply()
联合问答和摘要¶
在本节中,我们将展示如何让助手代理同时回答基于事实的问题和摘要问题。这是内部检索工具难以实现的功能。
# 加载数据
这里是加载数据的部分。
In [ ]:
Copied!
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2023-11-11 09:40:13-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘data/paul_graham/paul_graham_essay.txt’ data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.009s 2023-11-11 09:40:14 (8.24 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
In [ ]:
Copied!
from llama_index.core import SimpleDirectoryReader# 加载文档documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
from llama_index.core import SimpleDirectoryReader# 加载文档documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
设置向量 + 摘要索引/查询引擎/工具¶
In [ ]:
Copied!
from llama_index.llms.openai import OpenAIfrom llama_index.core import Settingsfrom llama_index.core import StorageContext, VectorStoreIndexfrom llama_index.core import SummaryIndex# 初始化设置(设置块大小)Settings.llm = OpenAI()Settings.chunk_size = 1024nodes = Settings.node_parser.get_nodes_from_documents(documents)# 初始化存储上下文(默认情况下为内存)storage_context = StorageContext.from_defaults()storage_context.docstore.add_documents(nodes)# 定义摘要索引和向量索引相同的数据summary_index = SummaryIndex(nodes, storage_context=storage_context)vector_index = VectorStoreIndex(nodes, storage_context=storage_context)# 定义查询引擎summary_query_engine = summary_index.as_query_engine( response_mode="tree_summarize", use_async=True,)vector_query_engine = vector_index.as_query_engine()
from llama_index.llms.openai import OpenAIfrom llama_index.core import Settingsfrom llama_index.core import StorageContext, VectorStoreIndexfrom llama_index.core import SummaryIndex# 初始化设置(设置块大小)Settings.llm = OpenAI()Settings.chunk_size = 1024nodes = Settings.node_parser.get_nodes_from_documents(documents)# 初始化存储上下文(默认情况下为内存)storage_context = StorageContext.from_defaults()storage_context.docstore.add_documents(nodes)# 定义摘要索引和向量索引相同的数据summary_index = SummaryIndex(nodes, storage_context=storage_context)vector_index = VectorStoreIndex(nodes, storage_context=storage_context)# 定义查询引擎summary_query_engine = summary_index.as_query_engine( response_mode="tree_summarize", use_async=True,)vector_query_engine = vector_index.as_query_engine()
In [ ]:
Copied!
from llama_index.core.tools import QueryEngineTool
summary_tool = QueryEngineTool.from_defaults(
query_engine=summary_query_engine,
name="summary_tool",
description=(
"Useful for summarization questions related to the author's life"
),
)
vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_query_engine,
name="vector_tool",
description=(
"Useful for retrieving specific context to answer specific questions about the author's life"
),
)
from llama_index.core.tools import QueryEngineTool
summary_tool = QueryEngineTool.from_defaults(
query_engine=summary_query_engine,
name="summary_tool",
description=(
"Useful for summarization questions related to the author's life"
),
)
vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_query_engine,
name="vector_tool",
description=(
"Useful for retrieving specific context to answer specific questions about the author's life"
),
)
定义助理代理¶
在这个部分,我们将定义一个助理代理,它将帮助我们执行各种任务。
In [ ]:
Copied!
from llama_index.agent.openai import OpenAIAssistantAgent
agent = OpenAIAssistantAgent.from_new(
name="QA bot",
instructions="You are a bot designed to answer questions about the author",
openai_tools=[],
tools=[summary_tool, vector_tool],
verbose=True,
run_retrieve_sleep_time=1.0,
)
from llama_index.agent.openai import OpenAIAssistantAgent
agent = OpenAIAssistantAgent.from_new(
name="QA bot",
instructions="You are a bot designed to answer questions about the author",
openai_tools=[],
tools=[summary_tool, vector_tool],
verbose=True,
run_retrieve_sleep_time=1.0,
)
结果:有点不稳定¶
In [ ]:
Copied!
response = agent.chat("Can you give me a summary about the author's life?")
print(str(response))
response = agent.chat("Can you give me a summary about the author's life?")
print(str(response))
=== Calling Function === Calling function: summary_tool with args: {"input":"Can you give me a summary about the author's life?"} Got output: The author, Paul Graham, had a strong interest in writing and programming from a young age. They started writing short stories and experimenting with programming in high school. In college, they initially studied philosophy but switched to studying artificial intelligence. However, they realized that the AI being practiced at the time was not going to lead to true understanding of natural language. This led them to focus on Lisp programming and eventually write a book about Lisp hacking. Despite being in a PhD program in computer science, the author also developed a passion for art and decided to pursue it further. They attended the Accademia di Belli Arti in Florence but found that it did not teach them much. They then returned to the US and got a job at a software company. Afterward, they attended the Rhode Island School of Design but dropped out due to the focus on developing a signature style rather than teaching the fundamentals of art. They then moved to New York City and became interested in the World Wide Web, eventually starting a company called Viaweb. They later founded Y Combinator, an investment firm, and created Hacker News. ======================== Paul Graham is an author with eclectic interests and a varied career path. He began with interests in writing and programming, engaged in philosophy and artificial intelligence during college, and authored a book on Lisp programming. With an equally strong passion for art, he studied at the Accademia di Belli Arti in Florence and briefly at the Rhode Island School of Design before immersing himself in the tech industry by starting Viaweb and later founding the influential startup accelerator Y Combinator. He also created Hacker News, a social news website focused on computer science and entrepreneurship. Graham's life reflects a blend of technology, entrepreneurship, and the arts.
In [ ]:
Copied!
response = agent.query("What did the author do after RICS?")
print(str(response))
response = agent.query("What did the author do after RICS?")
print(str(response))
=== Calling Function === Calling function: vector_tool with args: {"input":"After RICS"} Got output: After RICS, the author moved back to Providence to continue at RISD. However, it became clear that art school, specifically the painting department, did not have the same relationship to art as medical school had to medicine. Painting students were expected to express themselves and develop a distinctive signature style. ======================== After the author's time at the Royal Institution of Chartered Surveyors (RICS), they moved back to Providence to continue their studies at the Rhode Island School of Design (RISD). There, the author noted a significant difference in the educational approaches between RISD and medical school, specifically in the painting department. At RISD, students were encouraged to express themselves and to develop a unique and distinctive signature style in their artwork.
从向量数据库自动检索¶
我们现有的“自动检索”功能(在VectorIndexAutoRetriever
中)允许LLM推断向量数据库的正确查询参数,包括查询字符串和元数据过滤器。
由于助手API可以调用函数并推断函数参数,我们在这里探索它在执行自动检索方面的能力。
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
In [ ]:
Copied!
import pinecone
import os
api_key = os.environ["PINECONE_API_KEY"]
pinecone.init(api_key=api_key, environment="us-west1-gcp")
import pinecone
import os
api_key = os.environ["PINECONE_API_KEY"]
pinecone.init(api_key=api_key, environment="us-west1-gcp")
/Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages/pinecone/index.py:4: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console) from tqdm.autonotebook import tqdm
In [ ]:
Copied!
# 文本嵌入ada-002的维度try: pinecone.create_index( "quickstart", dimension=1536, metric="euclidean", pod_type="p1" )except Exception: # 很可能索引已经存在 pass
# 文本嵌入ada-002的维度try: pinecone.create_index( "quickstart", dimension=1536, metric="euclidean", pod_type="p1" )except Exception: # 很可能索引已经存在 pass
In [ ]:
Copied!
pinecone_index = pinecone.Index("quickstart")
pinecone_index = pinecone.Index("quickstart")
In [ ]:
Copied!
# 可选:删除pinecone索引中的数据pinecone_index.delete(deleteAll=True, namespace="test")
# 可选:删除pinecone索引中的数据pinecone_index.delete(deleteAll=True, namespace="test")
Out[ ]:
{}
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore
In [ ]:
Copied!
from llama_index.core.schema import TextNode
nodes = [
TextNode(
text=(
"Michael Jordan is a retired professional basketball player,"
" widely regarded as one of the greatest basketball players of all"
" time."
),
metadata={
"category": "Sports",
"country": "United States",
},
),
TextNode(
text=(
"Angelina Jolie is an American actress, filmmaker, and"
" humanitarian. She has received numerous awards for her acting"
" and is known for her philanthropic work."
),
metadata={
"category": "Entertainment",
"country": "United States",
},
),
TextNode(
text=(
"Elon Musk is a business magnate, industrial designer, and"
" engineer. He is the founder, CEO, and lead designer of SpaceX,"
" Tesla, Inc., Neuralink, and The Boring Company."
),
metadata={
"category": "Business",
"country": "United States",
},
),
TextNode(
text=(
"Rihanna is a Barbadian singer, actress, and businesswoman. She"
" has achieved significant success in the music industry and is"
" known for her versatile musical style."
),
metadata={
"category": "Music",
"country": "Barbados",
},
),
TextNode(
text=(
"Cristiano Ronaldo is a Portuguese professional footballer who is"
" considered one of the greatest football players of all time. He"
" has won numerous awards and set multiple records during his"
" career."
),
metadata={
"category": "Sports",
"country": "Portugal",
},
),
]
from llama_index.core.schema import TextNode
nodes = [
TextNode(
text=(
"Michael Jordan is a retired professional basketball player,"
" widely regarded as one of the greatest basketball players of all"
" time."
),
metadata={
"category": "Sports",
"country": "United States",
},
),
TextNode(
text=(
"Angelina Jolie is an American actress, filmmaker, and"
" humanitarian. She has received numerous awards for her acting"
" and is known for her philanthropic work."
),
metadata={
"category": "Entertainment",
"country": "United States",
},
),
TextNode(
text=(
"Elon Musk is a business magnate, industrial designer, and"
" engineer. He is the founder, CEO, and lead designer of SpaceX,"
" Tesla, Inc., Neuralink, and The Boring Company."
),
metadata={
"category": "Business",
"country": "United States",
},
),
TextNode(
text=(
"Rihanna is a Barbadian singer, actress, and businesswoman. She"
" has achieved significant success in the music industry and is"
" known for her versatile musical style."
),
metadata={
"category": "Music",
"country": "Barbados",
},
),
TextNode(
text=(
"Cristiano Ronaldo is a Portuguese professional footballer who is"
" considered one of the greatest football players of all time. He"
" has won numerous awards and set multiple records during his"
" career."
),
metadata={
"category": "Sports",
"country": "Portugal",
},
),
]
In [ ]:
Copied!
vector_store = PineconeVectorStore(
pinecone_index=pinecone_index, namespace="test"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_store = PineconeVectorStore(
pinecone_index=pinecone_index, namespace="test"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
In [ ]:
Copied!
index = VectorStoreIndex(nodes, storage_context=storage_context)
index = VectorStoreIndex(nodes, storage_context=storage_context)
Upserted vectors: 0%| | 0/5 [00:00<?, ?it/s]
定义函数工具¶
在这里,我们定义了函数接口,该接口将传递给OpenAI以执行自动检索。
我们无法让OpenAI使用嵌套的pydantic对象或元组作为参数,因此我们将元数据过滤键和值转换为列表,以便函数API能够使用。
In [ ]:
Copied!
# 定义函数工具from llama_index.core.tools import FunctionToolfrom llama_index.core.vector_stores import ( VectorStoreInfo, MetadataInfo, ExactMatchFilter, MetadataFilters,)from llama_index.core.retrievers import VectorIndexRetrieverfrom llama_index.core.query_engine import RetrieverQueryEnginefrom typing import List, Tuple, Anyfrom pydantic import BaseModel, Field# 暂时硬编码 top ktop_k = 3# 定义描述向量存储架构的向量存储信息vector_store_info = VectorStoreInfo( content_info="名人简介", metadata_info=[ MetadataInfo( name="category", type="str", description=( "名人的类别,包括[体育、娱乐、商业、音乐]" ), ), MetadataInfo( name="country", type="str", description=( "名人的国家,包括[美国、巴巴多斯、葡萄牙]" ), ), ],)# 为自动检索函数定义 pydantic 模型class AutoRetrieveModel(BaseModel): query: str = Field(..., description="自然语言查询字符串") filter_key_list: List[str] = Field( ..., description="元数据过滤字段名称列表" ) filter_value_list: List[str] = Field( ..., description=( "元数据过滤字段值列表(对应于 filter_key_list 中指定的名称)" ), )def auto_retrieve_fn( query: str, filter_key_list: List[str], filter_value_list: List[str]): """自动检索函数。 从向量数据库中执行自动检索,然后应用一组过滤器。 """ query = query or "查询" exact_match_filters = [ ExactMatchFilter(key=k, value=v) for k, v in zip(filter_key_list, filter_value_list) ] retriever = VectorIndexRetriever( index, filters=MetadataFilters(filters=exact_match_filters), top_k=top_k, ) results = retriever.retrieve(query) return [r.get_content() for r in results]description = f"""\使用此工具查找有关名人的传记信息。以下是向量数据库架构:{vector_store_info.json()}"""auto_retrieve_tool = FunctionTool.from_defaults( fn=auto_retrieve_fn, name="celebrity_bios", description=description, fn_schema=AutoRetrieveModel,)
# 定义函数工具from llama_index.core.tools import FunctionToolfrom llama_index.core.vector_stores import ( VectorStoreInfo, MetadataInfo, ExactMatchFilter, MetadataFilters,)from llama_index.core.retrievers import VectorIndexRetrieverfrom llama_index.core.query_engine import RetrieverQueryEnginefrom typing import List, Tuple, Anyfrom pydantic import BaseModel, Field# 暂时硬编码 top ktop_k = 3# 定义描述向量存储架构的向量存储信息vector_store_info = VectorStoreInfo( content_info="名人简介", metadata_info=[ MetadataInfo( name="category", type="str", description=( "名人的类别,包括[体育、娱乐、商业、音乐]" ), ), MetadataInfo( name="country", type="str", description=( "名人的国家,包括[美国、巴巴多斯、葡萄牙]" ), ), ],)# 为自动检索函数定义 pydantic 模型class AutoRetrieveModel(BaseModel): query: str = Field(..., description="自然语言查询字符串") filter_key_list: List[str] = Field( ..., description="元数据过滤字段名称列表" ) filter_value_list: List[str] = Field( ..., description=( "元数据过滤字段值列表(对应于 filter_key_list 中指定的名称)" ), )def auto_retrieve_fn( query: str, filter_key_list: List[str], filter_value_list: List[str]): """自动检索函数。 从向量数据库中执行自动检索,然后应用一组过滤器。 """ query = query or "查询" exact_match_filters = [ ExactMatchFilter(key=k, value=v) for k, v in zip(filter_key_list, filter_value_list) ] retriever = VectorIndexRetriever( index, filters=MetadataFilters(filters=exact_match_filters), top_k=top_k, ) results = retriever.retrieve(query) return [r.get_content() for r in results]description = f"""\使用此工具查找有关名人的传记信息。以下是向量数据库架构:{vector_store_info.json()}"""auto_retrieve_tool = FunctionTool.from_defaults( fn=auto_retrieve_fn, name="celebrity_bios", description=description, fn_schema=AutoRetrieveModel,)
In [ ]:
Copied!
auto_retrieve_fn(
"celebrity from the United States",
filter_key_list=["country"],
filter_value_list=["United States"],
)
auto_retrieve_fn(
"celebrity from the United States",
filter_key_list=["country"],
filter_value_list=["United States"],
)
Out[ ]:
['Angelina Jolie is an American actress, filmmaker, and humanitarian. She has received numerous awards for her acting and is known for her philanthropic work.', 'Michael Jordan is a retired professional basketball player, widely regarded as one of the greatest basketball players of all time.']
class Agent:
def __init__(self, name):
self.name = name
self.health = 100
self.mana = 100
初始化代理人
In [ ]:
Copied!
from llama_index.agent.openai import OpenAIAssistantAgent
agent = OpenAIAssistantAgent.from_new(
name="Celebrity bot",
instructions="You are a bot designed to answer questions about celebrities.",
tools=[auto_retrieve_tool],
verbose=True,
)
from llama_index.agent.openai import OpenAIAssistantAgent
agent = OpenAIAssistantAgent.from_new(
name="Celebrity bot",
instructions="You are a bot designed to answer questions about celebrities.",
tools=[auto_retrieve_tool],
verbose=True,
)
In [ ]:
Copied!
response = agent.chat("Tell me about two celebrities from the United States. ")
print(str(response))
response = agent.chat("Tell me about two celebrities from the United States. ")
print(str(response))
=== Calling Function === Calling function: celebrity_bios with args: {"query": "celebrity from United States", "filter_key_list": ["country"], "filter_value_list": ["United States"]} Got output: ['Angelina Jolie is an American actress, filmmaker, and humanitarian. She has received numerous awards for her acting and is known for her philanthropic work.', 'Michael Jordan is a retired professional basketball player, widely regarded as one of the greatest basketball players of all time.'] ======================== === Calling Function === Calling function: celebrity_bios with args: {"query": "celebrity from United States", "filter_key_list": ["country"], "filter_value_list": ["United States"]} Got output: ['Angelina Jolie is an American actress, filmmaker, and humanitarian. She has received numerous awards for her acting and is known for her philanthropic work.', 'Michael Jordan is a retired professional basketball player, widely regarded as one of the greatest basketball players of all time.'] ======================== Here is some information about two celebrities from the United States: 1. Angelina Jolie - Angelina Jolie is an American actress, filmmaker, and humanitarian. She has received numerous awards for her acting and is known for her philanthropic work. Over the years, Jolie has starred in several critically acclaimed and commercially successful films, and she has also been involved in various humanitarian causes, advocating for refugees and children's education, among other things. 2. Michael Jordan - Michael Jordan is a retired professional basketball player, widely regarded as one of the greatest basketball players of all time. During his career, Jordan dominated the NBA with his scoring ability, athleticism, and competitiveness. He won six NBA championships with the Chicago Bulls and earned the NBA Most Valuable Player Award five times. Jordan has also been a successful businessman and the principal owner of the Charlotte Hornets basketball team. Both figures have made significant impacts in their respective fields and continue to be influential even after reaching the peaks of their careers.
文本到SQL和语义搜索的联合¶
这目前由我们的SQLAutoVectorQueryEngine
处理。
让我们尝试通过让我们的OpenAIAssistantAgent
访问两个查询工具来实现这一点:SQL和向量搜索。
注意: 任何文本到SQL应用程序都应该意识到执行任意SQL查询可能存在安全风险。建议采取必要的预防措施,例如使用受限角色、只读数据库、沙箱等。
加载和索引结构化数据¶
我们将样本结构化数据点加载到 SQL 数据库中并进行索引。
In [ ]:
Copied!
from sqlalchemy import (
create_engine,
MetaData,
Table,
Column,
String,
Integer,
select,
column,
)
from llama_index.core import SQLDatabase
from llama_index.core.indices import SQLStructStoreIndex
engine = create_engine("sqlite:///:memory:", future=True)
metadata_obj = MetaData()
from sqlalchemy import (
create_engine,
MetaData,
Table,
Column,
String,
Integer,
select,
column,
)
from llama_index.core import SQLDatabase
from llama_index.core.indices import SQLStructStoreIndex
engine = create_engine("sqlite:///:memory:", future=True)
metadata_obj = MetaData()
In [ ]:
Copied!
# 创建城市SQL表table_name = "city_stats"city_stats_table = Table( table_name, metadata_obj, Column("city_name", String(16), primary_key=True), Column("population", Integer), Column("country", String(16), nullable=False),)metadata_obj.create_all(engine)
# 创建城市SQL表table_name = "city_stats"city_stats_table = Table( table_name, metadata_obj, Column("city_name", String(16), primary_key=True), Column("population", Integer), Column("country", String(16), nullable=False),)metadata_obj.create_all(engine)
In [ ]:
Copied!
# 打印表格metadata_obj.tables.keys()
# 打印表格metadata_obj.tables.keys()
Out[ ]:
dict_keys(['city_stats'])
In [ ]:
Copied!
from sqlalchemy import insert
rows = [
{"city_name": "Toronto", "population": 2930000, "country": "Canada"},
{"city_name": "Tokyo", "population": 13960000, "country": "Japan"},
{"city_name": "Berlin", "population": 3645000, "country": "Germany"},
]
for row in rows:
stmt = insert(city_stats_table).values(**row)
with engine.begin() as connection:
cursor = connection.execute(stmt)
from sqlalchemy import insert
rows = [
{"city_name": "Toronto", "population": 2930000, "country": "Canada"},
{"city_name": "Tokyo", "population": 13960000, "country": "Japan"},
{"city_name": "Berlin", "population": 3645000, "country": "Germany"},
]
for row in rows:
stmt = insert(city_stats_table).values(**row)
with engine.begin() as connection:
cursor = connection.execute(stmt)
In [ ]:
Copied!
with engine.connect() as connection:
cursor = connection.exec_driver_sql("SELECT * FROM city_stats")
print(cursor.fetchall())
with engine.connect() as connection:
cursor = connection.exec_driver_sql("SELECT * FROM city_stats")
print(cursor.fetchall())
[('Toronto', 2930000, 'Canada'), ('Tokyo', 13960000, 'Japan'), ('Berlin', 3645000, 'Germany')]
In [ ]:
Copied!
sql_database = SQLDatabase(engine, include_tables=["city_stats"])
sql_database = SQLDatabase(engine, include_tables=["city_stats"])
In [ ]:
Copied!
from llama_index.core.query_engine import NLSQLTableQueryEngine
from llama_index.core.query_engine import NLSQLTableQueryEngine
In [ ]:
Copied!
query_engine = NLSQLTableQueryEngine(
sql_database=sql_database,
tables=["city_stats"],
)
query_engine = NLSQLTableQueryEngine(
sql_database=sql_database,
tables=["city_stats"],
)
加载和索引非结构化数据¶
我们将非结构化数据加载到由Pinecone支持的向量索引中。
In [ ]:
Copied!
# 安装维基百科的Python包!pip install wikipedia
# 安装维基百科的Python包!pip install wikipedia
Requirement already satisfied: wikipedia in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (1.4.0) Requirement already satisfied: requests<3.0.0,>=2.0.0 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from wikipedia) (2.28.2) Requirement already satisfied: beautifulsoup4 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from wikipedia) (4.12.2) Requirement already satisfied: charset-normalizer<4,>=2 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (3.1.0) Requirement already satisfied: idna<4,>=2.5 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (3.4) Requirement already satisfied: certifi>=2017.4.17 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2022.12.7) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (1.26.15) Requirement already satisfied: soupsieve>1.2 in /Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages (from beautifulsoup4->wikipedia) (2.4.1) [notice] A new release of pip available: 22.3.1 -> 23.1.2 [notice] To update, run: pip install --upgrade pip
In [ ]:
Copied!
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
In [ ]:
Copied!
cities = ["Toronto", "Berlin", "Tokyo"]
wiki_docs = WikipediaReader().load_data(pages=cities)
cities = ["Toronto", "Berlin", "Tokyo"]
wiki_docs = WikipediaReader().load_data(pages=cities)
In [ ]:
Copied!
from llama_index.core import Settingsfrom llama_index.core import StorageContextfrom llama_index.core.node_parser import TokenTextSplitterfrom llama_index.llms.openai import OpenAI# 定义节点解析器和LLMSettings.chunk_size = 1024Settings.llm = OpenAI(temperature=0, model="gpt-4")text_splitter = TokenTextSplitter(chunk_size=1024)# 使用默认的内存存储storage_context = StorageContext.from_defaults()vector_index = VectorStoreIndex([], storage_context=storage_context)
from llama_index.core import Settingsfrom llama_index.core import StorageContextfrom llama_index.core.node_parser import TokenTextSplitterfrom llama_index.llms.openai import OpenAI# 定义节点解析器和LLMSettings.chunk_size = 1024Settings.llm = OpenAI(temperature=0, model="gpt-4")text_splitter = TokenTextSplitter(chunk_size=1024)# 使用默认的内存存储storage_context = StorageContext.from_defaults()vector_index = VectorStoreIndex([], storage_context=storage_context)
In [ ]:
Copied!
# 将文档插入向量索引# 每个文档都附带城市的元数据for city, wiki_doc in zip(cities, wiki_docs): nodes = text_splitter.get_nodes_from_documents([wiki_doc]) # 为每个节点添加元数据 for node in nodes: node.metadata = {"title": city} vector_index.insert_nodes(nodes)
# 将文档插入向量索引# 每个文档都附带城市的元数据for city, wiki_doc in zip(cities, wiki_docs): nodes = text_splitter.get_nodes_from_documents([wiki_doc]) # 为每个节点添加元数据 for node in nodes: node.metadata = {"title": city} vector_index.insert_nodes(nodes)
定义查询引擎/工具¶
In [ ]:
Copied!
from llama_index.core.tools import QueryEngineTool
from llama_index.core.tools import QueryEngineTool
In [ ]:
Copied!
sql_tool = QueryEngineTool.from_defaults(
query_engine=query_engine,
name="sql_tool",
description=(
"Useful for translating a natural language query into a SQL query over"
" a table containing: city_stats, containing the population/country of"
" each city"
),
)
vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_index.as_query_engine(similarity_top_k=2),
name="vector_tool",
description=(
f"Useful for answering semantic questions about different cities"
),
)
sql_tool = QueryEngineTool.from_defaults(
query_engine=query_engine,
name="sql_tool",
description=(
"Useful for translating a natural language query into a SQL query over"
" a table containing: city_stats, containing the population/country of"
" each city"
),
)
vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_index.as_query_engine(similarity_top_k=2),
name="vector_tool",
description=(
f"Useful for answering semantic questions about different cities"
),
)
class Agent:
def __init__(self, name):
self.name = name
初始化代理
In [ ]:
Copied!
from llama_index.agent.openai import OpenAIAssistantAgent
agent = OpenAIAssistantAgent.from_new(
name="City bot",
instructions="You are a bot designed to answer questions about cities (both unstructured and structured data)",
tools=[sql_tool, vector_tool],
verbose=True,
)
from llama_index.agent.openai import OpenAIAssistantAgent
agent = OpenAIAssistantAgent.from_new(
name="City bot",
instructions="You are a bot designed to answer questions about cities (both unstructured and structured data)",
tools=[sql_tool, vector_tool],
verbose=True,
)
In [ ]:
Copied!
response = agent.chat(
"Tell me about the arts and culture of the city with the highest"
" population"
)
print(str(response))
response = agent.chat(
"Tell me about the arts and culture of the city with the highest"
" population"
)
print(str(response))
=== Calling Function === Calling function: sql_tool with args: {"input":"SELECT name, country FROM city_stats ORDER BY population DESC LIMIT 1"} Got output: The city with the highest population is Tokyo, Japan. ======================== === Calling Function === Calling function: vector_tool with args: {"input":"What are the arts and culture like in Tokyo, Japan?"} Got output: Tokyo has a vibrant arts and culture scene. The city is home to many museums, including the Tokyo National Museum, which specializes in traditional Japanese art, the National Museum of Western Art, and the Edo-Tokyo Museum. There are also theaters for traditional forms of Japanese drama, such as the National Noh Theatre and the Kabuki-za. Tokyo hosts modern Japanese and international pop and rock music concerts, and the New National Theater Tokyo is a hub for opera, ballet, contemporary dance, and drama. The city also celebrates various festivals throughout the year, including the Sannō, Sanja, and Kanda Festivals. Additionally, Tokyo is known for its youth style, fashion, and cosplay in the Harajuku neighborhood. ======================== Tokyo, Japan, which has the highest population of any city, boasts a rich and diverse arts and culture landscape. The city is a hub for traditional Japanese art as showcased in prominent institutions like the Tokyo National Museum, and it also features artwork from different parts of the world at the National Museum of Western Art. Tokyo has a deep appreciation for its historical roots, with the Edo-Tokyo Museum presenting the past in a detailed and engaging manner. The traditional performing arts have a significant presence in Tokyo, with theaters such as the National Noh Theatre presenting classical Noh dramas and the iconic Kabuki-za offering enchanting Kabuki performances. For enthusiasts of modern entertainment, Tokyo is a prime spot for contemporary music, including both Japanese pop and rock as well as international acts. Opera, ballet, contemporary dance, and drama find a prestigious platform at the New National Theater Tokyo. Tokyo's calendar is filled with a variety of festivals that reflect the city's vibrant cultural heritage, including the Sannō, Sanja, and Kanda Festivals. Additionally, Tokyo is at the forefront of fashion and youth culture, particularly in the Harajuku district, which is famous for its unique fashion, style, and cosplay. This mix of traditional and modern, local and international arts and culture makes Tokyo a dynamic and culturally rich city.
In [ ]:
Copied!
response = agent.chat("Tell me about the history of Berlin")
print(str(response))
response = agent.chat("Tell me about the history of Berlin")
print(str(response))
=== Calling Function === Calling function: vector_tool with args: {"input":"What is the history of Berlin, Germany?"} Got output: Berlin has a rich and diverse history. It was first documented in the 13th century and has served as the capital of various entities throughout history, including the Margraviate of Brandenburg, the Kingdom of Prussia, the German Empire, the Weimar Republic, and Nazi Germany. After World War II, the city was divided, with West Berlin becoming a part of West Germany and East Berlin becoming the capital of East Germany. Following German reunification in 1990, Berlin once again became the capital of all of Germany. Throughout its history, Berlin has been a center of scientific, artistic, and philosophical activity, and has experienced periods of economic growth and cultural flourishing. Today, it is a world city of culture, politics, media, and science, known for its vibrant arts scene, diverse architecture, and high quality of life. ======================== Berlin, the capital city of Germany, has a rich and complex history that stretches back to its first documentation in the 13th century. Throughout the centuries, Berlin has been at the heart of numerous important historical movements and events. Initially a small town, Berlin grew in significance as the capital of the Margraviate of Brandenburg. Later on, it ascended in prominence as the capital of the Kingdom of Prussia. With the unification of Germany, Berlin became the imperial capital of the German Empire, a position it retained until the end of World War I. The interwar period saw Berlin as the capital of the Weimar Republic, and it was during this time that the city became known for its vibrant cultural scene. However, the rise of the Nazi regime in the 1930s led to a dark period in Berlin's history, and the city was heavily damaged during World War II. Following the war's end, Berlin became a divided city. The division was physical, represented by the Berlin Wall, and ideological, with West Berlin aligning with democratic West Germany while East Berlin became the capital of the socialist East Germany. The fall of the Berlin Wall in November 1989 was a historic moment, leading to German reunification in 1990. Berlin was once again chosen as the capital of a united Germany. Since reunification, Berlin has undergone massive reconstruction and has become a hub of contemporary culture, politics, media, and science. Today, Berlin celebrates its diverse heritage, from its grand historical landmarks like the Brandenburg Gate and the Reichstag, to its remembrance of the past with monuments such as the Berlin Wall Memorial and the Holocaust Memorial. It is a city known for its cultural dynamism, thriving arts and music scenes, and a high quality of life. Berlin's history has shaped it into a unique world city that continues to play a significant role on the global stage.
In [ ]:
Copied!
response = agent.chat(
"Can you give me the country corresponding to each city?"
)
print(str(response))
response = agent.chat(
"Can you give me the country corresponding to each city?"
)
print(str(response))
=== Calling Function === Calling function: sql_tool with args: {"input":"SELECT name, country FROM city_stats"} Got output: The cities in the city_stats table are Toronto from Canada, Tokyo from Japan, and Berlin from Germany. ======================== Here are the countries corresponding to each city: - Toronto: Canada - Tokyo: Japan - Berlin: Germany