使用LlamaIndex + DuckDB的SQL查询引擎¶
本指南展示了使用DuckDB的核心LlamaIndex SQL功能。
我们将介绍一些核心的LlamaIndex数据结构,包括NLSQLTableQueryEngine
和SQLTableRetrieverQueryEngine
。
注意: 任何文本转SQL应用程序都应该意识到执行任意SQL查询可能存在安全风险。建议采取必要的预防措施,比如使用受限角色、只读数据库、沙盒等。
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
In [ ]:
Copied!
%pip install llama-index-readers-wikipedia
%pip install llama-index-readers-wikipedia
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
In [ ]:
Copied!
!pip install duckdb duckdb-engine
!pip install duckdb duckdb-engine
In [ ]:
Copied!
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
In [ ]:
Copied!
from llama_index.core import SQLDatabase, SimpleDirectoryReader, Document
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core.query_engine import NLSQLTableQueryEngine
from llama_index.core.indices.struct_store import SQLTableRetrieverQueryEngine
from llama_index.core import SQLDatabase, SimpleDirectoryReader, Document
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core.query_engine import NLSQLTableQueryEngine
from llama_index.core.indices.struct_store import SQLTableRetrieverQueryEngine
In [ ]:
Copied!
from IPython.display import Markdown, display
from IPython.display import Markdown, display
使用我们的NLSQLTableQueryEngine
进行基本的文本到SQL转换¶
在这个初始的例子中,我们将介绍如何使用一些测试数据填充SQL数据库,并使用我们的文本到SQL功能进行查询。
创建数据库架构 + 测试数据¶
我们使用sqlalchemy,一个流行的SQL数据库工具包,连接到DuckDB并创建一个空的city_stats
表。然后我们用一些测试数据填充它。
In [ ]:
Copied!
from sqlalchemy import (
create_engine,
MetaData,
Table,
Column,
String,
Integer,
select,
column,
)
from sqlalchemy import (
create_engine,
MetaData,
Table,
Column,
String,
Integer,
select,
column,
)
In [ ]:
Copied!
engine = create_engine("duckdb:///:memory:")
# 取消注释以使其与MotherDuck一起工作
# engine = create_engine("duckdb:///md:llama-index")
metadata_obj = MetaData()
engine = create_engine("duckdb:///:memory:")
# 取消注释以使其与MotherDuck一起工作
# engine = create_engine("duckdb:///md:llama-index")
metadata_obj = MetaData()
In [ ]:
Copied!
# 创建城市SQL表
table_name = "city_stats"
city_stats_table = Table(
table_name,
metadata_obj,
Column("city_name", String(16), primary_key=True),
Column("population", Integer),
Column("country", String(16), nullable=False),
)
metadata_obj.create_all(engine)
# 创建城市SQL表
table_name = "city_stats"
city_stats_table = Table(
table_name,
metadata_obj,
Column("city_name", String(16), primary_key=True),
Column("population", Integer),
Column("country", String(16), nullable=False),
)
metadata_obj.create_all(engine)
In [ ]:
Copied!
# 打印表格
metadata_obj.tables.keys()
# 打印表格
metadata_obj.tables.keys()
Out[ ]:
dict_keys(['city_stats'])
我们向city_stats
表中引入一些测试数据
In [ ]:
Copied!
from sqlalchemy import insert
rows = [
{"city_name": "Toronto", "population": 2930000, "country": "Canada"},
{"city_name": "Tokyo", "population": 13960000, "country": "Japan"},
{
"city_name": "Chicago",
"population": 2679000,
"country": "United States",
},
{"city_name": "Seoul", "population": 9776000, "country": "South Korea"},
]
for row in rows:
stmt = insert(city_stats_table).values(**row)
with engine.begin() as connection:
cursor = connection.execute(stmt)
from sqlalchemy import insert
rows = [
{"city_name": "Toronto", "population": 2930000, "country": "Canada"},
{"city_name": "Tokyo", "population": 13960000, "country": "Japan"},
{
"city_name": "Chicago",
"population": 2679000,
"country": "United States",
},
{"city_name": "Seoul", "population": 9776000, "country": "South Korea"},
]
for row in rows:
stmt = insert(city_stats_table).values(**row)
with engine.begin() as connection:
cursor = connection.execute(stmt)
In [ ]:
Copied!
with engine.connect() as connection:
cursor = connection.exec_driver_sql("SELECT * FROM city_stats")
print(cursor.fetchall())
with engine.connect() as connection:
cursor = connection.exec_driver_sql("SELECT * FROM city_stats")
print(cursor.fetchall())
[('Toronto', 2930000, 'Canada'), ('Tokyo', 13960000, 'Japan'), ('Chicago', 2679000, 'United States'), ('Seoul', 9776000, 'South Korea')]
创建SQLDatabase对象¶
我们首先定义我们的SQLDatabase抽象(一个围绕着SQLAlchemy的轻量级封装)。
In [ ]:
Copied!
from llama_index.core import SQLDatabase
from llama_index.core import SQLDatabase
In [ ]:
Copied!
sql_database = SQLDatabase(engine, include_tables=["city_stats"])
sql_database = SQLDatabase(engine, include_tables=["city_stats"])
/Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages/duckdb_engine/__init__.py:162: DuckDBEngineWarning: duckdb-engine doesn't yet support reflection on indices warnings.warn(
查询索引¶
在这里,我们演示了NLSQLTableQueryEngine
的功能,它可以执行文本到SQL的转换。
- 我们构建了一个
NLSQLTableQueryEngine
,并传入我们的SQL数据库对象。 - 我们对查询引擎运行查询。
In [ ]:
Copied!
query_engine = NLSQLTableQueryEngine(sql_database)
query_engine = NLSQLTableQueryEngine(sql_database)
In [ ]:
Copied!
response = query_engine.query("Which city has the highest population?")
response = query_engine.query("Which city has the highest population?")
INFO:llama_index.indices.struct_store.sql_query:> Table desc str: Table 'city_stats' has columns: city_name (VARCHAR), population (INTEGER), country (VARCHAR) and foreign keys: . > Table desc str: Table 'city_stats' has columns: city_name (VARCHAR), population (INTEGER), country (VARCHAR) and foreign keys: .
/Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages/langchain/sql_database.py:238: UserWarning: This method is deprecated - please use `get_usable_table_names`. warnings.warn(
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 332 tokens > [query] Total LLM token usage: 332 tokens INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 0 tokens > [query] Total embedding token usage: 0 tokens
In [ ]:
Copied!
str(response)
str(response)
Out[ ]:
' Tokyo has the highest population, with 13,960,000 people.'
In [ ]:
Copied!
response.metadata
response.metadata
Out[ ]:
{'result': [('Tokyo', 13960000)], 'sql_query': 'SELECT city_name, population \nFROM city_stats \nORDER BY population DESC \nLIMIT 1;'}
使用我们的SQLTableRetrieverQueryEngine
进行高级文本到SQL转换¶
在本指南中,我们将解决的情况是您的数据库中有大量的表,将所有表模式放入提示可能会导致文本到SQL提示溢出的设置。
我们首先使用我们的ObjectIndex
对模式进行索引,然后在其上使用我们的SQLTableRetrieverQueryEngine
抽象。
In [ ]:
Copied!
engine = create_engine("duckdb:///:memory:")
# 取消注释以使其与MotherDuck一起工作
# engine = create_engine("duckdb:///md:llama-index")
metadata_obj = MetaData()
engine = create_engine("duckdb:///:memory:")
# 取消注释以使其与MotherDuck一起工作
# engine = create_engine("duckdb:///md:llama-index")
metadata_obj = MetaData()
In [ ]:
Copied!
# 创建城市SQL表
table_name = "city_stats"
city_stats_table = Table(
table_name,
metadata_obj,
Column("city_name", String(16), primary_key=True),
Column("population", Integer),
Column("country", String(16), nullable=False),
)
all_table_names = ["city_stats"]
# 创建大量虚拟表
n = 100
for i in range(n):
tmp_table_name = f"tmp_table_{i}"
tmp_table = Table(
tmp_table_name,
metadata_obj,
Column(f"tmp_field_{i}_1", String(16), primary_key=True),
Column(f"tmp_field_{i}_2", Integer),
Column(f"tmp_field_{i}_3", String(16), nullable=False),
)
all_table_names.append(f"tmp_table_{i}")
metadata_obj.create_all(engine)
# 创建城市SQL表
table_name = "city_stats"
city_stats_table = Table(
table_name,
metadata_obj,
Column("city_name", String(16), primary_key=True),
Column("population", Integer),
Column("country", String(16), nullable=False),
)
all_table_names = ["city_stats"]
# 创建大量虚拟表
n = 100
for i in range(n):
tmp_table_name = f"tmp_table_{i}"
tmp_table = Table(
tmp_table_name,
metadata_obj,
Column(f"tmp_field_{i}_1", String(16), primary_key=True),
Column(f"tmp_field_{i}_2", Integer),
Column(f"tmp_field_{i}_3", String(16), nullable=False),
)
all_table_names.append(f"tmp_table_{i}")
metadata_obj.create_all(engine)
In [ ]:
Copied!
# 插入虚拟数据
from sqlalchemy import insert
rows = [
{"city_name": "多伦多", "population": 2930000, "country": "加拿大"},
{"city_name": "东京", "population": 13960000, "country": "日本"},
{
"city_name": "芝加哥",
"population": 2679000,
"country": "美国",
},
{"city_name": "首尔", "population": 9776000, "country": "韩国"},
]
for row in rows:
stmt = insert(city_stats_table).values(**row)
with engine.begin() as connection:
cursor = connection.execute(stmt)
# 插入虚拟数据
from sqlalchemy import insert
rows = [
{"city_name": "多伦多", "population": 2930000, "country": "加拿大"},
{"city_name": "东京", "population": 13960000, "country": "日本"},
{
"city_name": "芝加哥",
"population": 2679000,
"country": "美国",
},
{"city_name": "首尔", "population": 9776000, "country": "韩国"},
]
for row in rows:
stmt = insert(city_stats_table).values(**row)
with engine.begin() as connection:
cursor = connection.execute(stmt)
In [ ]:
Copied!
sql_database = SQLDatabase(engine, include_tables=["city_stats"])
sql_database = SQLDatabase(engine, include_tables=["city_stats"])
构建对象索引¶
In [ ]:
Copied!
from llama_index.core.indices.struct_store import SQLTableRetrieverQueryEngine
from llama_index.core.objects import (
SQLTableNodeMapping,
ObjectIndex,
SQLTableSchema,
)
from llama_index.core import VectorStoreIndex
from llama_index.core.indices.struct_store import SQLTableRetrieverQueryEngine
from llama_index.core.objects import (
SQLTableNodeMapping,
ObjectIndex,
SQLTableSchema,
)
from llama_index.core import VectorStoreIndex
In [ ]:
Copied!
table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = []
for table_name in all_table_names:
table_schema_objs.append(SQLTableSchema(table_name=table_name))
obj_index = ObjectIndex.from_objects(
table_schema_objs,
table_node_mapping,
VectorStoreIndex,
)
table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = []
for table_name in all_table_names:
table_schema_objs.append(SQLTableSchema(table_name=table_name))
obj_index = ObjectIndex.from_objects(
table_schema_objs,
table_node_mapping,
VectorStoreIndex,
)
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens > [build_index_from_nodes] Total LLM token usage: 0 tokens INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 6343 tokens > [build_index_from_nodes] Total embedding token usage: 6343 tokens
使用SQLTableRetrieverQueryEngine
查询索引¶
SQLTableRetrieverQueryEngine
是一个用于查询索引的工具,可以帮助用户快速检索数据库中的索引信息。通过使用这个工具,用户可以轻松地查找数据库中的索引,从而更好地了解数据库的结构和性能优化情况。
In [ ]:
Copied!
query_engine = SQLTableRetrieverQueryEngine(
sql_database,
obj_index.as_retriever(similarity_top_k=1),
)
query_engine = SQLTableRetrieverQueryEngine(
sql_database,
obj_index.as_retriever(similarity_top_k=1),
)
In [ ]:
Copied!
response = query_engine.query("Which city has the highest population?")
response = query_engine.query("Which city has the highest population?")
INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens > [retrieve] Total LLM token usage: 0 tokens INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 7 tokens > [retrieve] Total embedding token usage: 7 tokens INFO:llama_index.indices.struct_store.sql_query:> Table desc str: Table 'city_stats' has columns: city_name (VARCHAR), population (INTEGER), country (VARCHAR) and foreign keys: . > Table desc str: Table 'city_stats' has columns: city_name (VARCHAR), population (INTEGER), country (VARCHAR) and foreign keys: . INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 337 tokens > [query] Total LLM token usage: 337 tokens INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 0 tokens > [query] Total embedding token usage: 0 tokens
In [ ]:
Copied!
response
response
Out[ ]:
Response(response=' The city with the highest population is Tokyo, with a population of 13,960,000.', source_nodes=[], metadata={'result': [('Tokyo', 13960000)], 'sql_query': 'SELECT city_name, population \nFROM city_stats \nORDER BY population DESC \nLIMIT 1;'})