Cassandra数据库工具¶
Apache Cassandra®是一个广泛使用的用于存储事务应用数据的数据库。在大型语言模型中引入函数和工具为生成式AI应用程序中现有数据开辟了一些令人兴奋的用例。Cassandra数据库工具包使AI工程师能够高效地将代理与Cassandra数据集成,提供以下功能:
- 通过优化查询实现快速数据访问。大多数查询应在单个数字毫秒或更短的时间内运行。
- 模式内省以增强LLM推理能力
- 与各种Cassandra部署兼容,包括Apache Cassandra®、DataStax Enterprise™和DataStax Astra™
- 目前,该工具包仅限于SELECT查询和模式内省操作。(安全第一)
快速开始¶
- 安装cassio库
- 为连接的Cassandra数据库设置环境变量
- 初始化CassandraDatabase
- 将工具传递给您的代理并使用spec.to_tool_list()
- 坐下来看着它为您完成所有工作
操作理论¶
Cassandra查询语言(CQL)是与Cassandra数据库进行交互的主要面向人的方式。虽然在生成查询时提供了一些灵活性,但它需要了解Cassandra数据建模的最佳实践。LLM函数调用赋予代理推理的能力,然后选择满足请求的工具。使用LLM的代理在选择适当的工具或工具链时应使用特定于Cassandra的逻辑进行推理。这减少了在强制LLM提供自上而下解决方案时引入的随机性。您想让LLM完全无限制地访问您的数据库吗?是的。可能不。为了实现这一点,我们在构建代理的问题时提供了一个提示:
您是一个Apache Cassandra专家查询分析机器人,具有以下功能和规则:
- 您将接受最终用户关于在数据库中查找特定数据的问题。
- 您将检查数据库的模式并创建查询路径。
- 您将为用户提供正确的查询以查找他们正在寻找的数据,显示查询路径提供的步骤。
- 您将使用使用分区键和聚簇列查询Apache Cassandra的最佳实践。
- 避免在查询中使用ALLOW FILTERING。
- 目标是找到一个查询路径,因此可能需要查询其他表才能得到最终答案。
以下是JSON格式中查询路径的示例:
{
"query_paths": [
{
"description": "使用电子邮件直接查询用户表",
"steps": [
{
"table": "user_credentials",
"query":
"SELECT userid FROM user_credentials WHERE email = 'example@example.com';"
},
{
"table": "users",
"query": "SELECT * FROM users WHERE userid = ?;"
}
]
}
]
}
提供的工具¶
cassandra_db_schema
¶
收集连接数据库或特定模式的所有模式信息。在确定操作时对代理至关重要。
cassandra_db_select_table_data
¶
从特定keyspace和表中选择数据。代理可以传递谓词和返回记录数的限制参数。
cassandra_db_query
¶
实验性的替代cassandra_db_select_table_data
,它接受完全由代理形成的查询字符串,而不是参数。警告:这可能导致不太高效(甚至无法工作)的异常查询。这可能会在将来的版本中被移除。如果它有一些很酷的功能,我们也想知道。你永远不知道!
环境设置¶
安装以下Python模块:
pip install ipykernel python-dotenv cassio llama-index llama-index-agent-openai llama-index-llms-openai llama-index-tools-cassandra
.env 文件¶
连接是通过 cassio
,使用 auto=True
参数,笔记本使用OpenAI。您应相应地创建一个 .env
文件。
对于Cassandra,设置:
CASSANDRA_CONTACT_POINTS
CASSANDRA_USERNAME
CASSANDRA_PASSWORD
CASSANDRA_KEYSPACE
对于Astra,设置:
ASTRA_DB_APPLICATION_TOKEN
ASTRA_DB_DATABASE_ID
ASTRA_DB_KEYSPACE
例如:
# 连接到Astra:
ASTRA_DB_DATABASE_ID=a1b2c3d4-...
ASTRA_DB_APPLICATION_TOKEN=AstraCS:...
ASTRA_DB_KEYSPACE=notebooks
# 也设置
OPENAI_API_KEY=sk-....
(您也可以修改下面的代码直接连接到 cassio
。)
from dotenv import load_dotenv
load_dotenv(override=True)
# 导入必要的库
import os
import cassio
from llama_index.tools.cassandra.base import CassandraDatabaseToolSpec
from llama_index.tools.cassandra.cassandra_database_wrapper import (
CassandraDatabase,
)
from llama_index.agent.openai import OpenAIAgent
from llama_index.llms.openai import OpenAI
要连接到Cassandra数据库¶
cassio.init(auto=True)
session = cassio.config.resolve_session()
if not session:
raise Exception(
"Check environment configuration or manually configure cassio connection parameters"
)
# 测试数据准备
session = cassio.config.resolve_session()
session.execute("""DROP KEYSPACE IF EXISTS llamaindex_agent_test; """)
session.execute(
"""
CREATE KEYSPACE if not exists llamaindex_agent_test
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
"""
)
session.execute(
"""
CREATE TABLE IF NOT EXIST llamaindex_agent_test.user_credentials (
user_email 文本 PRIMARY KEY,
user_id UUID,
password TEXT
);
"""
)
session.execute(
"""
CREATE TABLE IF NOT EXIST llamaindex_agent_test.users (
id UUID PRIMARY KEY,
name TEXT,
email TEXT
);"""
)
session.execute(
"""
CREATE TABLE IF NOT EXIST llamaindex_agent_test.user_videos (
user_id UUID,
video_id UUID,
title TEXT,
description TEXT,
PRIMARY KEY (user_id, video_id)
);
"""
)
user_id = "522b1fe2-2e36-4cef-a667-cd4237d08b89"
video_id = "27066014-bad7-9f58-5a30-f63fe03718f6"
session.execute(
f"""
INSERT INTO llamaindex_agent_test.user_credentials (user_id, user_email)
VALUES ({user_id}, 'patrick@datastax.com');
"""
)
session.execute(
f"""
INSERT INTO llamaindex_agent_test.users (id, name, email)
VALUES ({user_id}, 'Patrick McFadin', 'patrick@datastax.com');
"""
)
session.execute(
f"""
INSERT INTO llamaindex_agent_test.user_videos (user_id, video_id, title)
VALUES ({user_id}, {video_id}, 'Use Langflow to Build an LLM Application in 5 Minutes');
"""
)
session.set_keyspace("llamaindex_agent_test")
# 创建一个CassandraDatabaseToolSpec对象
db = CassandraDatabase()
spec = CassandraDatabaseToolSpec(db=db)
tools = spec.to_tool_list()
for tool in tools:
print(tool.metadata.name)
print(tool.metadata.description)
print(tool.metadata.fn_schema)
cassandra_db_schema cassandra_db_schema(keyspace: str) -> List[llama_index.core.schema.Document] Input to this tool is a keyspace name, output is a table description of Apache Cassandra tables. If the query is not correct, an error message will be returned. If an error is returned, report back to the user that the keyspace doesn't exist and stop. Args: keyspace (str): The name of the keyspace for which to return the schema. Returns: List[Document]: A list of Document objects, each containing a table description. <class 'pydantic.main.cassandra_db_schema'> cassandra_db_select_table_data cassandra_db_select_table_data(keyspace: str, table: str, predicate: str, limit: int) -> List[llama_index.core.schema.Document] Tool for getting data from a table in an Apache Cassandra database. Use the WHERE clause to specify the predicate for the query that uses the primary key. A blank predicate will return all rows. Avoid this if possible. Use the limit to specify the number of rows to return. A blank limit will return all rows. Args: keyspace (str): The name of the keyspace containing the table. table (str): The name of the table for which to return data. predicate (str): The predicate for the query that uses the primary key. limit (int): The maximum number of rows to return. Returns: List[Document]: A list of Document objects, each containing a row of data. <class 'pydantic.main.cassandra_db_select_table_data'>
# 选择将驱动代理的LLM
# 只有某些模型支持这一功能
llm = OpenAI(model="gpt-4-1106-preview")
# 使用我们的工具创建代理。Verbose将回显代理的动作
agent = OpenAIAgent.from_tools(tools, llm=llm, verbose=True)
使用工具调用代理程序¶
我们创建了一个代理程序,它使用LLM进行推理和与工具列表进行通信,现在我们可以简单地向代理程序提问,并观察它如何利用我们提供的工具。
# 向我们的新代理提出一系列问题。代理如何使用工具来获得答案。
agent.chat("在keyspace llamaindex_agent_test中有哪些表?")
agent.chat("patrick@datastax.com的用户ID是多少?")
agent.chat("用户patrick@datastax.com上传了哪些视频?")
Added user message to memory: What tables are in the keyspace llamaindex_agent_test? === Calling Function === Calling function: cassandra_db_schema with args: {"keyspace":"llamaindex_agent_test"} Got output: [Document(id_='4b6011e6-62e6-4db2-9198-046534b7c8dd', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Table Name: user_credentials\n- Keyspace: llamaindex_agent_test\n- Columns\n - password (text)\n - user_email (text)\n - user_id (uuid)\n- Partition Keys: (user_email)\n- Clustering Keys: \n\nTable Name: user_videos\n- Keyspace: llamaindex_agent_test\n- Columns\n - description (text)\n - title (text)\n - user_id (uuid)\n - video_id (uuid)\n- Partition Keys: (user_id)\n- Clustering Keys: (video_id asc)\n\n\nTable Name: users\n- Keyspace: llamaindex_agent_test\n- Columns\n - email (text)\n - id (uuid)\n - name (text)\n- Partition Keys: (id)\n- Clustering Keys: \n\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')] ======================== Added user message to memory: What is the userid for patrick@datastax.com ? === Calling Function === Calling function: cassandra_db_select_table_data with args: {"keyspace":"llamaindex_agent_test","table":"user_credentials","predicate":"user_email = 'patrick@datastax.com'","limit":1} Got output: [Document(id_='e5620177-c735-46f8-a09a-a0e062efcdec', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="Row(user_email='patrick@datastax.com', password=None, user_id=UUID('522b1fe2-2e36-4cef-a667-cd4237d08b89'))", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')] ======================== Added user message to memory: What videos did user patrick@datastax.com upload? === Calling Function === Calling function: cassandra_db_select_table_data with args: {"keyspace":"llamaindex_agent_test","table":"user_videos","predicate":"user_id = 522b1fe2-2e36-4cef-a667-cd4237d08b89","limit":10} Got output: [Document(id_='e3ecfba1-e8e1-4ce3-b321-3f51e12077a1', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="Row(user_id=UUID('522b1fe2-2e36-4cef-a667-cd4237d08b89'), video_id=UUID('27066014-bad7-9f58-5a30-f63fe03718f6'), description=None, title='Use Langflow to Build an LLM Application in 5 Minutes')", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')] ========================
AgentChatResponse(response='The user `patrick@datastax.com` uploaded the following video in the `llamaindex_agent_test` keyspace:\n\n- Title: "Use Langflow to Build an LLM Application in 5 Minutes"\n- Video ID: `27066014-bad7-9f58-5a30-f63fe03718f6`\n- Description: Not provided', sources=[ToolOutput(content='[Document(id_=\'e3ecfba1-e8e1-4ce3-b321-3f51e12077a1\', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="Row(user_id=UUID(\'522b1fe2-2e36-4cef-a667-cd4237d08b89\'), video_id=UUID(\'27066014-bad7-9f58-5a30-f63fe03718f6\'), description=None, title=\'Use Langflow to Build an LLM Application in 5 Minutes\')", start_char_idx=None, end_char_idx=None, text_template=\'{metadata_str}\\n\\n{content}\', metadata_template=\'{key}: {value}\', metadata_seperator=\'\\n\')]', tool_name='cassandra_db_select_table_data', raw_input={'args': (), 'kwargs': {'keyspace': 'llamaindex_agent_test', 'table': 'user_videos', 'predicate': 'user_id = 522b1fe2-2e36-4cef-a667-cd4237d08b89', 'limit': 10}}, raw_output=[Document(id_='e3ecfba1-e8e1-4ce3-b321-3f51e12077a1', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="Row(user_id=UUID('522b1fe2-2e36-4cef-a667-cd4237d08b89'), video_id=UUID('27066014-bad7-9f58-5a30-f63fe03718f6'), description=None, title='Use Langflow to Build an LLM Application in 5 Minutes')", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')], is_error=False)], source_nodes=[], is_dummy_stream=False)