ClickHouse向量存储¶
在本笔记本中,我们将展示如何快速演示使用ClickHouse向量存储。
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
In [ ]:
Copied!
!pip install llama-index
!pip install clickhouse_connect
!pip install llama-index
!pip install clickhouse_connect
创建一个ClickHouse客户端¶
In [ ]:
Copied!
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
In [ ]:
Copied!
from os import environ
import clickhouse_connect
environ["OPENAI_API_KEY"] = "sk-*"
# 初始化客户端
client = clickhouse_connect.get_client(
host="localhost",
port=8123,
username="default",
password="",
)
from os import environ
import clickhouse_connect
environ["OPENAI_API_KEY"] = "sk-*"
# 初始化客户端
client = clickhouse_connect.get_client(
host="localhost",
port=8123,
username="default",
password="",
)
加载文档,使用ClickHouseVectorStore构建和存储VectorStoreIndex¶
在这里,我们将使用一组Paul Graham的文章作为文本来生成嵌入向量,将其存储在ClickHouseVectorStore
中,并进行查询以为我们的LLM QnA循环找到上下文。
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.clickhouse import ClickHouseVectorStore
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.clickhouse import ClickHouseVectorStore
In [ ]:
Copied!
# 加载文档
documents = SimpleDirectoryReader("../data/paul_graham").load_data()
print("文档ID:", documents[0].doc_id)
print("文档数量: ", len(documents))
# 加载文档
documents = SimpleDirectoryReader("../data/paul_graham").load_data()
print("文档ID:", documents[0].doc_id)
print("文档数量: ", len(documents))
Document ID: d03ac7db-8dae-4199-bc38-445dec51a534 Number of Documents: 1
下载数据
In [ ]:
Copied!
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2024-02-13 10:08:31-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘data/paul_graham/paul_graham_essay.txt’ data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.003s 2024-02-13 10:08:31 (23.9 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
您可以使用SimpleDirectoryReader来逐个处理您的文件:
In [ ]:
Copied!
loader = SimpleDirectoryReader("./data/paul_graham/")
documents = loader.load_data()
for file in loader.input_files:
print(file)
# 这里是你可以进行任何预处理的地方
loader = SimpleDirectoryReader("./data/paul_graham/")
documents = loader.load_data()
for file in loader.input_files:
print(file)
# 这里是你可以进行任何预处理的地方
data/paul_graham/paul_graham_essay.txt
In [ ]:
Copied!
# 使用元数据过滤器和存储索引进行初始化
from llama_index.core import StorageContext
for document in documents:
document.metadata = {"user_id": "123", "favorite_color": "blue"}
vector_store = ClickHouseVectorStore(clickhouse_client=client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
# 使用元数据过滤器和存储索引进行初始化
from llama_index.core import StorageContext
for document in documents:
document.metadata = {"user_id": "123", "favorite_color": "blue"}
vector_store = ClickHouseVectorStore(clickhouse_client=client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
In [ ]:
Copied!
import textwrap
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
# 将日志设置为DEBUG以获得更详细的输出
query_engine = index.as_query_engine(
filters=MetadataFilters(
filters=[
ExactMatchFilter(key="user_id", value="123"),
]
),
similarity_top_k=2,
vector_store_query_mode="hybrid",
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
import textwrap
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
# 将日志设置为DEBUG以获得更详细的输出
query_engine = index.as_query_engine(
filters=MetadataFilters(
filters=[
ExactMatchFilter(key="user_id", value="123"),
]
),
similarity_top_k=2,
vector_store_query_mode="hybrid",
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
The author learned several things during their time at Interleaf, including the importance of having technology companies run by product people rather than sales people, the drawbacks of having too many people edit code, the value of corridor conversations over planned meetings, the challenges of dealing with big bureaucratic customers, and the importance of being the "entry level" option in a market.
清除所有索引¶
In [ ]:
Copied!
for document in documents:
index.delete_ref_doc(document.doc_id)
for document in documents:
index.delete_ref_doc(document.doc_id)