ClickHouse向量存储¶

在本笔记本中，我们将展示如何快速演示使用ClickHouse向量存储。

如果您在colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。

In [ ]:

Copied!

!pip install llama-index
!pip install clickhouse_connect
!pip install llama-index
!pip install clickhouse_connect

创建一个ClickHouse客户端¶

In [ ]:

Copied!

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [ ]:

Copied!





from os import environ
import clickhouse_connect

environ["OPENAI_API_KEY"] = "sk-*"

# 初始化客户端
client = clickhouse_connect.get_client(
    host="localhost",
    port=8123,
    username="default",
    password="",
)
from os import environ
import clickhouse_connect

environ["OPENAI_API_KEY"] = "sk-*"

# 初始化客户端
client = clickhouse_connect.get_client(
    host="localhost",
    port=8123,
    username="default",
    password="",
)

加载文档，使用ClickHouseVectorStore构建和存储VectorStoreIndex¶

在这里，我们将使用一组Paul Graham的文章作为文本来生成嵌入向量，将其存储在ClickHouseVectorStore中，并进行查询以为我们的LLM QnA循环找到上下文。

In [ ]:

Copied!

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.clickhouse import ClickHouseVectorStore
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.clickhouse import ClickHouseVectorStore

In [ ]:

Copied!





# 加载文档
documents = SimpleDirectoryReader("../data/paul_graham").load_data()
print("文档ID:", documents[0].doc_id)
print("文档数量: ", len(documents))
# 加载文档
documents = SimpleDirectoryReader("../data/paul_graham").load_data()
print("文档ID:", documents[0].doc_id)
print("文档数量: ", len(documents))

Document ID: d03ac7db-8dae-4199-bc38-445dec51a534
Number of Documents:  1

下载数据

In [ ]:

Copied!

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-02-13 10:08:31--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.003s  

2024-02-13 10:08:31 (23.9 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]

您可以使用SimpleDirectoryReader来逐个处理您的文件：

In [ ]:

Copied!





loader = SimpleDirectoryReader("./data/paul_graham/")
documents = loader.load_data()
for file in loader.input_files:
    print(file)
    # 这里是你可以进行任何预处理的地方
loader = SimpleDirectoryReader("./data/paul_graham/")
documents = loader.load_data()
for file in loader.input_files:
    print(file)
    # 这里是你可以进行任何预处理的地方

data/paul_graham/paul_graham_essay.txt

In [ ]:

Copied!





# 使用元数据过滤器和存储索引进行初始化
from llama_index.core import StorageContext

for document in documents:
    document.metadata = {"user_id": "123", "favorite_color": "blue"}
vector_store = ClickHouseVectorStore(clickhouse_client=client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
# 使用元数据过滤器和存储索引进行初始化
from llama_index.core import StorageContext

for document in documents:
    document.metadata = {"user_id": "123", "favorite_color": "blue"}
vector_store = ClickHouseVectorStore(clickhouse_client=client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

查询索引¶

现在ClickHouse向量存储支持过滤搜索和混合搜索

您可以了解更多关于query_engine和retriever。

In [ ]:

Copied!





import textwrap

from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

# 将日志设置为DEBUG以获得更详细的输出
query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            ExactMatchFilter(key="user_id", value="123"),
        ]
    ),
    similarity_top_k=2,
    vector_store_query_mode="hybrid",
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
import textwrap

from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

# 将日志设置为DEBUG以获得更详细的输出
query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            ExactMatchFilter(key="user_id", value="123"),
        ]
    ),
    similarity_top_k=2,
    vector_store_query_mode="hybrid",
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))

The author learned several things during their time at Interleaf, including the importance of having
technology companies run by product people rather than sales people, the drawbacks of having too
many people edit code, the value of corridor conversations over planned meetings, the challenges of
dealing with big bureaucratic customers, and the importance of being the "entry level" option in a
market.

清除所有索引¶

In [ ]:

Copied!

for document in documents:
    index.delete_ref_doc(document.doc_id)
for document in documents:
    index.delete_ref_doc(document.doc_id)