LanceDB向量存储¶

在这个笔记本中，我们将展示如何使用LanceDB在LlamaIndex中执行向量搜索。

如果您在colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。

In [ ]:

Copied!

%pip install llama-index-vector-stores-lancedb
%pip install llama-index-vector-stores-lancedb

In [ ]:

Copied!

!pip install llama-index
!pip install llama-index

In [ ]:

Copied!

import loggingimport sys# 取消注释以查看调试日志# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))from llama_index.core import SimpleDirectoryReader, Document, StorageContextfrom llama_index.core import VectorStoreIndexfrom llama_index.vector_stores.lancedb import LanceDBVectorStoreimport textwrap
import loggingimport sys# 取消注释以查看调试日志# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))from llama_index.core import SimpleDirectoryReader, Document, StorageContextfrom llama_index.core import VectorStoreIndexfrom llama_index.vector_stores.lancedb import LanceDBVectorStoreimport textwrap

设置OpenAI¶

第一步是配置OpenAI密钥。它将用于为加载到索引中的文档创建嵌入。

In [ ]:

Copied!

import openai

openai.api_key = ""
import openai

openai.api_key = ""

下载数据

In [ ]:

Copied!

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

加载文档¶

使用SimpleDirectoryReader加载存储在data/paul_graham/中的文档。

In [ ]:

Copied!

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print("Document ID:", documents[0].doc_id, "Document Hash:", documents[0].hash)
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print("Document ID:", documents[0].doc_id, "Document Hash:", documents[0].hash)

Document ID: 855fe1d1-1c1a-4fbe-82ba-6bea663a5920 Document Hash: 4c702b4df575421e1d1af4b1fd50511b226e0c9863dbfffeccb8b689b8448f35

创建索引¶

在这里，我们使用之前加载的文档创建一个由LanceDB支持的索引。LanceDBVectorStore接受一些参数。

uri（str，必需）：LanceDB将存储其文件的位置。
table_name（str，可选）：嵌入将被存储的表名。默认为"vectors"。
nprobes（int，可选）：使用的探测次数。较高的数字使搜索更准确，但也更慢。默认为20。
refine_factor：（int，可选）：通过读取额外的元素并在内存中重新排列结果来优化结果。默认为None。
更多详细信息可以在LanceDB文档中找到。

In [ ]:

Copied!





vector_store = LanceDBVectorStore(uri="/tmp/lancedb")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
vector_store = LanceDBVectorStore(uri="/tmp/lancedb")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

查询索引¶

现在我们可以使用我们的索引来提问。

In [ ]:

Copied!

query_engine = index.as_query_engine()
response = query_engine.query("How much did Viaweb charge per month?")
query_engine = index.as_query_engine()
response = query_engine.query("How much did Viaweb charge per month?")

In [ ]:

Copied!

print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))

Viaweb charged $100 per month for a small store and $300 per month for a big one.

In [ ]:

Copied!

response = query_engine.query("What did the author do growing up?")
response = query_engine.query("What did the author do growing up?")

In [ ]:

Copied!

print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))

The author worked on writing and programming outside of school before college. They wrote short
stories and tried writing programs on the IBM 1401 computer. They also mentioned getting a
microcomputer, a TRS-80, and started programming on it.

追加数据¶

您也可以将数据添加到现有的索引中。

In [ ]:

Copied!





del index

index = VectorStoreIndex.from_documents(
    [Document(text="The sky is purple in Portland, Maine")],
    uri="/tmp/new_dataset",
)
del index

index = VectorStoreIndex.from_documents(
    [Document(text="The sky is purple in Portland, Maine")],
    uri="/tmp/new_dataset",
)

In [ ]:

Copied!

query_engine = index.as_query_engine()
response = query_engine.query("Where is the sky purple?")
print(textwrap.fill(str(response), 100))
query_engine = index.as_query_engine()
response = query_engine.query("Where is the sky purple?")
print(textwrap.fill(str(response), 100))

The sky is purple in Portland, Maine.

In [ ]:

Copied!

index = VectorStoreIndex.from_documents(documents, uri="/tmp/new_dataset")
index = VectorStoreIndex.from_documents(documents, uri="/tmp/new_dataset")

In [ ]:

Copied!

query_engine = index.as_query_engine()
response = query_engine.query("What companies did the author start?")
print(textwrap.fill(str(response), 100))
query_engine = index.as_query_engine()
response = query_engine.query("What companies did the author start?")
print(textwrap.fill(str(response), 100))

The author started two companies: Viaweb and Y Combinator.