Deep Lake矢量存储快速入门¶
Deep Lake可以使用pip进行安装。
In [ ]:
Copied!
%pip install llama-index-vector-stores-deeplake
%pip install llama-index-vector-stores-deeplake
In [ ]:
Copied!
!pip install llama-index
!pip install deeplake
!pip install llama-index
!pip install deeplake
接下来,让我们导入所需的模块并设置所需的环境变量:
In [ ]:
Copied!
import os
import textwrap
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
os.environ["OPENAI_API_KEY"] = "sk-********************************"
os.environ["ACTIVELOOP_TOKEN"] = "********************************"
import os
import textwrap
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
os.environ["OPENAI_API_KEY"] = "sk-********************************"
os.environ["ACTIVELOOP_TOKEN"] = "********************************"
我们将在本地存储一个Paul Graham的文章,并将其嵌入到Deep Lake Vector Store中。首先,我们将数据下载到名为data/paul_graham
的目录中。
In [ ]:
Copied!
import urllib.request
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt",
"data/paul_graham/paul_graham_essay.txt",
)
import urllib.request
urllib.request.urlretrieve(
"https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt",
"data/paul_graham/paul_graham_essay.txt",
)
我们现在可以从源数据文件中创建文档。
In [ ]:
Copied!
# 加载文档
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(
"文档ID:",
documents[0].doc_id,
"文档哈希:",
documents[0].hash,
)
# 加载文档
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(
"文档ID:",
documents[0].doc_id,
"文档哈希:",
documents[0].hash,
)
Document ID: a98b6686-e666-41a9-a0bc-b79f0d666bde Document Hash: beaa54b3e9cea641e91e6975d2207af4f4200f4b2d629725d688f272372ce5bb
最后,让我们创建Deep Lake向量存储并填充数据。我们使用默认的张量配置,该配置创建具有text (str)
、metadata(json)
、id (str, auto-populated)
、embedding (float32)
的张量。在此处了解有关张量可定制性的更多信息。
In [ ]:
Copied!
from llama_index.core import StorageContext
数据集路径 = "./dataset/paul_graham"
# 在文档上创建索引
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
from llama_index.core import StorageContext
数据集路径 = "./dataset/paul_graham"
# 在文档上创建索引
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
Uploading data to deeplake dataset.
100%|██████████| 22/22 [00:00<00:00, 684.80it/s]
Dataset(path='./dataset/paul_graham', tensors=['text', 'metadata', 'embedding', 'id']) tensor htype shape dtype compression ------- ------- ------- ------- ------- text text (22, 1) str None metadata json (22, 1) str None embedding embedding (22, 1536) float32 None id text (22, 1) str None
执行向量搜索¶
Deep Lake提供高度灵活的向量搜索和混合搜索选项,在这些教程中详细讨论。在本快速入门中,我们将展示使用默认选项的简单示例。
In [ ]:
Copied!
query_engine = index.as_query_engine()
response = query_engine.query(
"What did the author learn?",
)
query_engine = index.as_query_engine()
response = query_engine.query(
"What did the author learn?",
)
In [ ]:
Copied!
print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))
The author learned that working on things that are not prestigious can be a good thing, as it can lead to discovering something real and avoiding the wrong track. The author also learned that ignorance can be beneficial, as it can lead to discovering something new and unexpected. The author also learned the importance of working hard, even at the parts of the job they don't like, in order to set an example for others. The author also learned the value of unsolicited advice, as it can be beneficial in unexpected ways, such as when Robert Morris suggested that the author should make sure Y Combinator wasn't the last cool thing they did.
In [ ]:
Copied!
response = query_engine.query("What was a hard moment for the author?")
response = query_engine.query("What was a hard moment for the author?")
In [ ]:
Copied!
print(textwrap.fill(str(response), 100))
print(textwrap.fill(str(response), 100))
The author experienced a hard moment when one of his programs on the IBM 1401 computer did not terminate. This was a social as well as a technical error, as the data center manager's expression made clear.
In [ ]:
Copied!
query_engine = index.as_query_engine()
response = query_engine.query("What was a hard moment for the author?")
print(textwrap.fill(str(response), 100))
query_engine = index.as_query_engine()
response = query_engine.query("What was a hard moment for the author?")
print(textwrap.fill(str(response), 100))
The author experienced a hard moment when one of his programs on the IBM 1401 computer did not terminate. This was a social as well as a technical error, as the data center manager's expression made clear.
从数据库中删除项目¶
要找到要删除的文档的ID,您可以直接查询底层的deeplake数据集。
In [ ]:
Copied!
import deeplake
ds = deeplake.load(dataset_path)
idx = ds.id[0].numpy().tolist()
idx
import deeplake
ds = deeplake.load(dataset_path)
idx = ds.id[0].numpy().tolist()
idx
./dataset/paul_graham loaded successfully.
Out[ ]:
['42f8220e-673d-4c65-884d-5a48a1a15b03']
In [ ]:
Copied!
index.delete(idx[0])
index.delete(idx[0])