作为先决条件,您需要运行 Epsilla 向量数据库(例如,通过我们的 Docker 镜像),并安装 pyepsilla
包。
可在 文档 中查看完整文档。
In [ ]:
Copied!
%pip install llama-index-vector-stores-epsilla
%pip install llama-index-vector-stores-epsilla
In [ ]:
Copied!
!pip/pip3 install pyepsilla
!pip/pip3 install pyepsilla
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
In [ ]:
Copied!
import loggingimport sys# 取消注释以查看调试日志# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))from llama_index.core import SimpleDirectoryReader, Document, StorageContextfrom llama_index.core import VectorStoreIndexfrom llama_index.vector_stores.epsilla import EpsillaVectorStoreimport textwrap
import loggingimport sys# 取消注释以查看调试日志# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))from llama_index.core import SimpleDirectoryReader, Document, StorageContextfrom llama_index.core import VectorStoreIndexfrom llama_index.vector_stores.epsilla import EpsillaVectorStoreimport textwrap
设置OpenAI¶
首先,让我们添加OpenAI API密钥。它将用于为加载到索引中的文档创建嵌入。
In [ ]:
Copied!
import openai
import getpass
OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY
import openai
import getpass
OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY
# 下载数据
在这个部分,我们将下载所需的数据。
In [ ]:
Copied!
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
加载文档¶
使用SimpleDirectoryReader加载存储在/data/paul_graham
文件夹中的文档。
In [ ]:
Copied!
# 加载文档documents = SimpleDirectoryReader("./data/paul_graham/").load_data()print(f"总文档数:{len(documents)}")print(f"第一个文档,id:{documents[0].doc_id}")print(f"第一个文档,哈希值:{documents[0].hash}")
# 加载文档documents = SimpleDirectoryReader("./data/paul_graham/").load_data()print(f"总文档数:{len(documents)}")print(f"第一个文档,id:{documents[0].doc_id}")print(f"第一个文档,哈希值:{documents[0].hash}")
Total documents: 1 First document, id: ac7f23f0-ce15-4d94-a0a2-5020fa87df61 First document, hash: 4c702b4df575421e1d1af4b1fd50511b226e0c9863dbfffeccb8b689b8448f35
创建索引¶
在这里,我们使用之前加载的文档创建一个由Epsilla支持的索引。EpsillaVectorStore接受一些参数。
client (Any): 用于连接的Epsilla客户端。
collection_name (str, optional): 要使用的集合。默认为"llama_collection"。
db_path (str, optional): 数据库将被持久化的路径。默认为"/tmp/langchain-epsilla"。
db_name (str, optional): 给加载的数据库一个名称。默认为"langchain_store"。
dimension (int, optional): 嵌入的维度。如果未提供,将在第一次插入时创建集合。默认为None。
overwrite (bool, optional): 是否覆盖同名的现有集合。默认为False。
Epsilla vectordb正在使用默认主机"localhost"和端口"8888"运行。
In [ ]:
Copied!
# 在文档上创建索引from pyepsilla import vectordbclient = vectordb.Client()vector_store = EpsillaVectorStore(client=client, db_path="/tmp/llamastore")storage_context = StorageContext.from_defaults(vector_store=vector_store)index = VectorStoreIndex.from_documents( documents, storage_context=storage_context)
# 在文档上创建索引from pyepsilla import vectordbclient = vectordb.Client()vector_store = EpsillaVectorStore(client=client, db_path="/tmp/llamastore")storage_context = StorageContext.from_defaults(vector_store=vector_store)index = VectorStoreIndex.from_documents( documents, storage_context=storage_context)
[INFO] Connected to localhost:8888 successfully.
查询数据¶
现在我们已经将文档存储在索引中,我们可以针对索引提出问题。
In [ ]:
Copied!
query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))
query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))
The author of the given context information is Paul Graham.
In [ ]:
Copied!
response = query_engine.query("How did the author learn about AI?")
print(textwrap.fill(str(response), 100))
response = query_engine.query("How did the author learn about AI?")
print(textwrap.fill(str(response), 100))
The author learned about AI through various sources. One source was a novel called "The Moon is a Harsh Mistress" by Heinlein, which featured an intelligent computer called Mike. Another source was a PBS documentary that showed Terry Winograd using SHRDLU, a program that could understand natural language. These experiences sparked the author's interest in AI and motivated them to start learning about it, including teaching themselves Lisp, which was regarded as the language of AI at the time.
接下来,让我们尝试覆盖之前的数据。
In [ ]:
Copied!
vector_store = EpsillaVectorStore(client=client, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
single_doc = Document(text="Epsilla is the vector database we are using.")
index = VectorStoreIndex.from_documents(
[single_doc],
storage_context=storage_context,
)
query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))
vector_store = EpsillaVectorStore(client=client, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
single_doc = Document(text="Epsilla is the vector database we are using.")
index = VectorStoreIndex.from_documents(
[single_doc],
storage_context=storage_context,
)
query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))
There is no information provided about the author in the given context.
In [ ]:
Copied!
response = query_engine.query("What vector database is being used?")
print(textwrap.fill(str(response), 100))
response = query_engine.query("What vector database is being used?")
print(textwrap.fill(str(response), 100))
Epsilla is the vector database being used.
接下来,让我们向现有集合添加更多数据。
In [ ]:
Copied!
vector_store = EpsillaVectorStore(client=client, overwrite=False)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
for doc in documents:
index.insert(document=doc)
query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))
vector_store = EpsillaVectorStore(client=client, overwrite=False)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
for doc in documents:
index.insert(document=doc)
query_engine = index.as_query_engine()
response = query_engine.query("Who is the author?")
print(textwrap.fill(str(response), 100))
The author of the given context information is Paul Graham.
In [ ]:
Copied!
response = query_engine.query("What vector database is being used?")
print(textwrap.fill(str(response), 100))
response = query_engine.query("What vector database is being used?")
print(textwrap.fill(str(response), 100))
Epsilla is the vector database being used.