Amazon Document DB
Amazon DocumentDB (与MongoDB兼容) 使得在云中设置、操作和扩展与MongoDB兼容的数据库变得容易。 使用Amazon DocumentDB,您可以运行相同的应用程序代码,并使用与MongoDB相同的驱动程序和工具。 Amazon DocumentDB的向量搜索结合了基于JSON的文档数据库的灵活性和丰富的查询能力,以及向量搜索的强大功能。
本笔记本向您展示如何使用Amazon Document DB 向量搜索将文档存储在集合中,创建索引并使用近似最近邻算法(如“余弦”、“欧几里得”和“点积”)执行向量搜索查询。默认情况下,DocumentDB 创建分层可导航小世界(HNSW)索引。要了解其他支持的向量索引类型,请参阅上面链接的文档。
要使用DocumentDB,您首先需要部署一个集群。请参阅开发者指南以获取更多详细信息。
Sign Up 免费注册,立即开始。
!pip install pymongo
import getpass
# DocumentDB connection string
# i.e., "mongodb://{username}:{pass}@{cluster_endpoint}:{port}/?{params}"
CONNECTION_STRING = getpass.getpass("DocumentDB Cluster URI:")
INDEX_NAME = "izzy-test-index"
NAMESPACE = "izzy_test_db.izzy_test_collection"
DB_NAME, COLLECTION_NAME = NAMESPACE.split(".")
我们想要使用OpenAIEmbeddings
,所以我们需要设置我们的OpenAI环境变量。
import getpass
import os
# Set up the OpenAI Environment Variables
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
os.environ["OPENAI_EMBEDDINGS_DEPLOYMENT"] = (
"smart-agent-embedding-ada" # the deployment name for the embedding model
)
os.environ["OPENAI_EMBEDDINGS_MODEL_NAME"] = "text-embedding-ada-002" # the model name
现在,我们将把文档加载到集合中,创建索引,然后对索引执行查询。
如果您对某些参数有疑问,请参考文档
from langchain.vectorstores.documentdb import (
DocumentDBSimilarityType,
DocumentDBVectorSearch,
)
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
SOURCE_FILE_NAME = "../../how_to/state_of_the_union.txt"
loader = TextLoader(SOURCE_FILE_NAME)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
# OpenAI Settings
model_deployment = os.getenv(
"OPENAI_EMBEDDINGS_DEPLOYMENT", "smart-agent-embedding-ada"
)
model_name = os.getenv("OPENAI_EMBEDDINGS_MODEL_NAME", "text-embedding-ada-002")
openai_embeddings: OpenAIEmbeddings = OpenAIEmbeddings(
deployment=model_deployment, model=model_name
)
from pymongo import MongoClient
INDEX_NAME = "izzy-test-index-2"
NAMESPACE = "izzy_test_db.izzy_test_collection"
DB_NAME, COLLECTION_NAME = NAMESPACE.split(".")
client: MongoClient = MongoClient(CONNECTION_STRING)
collection = client[DB_NAME][COLLECTION_NAME]
model_deployment = os.getenv(
"OPENAI_EMBEDDINGS_DEPLOYMENT", "smart-agent-embedding-ada"
)
model_name = os.getenv("OPENAI_EMBEDDINGS_MODEL_NAME", "text-embedding-ada-002")
vectorstore = DocumentDBVectorSearch.from_documents(
documents=docs,
embedding=openai_embeddings,
collection=collection,
index_name=INDEX_NAME,
)
# number of dimensions used by model above
dimensions = 1536
# specify similarity algorithm, valid options are:
# cosine (COS), euclidean (EUC), dotProduct (DOT)
similarity_algorithm = DocumentDBSimilarityType.COS
vectorstore.create_index(dimensions, similarity_algorithm)
{ 'createdCollectionAutomatically' : false,
'numIndexesBefore' : 1,
'numIndexesAfter' : 2,
'ok' : 1,
'operationTime' : Timestamp(1703656982, 1)}
# perform a similarity search between the embedding of the query and the embeddings of the documents
query = "What did the President say about Ketanji Brown Jackson"
docs = vectorstore.similarity_search(query)
print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
一旦文档加载完成并且索引已创建,您现在可以直接实例化向量存储并对索引运行查询
vectorstore = DocumentDBVectorSearch.from_connection_string(
connection_string=CONNECTION_STRING,
namespace=NAMESPACE,
embedding=openai_embeddings,
index_name=INDEX_NAME,
)
# perform a similarity search between a query and the ingested documents
query = "What did the president say about Ketanji Brown Jackson"
docs = vectorstore.similarity_search(query)
print(docs[0].page_content)
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
# perform a similarity search between a query and the ingested documents
query = "Which stats did the President share about the U.S. economy"
docs = vectorstore.similarity_search(query)
print(docs[0].page_content)
And unlike the $2 Trillion tax cut passed in the previous administration that benefitted the top 1% of Americans, the American Rescue Plan helped working people—and left no one behind.
And it worked. It created jobs. Lots of jobs.
In fact—our economy created over 6.5 Million new jobs just last year, more jobs created in one year
than ever before in the history of America.
Our economy grew at a rate of 5.7% last year, the strongest growth in nearly 40 years, the first step in bringing fundamental change to an economy that hasn’t worked for the working people of this nation for too long.
For the past 40 years we were told that if we gave tax breaks to those at the very top, the benefits would trickle down to everyone else.
But that trickle-down theory led to weaker economic growth, lower wages, bigger deficits, and the widest gap between those at the top and everyone else in nearly a century.
问答
qa_retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 25},
)
from langchain_core.prompts import PromptTemplate
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
"""
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)
API Reference:PromptTemplate
from langchain.chains import RetrievalQA
from langchain_openai import OpenAI
qa = RetrievalQA.from_chain_type(
llm=OpenAI(),
chain_type="stuff",
retriever=qa_retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": PROMPT},
)
docs = qa({"query": "gpt-4 compute requirements"})
print(docs["result"])
print(docs["source_documents"])
API Reference:RetrievalQA | OpenAI