跳到主要内容

使用向量嵌入的哲学,OpenAI 和 Astra DB

nbviewer

AstraPy 版本

在这个快速入门中,您将学习如何使用OpenAI的向量嵌入和DataStax Astra DB构建一个“哲学名言查找器和生成器”,用作数据持久化的向量存储。

本笔记本的基本工作流程如下所述。您将评估并存储一些著名哲学家的名言的向量嵌入,使用它们构建一个强大的搜索引擎,甚至之后还可以生成新的名言!

该笔记本展示了一些向量搜索的标准使用模式,同时展示了使用Astra DB开始的简易程度。

有关使用向量搜索和文本嵌入构建问答系统的背景,请查看这个优秀的实用笔记:使用嵌入进行问答

目录: - 设置 - 创建向量集合 - 连接到OpenAI - 将名言加载到向量存储中 - 用例1:名言搜索引擎 - 用例2:名言生成器 - 清理操作

工作原理

索引

每个引用都被转换为一个嵌入向量,使用OpenAI的Embedding。这些向量被保存在向量存储中,以便以后用于搜索。一些元数据,包括作者的姓名和一些预先计算的标签,也被存储在旁边,以允许搜索定制。

1_vector_indexing

搜索

为了找到与提供的搜索引用类似的引用,后者被即时转换为一个嵌入向量,并且该向量被用于查询存储中类似的向量…即先前索引的类似引用。搜索可以选择性地受到额外元数据的限制(“找到与这个类似的斯宾诺莎的引用…”)。

2_vector_search

这里的关键点是,“内容相似的引用”在向量空间中转换为彼此在度量上接近的向量:因此,向量相似性搜索有效地实现了语义相似性。这就是向量嵌入如此强大的关键原因。

下面的草图试图传达这个想法。每个引用一旦被转换为一个向量,就是空间中的一个点。嗯,在这种情况下,它在一个球体上,因为OpenAI的嵌入向量,像大多数其他向量一样,被归一化为_单位长度_。哦,这个球实际上不是三维的,而是1536维的!

因此,本质上,向量空间中的相似性搜索返回与查询向量最接近的向量:

3_vector_space

生成

给定一个建议(一个主题或一个暂定的引用),执行搜索步骤,并将返回的第一个结果(引用)输入到LLM提示中,该提示要求生成模型根据传递的示例和初始建议创造一个新的文本。

4_quote_generation

设置

安装并导入必要的依赖项:

!pip install --quiet "astrapy>=0.6.0" "openai>=1.0.0" datasets

from getpass import getpass
from collections import Counter

from astrapy.db import AstraDB
import openai
from datasets import load_dataset

连接参数

请在您的Astra仪表板上检索数据库凭据(信息):您将立即提供它们。

示例数值:

  • API端点:https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com
  • 令牌:AstraCS:6gBhNmsk135...
ASTRA_DB_API_ENDPOINT = input("Please enter your API Endpoint:")
ASTRA_DB_APPLICATION_TOKEN = getpass("Please enter your Token")

Please enter your API Endpoint: https://4f835778-ec78-42b0-9ae3-29e3cf45b596-us-east1.apps.astra.datastax.com
Please enter your Token ········

实例化Astra DB 客户端

astra_db = AstraDB(
api_endpoint=ASTRA_DB_API_ENDPOINT,
token=ASTRA_DB_APPLICATION_TOKEN,
)

创建向量集合

除了指定集合名称之外,您需要指定的唯一参数是要存储的向量的维度。其他参数,特别是用于搜索的相似度度量标准,都是可选的。

coll_name = "philosophers_astra_db"
collection = astra_db.create_collection(coll_name, dimension=1536)

连接到OpenAI

设置您的密钥

OPENAI_API_KEY = getpass("Please enter your OpenAI API Key: ")

Please enter your OpenAI API Key:  ········

获取嵌入向量的测试调用

快速检查如何获取一组输入文本的嵌入向量:

client = openai.OpenAI(api_key=OPENAI_API_KEY)
embedding_model_name = "text-embedding-3-small"

result = client.embeddings.create(
input=[
"This is a sentence",
"A second sentence"
],
model=embedding_model_name,
)

注意:以上是针对OpenAI v1.0+的语法。如果使用之前的版本,获取嵌入向量的代码会有所不同。

print(f"len(result.data)              = {len(result.data)}")
print(f"result.data[1].embedding = {str(result.data[1].embedding)[:55]}...")
print(f"len(result.data[1].embedding) = {len(result.data[1].embedding)}")

len(result.data)              = 2
result.data[1].embedding = [-0.0108176339417696, 0.0013546717818826437, 0.00362232...
len(result.data[1].embedding) = 1536

将报价加载到向量存储中

获取一个带有引语的数据集。 (我们从Kaggle数据集中调整和增加了数据,以便在此演示中使用。)

philo_dataset = load_dataset("datastax/philosopher-quotes")["train"]

快速检查:

print("An example entry:")
print(philo_dataset[16])

An example entry:
{'author': 'aristotle', 'quote': 'Love well, be loved and do something of value.', 'tags': 'love;ethics'}

检查数据集的大小:

author_count = Counter(entry["author"] for entry in philo_dataset)
print(f"Total: {len(philo_dataset)} quotes. By author:")
for author, count in author_count.most_common():
print(f" {author:<20}: {count} quotes")

Total: 450 quotes. By author:
aristotle : 50 quotes
schopenhauer : 50 quotes
spinoza : 50 quotes
hegel : 50 quotes
freud : 50 quotes
nietzsche : 50 quotes
sartre : 50 quotes
plato : 50 quotes
kant : 50 quotes

写入向量集合

您将计算引语的嵌入并将其保存到向量存储中,同时保存文本本身和稍后将使用的元数据。

为了优化速度并减少调用次数,您将对嵌入OpenAI服务执行批量调用。

要存储引语对象,您将使用集合的insert_many方法(每批一次调用)。在准备要插入的文档时,您将选择合适的字段名称–但请记住,嵌入向量必须是固定的特殊$vector字段。

BATCH_SIZE = 20

num_batches = ((len(philo_dataset) + BATCH_SIZE - 1) // BATCH_SIZE)

quotes_list = philo_dataset["quote"]
authors_list = philo_dataset["author"]
tags_list = philo_dataset["tags"]

print("Starting to store entries: ", end="")
for batch_i in range(num_batches):
b_start = batch_i * BATCH_SIZE
b_end = (batch_i + 1) * BATCH_SIZE
# 计算这一批数据的嵌入向量
b_emb_results = client.embeddings.create(
input=quotes_list[b_start : b_end],
model=embedding_model_name,
)
# 准备插入的文件
b_docs = []
for entry_idx, emb_result in zip(range(b_start, b_end), b_emb_results.data):
if tags_list[entry_idx]:
tags = {
tag: True
for tag in tags_list[entry_idx].split(";")
}
else:
tags = {}
b_docs.append({
"quote": quotes_list[entry_idx],
"$vector": emb_result.embedding,
"author": authors_list[entry_idx],
"tags": tags,
})
# 写入向量集合
collection.insert_many(b_docs)
print(f"[{len(b_docs)}]", end="")

print("\nFinished storing entries.")

Starting to store entries: [20][20][20][20][20][20][20][20][20][20][20][20][20][20][20][20][20][20][20][20][20][20][10]
Finished storing entries.

用例1:报价搜索引擎

对于引用搜索功能,您首先需要将输入的引用转换为向量,然后使用它来查询存储(除了处理可选的元数据到搜索调用中)。

将搜索引擎功能封装到一个函数中,以便于重复使用。其核心是集合的vector_find方法:

def find_quote_and_author(query_quote, n, author=None, tags=None):
query_vector = client.embeddings.create(
input=[query_quote],
model=embedding_model_name,
).data[0].embedding
filter_clause = {}
if author:
filter_clause["author"] = author
if tags:
filter_clause["tags"] = {}
for tag in tags:
filter_clause["tags"][tag] = True
#
results = collection.vector_find(
query_vector,
limit=n,
filter=filter_clause,
fields=["quote", "author"]
)
return [
(result["quote"], result["author"])
for result in results
]

进行搜索测试

只传递一个引用:

find_quote_and_author("We struggle all our life for nothing", 3)

[('Life to the great majority is only a constant struggle for mere existence, with the certainty of losing it at last.',
'schopenhauer'),
('We give up leisure in order that we may have leisure, just as we go to war in order that we may have peace.',
'aristotle'),
('Perhaps the gods are kind to us, by making life more disagreeable as we grow older. In the end death seems less intolerable than the manifold burdens we carry',
'freud')]

搜索限定为作者:

find_quote_and_author("We struggle all our life for nothing", 2, author="nietzsche")

[('To live is to suffer, to survive is to find some meaning in the suffering.',
'nietzsche'),
('What makes us heroic?--Confronting simultaneously our supreme suffering and our supreme hope.',
'nietzsche')]

搜索限定为一个标签(在之前用引号保存的标签中选择):

find_quote_and_author("We struggle all our life for nothing", 2, tags=["politics"])

[('He who seeks equality between unequals seeks an absurdity.', 'spinoza'),
('The people are that part of the state that does not know what it wants.',
'hegel')]

剔除不相关的结果

向量相似性搜索通常会返回与查询最接近的向量,即使这意味着如果没有更好的结果,可能会返回一些不太相关的结果。

为了控制这个问题,您可以获取查询与每个结果之间的实际“相似度”,然后对其实施一个阈值,有效地丢弃超出该阈值的结果。正确调整此阈值并不是一件容易的问题:在这里,我们只是向您展示一种方法。

为了了解这是如何工作的,尝试以下查询,并尝试选择引用和阈值来比较结果。请注意,相似度作为每个结果文档中的特殊$similarity字段返回 - 并且默认情况下将返回它,除非您向搜索方法传递include_similarity = False

注(适合数学倾向者):此值是两个向量之间余弦差异的重新缩放值,介于零和一之间,即标量积除以两个向量的范数的乘积。换句话说,对于面对相反的向量,此值为0,对于平行向量,此值为+1。对于其他相似性度量(余弦是默认值),请检查AstraDB.create_collection中的metric参数以及允许值的文档

quote = "Animals are our equals."
# 引文 = "Be good."
# 引文 = "This teapot is strange."

metric_threshold = 0.92

quote_vector = client.embeddings.create(
input=[quote],
model=embedding_model_name,
).data[0].embedding

results_full = collection.vector_find(
quote_vector,
limit=8,
fields=["quote"]
)
results = [res for res in results_full if res["$similarity"] >= metric_threshold]

print(f"{len(results)} quotes within the threshold:")
for idx, result in enumerate(results):
print(f" {idx}. [similarity={result['$similarity']:.3f}] \"{result['quote'][:70]}...\"")

3 quotes within the threshold:
0. [similarity=0.927] "The assumption that animals are without rights, and the illusion that ..."
1. [similarity=0.922] "Animals are in possession of themselves; their soul is in possession o..."
2. [similarity=0.920] "At his best, man is the noblest of all animals; separated from law and..."

使用案例2:报价生成器

对于这个任务,您需要OpenAI的另一个组件,即一个LLM来为我们生成报价(基于通过查询向量存储获取的输入)。

您还需要一个用于提示模板,该模板将用于填充生成报价LLM完成任务。

completion_model_name = "gpt-3.5-turbo"

generation_prompt_template = """"Generate a single short philosophical quote on the given topic,
similar in spirit and form to the provided actual example quotes.
Do not exceed 20-30 words in your quote.

REFERENCE TOPIC: "{topic}"

ACTUAL EXAMPLES:
{examples}
"""

与搜索功能类似,这个功能最好包装成一个方便的函数(内部使用搜索功能):

def generate_quote(topic, n=2, author=None, tags=None):
quotes = find_quote_and_author(query_quote=topic, n=n, author=author, tags=tags)
if quotes:
prompt = generation_prompt_template.format(
topic=topic,
examples="\n".join(f" - {quote[0]}" for quote in quotes),
)
# 少量日志记录:
print("** quotes found:")
for q, a in quotes:
print(f"** - {q} ({a})")
print("** end of logging")
#
response = client.chat.completions.create(
model=completion_model_name,
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
max_tokens=320,
)
return response.choices[0].message.content.replace('"', '').strip()
else:
print("** no quotes found.")
return None

注意:与嵌入计算的情况类似,在OpenAI v1.0之前,Chat Completion API的代码会略有不同。

将引用生成功能进行测试

只是传递一段文本(一个“引用”,但实际上可以只是建议一个主题,因为它的向量嵌入最终仍将出现在向量空间中的正确位置):

q_topic = generate_quote("politics and virtue")
print("\nA new generated quote:")
print(q_topic)

** quotes found:
** - Happiness is the reward of virtue. (aristotle)
** - Our moral virtues benefit mainly other people; intellectual virtues, on the other hand, benefit primarily ourselves; therefore the former make us universally popular, the latter unpopular. (schopenhauer)
** end of logging

A new generated quote:
True politics lies in the virtuous pursuit of justice, for it is through virtue that we build a better world for all.

从一个哲学家的灵感中汲取灵感:

q_topic = generate_quote("animals", author="schopenhauer")
print("\nA new generated quote:")
print(q_topic)

** quotes found:
** - Because Christian morality leaves animals out of account, they are at once outlawed in philosophical morals; they are mere 'things,' mere means to any ends whatsoever. They can therefore be used for vivisection, hunting, coursing, bullfights, and horse racing, and can be whipped to death as they struggle along with heavy carts of stone. Shame on such a morality that is worthy of pariahs, and that fails to recognize the eternal essence that exists in every living thing, and shines forth with inscrutable significance from all eyes that see the sun! (schopenhauer)
** - The assumption that animals are without rights, and the illusion that our treatment of them has no moral significance, is a positively outrageous example of Western crudity and barbarity. Universal compassion is the only guarantee of morality. (schopenhauer)
** end of logging

A new generated quote:
Excluding animals from ethical consideration reveals a moral blindness that allows for their exploitation and suffering. True morality embraces universal compassion.

清理

如果您想要删除此演示中使用的所有资源,请运行此单元格(警告:这将不可逆地删除集合及其数据!):

astra_db.delete_collection(coll_name)

{'status': {'ok': 1}}