跳到主要内容

Kusto作为用于AI嵌入向量的数据库

本笔记本提供了使用Azure Data Explorer(Kusto)作为矢量数据库与OpenAI嵌入的逐步说明。

这个笔记本展示了一个端到端的过程:

  1. 使用OpenAI API创建的预先计算的嵌入。
  2. 将嵌入存储在Kusto中。
  3. 将原始文本查询转换为使用OpenAI API的嵌入。
  4. 使用Kusto在存储的嵌入中执行余弦相似性搜索。

先决条件

为了完成这个练习,我们需要准备一些东西:

  1. Azure Data Explorer(Kusto) 服务器实例。https://azure.microsoft.com/en-us/products/data-explorer
  2. Azure OpenAI 凭据或 OpenAI API 密钥。
%pip install wget

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, -1, Finished, Available)
Collecting wget
Downloading wget-3.2.zip (10 kB)
Preparing metadata (setup.py) ... done
Building wheels for collected packages: wget
Building wheel for wget (setup.py) ... - done
Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9657 sha256=10fd8aa1d20fd49c36389dc888acc721d0578c5a0635fc9fc5dc642c0f49522e
Stored in directory: /home/trusted-service-user/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2

[notice] A new release of pip is available: 23.0 -> 23.1.2
[notice] To update, run: /nfs4/pyenv-27214bb4-edfd-4fdd-b888-8a99075a1416/bin/python -m pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Warning: PySpark kernel has been restarted to use updated packages.
%pip install openai

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, -1, Finished, Available)
Collecting openai
Downloading openai-0.27.6-py3-none-any.whl (71 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.9/71.9 kB 1.7 MB/s eta 0:00:0000:01
Requirement already satisfied: tqdm in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from openai) (4.65.0)
Requirement already satisfied: requests>=2.20 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from openai) (2.28.2)
Requirement already satisfied: aiohttp in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from openai) (3.8.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.20->openai) (1.26.14)
Requirement already satisfied: certifi>=2017.4.17 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.20->openai) (2022.12.7)
Requirement already satisfied: idna<4,>=2.5 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.20->openai) (3.4)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.20->openai) (2.1.1)
Requirement already satisfied: attrs>=17.3.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from aiohttp->openai) (22.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from aiohttp->openai) (1.3.3)
Requirement already satisfied: multidict<7.0,>=4.5 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from aiohttp->openai) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from aiohttp->openai) (1.8.2)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from aiohttp->openai) (4.0.2)
Requirement already satisfied: aiosignal>=1.1.2 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from aiohttp->openai) (1.3.1)
Installing collected packages: openai
Successfully installed openai-0.27.6

[notice] A new release of pip is available: 23.0 -> 23.1.2
[notice] To update, run: /nfs4/pyenv-27214bb4-edfd-4fdd-b888-8a99075a1416/bin/python -m pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Warning: PySpark kernel has been restarted to use updated packages.
%pip install azure-kusto-data

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, -1, Finished, Available)
Requirement already satisfied: azure-kusto-data in /nfs4/pyenv-27214bb4-edfd-4fdd-b888-8a99075a1416/lib/python3.10/site-packages (4.1.4)
Requirement already satisfied: msal<2,>=1.9.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-kusto-data) (1.21.0)
Requirement already satisfied: python-dateutil>=2.8.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-kusto-data) (2.8.2)
Requirement already satisfied: azure-core<2,>=1.11.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-kusto-data) (1.26.4)
Requirement already satisfied: requests>=2.13.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-kusto-data) (2.28.2)
Requirement already satisfied: ijson~=3.1 in /nfs4/pyenv-27214bb4-edfd-4fdd-b888-8a99075a1416/lib/python3.10/site-packages (from azure-kusto-data) (3.2.0.post0)
Requirement already satisfied: azure-identity<2,>=1.5.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-kusto-data) (1.12.0)
Requirement already satisfied: six>=1.11.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-core<2,>=1.11.0->azure-kusto-data) (1.16.0)
Requirement already satisfied: typing-extensions>=4.3.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-core<2,>=1.11.0->azure-kusto-data) (4.5.0)
Requirement already satisfied: cryptography>=2.5 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-identity<2,>=1.5.0->azure-kusto-data) (40.0.1)
Requirement already satisfied: msal-extensions<2.0.0,>=0.3.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from azure-identity<2,>=1.5.0->azure-kusto-data) (1.0.0)
Requirement already satisfied: PyJWT[crypto]<3,>=1.0.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from msal<2,>=1.9.0->azure-kusto-data) (2.6.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.13.0->azure-kusto-data) (1.26.14)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.13.0->azure-kusto-data) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.13.0->azure-kusto-data) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from requests>=2.13.0->azure-kusto-data) (2022.12.7)
Requirement already satisfied: cffi>=1.12 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from cryptography>=2.5->azure-identity<2,>=1.5.0->azure-kusto-data) (1.15.1)
Requirement already satisfied: portalocker<3,>=1.0 in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from msal-extensions<2.0.0,>=0.3.0->azure-identity<2,>=1.5.0->azure-kusto-data) (2.7.0)
Requirement already satisfied: pycparser in /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages (from cffi>=1.12->cryptography>=2.5->azure-identity<2,>=1.5.0->azure-kusto-data) (2.21)

[notice] A new release of pip is available: 23.0 -> 23.1.2
[notice] To update, run: /nfs4/pyenv-27214bb4-edfd-4fdd-b888-8a99075a1416/bin/python -m pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
Warning: PySpark kernel has been restarted to use updated packages.

下载预先计算的嵌入向量

在这一部分,我们将加载准备好的嵌入数据,这样您就不必使用自己的信用重新计算维基百科文章的嵌入。

import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# 文件大小约为700MB,因此需要一些时间来完成。
wget.download(embeddings_url)

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 17, Finished, Available)
'vector_database_wikipedia_articles_embedded.zip'

import zipfile

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
zip_ref.extractall("/lakehouse/default/Files/data")

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 18, Finished, Available)
import pandas as pd

from ast import literal_eval

article_df = pd.read_csv('/lakehouse/default/Files/data/vector_database_wikipedia_articles_embedded.csv')
# 从字符串中读取向量并将其转换回列表
article_df["title_vector"] = article_df.title_vector.apply(literal_eval)
article_df["content_vector"] = article_df.content_vector.apply(literal_eval)
article_df.head()

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 19, Finished, Available)
id url title text title_vector content_vector vector_id
0 1 https://simple.wikipedia.org/wiki/April April April is the fourth month of the year in the J... [0.001009464613161981, -0.020700545981526375, ... [-0.011253940872848034, -0.013491976074874401,... 0
1 2 https://simple.wikipedia.org/wiki/August August August (Aug.) is the eighth month of the year ... [0.0009286514250561595, 0.000820168002974242, ... [0.0003609954728744924, 0.007262262050062418, ... 1
2 6 https://simple.wikipedia.org/wiki/Art Art Art is a creative activity that expresses imag... [0.003393713850528002, 0.0061537534929811954, ... [-0.004959689453244209, 0.015772193670272827, ... 2
3 8 https://simple.wikipedia.org/wiki/A A A or a is the first letter of the English alph... [0.0153952119871974, -0.013759135268628597, 0.... [0.024894846603274345, -0.022186409682035446, ... 3
4 9 https://simple.wikipedia.org/wiki/Air Air Air refers to the Earth's atmosphere. Air is a... [0.02224554680287838, -0.02044147066771984, -0... [0.021524671465158463, 0.018522677943110466, -... 4

在Kusto表中存储向量

在Kusto中根据数据框中的内容创建一个表并加载向量。Spark选项CreakeIfNotExists会在表不存在时自动创建表。

# 请将以下内容替换为您的AAD租户ID、Kusto集群URI、Kusto数据库名称和Kusto表名称:
AAD_TENANT_ID = ""
KUSTO_CLUSTER = ""
KUSTO_DATABASE = "Vector"
KUSTO_TABLE = "Wiki"

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 37, Finished, Available)

kustoOptions = {"kustoCluster": KUSTO_CLUSTER, "kustoDatabase" :KUSTO_DATABASE, "kustoTable" : KUSTO_TABLE }

# 根据您希望使用的认证机制,替换认证方法 - https://github.com/Azure/azure-kusto-spark/blob/master/docs/Authentication.md
access_token=mssparkutils.credentials.getToken(kustoOptions["kustoCluster"])

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 21, Finished, Available)
#Pandas数据框转换为Spark数据框
sparkDF=spark.createDataFrame(article_df)

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 22, Finished, Available)
/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:604: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
# 将数据写入Kusto表
sparkDF.write. \
format("com.microsoft.kusto.spark.synapse.datasource"). \
option("kustoCluster",kustoOptions["kustoCluster"]). \
option("kustoDatabase",kustoOptions["kustoDatabase"]). \
option("kustoTable", kustoOptions["kustoTable"]). \
option("accessToken", access_token). \
option("tableCreateOptions", "CreateIfNotExist").\
mode("Append"). \
save()

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 23, Finished, Available)

准备您的OpenAI API密钥

OpenAI API密钥用于对文档和查询进行向量化。您可以按照说明创建和检索您的Azure OpenAI密钥和端点。https://learn.microsoft.com/en-us/azure/cognitive-services/openai/tutorials/embeddings

请确保使用text-embedding-3-small模型。由于预先计算的嵌入是使用text-embedding-3-small模型创建的,因此在搜索过程中我们也必须使用它。

import openai

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 43, Finished, Available)

如果使用Azure Open AI

openai.api_version = '2022-12-01'
openai.api_base = '' # 请在此处添加您的端点
openai.api_type = 'azure'
openai.api_key = '' # 请在此处添加您的API密钥

def embed(query):
# 从用户查询生成嵌入向量
embedded_query = openai.Embedding.create(
input=query,
deployment_id="embed", #替换为你的部署ID
chunk_size=1
)["data"][0]["embedding"]
return embedded_query

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 44, Finished, Available)

如果使用Open AI

只有在打算使用Open AI进行嵌入时才运行此单元格。

openai.api_key = ""


def embed(query):
# 从用户查询生成嵌入向量
embedded_query = openai.Embedding.create(
input=query,
model="text-embedding-3-small",
)["data"][0]["embedding"]
return embedded_query

为搜索词生成嵌入向量


searchedEmbedding = embed("places where you worship")
# 打印搜索到的嵌入向量

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 45, Finished, Available)

Kusto中的语义搜索

我们将在Kusto表中搜索最接近的向量。

我们将使用series-cosine-similarity-fl UDF进行相似性搜索。

在继续之前,请在您的数据库中创建该函数 - https://learn.microsoft.com/en-us/azure/data-explorer/kusto/functions-library/series-cosine-similarity-fl?tabs=query-defined

from azure.kusto.data import KustoClient, KustoConnectionStringBuilder
from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table
import pandas as pd

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 35, Finished, Available)
KCSB = KustoConnectionStringBuilder.with_aad_device_authentication(
KUSTO_CLUSTER)
KCSB.authority_id = AAD_TENANT_ID

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 38, Finished, Available)
KUSTO_CLIENT = KustoClient(KCSB)

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 39, Finished, Available)
KUSTO_QUERY = "Wiki | extend similarity = series_cosine_similarity_fl(dynamic("+str(searchedEmbedding)+"), content_vector,1,1) | top 10 by similarity desc "

RESPONSE = KUSTO_CLIENT.execute(KUSTO_DATABASE, KUSTO_QUERY)

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 48, Finished, Available)
df = dataframe_from_result_table(RESPONSE.primary_results[0])
df

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 49, Finished, Available)
id url title text title_vector content_vector vector_id similarity
0 852 https://simple.wikipedia.org/wiki/Temple Temple A temple is a building where people go to prac... [-0.021837441250681877, -0.007722342386841774,... [-0.0019541378132998943, 0.007151313126087189,... 413 0.834495
1 78094 https://simple.wikipedia.org/wiki/Christian%20... Christian worship In Christianity, worship has been thought as b... [0.0017675267299637198, -0.008890199474990368,... [0.020530683919787407, 0.0024345638230443, -0.... 20320 0.832132
2 59154 https://simple.wikipedia.org/wiki/Service%20of... Service of worship A service of worship is a religious meeting wh... [-0.007969820871949196, 0.0004240311391185969,... [0.003784010885283351, -0.0030924836173653603,... 15519 0.831633
3 51910 https://simple.wikipedia.org/wiki/Worship Worship Worship is a word often used in religion. It ... [0.0036036288365721703, -0.01276545226573944, ... [0.007925753481686115, -0.0110504487529397, 0.... 14010 0.828185
4 29576 https://simple.wikipedia.org/wiki/Altar Altar An altar is a place, often a table, where a re... [0.007887467741966248, -0.02706138789653778, -... [0.023901859298348427, -0.031175222247838977, ... 8708 0.824124
5 92507 https://simple.wikipedia.org/wiki/Shrine Shrine A shrine is a holy or sacred place with someth... [-0.011601685546338558, 0.006366696208715439, ... [0.016423320397734642, -0.0015560361789539456,... 23945 0.823863
6 815 https://simple.wikipedia.org/wiki/Synagogue Synagogue A synagogue is a place where Jews meet to wors... [-0.017317570745944977, 0.0022673190105706453,... [-0.004515442531555891, 0.003739549545571208, ... 398 0.819942
7 68080 https://simple.wikipedia.org/wiki/Shinto%20shrine Shinto shrine A Shinto shrine is a sacred place or site wher... [0.0035740730818361044, 0.0028098472394049168,... [0.011014971882104874, 0.00042272370774298906,... 18106 0.818475
8 57790 https://simple.wikipedia.org/wiki/Chapel Chapel A chapel is a place for Christian worship. The... [-0.01371884811669588, 0.0031672674231231213, ... [0.002526090247556567, 0.02482965588569641, 0.... 15260 0.817608
9 142 https://simple.wikipedia.org/wiki/Church%20%28... Church (building) A church is a building that was constructed to... [0.0021336888894438744, 0.0029748091474175453,... [0.016109377145767212, 0.022908871993422508, 0... 74 0.812636
searchedEmbedding = embed("unfortunate events in history")

KUSTO_QUERY = "Wiki | extend similarity = series_cosine_similarity_fl(dynamic("+str(searchedEmbedding)+"), title_vector,1,1) | top 10 by similarity desc "
RESPONSE = KUSTO_CLIENT.execute(KUSTO_DATABASE, KUSTO_QUERY)

df = dataframe_from_result_table(RESPONSE.primary_results[0])
df

StatementMeta(, 7e5070d2-4560-4fb8-a3a8-6a594acd58ab, 52, Finished, Available)
id url title text title_vector content_vector vector_id similarity
0 848 https://simple.wikipedia.org/wiki/Tragedy Tragedy In theatre, a tragedy as defined by Aristotle ... [-0.019502468407154083, -0.010160734876990318,... [-0.012951433658599854, -0.018836138769984245,... 410 0.851848
1 4469 https://simple.wikipedia.org/wiki/The%20Holocaust The Holocaust The Holocaust, sometimes called The Shoah (), ... [-0.030233195051550865, -0.024401605129241943,... [-0.016398731619119644, -0.013267949223518372,... 1203 0.847222
2 64216 https://simple.wikipedia.org/wiki/List%20of%20... List of historical plagues This list contains famous or well documented o... [-0.010667890310287476, -0.0003575817099772393... [-0.010863155126571655, -0.0012196656316518784... 16859 0.844411
3 4397 https://simple.wikipedia.org/wiki/List%20of%20... List of disasters This is a list of disasters, both natural and ... [-0.02713736332952976, -0.005278210621327162, ... [-0.023679986596107483, -0.006126823835074902,... 1158 0.843063
4 23073 https://simple.wikipedia.org/wiki/Disaster Disaster A disaster is something very not good that hap... [-0.018235962837934497, -0.020034968852996823,... [-0.02504003793001175, 0.007415903266519308, 0... 7251 0.840334
5 4382 https://simple.wikipedia.org/wiki/List%20of%20... List of terrorist incidents The following is a list by date of acts and fa... [-0.03989032283425331, -0.012808636762201786, ... [-0.045838188380002975, -0.01682935282588005, ... 1149 0.836162
6 13528 https://simple.wikipedia.org/wiki/A%20Series%2... A Series of Unfortunate Events A Series of Unfortunate Events is a series of ... [0.0010618815431371331, -0.0267023965716362, -... [0.002801976166665554, -0.02904471382498741, -... 4347 0.835172
7 42874 https://simple.wikipedia.org/wiki/History%20of... History of the world The history of the world (also called human hi... [0.0026915925554931164, -0.022206028923392296,... [0.013645033352077007, -0.005165994167327881, ... 11672 0.830243
8 4452 https://simple.wikipedia.org/wiki/Accident Accident An accident is when something goes wrong when ... [-0.004075294826179743, -0.0059883203357458115... [0.00926120299845934, 0.013705797493457794, 0.... 1190 0.826898
9 324 https://simple.wikipedia.org/wiki/History History History is the study of past events. People kn... [0.006603690329939127, -0.011856242083013058, ... [0.0048830462619662285, 0.0032003086525946856,... 170 0.824645