将Azure AI Search用作OpenAI嵌入向量数据库

这个笔记本提供了关于如何使用Azure AI Search（之前称为Azure Cognitive Search）作为带有OpenAI嵌入的向量数据库的逐步说明。Azure AI Search是一个云搜索服务，为开发人员提供基础架构、API和工具，用于在Web、移动和企业应用程序中构建丰富的搜索体验，覆盖私有的、异构的内容。

先决条件：

为了完成本练习，您必须具备以下条件： - Azure AI Search 服务 - OpenAI Key 或 Azure OpenAI 凭据

! pip install wget
! pip install azure-search-documents 
! pip install azure-identity
! pip install openai

导入所需的库

import json  
import wget
import pandas as pd
import zipfile
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from azure.core.credentials import AzureKeyCredential  
from azure.search.documents import SearchClient, SearchIndexingBufferedSender  
from azure.search.documents.indexes import SearchIndexClient  
from azure.search.documents.models import (
    QueryAnswerType,
    QueryCaptionType,
    QueryType,
    VectorizedQuery,
)
from azure.search.documents.indexes.models import (
    HnswAlgorithmConfiguration,
    HnswParameters,
    SearchField,
    SearchableField,
    SearchFieldDataType,
    SearchIndex,
    SemanticConfiguration,
    SemanticField,
    SemanticPrioritizedFields,
    SemanticSearch,
    SimpleField,
    VectorSearch,
    VectorSearchAlgorithmKind,
    VectorSearchAlgorithmMetric,
    VectorSearchProfile,
)

配置OpenAI设置

本部分将指导您设置Azure OpenAI的身份验证，使您能够安全地使用Azure Active Directory（AAD）或API密钥与服务进行交互。在继续之前，请确保您已准备好Azure OpenAI的终结点和凭据。有关如何在Azure OpenAI中设置AAD的详细说明，请参考官方文档。

endpoint: str = "YOUR_AZURE_OPENAI_ENDPOINT"
api_key: str = "YOUR_AZURE_OPENAI_KEY"
api_version: str = "2023-05-15"
deployment = "YOUR_AZURE_OPENAI_DEPLOYMENT_NAME"
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(
    credential, "https://cognitiveservices.azure.com/.default"
)

# 将此标志设置为 True，如果您正在使用 Azure Active Directory。
use_aad_for_aoai = True 

if use_aad_for_aoai:
    # 使用 Azure Active Directory (AAD) 身份验证
    client = AzureOpenAI(
        azure_endpoint=endpoint,
        api_version=api_version,
        azure_ad_token_provider=token_provider,
    )
else:
    # 使用API密钥认证
    client = AzureOpenAI(
        api_key=api_key,
        api_version=api_version,
        azure_endpoint=endpoint,
    )

配置Azure AI Search向量存储设置

本节将解释如何设置Azure AI Search客户端，以便与向量存储功能集成。您可以在Azure门户中或通过Search Management SDK以编程方式找到Azure AI Search服务的详细信息。

# 配置
search_service_endpoint: str = "YOUR_AZURE_SEARCH_ENDPOINT"
search_service_api_key: str = "YOUR_AZURE_SEARCH_ADMIN_KEY"
index_name: str = "azure-ai-search-openai-cookbook-demo"

# 将此标志设置为 True，如果您正在使用 Azure Active Directory。
use_aad_for_search = True  

if use_aad_for_search:
    # 使用 Azure Active Directory (AAD) 身份验证
    credential = DefaultAzureCredential()
else:
    # 使用API密钥认证
    credential = AzureKeyCredential(search_service_api_key)

# 使用选定的身份验证方法初始化 SearchClient
search_client = SearchClient(
    endpoint=search_service_endpoint, index_name=index_name, credential=credential
)

加载数据

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# 文件大小约为700MB，因此需要一些时间来完成。
wget.download(embeddings_url)

'vector_database_wikipedia_articles_embedded.zip'

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip", "r") as zip_ref:
    zip_ref.extractall("../../data")

article_df = pd.read_csv("../../data/vector_database_wikipedia_articles_embedded.csv")

# 使用 `json.loads` 将字符串中的向量读取回列表
article_df["title_vector"] = article_df.title_vector.apply(json.loads)
article_df["content_vector"] = article_df.content_vector.apply(json.loads)
article_df["vector_id"] = article_df["vector_id"].apply(str)
article_df.head()

	id	url	title	text	title_vector	content_vector	vector_id
0	1	https://simple.wikipedia.org/wiki/April	April	April is the fourth month of the year in the J...	[0.001009464613161981, -0.020700545981526375, ...	[-0.011253940872848034, -0.013491976074874401,...	0
1	2	https://simple.wikipedia.org/wiki/August	August	August (Aug.) is the eighth month of the year ...	[0.0009286514250561595, 0.000820168002974242, ...	[0.0003609954728744924, 0.007262262050062418, ...	1
2	6	https://simple.wikipedia.org/wiki/Art	Art	Art is a creative activity that expresses imag...	[0.003393713850528002, 0.0061537534929811954, ...	[-0.004959689453244209, 0.015772193670272827, ...	2
3	8	https://simple.wikipedia.org/wiki/A	A	A or a is the first letter of the English alph...	[0.0153952119871974, -0.013759135268628597, 0....	[0.024894846603274345, -0.022186409682035446, ...	3
4	9	https://simple.wikipedia.org/wiki/Air	Air	Air refers to the Earth's atmosphere. Air is a...	[0.02224554680287838, -0.02044147066771984, -0...	[0.021524671465158463, 0.018522677943110466, -...	4

创建索引

这段代码片段演示了如何使用Azure AI Search Python SDK中的SearchIndexClient来定义和创建一个搜索索引。该索引结合了向量搜索和语义排名器的功能。更多详情，请访问我们关于如何创建向量索引的文档。

# 初始化 SearchIndexClient
index_client = SearchIndexClient(
    endpoint=search_service_endpoint, credential=credential
)

# 定义索引的字段
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String),
    SimpleField(name="vector_id", type=SearchFieldDataType.String, key=True),
    SimpleField(name="url", type=SearchFieldDataType.String),
    SearchableField(name="title", type=SearchFieldDataType.String),
    SearchableField(name="text", type=SearchFieldDataType.String),
    SearchField(
        name="title_vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        vector_search_dimensions=1536,
        vector_search_profile_name="my-vector-config",
    ),
    SearchField(
        name="content_vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        vector_search_dimensions=1536,
        vector_search_profile_name="my-vector-config",
    ),
]

# 配置向量搜索配置
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="my-hnsw",
            kind=VectorSearchAlgorithmKind.HNSW,
            parameters=HnswParameters(
                m=4,
                ef_construction=400,
                ef_search=500,
                metric=VectorSearchAlgorithmMetric.COSINE,
            ),
        )
    ],
    profiles=[
        VectorSearchProfile(
            name="my-vector-config",
            algorithm_configuration_name="my-hnsw",
        )
    ],
)

# 配置语义搜索配置
semantic_search = SemanticSearch(
    configurations=[
        SemanticConfiguration(
            name="my-semantic-config",
            prioritized_fields=SemanticPrioritizedFields(
                title_field=SemanticField(field_name="title"),
                keywords_fields=[SemanticField(field_name="url")],
                content_fields=[SemanticField(field_name="text")],
            ),
        )
    ]
)

# 创建带有向量搜索和语义搜索配置的搜索索引
index = SearchIndex(
    name=index_name,
    fields=fields,
    vector_search=vector_search,
    semantic_search=semantic_search,
)

# 创建或更新索引
result = index_client.create_or_update_index(index)
print(f"{result.name} created")

azure-ai-search-openai-cookbook-demo created

将数据上传到Azure AI Search索引

以下代码片段概述了将一批文档（具体来说是具有预先计算的嵌入的维基百科文章）从pandas DataFrame上传到Azure AI Search索引的过程。有关数据导入策略和最佳实践的详细指南，请参考在Azure AI Search中进行数据导入。

from azure.core.exceptions import HttpResponseError

# Convert the 'id' and 'vector_id' columns to string so one of them can serve as our key field
article_df["id"] = article_df["id"].astype(str)
article_df["vector_id"] = article_df["vector_id"].astype(str)
# 将 DataFrame 转换为字典列表
documents = article_df.to_dict(orient="records")

# 创建一个 `SearchIndexingBufferedSender`
batch_client = SearchIndexingBufferedSender(
    search_service_endpoint, index_name, credential
)

try:
    # 在一次调用中为所有文档添加上传操作
    batch_client.upload_documents(documents=documents)

    # 手动刷新以发送缓冲区中剩余的文档
    batch_client.flush()
except HttpResponseError as e:
    print(f"An error occurred: {e}")
finally:
    # 清理资源
    batch_client.close()

print(f"Uploaded {len(documents)} documents in total")

Uploaded 25000 documents in total

如果您的数据集中尚未包含预先计算的嵌入向量，您可以使用下面的函数来创建嵌入向量，该函数使用openai Python库。您还会注意到相同的函数和模型被用来生成查询嵌入向量，以执行向量搜索。

# 生成文档嵌入的示例函数
def generate_embeddings(text, model):
    # 使用指定模型为提供的文本生成嵌入向量。
    embeddings_response = client.embeddings.create(model=model, input=text)
    # 从响应中提取嵌入数据
    embedding = embeddings_response.data[0].embedding
    return embedding


first_document_content = documents[0]["text"]
print(f"Content: {first_document_content[:100]}")

content_vector = generate_embeddings(first_document_content, deployment)
print("Content vector generated")

Content: April is the fourth month of the year in the Julian and Gregorian calendars, and comes between March
Content vector generated

执行向量相似性搜索

# 纯向量搜索
query = "modern art in Europe"
  
search_client = SearchClient(search_service_endpoint, index_name, credential)  
vector_query = VectorizedQuery(vector=generate_embeddings(query, deployment), k_nearest_neighbors=3, fields="content_vector")
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query], 
    select=["title", "text", "url"] 
)
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"URL: {result['url']}\n")  

Title: Documenta
Score: 0.8599451
URL: https://simple.wikipedia.org/wiki/Documenta

Title: Museum of Modern Art
Score: 0.85260946
URL: https://simple.wikipedia.org/wiki/Museum%20of%20Modern%20Art

Title: Expressionism
Score: 0.852354
URL: https://simple.wikipedia.org/wiki/Expressionism

执行混合搜索

混合搜索结合了传统基于关键字的搜索和基于向量相似度的搜索的能力，以提供更相关和上下文相关的结果。这种方法在处理复杂查询并从文本背后理解语义含义有益处时特别有用。

提供的代码片段演示了如何执行混合搜索查询：

# 混合搜索
query = "Famous battles in Scottish history"  
  
search_client = SearchClient(search_service_endpoint, index_name, credential)  
vector_query = VectorizedQuery(vector=generate_embeddings(query, deployment), k_nearest_neighbors=3, fields="content_vector")
  
results = search_client.search(  
    search_text=query,  
    vector_queries= [vector_query], 
    select=["title", "text", "url"],
    top=3
)
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"URL: {result['url']}\n")  

Title: Wars of Scottish Independence
Score: 0.03306011110544205
URL: https://simple.wikipedia.org/wiki/Wars%20of%20Scottish%20Independence

Title: Battle of Bannockburn
Score: 0.022253260016441345
URL: https://simple.wikipedia.org/wiki/Battle%20of%20Bannockburn

Title: Scottish
Score: 0.016393441706895828
URL: https://simple.wikipedia.org/wiki/Scottish

使用重新排序进行混合搜索（由Bing提供支持）

语义排名器通过使用语言理解重新排列搜索结果，显著提高了搜索相关性。此外，您还可以获取摘要、答案和亮点。

# 语义混合搜索
query = "What were the key technological advancements during the Industrial Revolution?"

search_client = SearchClient(search_service_endpoint, index_name, credential)
vector_query = VectorizedQuery(
    vector=generate_embeddings(query, deployment),
    k_nearest_neighbors=3,
    fields="content_vector",
)

results = search_client.search(
    search_text=query,
    vector_queries=[vector_query],
    select=["title", "text", "url"],
    query_type=QueryType.SEMANTIC,
    semantic_configuration_name="my-semantic-config",
    query_caption=QueryCaptionType.EXTRACTIVE,
    query_answer=QueryAnswerType.EXTRACTIVE,
    top=3,
)

semantic_answers = results.get_answers()
for answer in semantic_answers:
    if answer.highlights:
        print(f"Semantic Answer: {answer.highlights}")
    else:
        print(f"Semantic Answer: {answer.text}")
    print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"Title: {result['title']}")
    print(f"Reranker Score: {result['@search.reranker_score']}")
    print(f"URL: {result['url']}")
    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")

Semantic Answer: Advancements  During the industrial revolution, new technology brought many changes. For example:<em>   Canals</em> were built to allow heavy goods to be moved easily where they were needed. The steam engine became the main source of power. It replaced horses and human labor. Cheap iron and steel became mass-produced.
Semantic Answer Score: 0.90478515625

Title: Industrial Revolution
Reranker Score: 3.408700942993164
URL: https://simple.wikipedia.org/wiki/Industrial%20Revolution
Caption: Advancements  During the industrial revolution, new technology brought many changes. For example:   Canals were built to allow heavy goods to be moved easily where they were needed. The steam engine became the main source of power. It replaced horses and human labor. Cheap iron and steel became mass-produced.

Title: Printing
Reranker Score: 1.603400707244873
URL: https://simple.wikipedia.org/wiki/Printing
Caption: Machines to speed printing, cheaper paper, automatic stitching and binding all arrived in the 19th century during the industrial revolution. What had once been done by a few men by hand was now done by limited companies on huge machines. The result was much lower prices, and a much wider readership.

Title: Industrialisation
Reranker Score: 1.3238357305526733
URL: https://simple.wikipedia.org/wiki/Industrialisation
Caption: <em>Industrialisation</em> (or<em> industrialization)</em> is a process that happens in countries when they start to use machines to do work that was once done by people.<em> Industrialisation changes</em> the things people do.<em> Industrialisation</em> caused towns to grow larger. Many people left farming to take higher paid jobs in factories in towns.

先决条件：​

导入所需的库​

配置OpenAI设置​

配置Azure AI Search向量存储设置​

加载数据​

创建索引​

将数据上传到Azure AI Search索引​

执行向量相似性搜索​

执行混合搜索​

使用重新排序进行混合搜索（由Bing提供支持）​