简介

本笔记本是一个示例，展示了如何使用SingleStoreDB向量存储和函数来构建一个与ChatGPT交互式问答应用程序。如果您在SingleStoreDB中启动了一个试用版，您可以在我们的示例笔记本中找到相同的笔记本，并进行本机连接。

首先让我们直接与ChatGPT交流，尝试获取一个回复

!pip install openai --quiet

[notice] A new release of pip is available: 23.0.1 -> 23.1.2
[notice] To update, run: python3.11 -m pip install --upgrade pip

import openai

EMBEDDING_MODEL = "text-embedding-3-small"
GPT_MODEL = "gpt-3.5-turbo"

让我们连接到OpenAI，看看当询问2021年之后的日期时我们会得到什么结果

openai.api_key = 'OPENAI API KEY'

response = openai.ChatCompletion.create(
  model=GPT_MODEL,
  messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the gold medal for curling in Olymics 2022?"},
    ]
)

print(response['choices'][0]['message']['content'])

I'm sorry, I cannot provide information about events that have not occurred yet. The Winter Olympics 2022 will be held in Beijing, China from February 4 to 20, 2022. The curling events will take place during this time and the results will not be known until after the competition has concluded.

获取关于冬季奥运会的数据，并将信息作为上下文提供给ChatGPT

1. 设置

!pip install matplotlib plotly.express scikit-learn tabulate tiktoken wget --quiet

[notice] A new release of pip is available: 23.0.1 -> 23.1.2
[notice] To update, run: python3.11 -m pip install --upgrade pip

import pandas as pd
import os
import wget
import ast

第一步 - 从CSV中获取数据并准备数据

# 下载预先分块的文本和预先计算的嵌入
# 这个文件大小约为200MB，因此根据您的网络连接速度，下载可能需要一分钟时间。
embeddings_path = "https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv"
file_path = "winter_olympics_2022.csv"

if not os.path.exists(file_path):
    wget.download(embeddings_path, file_path)
    print("File downloaded successfully.")
else:
    print("File already exists in the local file system.")

File downloaded successfully.

df = pd.read_csv(
    "winter_olympics_2022.csv"
)

# 将嵌入从CSV字符串类型转换回列表类型
df['embedding'] = df['embedding'].apply(ast.literal_eval)

df

df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6059 entries, 0 to 6058
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       6059 non-null   object
 1   embedding  6059 non-null   object
dtypes: object(2)
memory usage: 94.8+ KB

2. 设置SingleStore DB

import singlestoredb as s2

conn = s2.connect("<user>:<Password>@<host>:3306/")

cur = conn.cursor()

# 创建数据库
stmt = """
    创建数据库（如果尚不存在）：winter_wikipedia2;
"""

cur.execute(stmt)

#创建表
stmt = """
CREATE TABLE IF NOT EXISTS winter_wikipedia2.winter_olympics_2022 (
    id INT PRIMARY KEY,
    text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
    embedding BLOB
);"""

cur.execute(stmt)

3. 使用我们的数据框df填充表格，并使用JSON_ARRAY_PACK来压缩它。

%%time

# 准备声明
stmt = """
    INSERT INTO winter_wikipedia2.winter_olympics_2022 (
        id,
        text,
        embedding
    )
    VALUES (
        %s,
        %s,
        JSON_ARRAY_PACK_F64(%s)
    )
"""

# 将 DataFrame 转换为 NumPy 记录数组
record_arr = df.to_records(index=True)

# 设置批量大小
batch_size = 1000

# 以批处理方式遍历记录数组的行
for i in range(0, len(record_arr), batch_size):
    batch = record_arr[i:i+batch_size]
    values = [(row[0], row[1], str(row[2])) for row in batch]
    cur.executemany(stmt, values)

CPU times: user 8.79 s, sys: 4.63 s, total: 13.4 s
Wall time: 11min 4s

4. 使用上面相同的问题进行语义搜索，并使用响应再次发送给OpenAI。

from utils.embeddings_utils import get_embedding

def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple:
    """返回一个字符串列表及其相关性，按从最相关到最不相关的顺序排序。"""

    # 获取查询的嵌入表示。
    query_embedding_response = get_embedding(query, EMBEDDING_MODEL)

    # 创建 SQL 语句。
    stmt = """
        SELECT
            text,
            DOT_PRODUCT_F64(JSON_ARRAY_PACK_F64(%s), embedding) AS score
        FROM winter_wikipedia2.winter_olympics_2022
        ORDER BY score DESC
        LIMIT %s
    """

    # 执行 SQL 语句。
    results = cur.execute(stmt, [str(query_embedding_response), top_n])

    # 获取结果
    results = cur.fetchall()

    strings = []
    relatednesses = []

    for row in results:
        strings.append(row[0])
        relatednesses.append(row[1])

    # 返回结果。
    return strings[:top_n], relatednesses[:top_n]

from tabulate import tabulate

strings, relatednesses = strings_ranked_by_relatedness(
    "curling gold medal",
    df,
    top_n=5
)

for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    print(tabulate([[string]], headers=['Result'], tablefmt='fancy_grid'))

5. 发送正确的上下文给ChatGPT以获得更准确的答案

import tiktoken

def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """返回字符串中的标记数量。"""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """从SingleStoreDB中提取相关源文本，并返回一条GPT消息。"""
    strings, relatednesses = strings_ranked_by_relatedness(query, df, "winter_olympics_2022")
    introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question


def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a table of relevant texts and embeddings in SingleStoreDB."""
    message = query_message(query, df, model=model, token_budget=token_budget)
    if print_message:
        print(message)
    messages = [
        {"role": "system", "content": "You answer questions about the 2022 Winter Olympics."},
        {"role": "user", "content": message},
    ]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response["choices"][0]["message"]["content"]
    return response_message

6. 从Chat GPT获取答案

from pprint import pprint

answer = ask('Who won the gold medal for curling in Olymics 2022?')

pprint(answer)

("There were three curling events at the 2022 Winter Olympics: men's, women's, "
 'and mixed doubles. The gold medalists for each event are:\n'
 '\n'
 "- Men's: Sweden (Niklas Edin, Oskar Eriksson, Rasmus Wranå, Christoffer "
 'Sundgren, Daniel Magnusson)\n'
 "- Women's: Great Britain (Eve Muirhead, Vicky Wright, Jennifer Dodds, Hailey "
 'Duff, Mili Smith)\n'
 '- Mixed doubles: Italy (Stefania Constantini, Amos Mosaner)')

首先让我们直接与ChatGPT交流，尝试获取一个回复​

让我们连接到OpenAI，看看当询问2021年之后的日期时我们会得到什么结果​

获取关于冬季奥运会的数据，并将信息作为上下文提供给ChatGPT

1. 设置​

第一步 - 从CSV中获取数据并准备数据​

2. 设置SingleStore DB​

3. 使用我们的数据框df填充表格，并使用JSON_ARRAY_PACK来压缩它。​

4. 使用上面相同的问题进行语义搜索，并使用响应再次发送给OpenAI。​

5. 发送正确的上下文给ChatGPT以获得更准确的答案​

6. 从Chat GPT获取答案​

首先让我们直接与ChatGPT交流，尝试获取一个回复

让我们连接到OpenAI，看看当询问2021年之后的日期时我们会得到什么结果

1. 设置

第一步 - 从CSV中获取数据并准备数据

2. 设置SingleStore DB

3. 使用我们的数据框df填充表格，并使用JSON_ARRAY_PACK来压缩它。

4. 使用上面相同的问题进行语义搜索，并使用响应再次发送给OpenAI。

5. 发送正确的上下文给ChatGPT以获得更准确的答案

6. 从Chat GPT获取答案