嵌入维基百科文章以进行搜索

本笔记本展示了我们如何为搜索准备了一组维基百科文章的数据集，用于使用嵌入进行问答.ipynb中。

步骤：

先决条件：导入库，设置API密钥（如果需要）
收集：我们下载了几百篇关于2022年奥运会的维基百科文章
分块：将文档分割成短的、半独立的部分以进行嵌入
嵌入：使用OpenAI API对每个部分进行嵌入
存储：将嵌入保存在CSV文件中（对于大型数据集，可以使用向量数据库）

0. 先决条件

导入库

# 导入
import mwclient  # 用于下载示例维基百科文章
import mwparserfromhell  # 将维基百科文章分割成各个部分
import openai  # 用于生成嵌入
import os  # 对于环境变量
import pandas as pd  # 为了存储文章的各个部分及其嵌入表示，我们使用数据框（DataFrames）。
import re  # 从维基百科文章中去除<ref>链接
import tiktoken  # 用于计数令牌

client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

在终端中使用pip install安装任何缺失的库。例如，

pip install openai

（您也可以在笔记本单元格中使用!pip install openai来执行此操作。）

如果安装了任何库，请确保重新启动笔记本内核。

设置API密钥（如果需要）

请注意，OpenAI库将尝试从OPENAI_API_KEY环境变量中读取您的API密钥。如果您还没有设置这个环境变量，请按照这些说明进行设置。

1. 收集文档

在这个例子中，我们将下载几百篇与2022年冬季奥运会相关的维基百科文章。

# 获取关于2022年冬季奥运会的维基百科页面

CATEGORY_TITLE = "Category:2022 Winter Olympics"
WIKI_SITE = "en.wikipedia.org"


def titles_from_category(
    category: mwclient.listing.Category, max_depth: int
) -> set[str]:
    """返回给定维基百科分类及其子分类中的一组页面标题。"""
    titles = set()
    for cm in category.members():
        if type(cm) == mwclient.page.Page:
            # 使用 `type()` 而非 `isinstance()` 来捕获无继承关系的匹配项。
            titles.add(cm.name)
        elif isinstance(cm, mwclient.listing.Category) and max_depth > 0:
            deeper_titles = titles_from_category(cm, max_depth=max_depth - 1)
            titles.update(deeper_titles)
    return titles


site = mwclient.Site(WIKI_SITE)
category_page = site.pages[CATEGORY_TITLE]
titles = titles_from_category(category_page, max_depth=1)
# ^注意：max_depth=1 表示我们在类别树中只深入一层。
print(f"Found {len(titles)} article titles in {CATEGORY_TITLE}.")

Found 731 article titles in Category:2022 Winter Olympics.

2. 分块文档

现在我们有了参考文档，接下来我们需要为搜索准备这些文档。

由于 GPT 一次只能阅读有限的文本量，我们将把每个文档分割成足够短的块来阅读。

对于维基百科文章的这个具体示例，我们将： - 丢弃外部链接和脚注等看起来不太相关的部分 - 清理文本，删除参考标签（例如，<ref>）、空白和超短部分 - 将每篇文章分成各个部分 - 将标题和副标题添加到每个部分的文本前，以帮助 GPT 理解上下文 - 如果一个部分很长（比如，>1,600 个 Token），我们会递归将其分割成更小的部分，尝试沿着段落等语义边界进行分割。

# 定义函数以将维基百科页面分割成各个部分

SECTIONS_TO_IGNORE = [
    "See also",
    "References",
    "External links",
    "Further reading",
    "Footnotes",
    "Bibliography",
    "Sources",
    "Citations",
    "Literature",
    "Footnotes",
    "Notes and references",
    "Photo gallery",
    "Works cited",
    "Photos",
    "Gallery",
    "Notes",
    "References and sources",
    "References and notes",
]


def all_subsections_from_section(
    section: mwparserfromhell.wikicode.Wikicode,
    parent_titles: list[str],
    sections_to_ignore: set[str],
) -> list[tuple[list[str], str]]:
    """
    From a Wikipedia section, return a flattened list of all nested subsections.
    Each subsection is a tuple, where:
        - the first element is a list of parent subtitles, starting with the page title
        - the second element is the text of the subsection (but not any children)
    """
    headings = [str(h) for h in section.filter_headings()]
    title = headings[0]
    if title.strip("=" + " ") in sections_to_ignore:
        # ^维基百科的标题被包装成类似 "== Heading =="
        return []
    titles = parent_titles + [title]
    full_text = str(section)
    section_text = full_text.split(title)[1]
    if len(headings) == 1:
        return [(titles, section_text)]
    else:
        first_subtitle = headings[1]
        section_text = section_text.split(first_subtitle)[0]
        results = [(titles, section_text)]
        for subsection in section.get_sections(levels=[len(titles) + 1]):
            results.extend(all_subsections_from_section(subsection, titles, sections_to_ignore))
        return results


def all_subsections_from_title(
    title: str,
    sections_to_ignore: set[str] = SECTIONS_TO_IGNORE,
    site_name: str = WIKI_SITE,
) -> list[tuple[list[str], str]]:
    """From a Wikipedia page title, return a flattened list of all nested subsections.
    Each subsection is a tuple, where:
        - the first element is a list of parent subtitles, starting with the page title
        - the second element is the text of the subsection (but not any children)
    """
    site = mwclient.Site(site_name)
    page = site.pages[title]
    text = page.text()
    parsed_text = mwparserfromhell.parse(text)
    headings = [str(h) for h in parsed_text.filter_headings()]
    if headings:
        summary_text = str(parsed_text).split(headings[0])[0]
    else:
        summary_text = str(parsed_text)
    results = [([title], summary_text)]
    for subsection in parsed_text.get_sections(levels=[2]):
        results.extend(all_subsections_from_section(subsection, [title], sections_to_ignore))
    return results

# 将页面分割成多个部分
# 每处理100篇文章可能需要约1分钟时间。
wikipedia_sections = []
for title in titles:
    wikipedia_sections.extend(all_subsections_from_title(title))
print(f"Found {len(wikipedia_sections)} sections in {len(titles)} pages.")

Found 5730 sections in 731 pages.

# 清洁文本
def clean_section(section: tuple[list[str], str]) -> tuple[list[str], str]:
    """
    返回一个经过清理的段落，其中包含：
        - 移除所有形如 `<ref>xyz</ref>` 的模式
        - 去除前导和尾随的空白字符
    """
    titles, text = section
    text = re.sub(r"<ref.*?</ref>", "", text)
    text = text.strip()
    return (titles, text)


wikipedia_sections = [clean_section(ws) for ws in wikipedia_sections]

# 过滤掉短小/空白的部分
def keep_section(section: tuple[list[str], str]) -> bool:
    """如果该部分应保留，则返回 True，否则返回 False。"""
    titles, text = section
    if len(text) < 16:
        return False
    else:
        return True


original_num_sections = len(wikipedia_sections)
wikipedia_sections = [ws for ws in wikipedia_sections if keep_section(ws)]
print(f"Filtered out {original_num_sections-len(wikipedia_sections)} sections, leaving {len(wikipedia_sections)} sections.")

Filtered out 530 sections, leaving 5200 sections.

# 打印示例数据
for ws in wikipedia_sections[:5]:
    print(ws[0])
    display(ws[1][:77] + "...")
    print()

['Lviv bid for the 2022 Winter Olympics']

'{{Olympic bid|2022|Winter|\n| Paralympics = yes\n| logo = Lviv 2022 Winter Olym...'


['Lviv bid for the 2022 Winter Olympics', '==History==']

'[[Image:Lwów - Rynek 01.JPG|thumb|right|200px|View of Rynok Square in Lviv]]\n...'


['Lviv bid for the 2022 Winter Olympics', '==Venues==']

'{{Location map+\n|Ukraine\n|border =\n|caption = Venue areas\n|float = left\n|widt...'


['Lviv bid for the 2022 Winter Olympics', '==Venues==', '===City zone===']

'The main Olympic Park would be centered around the [[Arena Lviv]], hosting th...'


['Lviv bid for the 2022 Winter Olympics', '==Venues==', '===Mountain zone===', '====Venue cluster Tysovets-Panasivka====']

'An existing military ski training facility in [[Tysovets, Skole Raion|Tysovet...'

接下来，我们将把长的部分递归地分割成更小的部分。

将文本分割成部分没有一个完美的方法。

一些权衡包括： - 对于需要更多上下文的问题，较长的部分可能更好 - 较长的部分可能对检索更差，因为它们可能包含更多混在一起的主题 - 较短的部分有助于降低成本（成本与标记数量成正比） - 较短的部分允许检索更多部分，这可能有助于召回 - 重叠的部分可能有助于防止答案被部分边界切断

在这里，我们将使用一种简单的方法，将每个部分限制为1,600个标记，递归地将任何过长的部分减半。为了避免在有用句子的中间切割，我们将尽可能沿段落边界进行分割。

GPT_MODEL = "gpt-3.5-turbo"  # 只有在选择使用哪个分词器时才有意义。


def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """返回字符串中的标记数量。"""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def halved_by_delimiter(string: str, delimiter: str = "\n") -> list[str, str]:
    """将字符串在分隔符处分成两部分，尽量使两边的标记数量平衡。"""
    chunks = string.split(delimiter)
    if len(chunks) == 1:
        return [string, ""]  # 未找到分隔符
    elif len(chunks) == 2:
        return chunks  # 无需寻找中间点
    else:
        total_tokens = num_tokens(string)
        halfway = total_tokens // 2
        best_diff = halfway
        for i, chunk in enumerate(chunks):
            left = delimiter.join(chunks[: i + 1])
            left_tokens = num_tokens(left)
            diff = abs(halfway - left_tokens)
            if diff >= best_diff:
                break
            else:
                best_diff = diff
        left = delimiter.join(chunks[:i])
        right = delimiter.join(chunks[i:])
        return [left, right]


def truncated_string(
    string: str,
    model: str,
    max_tokens: int,
    print_warning: bool = True,
) -> str:
    """将字符串截断至最大数量的词元。"""
    encoding = tiktoken.encoding_for_model(model)
    encoded_string = encoding.encode(string)
    truncated_string = encoding.decode(encoded_string[:max_tokens])
    if print_warning and len(encoded_string) > max_tokens:
        print(f"Warning: Truncated string from {len(encoded_string)} tokens to {max_tokens} tokens.")
    return truncated_string


def split_strings_from_subsection(
    subsection: tuple[list[str], str],
    max_tokens: int = 1000,
    model: str = GPT_MODEL,
    max_recursion: int = 5,
) -> list[str]:
    """
    将一个子部分拆分成多个子部分列表，每个子部分的字数不超过max_tokens。每个子部分是一个包含父标题[H1, H2, ...]和文本（字符串）的元组。
    """
    titles, text = subsection
    string = "\n\n".join(titles + [text])
    num_tokens_in_string = num_tokens(string)
    # 如果长度合适，返回字符串。
    if num_tokens_in_string <= max_tokens:
        return [string]
    # 如果在经过X次迭代后递归仍未找到分割点，则直接截断。
    elif max_recursion == 0:
        return [truncated_string(string, model=model, max_tokens=max_tokens)]
    # 否则，将其一分为二并递归处理。
    else:
        titles, text = subsection
        for delimiter in ["\n\n", "\n", ". "]:
            left, right = halved_by_delimiter(text, delimiter=delimiter)
            if left == "" or right == "":
                # 如果两部分中任意一部分为空，则使用更细粒度的分隔符重试。
                continue
            else:
                # 对每一半进行递归
                results = []
                for half in [left, right]:
                    half_subsection = (titles, half)
                    half_strings = split_strings_from_subsection(
                        half_subsection,
                        max_tokens=max_tokens,
                        model=model,
                        max_recursion=max_recursion - 1,
                    )
                    results.extend(half_strings)
                return results
    # 否则未发现分割点，因此直接截断（这种情况应该非常罕见）
    return [truncated_string(string, model=model, max_tokens=max_tokens)]

# 将段落分割成小块
MAX_TOKENS = 1600
wikipedia_strings = []
for section in wikipedia_sections:
    wikipedia_strings.extend(split_strings_from_subsection(section, max_tokens=MAX_TOKENS))

print(f"{len(wikipedia_sections)} Wikipedia sections split into {len(wikipedia_strings)} strings.")

5200 Wikipedia sections split into 6059 strings.

# 打印示例数据
print(wikipedia_strings[1])

Lviv bid for the 2022 Winter Olympics

==History==

[[Image:Lwów - Rynek 01.JPG|thumb|right|200px|View of Rynok Square in Lviv]]

On 27 May 2010, [[President of Ukraine]] [[Viktor Yanukovych]] stated during a visit to [[Lviv]] that Ukraine "will start working on the official nomination of our country as the holder of the Winter Olympic Games in [[Carpathian Mountains|Carpathians]]".

In September 2012, [[government of Ukraine]] approved a document about the technical-economic substantiation of the national project "Olympic Hope 2022". This was announced by Vladyslav Kaskiv, the head of Ukraine´s Derzhinvestproekt (State investment project). The organizers announced on their website venue plans featuring Lviv as the host city and location for the "ice sport" venues, [[Volovets]] (around {{convert|185|km|mi|abbr=on}} from Lviv) as venue for the [[Alpine skiing]] competitions and [[Tysovets, Skole Raion|Tysovets]] (around {{convert|130|km|mi|abbr=on}} from Lviv) as venue for all other "snow sport" competitions. By March 2013 no other preparations than the feasibility study had been approved.

On 24 October 2013, session of the Lviv City Council adopted a resolution "About submission to the International Olympic Committee for nomination of city to participate in the procedure for determining the host city of Olympic and Paralympic Winter Games in 2022".

On 5 November 2013, it was confirmed that Lviv was bidding to host the [[2022 Winter Olympics]]. Lviv would host the ice sport events, while the skiing events would be held in the [[Carpathian]] mountains. This was the first bid Ukraine had ever submitted for an Olympic Games.

On 30 June 2014, the International Olympic Committee announced "Lviv will turn its attention to an Olympic bid for 2026, and not continue with its application for 2022. The decision comes as a result of the present political and economic circumstances in Ukraine."

Ukraine's Deputy Prime Minister Oleksandr Vilkul said that the Winter Games "will be an impetus not just for promotion of sports and tourism in Ukraine, but a very important component in the economic development of Ukraine, the attraction of the investments, the creation of new jobs, opening Ukraine to the world, returning Ukrainians working abroad to their motherland."

Lviv was one of the host cities of [[UEFA Euro 2012]].

3. 嵌入文档块

现在我们已经将库分割成了更短的独立字符串，我们可以为每个字符串计算嵌入。

（对于大型嵌入作业，请使用类似api_request_parallel_processor.py的脚本来并行处理请求，同时限制速率以保持在速率限制之下。）

EMBEDDING_MODEL = "text-embedding-3-small"
BATCH_SIZE = 1000  # 每次请求最多可提交2048个嵌入输入。

embeddings = []
for batch_start in range(0, len(wikipedia_strings), BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    batch = wikipedia_strings[batch_start:batch_end]
    print(f"Batch {batch_start} to {batch_end-1}")
    response = client.embeddings.create(model=EMBEDDING_MODEL, input=batch)
    for i, be in enumerate(response.data):
        assert i == be.index  # 再次确认嵌入向量与输入数据的顺序一致。
    batch_embeddings = [e.embedding for e in response.data]
    embeddings.extend(batch_embeddings)

df = pd.DataFrame({"text": wikipedia_strings, "embedding": embeddings})

Batch 0 to 999
Batch 1000 to 1999
Batch 2000 to 2999
Batch 3000 to 3999
Batch 4000 to 4999
Batch 5000 to 5999
Batch 6000 to 6999

4. 存储文档块和嵌入向量

由于这个示例只使用了几千个字符串，我们将它们存储在一个CSV文件中。

（对于更大的数据集，请使用向量数据库，这样会更高效。）

# 保存文档块和嵌入内容

SAVE_PATH = "data/winter_olympics_2022.csv"

df.to_csv(SAVE_PATH, index=False)

0. 先决条件​

导入库​

设置API密钥（如果需要）​

1. 收集文档​

2. 分块文档​

3. 嵌入文档块​

4. 存储文档块和嵌入向量​