2. 创建一个合成的问答数据集
我们使用davinci-instruct-beta-v3
,这是一个专门用于按照指示操作的模型,根据给定的上下文创建问题。然后我们还使用davinci-instruct-beta-v3
来回答这些问题,给定相同的上下文。
这是一项昂贵的操作,并且会花费很长时间,因为我们需要为每个部分调用davinci引擎。您可以简单地下载最终的数据集。
我们正在使用使用之前笔记本创建的数据集。
2.1 读取数据,并创建一个上下文
通过连接该部分的标题、标题和内容来创建一个上下文。
import pandas as pd
df = pd.read_csv('olympics-data/olympics_sections.csv')
df['context'] = df.title + "\n" + df.heading + "\n\n" + df.content
df.head()
title | heading | content | tokens | context | |
---|---|---|---|---|---|
0 | 2020 Summer Olympics | Summary | The 2020 Summer Olympics (Japanese: 2020年夏季オリン... | 713 | 2020 Summer Olympics\nSummary\n\nThe 2020 Summ... |
1 | 2020 Summer Olympics | Host city selection | The International Olympic Committee (IOC) vote... | 126 | 2020 Summer Olympics\nHost city selection\n\nT... |
2 | 2020 Summer Olympics | Impact of the COVID-19 pandemic | In January 2020, concerns were raised about th... | 369 | 2020 Summer Olympics\nImpact of the COVID-19 p... |
3 | 2020 Summer Olympics | Qualifying event cancellation and postponement | Concerns about the pandemic began to affect qu... | 298 | 2020 Summer Olympics\nQualifying event cancell... |
4 | 2020 Summer Olympics | Effect on doping tests | Mandatory doping tests were being severely res... | 163 | 2020 Summer Olympics\nEffect on doping tests\n... |
2.2 根据上下文创建问题
使用davinci-instruct生成一些与维基百科部分内容相关的合理问题。
注意:我们已经使用了temperature=0,但尝试使用更高的temperature可能会获得更多多样化的问题。
警告:这一步将需要很长时间,并消耗大量的令牌,因为它会为每个部分调用davinci-instruct来生成一些问题。from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
def get_questions(context):
try:
response = client.chat.completions.create(model="davinci-instruct-beta-v3",
prompt=f"Write questions based on the text below\n\nText: {context}\n\nQuestions:\n1.",
temperature=0,
max_tokens=257,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=["\n\n"])
return response.choices[0].text
except:
return ""
df['questions']= df.context.apply(get_questions)
df['questions'] = "1." + df.questions
print(df[['questions']].values[0][0])
1. What is the 2020 Summer Olympics?
2. When did the 2020 Summer Olympics take place?
3. Who won the most medals at the 2020 Summer Olympics?
4. Who won the most gold medals at the 2020 Summer Olympics?
5. Who won the most medals at the 2020 Summer Olympics?
该提示旨在生成一系列问题。上面的示例问题是基于2020年夏季奥运会页面的摘要部分生成的。
我们可以观察到上面的问题3和5是重复的。有时,在没有上下文的情况下,生成的问题可能会含糊不清。我们将展示,即使存在这些限制,我们仍然可以创建一个成功的模型。
print(df.content.values[0])
The 2020 Summer Olympics (Japanese: 2020年夏季オリンピック, Hepburn: Nisen Nijū-nen Kaki Orinpikku), officially the Games of the XXXII Olympiad (第三十二回オリンピック競技大会, Dai Sanjūni-kai Orinpikku Kyōgi Taikai) and branded as Tokyo 2020 (東京2020, Tōkyō Nii Zero Nii Zero), was an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan, with some preliminary events that began on 21 July.
Tokyo was selected as the host city during the 125th IOC Session in Buenos Aires, Argentina, on 7 September 2013. Originally scheduled to take place from 24 July to 9 August 2020, the event was postponed to 2021 in March 2020 as a result of the COVID-19 pandemic, the first such instance in the history of the Olympic Games (previous games had been cancelled but not rescheduled). However, the event retained the Tokyo 2020 name for marketing and branding purposes. It was largely held behind closed doors with no public spectators permitted due to the declaration of a state of emergency in the Greater Tokyo Area in response to the pandemic. The Summer Paralympics were held between 24 August and 5 September 2021, 16 days after the completion of the Olympics.The 2020 Games were the fourth Olympic Games to be held in Japan, following the Tokyo 1964 (Summer), Sapporo 1972 (Winter) and Nagano 1998 (Winter) games. Tokyo is the first city in Asia to hold the Summer Games twice. The 2020 Games were the second of three consecutive Olympics to be held in East Asia, following the 2018 Winter Olympics in Pyeongchang, South Korea and preceding the 2022 Winter Olympics in Beijing, China.
New events were introduced in existing sports for 2020, including 3x3 basketball, freestyle BMX and mixed gender team events in a number of existing sports, as well as the return of madison cycling for men and an introduction of the same event for women. New IOC policies also allowed the host organizing committee to add new sports to the Olympic program for just one Games. The disciplines added by the Japanese Olympic Committee were baseball and softball, karate, sport climbing, surfing and skateboarding, the last four of which made their Olympic debuts, and the last three of which will remain on the Olympic program.The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88). Host nation Japan finished third, setting a record for the most gold medals and total medals ever won by their delegation at an Olympic Games with 27 and 58. Great Britain finished fourth, with a total of 22 gold and 65 medals, becoming the first nation at the Summer Olympics to increase or equal their total medals won in the two Games subsequent to hosting them. The Russian delegation competing as the ROC (not to be confused with the Republic of China (Taiwan) which competed as Chinese Taipei, not ROC) finished fifth with 20 gold medals and third in the overall medal count, with 71 medals. Bermuda, the Philippines and Qatar won their first-ever Olympic gold medals. Burkina Faso, San Marino and Turkmenistan won their first-ever Olympic medals.
2.3 根据上下文创建答案
使用davinci-instruct来回答问题,根据相关的维基百科章节内容
注意:我们已经使用了温度为0,但尝试使用更高的温度可能会获得更多多样化的问题。
警告:这一步将需要很长时间,并消耗大量的令牌,因为它会为每个章节调用davinci-instruct来回答所有问题。def get_answers(row):
try:
response = client.chat.completions.create(
engine="davinci-instruct-beta-v3",
prompt=f"Write answer based on the text below\n\nText: {row.context}\n\nQuestions:\n{row.questions}\n\nAnswers:\n1.",
temperature=0,
max_tokens=257,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
return response.choices[0].text
except Exception as e:
print (e)
return ""
df['answers']= df.apply(get_answers, axis=1)
df['answers'] = "1." + df.answers
df = df.dropna().reset_index().drop('index',axis=1)
print(df[['answers']].values[0][0])
1. The 2020 Summer Olympics is an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan.
2. The 2020 Summer Olympics took place from 23 July to 8 August 2021.
3. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88).
4. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88).
5. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88).
这些是根据主办城市选择周围的上下文回答上面问题的答案。
我们可以看到答案3-5包含了正确答案,但是与直接回答问题不同,答案是逐字提取的。尽管偶尔会出现较低质量的答案,我们将展示在给定大量示例的情况下,模型可以学习任务得相当好。
2.4 根据维基百科章节保存奥运问答数据集
我们保存这个文件以便在下一个笔记本中使用。
df.to_csv('olympics-data/olympics_qa.csv', index=False)
2.5 搜索文件(已弃用)
我们创建了一个搜索文件(API参考),可 以在提问时用来检索相关的上下文。
已弃用:/search端点已被弃用,推荐使用嵌入。嵌入更便宜、更快速,并且可以支持更好的搜索体验。请参阅问题回答指南,了解如何使用嵌入进行搜索实现。df = df[df.tokens<2000]
df[['context', 'tokens']].rename(columns={'context':'text','tokens':'metadata'}).to_json('olympics-data/olympics_search.jsonl', orient='records', lines=True)
search_file = client.files.create(
file=open("olympics-data/olympics_search.jsonl"),
purpose='search'
)
olympics_search_fileid = search_file['id']
2.6 根据提供的上下文回答问题
我们将使用一个简单的答案端点的实现。这个实现简单地使用/search端点,该端点搜索索引文件以获取相关部分,这些部分可以包含在上下文中,然后根据指定的模型给出一个问题和回答提示。
from answers_with_ft import create_context, answer_question
print(create_context("Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?", olympics_search_fileid, max_len=400))
Athletics at the 2020 Summer Olympics – Women's 4 × 100 metres relay
Summary
The women's 4 × 100 metres relay event at the 2020 Summer Olympics took place on 5 and 6 August 2021 at the Japan National Stadium. There were 16 competing relay teams, with each team having 5 members from which 4 were selected in each round.
###
Athletics at the 2020 Summer Olympics – Men's 4 × 100 metres relay
Qualification
National Olympic Committees (NOCs) could qualify one relay team in one of three following ways:
The top 8 NOCs at the 2019 World Athletics Championships qualified a relay team.
The top 8 NOCs at the 2021 World Athletics Relays qualified a relay team.
Where an NOC placed in the top 8 at both the 2019 World Championships and the 2021 World Relays, the quota place was allocated to the world top list as of 29 June 2021. In this case, 4 teams did so, so there are 4 places available through the world rankings.A total of five athletes may be entered for a relay team. Should a NOC have also entered individual athletes in the corresponding individual event (100 m), the entered individual athletes must be included in the total of five (5) athletes entered for the relay event. In addition of five, NOCs can nominate a maximum of one alternate athlete for each team.
The qualifying period was originally from 1 May 2019 to 29 June 2020. Due to the COVID-19 pandemic, the period was suspended from 6 April 2020 to 30 November 2020, with the end date extended to 29 June 2021. The qualifying time standards could be obtained in various meets during the given period that have the approval of the IAAF. Both indoor and outdoor meets are eligible. The most recent Area Championships may be counted in the ranking, even if not during the qualifying period.
answer_question(olympics_search_fileid, "davinci-instruct-beta-v3",
"Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?")
' Japan National Stadium'
在我们对问答模型进行微调之后,我们将能够使用它来取代davinci-instruct-beta-v3
,以在问题无法根据上下文回答时获得更好的答案。我们看到davinci-instruct-beta-v3
的一个缺点是,它总是试图回答问题,而不管相关上下文是否存在。(请注意,第二个问题是关于2024年设定的未来事件。)
answer_question(olympics_search_fileid, "davinci-instruct-beta-v3",
"Where did women's 4 x 100 metres relay event take place during the 2048 Summer Olympics?", max_len=1000)
' Japan National Stadium'
我们可以看到,达芬奇有一种倾向,即便在所提供的上下文中无法回答问题,也会尝试回答。请注意有关2048年夏季奥运会的问题,这个奥运会还没有举办,而检索到的内容只返回了2020年的结果。
2.7(可选)调查搜索端点返回相关上下文的可能性
def check_context(title, heading, question, max_len=1800, search_model='ada', max_rerank=10):
"""
Evaluate the performance of the search model in retrieving the correct context
Parameters
----------
title: str
The title of the Wikipedia page
heading: str
The heading of the Wikipedia section
qusetion: str
The question
max_len: int
The maximum length of the context
search_model: str
The search model to use - `ada` is most cost effective
max_rerank: int
The maximum number of reranking documents to use the search model on
Returns
-------
rank: int
The rank of the correct context
token_length: int
The number of tokens needed to obtain the correct context
"""
try:
# 待办事项:openai.Engine(search_model) 已被弃用。
results = openai.Engine(search_model).search(
search_model=search_model,
query=question,
max_rerank=max_rerank,
file=olympics_search_fileid,
return_metadata=True
)
index=-1
returns = []
cur_len = 0
for result in results['data']:
cur_len += int(result['metadata']) + 4 # 我们为分隔符 `\n\n###\n\n` 添加了 4 个标记。
if cur_len > max_len:
break
returns.append(result['text'])
res = result['text'].split('\n')
if res[0] == title and res[1] == heading:
index = len(returns) - 1
break
return index, cur_len
except Exception as e:
#打印 (e)
return []
print(check_context("Athletics at the 2020 Summer Olympics – Women's 4 × 100 metres relay", "Summary", "Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?", max_len=10000))
(0, 58)
我们利用基于上下文生成的问题来估计我们可以多频繁地检索到原始上下文。这些问题存在噪音,因此这不是一个完美的估计。
我们的问题和答案前面都带有编号的项目符号,但由于它们的生成方式,它们缺少第一个数字,因此我们在问题(和答案)列表中添加“1.”。
我们使用ada搜索计算检索到的部分的排名,以及检索到相关部分所需的上下文中的标记数。
ada_results = df.apply(lambda x: [
check_context( x.title,
x.heading,
q[3:], # 移除数字前缀
max_len=1000000, # 设置一个较大的数值以获取完整上下文
search_model='ada',
max_rerank=200,
)
for q in (x.questions).split('\n') # 将问题分开
if len(q) >10 # 删除空白问题
], axis=1)
ada_results.head()
0 [(132, 27104), (-1, 22939), (8, 2151), (2, 121...
1 [(4, 1737), (0, 130), (8, 744), (96, 17208), (...
2 [(0, 373), (0, 373), (-1, 40610), (1, 570)]
3 [(0, 302), (0, 302), (5, 968), (8, 1425)]
4 [(0, 167), (0, 167), (2, 1442)]
Name: ada, dtype: object
out = pd.concat([ada_results], axis=1)
out.columns = ['ada']
out.to_csv('olympics-data/search_engine_results.csv')
def expand_lists(out):
"""
Expand a pandas series containing lists into a series, where each list element becomes a value on its own
Input is a row per paragraph, which has multiple questions
Output is a row per question
"""
cols = [pd.DataFrame(out[name].tolist()).stack().reset_index(level=1, drop=True).rename(name) for name in out.columns]
return pd.concat(cols, axis=1)
out_expanded = expand_lists(out)
out_expanded['rank'] = out_expanded.ada.apply(lambda x: x[0] if x != [] else -2)
out_expanded['tokens'] = out_expanded.ada.apply(lambda x: x[1] if x != [] else -2)
within_2k = (out_expanded.tokens < 2000).mean()
print(f"{within_2k*100:.1f}% of relevant paragraphs are retrieved within the first 2k tokens")
74.3% of relevant paragraphs are retrieved within the first 2k tokens
在这个数据集中,相关的上下文可以被获取到74%的时间。
outside_200 = (out_expanded['rank'] == -1).mean()
print(f"{outside_200*100:.1f}% of relevant paragraphs are not retrieved within the first 200 results")
7.4% of relevant paragraphs are not retrieved within the first 200 results
7.4%的情况是由于搜索算法的关键字搜索部分未在前200个结果中检索到相关上下文。 18.3%的情况是由于语义搜索未将相关上下文放置在前2000个标记内。
import matplotlib.pyplot as plt
# 绘制直方图,并添加轴描述和标题
out_expanded[(out_expanded['rank'] >=0)&(out_expanded['rank'] <30)]['rank'].hist(bins=29)
plt.xlabel('rank')
plt.ylabel('count')
plt.title('Histogram of ranks of retrieved paragraphs')
plt.show()
out_expanded[(out_expanded.tokens>=0)&(out_expanded.tokens < 2000)]['tokens'].hist(bins=29)
plt.xlabel('tokens')
plt.ylabel('count')
plt.title('Histogram of the number of minimum tokens needed')
plt.show()
我们可以观察到,上下文很可能会作为最初的结果之一返回,并且很可能会在前200-500个标记内返回。
# 标准化值计数
out_expanded['rank'].value_counts(normalize=True).sort_index()[:13]
-2 0.000063
-1 0.074428
0 0.453420
1 0.089515
2 0.047146
3 0.032437
4 0.024139
5 0.019676
6 0.015967
7 0.013452
8 0.011189
9 0.009869
10 0.009178
Name: rank, dtype: float64
在每个排名返回相关上下文的概率。 (-2 表示处理错误,-1 表示排名大于200)