使用搜索API和重新排名进行问答
有时候搜索相关信息就像在草堆中找针一样困难,但不要绝望,GPT实际上可以帮助我们完成大部分这样的工作。在本指南中,我们探讨了一种利用各种人工智能技术增强现有搜索系统的方法,帮助我们筛选噪音。
为GPT检索信息的两种方式是:
- 模仿人类浏览: GPT触发搜索,评估结果,并在必要时修改搜索查询。它还可以跟进特定的搜索结果,形成一系列思路,就像人类用户会做的那样。
- 使用嵌入进行检索: 为您的内容和用户查询计算嵌入,然后根据余弦相似度测量检索最相关的内容。这种技术被像谷歌这样的搜索引擎大量使用。
这些方法都很有前途,但各自都有缺点:第一种由于其迭代性质可能会很慢,而第二种则需要提前嵌入整个知识库,持续嵌入新内容并维护一个向量数据库。
通过结合这些方法,并从重新排名方法中汲取灵感,我们确定了一种居于中间位置的方法。这种方法可以在任何现有搜索系统的基础上实现,比如Slack搜索API,或者具有私有数据的内部ElasticSearch实例。它的工作原理如下:
步骤1:搜索
- 用户提出问题。
- GPT生成潜在查询列表。
- 并行执行搜索查询。
步骤2:重新排名
- 使用每个结果的嵌入来计算与生成的假设理想答案对用户问题的语义相似度。
- 根据这个相似度指标对结果进行排名和过滤。
步骤3:回答
- 针对顶部搜索结果,模型生成用户问题的答案,包括参考和链接。
这种混合方法提供了相对较低的延迟,并且可以集成到任何现有的搜索端点中,而无需维护向量数据库。让我们深入研究吧!我们将使用News API作为一个示例领域进行搜索。
设置
除了您的OPENAI_API_KEY
,您还需要在环境中包含一个NEWS_API_KEY
。您可以在这里获取一个API密钥。
%%capture
%env NEWS_API_KEY = YOUR_NEWS_API_KEY
# 依赖关系
from datetime import date, timedelta # 处理获取最新新闻的日期
from IPython import display # 用于美化打印
import json # 用于解析JSON API响应和模型输出
from numpy import dot # 对于余弦相似度
from openai import OpenAI
import os # 用于加载环境变量
import requests # 为了发起API请求
from tqdm.notebook import tqdm # 用于打印进度条
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
# 加载环境变量
news_api_key = os.getenv("NEWS_API_KEY")
GPT_MODEL = "gpt-3.5-turbo"
# 辅助函数
def json_gpt(input: str):
completion = client.chat.completions.create(model=GPT_MODEL,
messages=[
{"role": "system", "content": "Output only valid JSON"},
{"role": "user", "content": input},
],
temperature=0.5)
text = completion.choices[0].message.content
parsed = json.loads(text)
return parsed
def embeddings(input: list[str]) -> list[list[str]]:
response = client.embeddings.create(model="text-embedding-3-small", input=input)
return [data.embedding for data in response.data]
1. 搜索
一切都始于用户的问题。
# 用户提出一个问题
USER_QUESTION = "Who won the NBA championship? And who was the MVP? Tell me a bit about the last game."
现在,为了尽可能全面,我们使用模型生成基于这个问题的多样查询列表。
QUERIES_INPUT = f"""
You have access to a search API that returns recent news articles.
Generate an array of search queries that are relevant to this question.
Use a variation of related keywords for the queries, trying to be as general as possible.
Include as many queries as you can think of, including and excluding terms.
For example, include queries like ['keyword_1 keyword_2', 'keyword_1', 'keyword_2'].
Be creative. The more queries you include, the more likely you are to find relevant results.
User question: {USER_QUESTION}
Format: {{"queries": ["query_1", "query_2", "query_3"]}}
"""
queries = json_gpt(QUERIES_INPUT)["queries"]
# 为了确保完整性,我们也把原始问题一并纳入。
queries.append(USER_QUESTION)
queries
['NBA championship winner',
'MVP of NBA championship',
'Last game of NBA championship',
'NBA finals winner',
'Most valuable player of NBA championship',
'Finals game of NBA',
'Who won the NBA finals',
'NBA championship game summary',
'NBA finals MVP',
'Champion of NBA playoffs',
'NBA finals last game highlights',
'NBA championship series result',
'NBA finals game score',
'NBA finals game recap',
'NBA champion team and player',
'NBA finals statistics',
'NBA championship final score',
'NBA finals best player',
'NBA playoffs champion and MVP',
'NBA finals game analysis',
'Who won the NBA championship? And who was the MVP? Tell me a bit about the last game.']
查询看起来不错,让我们运行搜索。
def search_news(
query: str,
news_api_key: str = news_api_key,
num_articles: int = 50,
from_datetime: str = "2023-06-01", # 2023年NBA总决赛于2023年6月举行。
to_datetime: str = "2023-06-30",
) -> dict:
response = requests.get(
"https://newsapi.org/v2/everything",
params={
"q": query,
"apiKey": news_api_key,
"pageSize": num_articles,
"sortBy": "relevancy",
"from": from_datetime,
"to": to_datetime,
},
)
return response.json()
articles = []
for query in tqdm(queries):
result = search_news(query)
if result["status"] == "ok":
articles = articles + result["articles"]
else:
raise Exception(result["message"])
# 去除重复项
articles = list({article["url"]: article for article in articles}.values())
print("Total number of articles:", len(articles))
print("Top 5 articles of query 1:", "\n")
for article in articles[0:5]:
print("Title:", article["title"])
print("Description:", article["description"])
print("Content:", article["content"][0:100] + "...")
print()
0%| | 0/21 [00:00<?, ?it/s]
Total number of articles: 554
Top 5 articles of query 1:
Title: Nascar takes on Le Mans as LeBron James gets centenary race under way
Description: <ul><li>Nascar has presence at iconic race for first time since 1976</li><li>NBA superstar LeBron James waves flag as honorary starter</li></ul>The crowd chanted “U-S-A! U-S-A!” as Nascar driver lineup for the 24 Hours of Le Mans passed through the city cente…
Content: The crowd chanted U-S-A! U-S-A! as Nascar driver lineup for the 24 Hours of Le Mans passed through t...
Title: NBA finals predictions: Nuggets or Heat? Our writers share their picks
Description: Denver or Miami? Our contributors pick the winner, key players and dark horses before the NBA’s grand finale tips offA lot has been made of the importance of a balanced roster with continuity, but, somehow, still not enough. The Nuggets are the prime example …
Content: The Nuggets are here because
A lot has been made of the importance of a balanced roster with conti...
Title: Unboxing: Michelob ULTRA and Artist Futura Enshrine the NBA Championship In Custom Hand-Painted Bottles
Description: As the 2022-2023 NBA Championship nears the end, Michelob ULTRA brings joy to sports fans who will gather to watch the showdown between the Denver Nuggets and Miami Heat. The beermaker teamed up with artist Futura to remix its newly-designed 2023 Champ Bottle…
Content: As the 2022-2023 NBA Championship nears the end, Michelob ULTRA brings joy to sports fans who will g...
Title: Futura and Michelob ULTRA Toast to the NBA Finals With Abstract Artwork Crafted From the Brand’s 2023 Limited-Edition Championship Bottles
Description: The sun is out to play, and so is Michelob ULTRA. With the 2022-2023 NBA Finals underway, the beermaker is back with its celebratory NBA Champ Bottles. This year, the self-proclaimed MVP of joy is dropping a limited-edition bottle made in collaboration with a…
Content: The sun is out to play, and so is Michelob ULTRA. With the 2022-2023 NBA Finals underway, the beerma...
Title: Signed and Delivered, Futura and Michelob ULTRA Will Gift Hand-Painted Bottles to This Year’s NBA Championship Team
Description: Michelob ULTRA, the MVP of joy and official beer sponsor of the NBA is back to celebrate with basketball lovers and sports fans around the globe as the NBA 2022-2023 season comes to a nail-biting close. In collaboration with artist Futura, Michelob ULTRA will…
Content: Michelob ULTRA, the MVP of joy and official beer sponsor of the NBA is back to celebrate with basket...
正如我们所看到的,往往搜索查询会返回大量结果,其中许多与用户提出的原始问题无关。为了提高最终答案的质量,我们使用嵌入来重新排名和过滤结果。
2. 重新排名
受HуDE (Gao等人)的启发,我们首先生成一个假想的理想答案,以重新排名并将我们的结果与之进行比较。这有助于优先考虑看起来像是好答案的结果,而不是与我们的问题相似的结果。以下是我们用来生成假想答案的提示。
HA_INPUT = f"""
Generate a hypothetical answer to the user's question. This answer will be used to rank search results.
Pretend you have all the information you need to answer, but don't use any actual facts. Instead, use placeholders
like NAME did something, or NAME said something at PLACE.
User question: {USER_QUESTION}
Format: {{"hypotheticalAnswer": "hypothetical answer text"}}
"""
hypothetical_answer = json_gpt(HA_INPUT)["hypotheticalAnswer"]
hypothetical_answer
'The NBA championship was won by TEAM NAME. The MVP was awarded to PLAYER NAME. The last game was held at STADIUM NAME, where both teams played with great energy and enthusiasm. It was a close game, but in the end, TEAM NAME emerged victorious.'
现在,让我们为搜索结果和假设答案生成嵌入。然后我们计算这些嵌入之间的余弦距离,从而得到一个语义相似度指标。请注意,由于OpenAI嵌入在我们的API中是归一化的,所以我们可以简单地计算点积,而不必进行完整的余弦相似度计算。
hypothetical_answer_embedding = embeddings(hypothetical_answer)[0]
article_embeddings = embeddings(
[
f"{article['title']} {article['description']} {article['content'][0:100]}"
for article in articles
]
)
# 计算余弦相似度
cosine_similarities = []
for article_embedding in article_embeddings:
cosine_similarities.append(dot(hypothetical_answer_embedding, article_embedding))
cosine_similarities[0:10]
[0.7854456526852069,
0.8086023500072106,
0.8002998147018501,
0.7961229569526956,
0.798354506673743,
0.758216458795653,
0.7753754083127359,
0.7494958338411927,
0.804733946801739,
0.8405965885235218]
最后,我们使用这些相似度分数来对结果进行排序和筛选。
scored_articles = zip(articles, cosine_similarities)
# 按余弦相似度排序文章
sorted_articles = sorted(scored_articles, key=lambda x: x[1], reverse=True)
# 打 印排名前五的文章
print("Top 5 articles:", "\n")
for article, score in sorted_articles[0:5]:
print("Title:", article["title"])
print("Description:", article["description"])
print("Content:", article["content"][0:100] + "...")
print("Score:", score)
print()
Top 5 articles:
Title: NBA Finals: Denver Nuggets beat Miami Hea, lift thier first-ever NBA title
Description: Denver Nuggets won their maiden NBA Championship trophy defeating Miami Heat 94-89 in Game 5 of the NBA Final held on Tuesday at the Ball Arena in Denver
Content: Denver Nuggets won their maiden NBA Championship trophy defeating Miami Heat 94-89 in Game 5 of the ...
Score: 0.8445817523602124
Title: Photos: Denver Nuggets celebrate their first NBA title
Description: The Nuggets capped off an impressive postseason by beating the Miami Heat in the NBA Finals.
Content: Thousands of supporters watched along the streets of Denver, Colorado as the US National Basketball ...
Score: 0.842070667753606
Title: Denver Nuggets win first NBA championship title in Game 5 victory over Miami Heat
Description: The Denver Nuggets won their first NBA championship Monday night, downing the Miami Heat 94-89 at Ball Arena in Denver to take Game 5 of the NBA Finals.
Content: The Denver Nuggets won their first NBA championship Monday night, downing the Miami Heat 94-89 at Ba...
Score: 0.8409346078172385
Title: Denver Nuggets Capture Their First NBA Championship Behind Unbreakable Chemistry
Description: After 47 years of waiting, the Denver Nuggets are NBA champions. Led by Nikola Jokic and Jamal Murray, they reached the mountain top by staying true to themselves.
Content: DENVER, CO - JUNE 12: Jamal Murray (27) of the Denver Nuggets celebrates as he leaves the court ... ...
Score: 0.8405965885235218
Title: NBA Finals: Nikola Jokic, Denver Nuggets survive Miami Heat to secure franchise's first NBA championship
Description: In a rock-fight of a Game 5, the Denver Nuggets reached the NBA mountaintop from the foothills of the Rockies, winning their first-ever championship and setting Nikola Jokic's legacy as an all-timer in stone.
Content: DENVER, COLORADO - JUNE 12: Jamal Murray #27 of the Denver Nuggets reacts during the fourth quarter ...
Score: 0.8389716330890262
太棒了!这些结果看起来更符合我们最初的查询。现在,让我们使用前5个结果生成最终答案。
3. 答案
formatted_top_results = [
{
"title": article["title"],
"description": article["description"],
"url": article["url"],
}
for article, _score in sorted_articles[0:5]
]
ANSWER_INPUT = f"""
Generate an answer to the user's question based on the given search results.
TOP_RESULTS: {formatted_top_results}
USER_QUESTION: {USER_QUESTION}
Include as much information as possible in the answer. Reference the relevant search result urls as markdown links.
"""
completion = client.chat.completions.create(
model=GPT_MODEL,
messages=[{"role": "user", "content": ANSWER_INPUT}],
temperature=0.5,
stream=True,
)
text = ""
for chunk in completion:
text += chunk.choices[0].delta.content
display.clear_output(wait=True)
display.display(display.Markdown(text))
The Denver Nuggets won their first-ever NBA championship by defeating the Miami Heat 94-89 in Game 5 of the NBA Finals held on Tuesday at the Ball Arena in Denver, according to this Business Standard article. Nikola Jokic, the Nuggets’ center, was named the NBA Finals MVP. In a rock-fight of a Game 5, the Nuggets reached the NBA mountaintop, securing their franchise’s first NBA championship and setting Nikola Jokic’s legacy as an all-timer in stone, according to this Yahoo Sports article. For more information and photos of the Nuggets’ celebration, check out this Al Jazeera article and this CNN article.