跳到主要内容

如何在知识库中使用函数

nbviewer

本笔记构建在argument generation笔记中的概念基础上,通过创建一个可以访问知识库并根据用户需求调用两个函数的代理来实现。

我们将创建一个代理,利用arXiv的数据来回答关于学术科目的问题。它可以使用两个函数: - get_articles:一个获取关于某一主题的arXiv文章并为用户提供摘要和链接的函数。 - read_article_and_summarize:这个函数获取之前搜索过的文章之一,完整阅读并总结其核心论点、证据和结论。

这将使您熟悉一个可以从多个服务中选择的多功能工作流程,在这个过程中,第一个函数的一些数据将被保留下来供第二个函数使用。

演练

本手册将引导您完成以下工作流程:

  • 搜索工具: 创建两个访问arXiv以获取答案的函数。
  • 配置代理: 建立代理行为,评估是否需要函数,如果需要,则调用该函数并将结果呈现给代理。
  • arXiv对话: 将所有这些内容整合到实时对话中。
!pip install scipy --quiet
!pip install tenacity --quiet
!pip install tiktoken==0.3.3 --quiet
!pip install termcolor --quiet
!pip install openai --quiet
!pip install arxiv --quiet
!pip install pandas --quiet
!pip install PyPDF2 --quiet
!pip install tqdm --quiet

import os
import arxiv
import ast
import concurrent
import json
import os
import pandas as pd
import tiktoken
from csv import writer
from IPython.display import display, Markdown, Latex
from openai import OpenAI
from PyPDF2 import PdfReader
from scipy import spatial
from tenacity import retry, wait_random_exponential, stop_after_attempt
from tqdm import tqdm
from termcolor import colored

GPT_MODEL = "gpt-3.5-turbo-0613"
EMBEDDING_MODEL = "text-embedding-ada-002"
client = OpenAI()

搜索工具

我们首先设置一些工具,这些工具将支撑我们的两个函数。

下载的论文将被存储在一个目录中(这里我们使用./data/papers)。我们创建一个文件arxiv_library.csv来存储下载论文的嵌入和详细信息,以便使用summarize_text进行检索。

directory = './data/papers'

# 检查目录是否已存在
if not os.path.exists(directory):
# If the directory doesn't exist, create it and any necessary intermediate directories
os.makedirs(directory)
print(f"Directory '{directory}' created successfully.")
else:
# If the directory already exists, print a message indicating it
print(f"Directory '{directory}' already exists.")

Directory './data/papers' already exists.
# 设置一个目录来存放下载的论文
data_dir = os.path.join(os.curdir, "data", "papers")
paper_dir_filepath = "./data/arxiv_library.csv"

# 生成一个空白数据框,用于存储下载的文件
df = pd.DataFrame(list())
df.to_csv(paper_dir_filepath)

@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))
def embedding_request(text):
response = client.embeddings.create(input=text, model=EMBEDDING_MODEL)
return response


@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))
def get_articles(query, library=paper_dir_filepath, top_k=5):
"""This function gets the top_k articles based on a user's query, sorted by relevance.
It also downloads the files and stores them in arxiv_library.csv to be retrieved by the read_article_and_summarize.
"""
client = arxiv.Client()
search = arxiv.Search(
query = "quantum",
max_results = 10,
sort_by = arxiv.SortCriterion.SubmittedDate
)
result_list = []
for result in client.results(search):
result_dict = {}
result_dict.update({"title": result.title})
result_dict.update({"summary": result.summary})

# 采用所提供的第一个网址
result_dict.update({"article_url": [x.href for x in result.links][0]})
result_dict.update({"pdf_url": [x.href for x in result.links][1]})
result_list.append(result_dict)

# 将引用存储在库文件中
response = embedding_request(text=result.title)
file_reference = [
result.title,
result.download_pdf(data_dir),
response.data[0].embedding,
]

# 写入文件
with open(library, "a") as f_object:
writer_object = writer(f_object)
writer_object.writerow(file_reference)
f_object.close()
return result_list


# 测试搜索功能是否正常运作
result_output = get_articles("ppo reinforcement learning")
result_output[0]


{'title': 'Quantum types: going beyond qubits and quantum gates',
'summary': 'Quantum computing is a growing field with significant potential applications.\nLearning how to code quantum programs means understanding how qubits work and\nlearning to use quantum gates. This is analogous to creating classical\nalgorithms using logic gates and bits. Even after learning all concepts, it is\ndifficult to create new algorithms, which hinders the acceptance of quantum\nprogramming by most developers. This article outlines the need for higher-level\nabstractions and proposes some of them in a developer-friendly programming\nlanguage called Rhyme. The new quantum types are extensions of classical types,\nincluding bits, integers, floats, characters, arrays, and strings. We show how\nto use such types with code snippets.',
'article_url': 'http://arxiv.org/abs/2401.15073v1',
'pdf_url': 'http://arxiv.org/pdf/2401.15073v1'}
def strings_ranked_by_relatedness(
query: str,
df: pd.DataFrame,
relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
top_n: int = 100,
) -> list[str]:
"""返回一个字符串列表及其相关性,按从最相关到最不相关的顺序排序。"""
query_embedding_response = embedding_request(query)
query_embedding = query_embedding_response.data[0].embedding
strings_and_relatednesses = [
(row["filepath"], relatedness_fn(query_embedding, row["embedding"]))
for i, row in df.iterrows()
]
strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
strings, relatednesses = zip(*strings_and_relatednesses)
return strings[:top_n]


def read_pdf(filepath):
"""接受一个指向PDF文件的路径,并返回该PDF内容的字符串。"""
# creating a pdf reader object
reader = PdfReader(filepath)
pdf_text = ""
page_number = 0
for page in reader.pages:
page_number += 1
pdf_text += page.extract_text() + f"\nPage Number: {page_number}"
return pdf_text


# Split a text into smaller chunks of size n, preferably ending at the end of a sentence
def create_chunks(text, n, tokenizer):
"""从提供的文本中返回连续的 n 大小块。"""
tokens = tokenizer.encode(text)
i = 0
while i < len(tokens):
# Find the nearest end of sentence within a range of 0.5 * n and 1.5 * n tokens
j = min(i + int(1.5 * n), len(tokens))
while j > i + int(0.5 * n):
# Decode the tokens and check for full stop or newline
chunk = tokenizer.decode(tokens[i:j])
if chunk.endswith(".") or chunk.endswith("\n"):
break
j -= 1
# If no end of sentence found, use n tokens as the chunk size
if j == i + int(0.5 * n):
j = min(i + n, len(tokens))
yield tokens[i:j]
i = j


def extract_chunk(content, template_prompt):
"""此功能将提示应用于某些输入内容。在这种情况下,它会返回一段摘要文本。"""
prompt = template_prompt + content
response = client.chat.completions.create(
model=GPT_MODEL, messages=[{"role": "user", "content": prompt}], temperature=0
)
return response.choices[0].message.content


def summarize_text(query):
"""This function does the following:
- Reads in the arxiv_library.csv file in including the embeddings
- Finds the closest file to the user's query
- Scrapes the text out of the file and chunks it
- Summarizes each chunk in parallel
- Does one final summary and returns this to the user"""

# 一个指导如何进行递归总结的提示,以处理输入的论文
summary_prompt = """由于原文内容未提供,我无法直接进行总结和提取关键点。请提供具体的学术论文文本,我将能够帮助您概括其主要观点并解释其推理。"""

# 如果图书馆为空(即尚未进行任何搜索),我们将执行一次搜索并下载结果。
library_df = pd.read_csv(paper_dir_filepath).reset_index()
if len(library_df) == 0:
print("No papers searched yet, downloading first.")
get_articles(query)
print("Papers downloaded, continuing")
library_df = pd.read_csv(paper_dir_filepath).reset_index()
library_df.columns = ["title", "filepath", "embedding"]
library_df["embedding"] = library_df["embedding"].apply(ast.literal_eval)
strings = strings_ranked_by_relatedness(query, library_df, top_n=1)
print("Chunking text from paper")
pdf_text = read_pdf(strings[0])

# 初始化分词器
tokenizer = tiktoken.get_encoding("cl100k_base")
results = ""

# 将文档分割成每块1500个标记的块
chunks = create_chunks(pdf_text, 1500, tokenizer)
text_chunks = [tokenizer.decode(chunk) for chunk in chunks]
print("Summarizing each chunk of text")

# 并行处理摘要
with concurrent.futures.ThreadPoolExecutor(
max_workers=len(text_chunks)
) as executor:
futures = [
executor.submit(extract_chunk, chunk, summary_prompt)
for chunk in text_chunks
]
with tqdm(total=len(text_chunks)) as pbar:
for _ in concurrent.futures.as_completed(futures):
pbar.update(1)
for future in futures:
data = future.result()
results += data

# 最终总结
print("Summarizing into overall summary")
response = client.chat.completions.create(
model=GPT_MODEL,
messages=[
{
"role": "user",
"content": f"""Write a summary collated from this collection of key points extracted from an academic paper.
The summary should highlight the core argument, conclusions and evidence, and answer the user's query.
User query: {query}
The summary should be structured in bulleted lists following the headings Core Argument, Evidence, and Conclusions.
Key points:\n{results}\nSummary:\n""",
}
],
temperature=0,
)
return response


# 测试summarize_text函数是否正常工作
chat_test_response = summarize_text("PPO reinforcement learning sequence generation")


Chunking text from paper
Summarizing each chunk of text
100%|██████████| 6/6 [00:06<00:00,  1.08s/it]
Summarizing into overall summary
print(chat_test_response.choices[0].message.content)


Core Argument:
- The academic paper explores the connection between the transverse field Ising (TFI) model and the ϕ4 model, highlighting the analogy between topological solitary waves in the ϕ4 model and the effect of the transverse field on spin flips in the TFI model.
- The study reveals regimes of memory/loss of memory and coherence/decoherence in the classical ϕ4 model subjected to periodic perturbations, which are essential in annealing phenomena.
- The exploration of the analogy between lower-dimensional linear quantum systems and higher-dimensional classical nonlinear systems can lead to a deeper understanding of information processing in these systems.

Evidence:
- The authors analyze the dynamics and relaxation of weakly coupled ϕ4 chains through numerical simulations, observing kink and breather excitations and investigating the structural phase transition associated with the double well potential.
- The critical temperature (Tc) approaches zero as the inter-chain coupling strength (C⊥) approaches zero, but there is a finite Tc for C⊥>0.
- The spectral function shows peaks corresponding to particle motion across the double-well potential at higher temperatures and oscillations in a single well at lower temperatures.
- The soft-mode frequency (ωs) decreases as temperature approaches Ts, the dynamical crossover temperature.
- The relaxation process of the average displacement (QD) is controlled by spatially extended vibrations and large kink densities.
- The mean domain size (⟨DS⟩) exhibits an algebraic decay for finite C⊥>0.
- The probability of larger domain sizes is higher before a kick compared to after a kick for C⊥>0.

Conclusions:
- The authors suggest further exploration of the crossover between decoherence and finite coherence in periodic-kick strength space.
- They propose extending the study to different kick profiles, introducing kink defects, and studying weakly-coupled chains in higher dimensions.
- Recognizing similarities between classical nonlinear equations and quantum linear ones in information processing is important.
- Future research directions include investigating the dynamics of quantum annealing, measurement and memory in the periodically driven complex Ginzburg-Landau equation, and the behavior of solitons and domain walls in various systems.

配置代理

在这一步中,我们将创建我们的代理,包括一个Conversation类来支持与API的多轮对话,并一些Python函数来实现ChatCompletionAPI与我们的知识库函数之间的交互。

@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))
def chat_completion_request(messages, functions=None, model=GPT_MODEL):
try:
response = client.chat.completions.create(
model=model,
messages=messages,
functions=functions,
)
return response
except Exception as e:
print("Unable to generate ChatCompletion response")
print(f"Exception: {e}")
return e


class Conversation:
def __init__(self):
self.conversation_history = []

def add_message(self, role, content):
message = {"role": role, "content": content}
self.conversation_history.append(message)

def display_conversation(self, detailed=False):
role_to_color = {
"system": "red",
"user": "green",
"assistant": "blue",
"function": "magenta",
}
for message in self.conversation_history:
print(
colored(
f"{message['role']}: {message['content']}\n\n",
role_to_color[message["role"]],
)
)

# 启动我们的get_articles和read_article_and_summarize函数
arxiv_functions = [
{
"name": "get_articles",
"description": """Use this function to get academic papers from arXiv to answer user questions.""",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": f"""
User query in JSON. Responses should be summarized and should include the article URL reference
""",
}
},
"required": ["query"],
},
},
{
"name": "read_article_and_summarize",
"description": """Use this function to read whole papers and provide a summary for users.
You should NEVER call this function before get_articles has been called in the conversation.""",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": f"""
Description of the article in plain text based on the user's query
""",
}
},
"required": ["query"],
},
}
]


def chat_completion_with_function_execution(messages, functions=[None]):
"""此功能通过添加函数选项来调用ChatCompletion API。"""
response = chat_completion_request(messages, functions)
full_message = response.choices[0]
if full_message.finish_reason == "function_call":
print(f"Function generation requested, calling function")
return call_arxiv_function(messages, full_message)
else:
print(f"Function not required, responding to user")
return response


def call_arxiv_function(messages, full_message):
"""函数调用功能,即在模型认为必要时执行函数调用。
目前通过向此if语句添加子句来扩展该功能。"""

if full_message.message.function_call.name == "get_articles":
try:
parsed_output = json.loads(
full_message.message.function_call.arguments
)
print("Getting search results")
results = get_articles(parsed_output["query"])
except Exception as e:
print(parsed_output)
print(f"Function execution failed")
print(f"Error message: {e}")
messages.append(
{
"role": "function",
"name": full_message.message.function_call.name,
"content": str(results),
}
)
try:
print("Got search results, summarizing content")
response = chat_completion_request(messages)
return response
except Exception as e:
print(type(e))
raise Exception("Function chat request failed")

elif (
full_message.message.function_call.name == "read_article_and_summarize"
):
parsed_output = json.loads(
full_message.message.function_call.arguments
)
print("Finding and reading paper")
summary = summarize_text(parsed_output["query"])
return summary

else:
raise Exception("Function does not exist and cannot be called")


arXiv对话

让我们通过在对话中测试我们的函数来将所有内容整合在一起。

# 从系统消息开始
paper_system_message = """You are arXivGPT, a helpful assistant pulls academic papers to answer user questions.
You summarize the papers clearly so the customer can decide which to read to answer their question.
You always provide the article_url and title so the user can understand the name of the paper and click through to access it.
Begin!"""
paper_conversation = Conversation()
paper_conversation.add_message("system", paper_system_message)


# 添加用户消息
paper_conversation.add_message("user", "Hi, how does PPO reinforcement learning work?")
chat_response = chat_completion_with_function_execution(
paper_conversation.conversation_history, functions=arxiv_functions
)
assistant_message = chat_response.choices[0].message.content
paper_conversation.add_message("assistant", assistant_message)
display(Markdown(assistant_message))


Function generation requested, calling function
Getting search results
Got search results, summarizing content

PPO (Proximal Policy Optimization) is a reinforcement learning algorithm that aims to find the optimal policy for an agent by optimizing the policy parameters in an iterative manner. Here are a few papers that discuss PPO in more detail:

  1. Title: “Proximal Policy Optimization Algorithms” Article URL: arxiv.org/abs/1707.06347v2 Summary: This paper introduces two algorithms, PPO (Proximal Policy Optimization) and TRPO (Trust Region Policy Optimization), that address the issue of sample efficiency and stability in reinforcement learning. PPO uses a surrogate objective function that makes smaller updates to the policy parameters, resulting in more stable and efficient learning.

  2. Title: “Emergence of Locomotion Behaviours in Rich Environments with PPO” Article URL: arxiv.org/abs/1707.02286v3 Summary: This paper explores the use of PPO in training agents to learn locomotion behaviors in complex and dynamic environments. The authors demonstrate the effectiveness of PPO in learning a variety of locomotion skills, such as walking, jumping, and climbing.

  3. Title: “Proximal Policy Optimization for Multi-Agent Systems” Article URL: arxiv.org/abs/2006.14171v2 Summary: This paper extends PPO to the domain of multi-agent systems, where multiple agents interact and learn together. The authors propose a decentralized version of PPO that allows each agent to update its policy independently based on its local observations, resulting in more scalable and efficient learning in multi-agent environments.

These papers provide detailed explanations of the PPO algorithm, its advantages, and its applications in different scenarios. Reading them can give you a deeper understanding of how PPO reinforcement learning works.

# 再添加一条用户消息,引导我们的系统使用第二个工具。
paper_conversation.add_message(
"user",
"Can you read the PPO sequence generation paper for me and give me a summary",
)
updated_response = chat_completion_with_function_execution(
paper_conversation.conversation_history, functions=arxiv_functions
)
display(Markdown(updated_response.choices[0].message.content))


Function generation requested, calling function
Finding and reading paper
Chunking text from paper
Summarizing each chunk of text
100%|██████████| 6/6 [00:07<00:00,  1.19s/it]
Summarizing into overall summary

Core Argument: - The academic paper explores the connection between the transverse field Ising (TFI) model and the ϕ4 model, highlighting the analogy between the coupling of topological solitary waves in the ϕ4 model and the effect of the transverse field on spin flips in the TFI model. - The study reveals regimes of memory/loss of memory and coherence/decoherence in the classical ϕ4 model subjected to periodic perturbations, which are essential in annealing phenomena. - The exploration of the analogy between lower-dimensional linear quantum systems and higher-dimensional classical nonlinear systems can lead to a deeper understanding of information processing in these systems.

Evidence: - The authors analyze the dynamics and relaxation of weakly coupled ϕ4 chains through numerical simulations, studying the behavior of kink and breather excitations and the structural phase transition associated with the double well potential. - The critical temperature (Tc) approaches zero as the inter-chain coupling strength (C⊥) approaches zero, but there is a finite Tc for C⊥>0. - The spectral function shows peaks corresponding to particle motion across the double-well potential at higher temperatures and oscillations in a single well at lower temperatures. - The soft-mode frequency (ωs) decreases as temperature approaches Ts, the dynamical crossover temperature. - The relaxation process of the average displacement (QD) is controlled by spatially extended vibrations and large kink densities. - The mean domain size (⟨DS⟩) exhibits an algebraic decay for finite C⊥>0. - The probability of larger domain sizes is higher before a kick compared to after a kick for C⊥>0.

Conclusions: - The study of weakly-coupled classical ϕ4 chains provides insights into quantum annealing architectures and the role of topological excitations in these systems. - The equilibration of the system is faster for higher kick strengths, and the mean domain size increases with higher final temperatures. - Further exploration of the crossover between decoherence and finite coherence in periodic-kick strength space is suggested. - The paper highlights the importance of recognizing similarities between classical nonlinear equations and quantum linear ones in information processing and suggests future research directions in this area.