电子邮件数据提取¶
OpenAI函数可用于从电子邮件中提取数据。这是使用LLamaIndex从非结构化内容中获取结构化数据的另一个示例。
这个示例的主要目标是将原始电子邮件内容转换为易于解释的JSON格式,展示语言模型在数据提取中的实际应用。提取的结构化JSON数据可以在任何下游应用中使用。
我们将使用下面显示的样本电子邮件。这封电子邮件模拟了ARK投资公司向其订阅者发送的典型日常通信。这封样本电子邮件包含有关其交易所交易基金(ETF)下的交易的详细信息。通过使用这个特定示例,我们旨在展示如何有效地从现实世界的电子邮件场景中提取和结构化复杂的金融数据,将其转换为可理解的JSON格式。
In [ ]:
Copied!
%pip install llama-index-llms-openai
%pip install llama-index-readers-file
%pip install llama-index-program-openai
%pip install llama-index-llms-openai
%pip install llama-index-readers-file
%pip install llama-index-program-openai
In [ ]:
Copied!
# LlamaIndex
!pip install llama-index
# 从 .eml 和 .msg 文件中获取文本内容
!pip install "unstructured[msg]"
# LlamaIndex
!pip install llama-index
# 从 .eml 和 .msg 文件中获取文本内容
!pip install "unstructured[msg]"
启用日志记录并设置OpenAI API密钥¶
在这一步中,我们设置日志记录以监控程序的执行,并在需要时进行调试。我们还配置OpenAI API密钥,这是利用OpenAI服务的关键。将"YOUR_KEY_HERE"替换为您实际的OpenAI API密钥。
In [ ]:
Copied!
import logging
import sys, json
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging
import sys, json
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
In [ ]:
Copied!
import os
import openai
# os.environ["OPENAI_API_KEY"] = "YOUR_KEY_HERE"
openai.api_key = os.environ["OPENAI_API_KEY"]
import os
import openai
# os.environ["OPENAI_API_KEY"] = "YOUR_KEY_HERE"
openai.api_key = os.environ["OPENAI_API_KEY"]
设置预期的JSON输出定义(JSON模式)¶
在这里,我们使用Pydantic库定义了一个名为EmailData
的Python类。该类对我们期望从电子邮件中提取的数据结构进行建模,包括发件人、收件人、电子邮件的日期和时间,以及包含在该ETF下交易的股票列表。
In [ ]:
Copied!
from pydantic import BaseModel, Field
from typing import List
class Instrument(BaseModel):
"""ticker交易详情的数据模型。"""
direction: str = Field(description="ticker交易方向 - 买入、卖出、持有等")
ticker: str = Field(
description="股票代码。1-4个字符的代码。例如:AAPL,TSLS,MSFT,VZ"
)
company_name: str = Field(
description="与股票代码对应的公司名称"
)
shares_traded: float = Field(description="交易的股票数量")
percent_of_etf: float = Field(description="ETF的百分比")
class Etf(BaseModel):
"""ETF交易数据模型。"""
etf_ticker: str = Field(
description="ETF代码。例如:ARKK,FSPTX"
)
trade_date: str = Field(description="交易日期")
stocks: List[Instrument] = Field(
description="在该ETF下交易的工具或股票列表"
)
class EmailData(BaseModel):
"""用于电子邮件提取信息的数据模型。"""
etfs: List[Etf] = Field(
description="电子邮件中描述的ETF列表,其中包括在其下交易的股票列表"
)
trade_notification_date: str = Field(
description="交易通知日期"
)
sender_email_id: str = Field(description="电子邮件发送者的电子邮件地址。")
email_date_time: str = Field(description="电子邮件的日期和时间")
from pydantic import BaseModel, Field
from typing import List
class Instrument(BaseModel):
"""ticker交易详情的数据模型。"""
direction: str = Field(description="ticker交易方向 - 买入、卖出、持有等")
ticker: str = Field(
description="股票代码。1-4个字符的代码。例如:AAPL,TSLS,MSFT,VZ"
)
company_name: str = Field(
description="与股票代码对应的公司名称"
)
shares_traded: float = Field(description="交易的股票数量")
percent_of_etf: float = Field(description="ETF的百分比")
class Etf(BaseModel):
"""ETF交易数据模型。"""
etf_ticker: str = Field(
description="ETF代码。例如:ARKK,FSPTX"
)
trade_date: str = Field(description="交易日期")
stocks: List[Instrument] = Field(
description="在该ETF下交易的工具或股票列表"
)
class EmailData(BaseModel):
"""用于电子邮件提取信息的数据模型。"""
etfs: List[Etf] = Field(
description="电子邮件中描述的ETF列表,其中包括在其下交易的股票列表"
)
trade_notification_date: str = Field(
description="交易通知日期"
)
sender_email_id: str = Field(description="电子邮件发送者的电子邮件地址。")
email_date_time: str = Field(description="电子邮件的日期和时间")
从 .eml / .msg 文件中加载内容¶
在这一步中,我们将使用 llama-hub
中的 UnstructuredReader
来加载 .eml 邮件文件或 .msg Outlook 文件的内容。然后将该文件的内容存储在一个变量中,以便进行进一步处理。
In [ ]:
Copied!
# 获取下载加载器
from llama_index.core import download_loader
# 获取下载加载器
from llama_index.core import download_loader
In [ ]:
Copied!
# 创建一个下载加载器
from llama_index.readers.file import UnstructuredReader
# 初始化UnstructuredReader
loader = UnstructuredReader()
# 对于eml文件
eml_documents = loader.load_data("../data/email/ark-trading-jan-12-2024.eml")
email_content = eml_documents[0].text
print("\n\n 邮件内容")
print(email_content)
# 创建一个下载加载器
from llama_index.readers.file import UnstructuredReader
# 初始化UnstructuredReader
loader = UnstructuredReader()
# 对于eml文件
eml_documents = loader.load_data("../data/email/ark-trading-jan-12-2024.eml")
email_content = eml_documents[0].text
print("\n\n 邮件内容")
print(email_content)
In [ ]:
Copied!
# 对于Outlook消息
msg_documents = loader.load_data("../data/email/ark-trading-jan-12-2024.msg") # 加载数据
msg_content = msg_documents[0].text # 获取消息内容
print("\n\n Outlook内容")
print(msg_content)
# 对于Outlook消息
msg_documents = loader.load_data("../data/email/ark-trading-jan-12-2024.msg") # 加载数据
msg_content = msg_documents[0].text # 获取消息内容
print("\n\n Outlook内容")
print(msg_content)
使用LLM函数以JSON格式提取内容¶
在最后一步中,我们利用llama_index
包来创建一个提示模板,以从加载的电子邮件中提取见解。我们使用OpenAI
模型的一个实例来解释电子邮件内容,并根据我们预定义的EmailData
模式提取相关信息。然后将输出转换为字典格式,以便于查看和处理。
In [ ]:
Copied!
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage
from llama_index.llms.openai import OpenAI
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage
from llama_index.llms.openai import OpenAI
In [ ]:
Copied!
prompt = ChatPromptTemplate(
message_templates=[
ChatMessage(
role="system",
content=(
"You are an expert assitant for extracting insights from email in JSON format. \n"
"You extract data and returns it in JSON format, according to provided JSON schema, from given email message. \n"
"REMEMBER to return extracted data only from provided email message."
),
),
ChatMessage(
role="user",
content=(
"Email Message: \n" "------\n" "{email_msg_content}\n" "------"
),
),
]
)
llm = OpenAI(model="gpt-3.5-turbo-1106")
program = OpenAIPydanticProgram.from_defaults(
output_cls=EmailData,
llm=llm,
prompt=prompt,
verbose=True,
)
prompt = ChatPromptTemplate(
message_templates=[
ChatMessage(
role="system",
content=(
"You are an expert assitant for extracting insights from email in JSON format. \n"
"You extract data and returns it in JSON format, according to provided JSON schema, from given email message. \n"
"REMEMBER to return extracted data only from provided email message."
),
),
ChatMessage(
role="user",
content=(
"Email Message: \n" "------\n" "{email_msg_content}\n" "------"
),
),
]
)
llm = OpenAI(model="gpt-3.5-turbo-1106")
program = OpenAIPydanticProgram.from_defaults(
output_cls=EmailData,
llm=llm,
prompt=prompt,
verbose=True,
)
In [ ]:
Copied!
output = program(email_msg_content=email_content)
print("Output JSON From .eml File: ")
print(json.dumps(output.dict(), indent=2))
output = program(email_msg_content=email_content)
print("Output JSON From .eml File: ")
print(json.dumps(output.dict(), indent=2))
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" Function call: EmailData with args: {"etfs":[{"etf_ticker":"ARKK","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TSLA","company_name":"TESLA INC","shares_traded":93654,"percent_of_etf":0.2453},{"direction":"Buy","ticker":"TXG","company_name":"10X GENOMICS INC","shares_traded":159506,"percent_of_etf":0.0907},{"direction":"Buy","ticker":"CRSP","company_name":"CRISPR THERAPEUTICS AG","shares_traded":86268,"percent_of_etf":0.0669},{"direction":"Buy","ticker":"RXRX","company_name":"RECURSION PHARMACEUTICALS","shares_traded":289619,"percent_of_etf":0.0391},{"direction":"Sell","ticker":"HOOD","company_name":"ROBINHOOD MARKETS INC","shares_traded":927,"percent_of_etf":0.0001},{"direction":"Sell","ticker":"EXAS","company_name":"EXACT SCIENCES CORP","shares_traded":100766,"percent_of_etf":0.0829},{"direction":"Sell","ticker":"TWLO","company_name":"TWILIO INC","shares_traded":108523,"percent_of_etf":0.0957},{"direction":"Sell","ticker":"PD","company_name":"PAGERDUTY INC","shares_traded":302096,"percent_of_etf":0.0958},{"direction":"Sell","ticker":"PATH","company_name":"UIPATH INC","shares_traded":553172,"percent_of_etf":0.1476}],"trade_date":"1/12/2024"},{"etf_ticker":"ARKW","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TSLA","company_name":"TESLA INC","shares_traded":18148,"percent_of_etf":0.2454},{"direction":"Sell","ticker":"HOOD","company_name":"ROBINHOOD MARKETS INC","shares_traded":49,"percent_of_etf":0.0000},{"direction":"Sell","ticker":"PD","company_name":"PAGERDUTY INC","shares_traded":9756,"percent_of_etf":0.016},{"direction":"Sell","ticker":"TWLO","company_name":"TWILIO INC","shares_traded":21849,"percent_of_etf":0.0994},{"direction":"Sell","ticker":"PATH","company_name":"UIPATH INC","shares_traded":105944,"percent_of_etf":0.1459}],"trade_date":"1/12/2024"},{"etf_ticker":"ARKG","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TXG","company_name":"10X GENOMICS INC","shares_traded":38042,"percent_of_etf":0.0864},{"direction":"Buy","ticker":"CRSP","company_name":"CRISPR THERAPEUTICS AG","shares_traded":21197,"percent_of_etf":0.0656},{"direction":"Buy","ticker":"RXRX","company_name":"RECURSION PHARMACEUTICALS","shares_traded":67422,"percent_of_etf":0.0363},{"direction":"Buy","ticker":"RPTX","company_name":"REPARE THERAPEUTICS INC","shares_traded":15410,"percent_of_etf":0.0049},{"direction":"Sell","ticker":"EXAS","company_name":"EXACT SCIENCES CORP","shares_traded":32057,"percent_of_etf":0.1052}],"trade_date":"1/12/2024"}],"trade_notification_date":"1/12/2024","sender_email_id":"ark@ark-funds.com","email_date_time":"1/12/2024"} Output JSON From .eml File: { "etfs": [ { "etf_ticker": "ARKK", "trade_date": "1/12/2024", "stocks": [ { "direction": "Buy", "ticker": "TSLA", "company_name": "TESLA INC", "shares_traded": 93654.0, "percent_of_etf": 0.2453 }, { "direction": "Buy", "ticker": "TXG", "company_name": "10X GENOMICS INC", "shares_traded": 159506.0, "percent_of_etf": 0.0907 }, { "direction": "Buy", "ticker": "CRSP", "company_name": "CRISPR THERAPEUTICS AG", "shares_traded": 86268.0, "percent_of_etf": 0.0669 }, { "direction": "Buy", "ticker": "RXRX", "company_name": "RECURSION PHARMACEUTICALS", "shares_traded": 289619.0, "percent_of_etf": 0.0391 }, { "direction": "Sell", "ticker": "HOOD", "company_name": "ROBINHOOD MARKETS INC", "shares_traded": 927.0, "percent_of_etf": 0.0001 }, { "direction": "Sell", "ticker": "EXAS", "company_name": "EXACT SCIENCES CORP", "shares_traded": 100766.0, "percent_of_etf": 0.0829 }, { "direction": "Sell", "ticker": "TWLO", "company_name": "TWILIO INC", "shares_traded": 108523.0, "percent_of_etf": 0.0957 }, { "direction": "Sell", "ticker": "PD", "company_name": "PAGERDUTY INC", "shares_traded": 302096.0, "percent_of_etf": 0.0958 }, { "direction": "Sell", "ticker": "PATH", "company_name": "UIPATH INC", "shares_traded": 553172.0, "percent_of_etf": 0.1476 } ] }, { "etf_ticker": "ARKW", "trade_date": "1/12/2024", "stocks": [ { "direction": "Buy", "ticker": "TSLA", "company_name": "TESLA INC", "shares_traded": 18148.0, "percent_of_etf": 0.2454 }, { "direction": "Sell", "ticker": "HOOD", "company_name": "ROBINHOOD MARKETS INC", "shares_traded": 49.0, "percent_of_etf": 0.0 }, { "direction": "Sell", "ticker": "PD", "company_name": "PAGERDUTY INC", "shares_traded": 9756.0, "percent_of_etf": 0.016 }, { "direction": "Sell", "ticker": "TWLO", "company_name": "TWILIO INC", "shares_traded": 21849.0, "percent_of_etf": 0.0994 }, { "direction": "Sell", "ticker": "PATH", "company_name": "UIPATH INC", "shares_traded": 105944.0, "percent_of_etf": 0.1459 } ] }, { "etf_ticker": "ARKG", "trade_date": "1/12/2024", "stocks": [ { "direction": "Buy", "ticker": "TXG", "company_name": "10X GENOMICS INC", "shares_traded": 38042.0, "percent_of_etf": 0.0864 }, { "direction": "Buy", "ticker": "CRSP", "company_name": "CRISPR THERAPEUTICS AG", "shares_traded": 21197.0, "percent_of_etf": 0.0656 }, { "direction": "Buy", "ticker": "RXRX", "company_name": "RECURSION PHARMACEUTICALS", "shares_traded": 67422.0, "percent_of_etf": 0.0363 }, { "direction": "Buy", "ticker": "RPTX", "company_name": "REPARE THERAPEUTICS INC", "shares_traded": 15410.0, "percent_of_etf": 0.0049 }, { "direction": "Sell", "ticker": "EXAS", "company_name": "EXACT SCIENCES CORP", "shares_traded": 32057.0, "percent_of_etf": 0.1052 } ] } ], "trade_notification_date": "1/12/2024", "sender_email_id": "ark@ark-funds.com", "email_date_time": "1/12/2024" }
针对Outlook邮件的处理¶
In [ ]:
Copied!
output = program(email_msg_content=msg_content)
print("Output JSON from .msg file: ")
print(json.dumps(output.dict(), indent=2))
output = program(email_msg_content=msg_content)
print("Output JSON from .msg file: ")
print(json.dumps(output.dict(), indent=2))
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" Function call: EmailData with args: {"etfs":[{"etf_ticker":"ARKK","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TSLA","company_name":"TESLA INC","shares_traded":93654,"percent_of_etf":0.2453},{"direction":"Buy","ticker":"TXG","company_name":"10X GENOMICS INC","shares_traded":159506,"percent_of_etf":0.0907},{"direction":"Buy","ticker":"CRSP","company_name":"CRISPR THERAPEUTICS AG","shares_traded":86268,"percent_of_etf":0.0669},{"direction":"Buy","ticker":"RXRX","company_name":"RECURSION PHARMACEUTICALS","shares_traded":289619,"percent_of_etf":0.0391},{"direction":"Sell","ticker":"HOOD","company_name":"ROBINHOOD MARKETS INC","shares_traded":927,"percent_of_etf":0.0001},{"direction":"Sell","ticker":"EXAS","company_name":"EXACT SCIENCES CORP","shares_traded":100766,"percent_of_etf":0.0829},{"direction":"Sell","ticker":"TWLO","company_name":"TWILIO INC","shares_traded":108523,"percent_of_etf":0.0957},{"direction":"Sell","ticker":"PD","company_name":"PAGERDUTY INC","shares_traded":302096,"percent_of_etf":0.0958},{"direction":"Sell","ticker":"PATH","company_name":"UIPATH INC","shares_traded":553172,"percent_of_etf":0.1476}]},{"etf_ticker":"ARKW","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TSLA","company_name":"TESLA INC","shares_traded":18148,"percent_of_etf":0.2454},{"direction":"Sell","ticker":"HOOD","company_name":"ROBINHOOD MARKETS INC","shares_traded":49,"percent_of_etf":0.0000},{"direction":"Sell","ticker":"PD","company_name":"PAGERDUTY INC","shares_traded":9756,"percent_of_etf":0.0160},{"direction":"Sell","ticker":"TWLO","company_name":"TWILIO INC","shares_traded":21849,"percent_of_etf":0.0994},{"direction":"Sell","ticker":"PATH","company_name":"UIPATH INC","shares_traded":105944,"percent_of_etf":0.1459}]},{"etf_ticker":"ARKG","trade_date":"1/12/2024","stocks":[{"direction":"Buy","ticker":"TXG","company_name":"10X GENOMICS INC","shares_traded":38042,"percent_of_etf":0.0864},{"direction":"Buy","ticker":"CRSP","company_name":"CRISPR THERAPEUTICS AG","shares_traded":21197,"percent_of_etf":0.0656},{"direction":"Buy","ticker":"RXRX","company_name":"RECURSION PHARMACEUTICALS","shares_traded":67422,"percent_of_etf":0.0363},{"direction":"Buy","ticker":"RPTX","company_name":"REPARE THERAPEUTICS INC","shares_traded":15410,"percent_of_etf":0.0049},{"direction":"Sell","ticker":"EXAS","company_name":"EXACT SCIENCES CORP","shares_traded":32057,"percent_of_etf":0.1052}]}],"trade_notification_date":"1/12/2024","sender_email_id":"ark-invest.com","email_date_time":"1/12/2024"} Output JSON : { "etfs": [ { "etf_ticker": "ARKK", "trade_date": "1/12/2024", "stocks": [ { "direction": "Buy", "ticker": "TSLA", "company_name": "TESLA INC", "shares_traded": 93654.0, "percent_of_etf": 0.2453 }, { "direction": "Buy", "ticker": "TXG", "company_name": "10X GENOMICS INC", "shares_traded": 159506.0, "percent_of_etf": 0.0907 }, { "direction": "Buy", "ticker": "CRSP", "company_name": "CRISPR THERAPEUTICS AG", "shares_traded": 86268.0, "percent_of_etf": 0.0669 }, { "direction": "Buy", "ticker": "RXRX", "company_name": "RECURSION PHARMACEUTICALS", "shares_traded": 289619.0, "percent_of_etf": 0.0391 }, { "direction": "Sell", "ticker": "HOOD", "company_name": "ROBINHOOD MARKETS INC", "shares_traded": 927.0, "percent_of_etf": 0.0001 }, { "direction": "Sell", "ticker": "EXAS", "company_name": "EXACT SCIENCES CORP", "shares_traded": 100766.0, "percent_of_etf": 0.0829 }, { "direction": "Sell", "ticker": "TWLO", "company_name": "TWILIO INC", "shares_traded": 108523.0, "percent_of_etf": 0.0957 }, { "direction": "Sell", "ticker": "PD", "company_name": "PAGERDUTY INC", "shares_traded": 302096.0, "percent_of_etf": 0.0958 }, { "direction": "Sell", "ticker": "PATH", "company_name": "UIPATH INC", "shares_traded": 553172.0, "percent_of_etf": 0.1476 } ] }, { "etf_ticker": "ARKW", "trade_date": "1/12/2024", "stocks": [ { "direction": "Buy", "ticker": "TSLA", "company_name": "TESLA INC", "shares_traded": 18148.0, "percent_of_etf": 0.2454 }, { "direction": "Sell", "ticker": "HOOD", "company_name": "ROBINHOOD MARKETS INC", "shares_traded": 49.0, "percent_of_etf": 0.0 }, { "direction": "Sell", "ticker": "PD", "company_name": "PAGERDUTY INC", "shares_traded": 9756.0, "percent_of_etf": 0.016 }, { "direction": "Sell", "ticker": "TWLO", "company_name": "TWILIO INC", "shares_traded": 21849.0, "percent_of_etf": 0.0994 }, { "direction": "Sell", "ticker": "PATH", "company_name": "UIPATH INC", "shares_traded": 105944.0, "percent_of_etf": 0.1459 } ] }, { "etf_ticker": "ARKG", "trade_date": "1/12/2024", "stocks": [ { "direction": "Buy", "ticker": "TXG", "company_name": "10X GENOMICS INC", "shares_traded": 38042.0, "percent_of_etf": 0.0864 }, { "direction": "Buy", "ticker": "CRSP", "company_name": "CRISPR THERAPEUTICS AG", "shares_traded": 21197.0, "percent_of_etf": 0.0656 }, { "direction": "Buy", "ticker": "RXRX", "company_name": "RECURSION PHARMACEUTICALS", "shares_traded": 67422.0, "percent_of_etf": 0.0363 }, { "direction": "Buy", "ticker": "RPTX", "company_name": "REPARE THERAPEUTICS INC", "shares_traded": 15410.0, "percent_of_etf": 0.0049 }, { "direction": "Sell", "ticker": "EXAS", "company_name": "EXACT SCIENCES CORP", "shares_traded": 32057.0, "percent_of_etf": 0.1052 } ] } ], "trade_notification_date": "1/12/2024", "sender_email_id": "ark-invest.com", "email_date_time": "1/12/2024" }