OpenAI JSON模式 vs. 函数调用进行数据提取¶

OpenAI刚刚发布了JSON模式：这种新配置限制了LLM只生成解析为有效JSON的字符串（但不能保证针对任何模式的验证）。

在此之前，从文本中提取结构化数据的最佳方法是通过函数调用。

在这个笔记本中，我们将探讨最新的JSON模式和函数调用功能在结构化输出和提取方面的权衡。

更新：OpenAI已经澄清，JSON模式始终对函数调用启用，对于常规消息来说是选择加入的（https://community.openai.com/t/json-mode-vs-function-calling/476994/4）

生成合成数据¶

我们将从为数据提取任务生成一些合成数据开始。让我们向我们的LLM请求一个假设的销售交易记录。

In [ ]:

Copied!

%pip install llama-index-llms-openai
%pip install llama-index-program-openai
%pip install llama-index-llms-openai
%pip install llama-index-program-openai

In [ ]:

Copied!





from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo-1106")
response = llm.complete(
    "Generate a sales call transcript, use real names, talk about a product, discuss some action items"
)
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo-1106")
response = llm.complete(
    "Generate a sales call transcript, use real names, talk about a product, discuss some action items"
)

In [ ]:

Copied!

transcript = response.text
print(transcript)
transcript = response.text
print(transcript)

[Phone rings]

John: Hello, this is John.

Sarah: Hi John, this is Sarah from XYZ Company. I'm calling to discuss our new product, the XYZ Widget, and see if it might be a good fit for your business.

John: Hi Sarah, thanks for reaching out. I'm definitely interested in learning more about the XYZ Widget. Can you give me a quick overview of what it does?

Sarah: Of course! The XYZ Widget is a cutting-edge tool that helps businesses streamline their workflow and improve productivity. It's designed to automate repetitive tasks and provide real-time data analytics to help you make informed decisions.

John: That sounds really interesting. I can see how that could benefit our team. Do you have any case studies or success stories from other companies who have used the XYZ Widget?

Sarah: Absolutely, we have several case studies that I can share with you. I'll send those over along with some additional information about the product. I'd also love to schedule a demo for you and your team to see the XYZ Widget in action.

John: That would be great. I'll make sure to review the case studies and then we can set up a time for the demo. In the meantime, are there any specific action items or next steps we should take?

Sarah: Yes, I'll send over the information and then follow up with you to schedule the demo. In the meantime, feel free to reach out if you have any questions or need further information.

John: Sounds good, I appreciate your help Sarah. I'm looking forward to learning more about the XYZ Widget and seeing how it can benefit our business.

Sarah: Thank you, John. I'll be in touch soon. Have a great day!

John: You too, bye.

设置我们期望的模式¶

让我们将我们期望的输出“shape”指定为一个Pydantic模型。

In [ ]:

Copied!

from pydantic import BaseModel, Fieldfrom typing import Listclass CallSummary(BaseModel):    """通话摘要的数据模型。"""    summary: str = Field(        description="通话摘要的高层摘要。不应超过3句话。"    )    products: List[str] = Field(        description="通话中讨论的产品列表"    )    rep_name: str = Field(description="销售代表的姓名")    prospect_name: str = Field(description="潜在客户的姓名")    action_items: List[str] = Field(description="行动项列表")
from pydantic import BaseModel, Fieldfrom typing import Listclass CallSummary(BaseModel):    """通话摘要的数据模型。"""    summary: str = Field(        description="通话摘要的高层摘要。不应超过3句话。"    )    products: List[str] = Field(        description="通话中讨论的产品列表"    )    rep_name: str = Field(description="销售代表的姓名")    prospect_name: str = Field(description="潜在客户的姓名")    action_items: List[str] = Field(description="行动项列表")

使用函数调用进行数据提取¶

我们可以在LlamaIndex中使用OpenAIPydanticProgram模块，使事情变得非常简单，只需定义一个提示模板，并传入我们定义的LLM和pydantic模型。

In [ ]:

Copied!

from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage

In [ ]:

Copied!





prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts."
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)
program = OpenAIPydanticProgram.from_defaults(
    output_cls=CallSummary,
    llm=llm,
    prompt=prompt,
    verbose=True,
)
prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts."
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)
program = OpenAIPydanticProgram.from_defaults(
    output_cls=CallSummary,
    llm=llm,
    prompt=prompt,
    verbose=True,
)

In [ ]:

Copied!

output = program(transcript=transcript)
output = program(transcript=transcript)

Function call: CallSummary with args: {"summary":"Sarah from XYZ Company called to discuss the new product, the XYZ Widget, which John expressed interest in. Sarah offered to share case studies and schedule a demo. They agreed to review the case studies and set up a time for the demo. The next steps include Sarah sending over information and following up to schedule the demo.","products":["XYZ Widget"],"rep_name":"Sarah","prospect_name":"John","action_items":["Review case studies","Schedule demo"]}

我们现在已经得到了期望的结构化数据，作为一个Pydantic模型。快速检查显示结果与我们预期的一样。

In [ ]:

Copied!

output.dict()
output.dict()

Out[ ]:

{'summary': 'Sarah from XYZ Company called to discuss the new product, the XYZ Widget, which John expressed interest in. Sarah offered to share case studies and schedule a demo. They agreed to review the case studies and set up a time for the demo. The next steps include Sarah sending over information and following up to schedule the demo.',
 'products': ['XYZ Widget'],
 'rep_name': 'Sarah',
 'prospect_name': 'John',
 'action_items': ['Review case studies', 'Schedule demo']}

使用JSON模式进行数据提取¶

让我们尝试使用JSON模式来完成相同的操作，而不是使用函数调用。

In [ ]:

Copied!





prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
                "Generate a valid JSON following the given schema below:\n"
                "{json_schema}"
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)
prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
                "Generate a valid JSON following the given schema below:\n"
                "{json_schema}"
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)

In [ ]:

Copied!

messages = prompt.format_messages(
    json_schema=CallSummary.schema_json(), transcript=transcript
)
messages = prompt.format_messages(
    json_schema=CallSummary.schema_json(), transcript=transcript
)

In [ ]:

Copied!

output = llm.chat(
    messages, response_format={"type": "json_object"}
).message.content
output = llm.chat(
    messages, response_format={"type": "json_object"}
).message.content

我们得到了一个有效的JSON，但它只是重复了我们指定的模式，实际上并没有进行提取。

In [ ]:

Copied!

print(output)
print(output)

{
  "title": "CallSummary",
  "description": "Data model for a call summary.",
  "type": "object",
  "properties": {
    "summary": {
      "title": "Summary",
      "description": "High-level summary of the call transcript. Should not exceed 3 sentences.",
      "type": "string"
    },
    "products": {
      "title": "Products",
      "description": "List of products discussed in the call",
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "rep_name": {
      "title": "Rep Name",
      "description": "Name of the sales rep",
      "type": "string"
    },
    "prospect_name": {
      "title": "Prospect Name",
      "description": "Name of the prospect",
      "type": "string"
    },
    "action_items": {
      "title": "Action Items",
      "description": "List of action items",
      "type": "array",
      "items": {
        "type": "string"
      }
    }
  },
  "required": ["summary", "products", "rep_name", "prospect_name", "action_items"]
}

让我们再试一次，只展示我们想要的JSON格式，而不指定模式。

In [ ]:

Copied!





import json

prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
                "Generate a valid JSON in the following format:\n"
                "{json_example}"
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)

dict_example = {
    "summary": "High-level summary of the call transcript. Should not exceed 3 sentences.",
    "products": ["product 1", "product 2"],
    "rep_name": "Name of the sales rep",
    "prospect_name": "Name of the prospect",
    "action_items": ["action item 1", "action item 2"],
}

json_example = json.dumps(dict_example)
import json

prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
                "Generate a valid JSON in the following format:\n"
                "{json_example}"
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)

dict_example = {
    "summary": "High-level summary of the call transcript. Should not exceed 3 sentences.",
    "products": ["product 1", "product 2"],
    "rep_name": "Name of the sales rep",
    "prospect_name": "Name of the prospect",
    "action_items": ["action item 1", "action item 2"],
}

json_example = json.dumps(dict_example)

In [ ]:

Copied!

messages = prompt.format_messages(
    json_example=json_example, transcript=transcript
)
messages = prompt.format_messages(
    json_example=json_example, transcript=transcript
)

In [ ]:

Copied!

output = llm.chat(
    messages, response_format={"type": "json_object"}
).message.content
output = llm.chat(
    messages, response_format={"type": "json_object"}
).message.content

现在我们能够按照预期获取提取的结构化数据。

In [ ]:

Copied!

print(output)
print(output)

{
  "summary": "Sarah from XYZ Company called John to discuss the new product, the XYZ Widget, which is designed to streamline workflow and improve productivity. They discussed case studies and scheduling a demo for John and his team. The next steps include Sarah sending over information and following up to schedule the demo.",
  "products": ["XYZ Widget"],
  "rep_name": "Sarah",
  "prospect_name": "John",
  "action_items": ["Review case studies", "Schedule demo"]
}

快速了解¶

这个部分将介绍一些关于本教程的快速要点。

对于结构化数据提取，使用函数调用仍然更容易（特别是如果您已经将模式指定为例如pydantic模型）
虽然JSON模式强制输出的格式，但它并不帮助验证指定的模式。直接传入模式可能不会生成预期的JSON，可能需要额外的小心格式化和提示。