OpenAI Pydantic 程序¶
本指南向您展示如何通过LlamaIndex使用new OpenAI API生成结构化数据。用户只需指定一个Pydantic对象。
我们演示了两种设置:
- 提取到一个
Album
对象中(可以包含一系列的Song对象) - 提取到一个
DirectoryTree
对象中(可以包含递归的Node对象)
提取为专辑
¶
这是一个简单的示例,演示了将输出解析为专辑
模式的过程,其中可以包含多首歌曲。
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
%pip install llama-index-llms-openai
%pip install llama-index-program-openai
%pip install llama-index
from pydantic import BaseModel
from typing import List
from llama_index.program.openai import OpenAIPydanticProgram
没有在模型中添加文档字符串¶
定义输出模式(不包括文档字符串)
class Song(BaseModel):
title: str
length_seconds: int
class Album(BaseModel):
name: str
artist: str
songs: List[Song]
OpenAI Pydantic 程序
prompt_template_str = """\
生成一个示例专辑,包括一个艺术家和一组歌曲。以电影 {movie_name} 为灵感。\
"""
program = OpenAIPydanticProgram.from_defaults(
output_cls=Album, prompt_template_str=prompt_template_str, verbose=True
)
把程序运行起来,以便获得结构化的输出。
output = program(
movie_name="The Shining", description="Data model for an album."
)
Function call: Album with args: { "name": "The Shining", "artist": "Various Artists", "songs": [ { "title": "Main Title", "length_seconds": 180 }, { "title": "Opening Credits", "length_seconds": 120 }, { "title": "The Overlook Hotel", "length_seconds": 240 }, { "title": "Redrum", "length_seconds": 150 }, { "title": "Here's Johnny!", "length_seconds": 200 } ] }
在模型中使用文档字符串¶
class Song(BaseModel):
"""歌曲的数据模型。"""
title: str
length_seconds: int
class Album(BaseModel):
"""专辑的数据模型。"""
name: str
artist: str
songs: List[Song]
prompt_template_str = """\
生成一个示例专辑,包括一个艺术家和一组歌曲。以电影 {movie_name} 为灵感。\
"""
program = OpenAIPydanticProgram.from_defaults(
output_cls=Album, prompt_template_str=prompt_template_str, verbose=True
)
把程序运行起来,以获得结构化的输出。
output = program(movie_name="The Shining")
Function call: Album with args: { "name": "The Shining", "artist": "Various Artists", "songs": [ { "title": "Main Title", "length_seconds": 180 }, { "title": "Opening Credits", "length_seconds": 120 }, { "title": "The Overlook Hotel", "length_seconds": 240 }, { "title": "Redrum", "length_seconds": 150 }, { "title": "Here's Johnny", "length_seconds": 200 } ] }
输出是一个有效的Pydantic对象,我们可以使用它来调用函数/API。
output
Album(name='The Shining', artist='Various Artists', songs=[Song(title='Main Title', length_seconds=180), Song(title='Opening Credits', length_seconds=120), Song(title='The Overlook Hotel', length_seconds=240), Song(title='Redrum', length_seconds=150), Song(title="Here's Johnny", length_seconds=200)])
流式传递部分中间 Pydantic 对象¶
而不是等待函数调用生成整个JSON,我们可以使用program
的stream_partial_objects()
方法,以在有效的Pydantic输出类的中间实例可用时立即进行流式传输。🔥
首先让我们定义输出的Pydantic类。
from pydantic import BaseModel, Field
class CharacterInfo(BaseModel):
"""角色信息。"""
character_name: str
name: str = Field(..., description="演员/女演员的姓名")
hometown: str
class Characters(BaseModel):
"""角色列表。"""
characters: list[CharacterInfo] = Field(default_factory=list)
现在我们将使用提示模板初始化程序
from llama_index.program.openai import OpenAIPydanticProgram
prompt_template_str = "Information about 3 characters from the movie: {movie}"
program = OpenAIPydanticProgram.from_defaults(
output_cls=Characters, prompt_template_str=prompt_template_str
)
最后,我们使用 stream_partial_objects()
方法流式传输部分对象。
# 遍历流式对象中的部分对象
for partial_object in program.stream_partial_objects(movie="Harry Potter"):
# 将部分对象发送到前端以提供更好的用户体验
print(partial_object)
提取专辑
列表(使用并行函数调用)¶
使用OpenAI最新的并行函数调用功能,我们可以同时从单个提示中提取多个结构化数据!
为了做到这一点,我们需要:
- 选择最新的模型之一(例如
gpt-3.5-turbo-1106
),并且 - 在我们的
OpenAIPydanticProgram
中将allow_multiple
设置为 True(如果不这样做,它将只返回第一个对象,并引发警告)。
from llama_index.llms.openai import OpenAI
prompt_template_str = """\
生成4张关于春天、夏天、秋天和冬天的专辑。
"""
program = OpenAIPydanticProgram.from_defaults(
output_cls=Album,
llm=OpenAI(model="gpt-3.5-turbo-1106"),
prompt_template_str=prompt_template_str,
allow_multiple=True,
verbose=True,
)
output = program()
Function call: Album with args: {"name": "Spring", "artist": "Various Artists", "songs": [{"title": "Blossom", "length_seconds": 180}, {"title": "Sunshine", "length_seconds": 240}, {"title": "Renewal", "length_seconds": 200}]} Function call: Album with args: {"name": "Summer", "artist": "Beach Boys", "songs": [{"title": "Beach Party", "length_seconds": 220}, {"title": "Heatwave", "length_seconds": 260}, {"title": "Vacation", "length_seconds": 180}]} Function call: Album with args: {"name": "Fall", "artist": "Autumn Leaves", "songs": [{"title": "Golden Days", "length_seconds": 210}, {"title": "Harvest Moon", "length_seconds": 240}, {"title": "Crisp Air", "length_seconds": 190}]} Function call: Album with args: {"name": "Winter", "artist": "Snowflakes", "songs": [{"title": "Frosty Morning", "length_seconds": 190}, {"title": "Snowfall", "length_seconds": 220}, {"title": "Cozy Nights", "length_seconds": 250}]}
输出是一个有效的Pydantic对象列表。
output
[Album(name='Spring', artist='Various Artists', songs=[Song(title='Blossom', length_seconds=180), Song(title='Sunshine', length_seconds=240), Song(title='Renewal', length_seconds=200)]), Album(name='Summer', artist='Beach Boys', songs=[Song(title='Beach Party', length_seconds=220), Song(title='Heatwave', length_seconds=260), Song(title='Vacation', length_seconds=180)]), Album(name='Fall', artist='Autumn Leaves', songs=[Song(title='Golden Days', length_seconds=210), Song(title='Harvest Moon', length_seconds=240), Song(title='Crisp Air', length_seconds=190)]), Album(name='Winter', artist='Snowflakes', songs=[Song(title='Frosty Morning', length_seconds=190), Song(title='Snowfall', length_seconds=220), Song(title='Cozy Nights', length_seconds=250)])]
从Album
中提取(流式处理)¶
我们还支持通过我们的stream_list
函数对对象列表进行流式处理。
这个想法的全部功劳归功于openai_function_call
仓库:https://github.com/jxnl/openai_function_call/tree/main/examples/streaming_multitask
prompt_template_str = "{input_str}"
program = OpenAIPydanticProgram.from_defaults(
output_cls=Album,
prompt_template_str=prompt_template_str,
verbose=False,
)
output = program.stream_list(
input_str="make up 5 random albums",
)
for obj in output:
print(obj.json(indent=2))
将内容提取到DirectoryTree
对象中¶
这直接受到了jxnl在这里的令人敬畏的存储库的启发:https://github.com/jxnl/openai_function_call。
该存储库展示了如何使用OpenAI的函数API来解析递归的Pydantic对象。主要要求是要将递归的Pydantic对象“包装”到一个非递归对象中。
在这里,我们展示了一个“目录”设置的示例,其中DirectoryTree
对象包装了递归的Node
对象,以解析文件结构。
# 注意:在笔记本中定义递归对象会导致错误
from directory import DirectoryTree, Node
DirectoryTree.schema()
{'title': 'DirectoryTree', 'description': 'Container class representing a directory tree.\n\nArgs:\n root (Node): The root node of the tree.', 'type': 'object', 'properties': {'root': {'title': 'Root', 'description': 'Root folder of the directory tree', 'allOf': [{'$ref': '#/definitions/Node'}]}}, 'required': ['root'], 'definitions': {'NodeType': {'title': 'NodeType', 'description': 'Enumeration representing the types of nodes in a filesystem.', 'enum': ['file', 'folder'], 'type': 'string'}, 'Node': {'title': 'Node', 'description': 'Class representing a single node in a filesystem. Can be either a file or a folder.\nNote that a file cannot have children, but a folder can.\n\nArgs:\n name (str): The name of the node.\n children (List[Node]): The list of child nodes (if any).\n node_type (NodeType): The type of the node, either a file or a folder.', 'type': 'object', 'properties': {'name': {'title': 'Name', 'description': 'Name of the folder', 'type': 'string'}, 'children': {'title': 'Children', 'description': 'List of children nodes, only applicable for folders, files cannot have children', 'type': 'array', 'items': {'$ref': '#/definitions/Node'}}, 'node_type': {'description': 'Either a file or folder, use the name to determine which it could be', 'default': 'file', 'allOf': [{'$ref': '#/definitions/NodeType'}]}}, 'required': ['name']}}}
program = OpenAIPydanticProgram.from_defaults(
output_cls=DirectoryTree,
prompt_template_str="{input_str}",
verbose=True,
)
input_str = """
根目录
├── 文件夹1
│ ├── 文件1.txt
│ └── 文件2.txt
└── 文件夹2
├── 文件3.txt
└── 子文件夹1
└── 文件4.txt
"""
output = program(input_str=input_str)
Function call: DirectoryTree with args: { "root": { "name": "root", "children": [ { "name": "folder1", "children": [ { "name": "file1.txt", "children": [], "node_type": "file" }, { "name": "file2.txt", "children": [], "node_type": "file" } ], "node_type": "folder" }, { "name": "folder2", "children": [ { "name": "file3.txt", "children": [], "node_type": "file" }, { "name": "subfolder1", "children": [ { "name": "file4.txt", "children": [], "node_type": "file" } ], "node_type": "folder" } ], "node_type": "folder" } ], "node_type": "folder" } }
输出是一个完整的DirectoryTree结构,其中包含递归的Node
对象。
output
DirectoryTree(root=Node(name='root', children=[Node(name='folder1', children=[Node(name='file1.txt', children=[], node_type=<NodeType.FILE: 'file'>), Node(name='file2.txt', children=[], node_type=<NodeType.FILE: 'file'>)], node_type=<NodeType.FOLDER: 'folder'>), Node(name='folder2', children=[Node(name='file3.txt', children=[], node_type=<NodeType.FILE: 'file'>), Node(name='subfolder1', children=[Node(name='file4.txt', children=[], node_type=<NodeType.FILE: 'file'>)], node_type=<NodeType.FOLDER: 'folder'>)], node_type=<NodeType.FOLDER: 'folder'>)], node_type=<NodeType.FOLDER: 'folder'>))