教程:在DSPy程序中使用音频¶
本教程将引导您使用DSPy构建基于音频应用的流水线。
安装依赖¶
确保您使用的是最新的DSPy版本:
pip install -U dspy
要处理音频数据,请安装以下依赖项:
pip install datasets soundfile torch==2.0.1+cu118 torchaudio==2.0.2+cu118
加载Spoken-SQuAD数据集¶
我们将使用Spoken-SQuAD数据集(官方 & HuggingFace版本用于教程演示),其中包含用于问答的口语音频段落:
import random
import dspy
from dspy.datasets import DataLoader
kwargs = dict(fields=("context", "instruction", "answer"), input_keys=("context", "instruction"))
spoken_squad = DataLoader().from_huggingface(dataset_name="AudioLLMs/spoken_squad_test", split="train", trust_remote_code=True, **kwargs)
random.Random(42).shuffle(spoken_squad)
spoken_squad = spoken_squad[:100]
split_idx = len(spoken_squad) // 2
trainset_raw, testset_raw = spoken_squad[:split_idx], spoken_squad[split_idx:]
预处理音频数据¶
数据集中的音频片段需要进行一些预处理,转换为字节数组及其对应的采样率。
def preprocess(x):
audio = dspy.Audio.from_array(x.context["array"], x.context["sampling_rate"])
return dspy.Example(
passage_audio=audio,
question=x.instruction,
answer=x.answer
).with_inputs("passage_audio", "question")
trainset = [preprocess(x) for x in trainset_raw]
testset = [preprocess(x) for x in testset_raw]
len(trainset), len(testset)
class SpokenQASignature(dspy.Signature):
"""Answer the question based on the audio clip."""
passage_audio: dspy.Audio = dspy.InputField()
question: str = dspy.InputField()
answer: str = dspy.OutputField(desc = 'factoid answer between 1 and 5 words')
spoken_qa = dspy.ChainOfThought(SpokenQASignature)
现在让我们配置能够处理输入音频的大语言模型。
dspy.settings.configure(lm=dspy.LM(model='gpt-4o-mini-audio-preview-2024-12-17'))
注意:在签名中使用dspy.Audio允许直接将音频传递给模型。
定义评估指标¶
我们将使用精确匹配指标 (dspy.evaluate.answer_exact_match) 来测量答案与提供的参考答案之间的准确度:
evaluate_program = dspy.Evaluate(devset=testset, metric=dspy.evaluate.answer_exact_match,display_progress=True, num_threads = 10, display_table=True)
evaluate_program(spoken_qa)
使用DSPy进行优化¶
你可以像使用任何DSPy优化器一样优化这个基于音频的程序。
注意:音频令牌可能成本较高,因此建议保守地配置优化器,如dspy.BootstrapFewShotWithRandomSearch或dspy.MIPROv2,使用0-2个少样本示例,并比优化器默认参数更少的候选/试验次数。
optimizer = dspy.BootstrapFewShotWithRandomSearch(metric = dspy.evaluate.answer_exact_match, max_bootstrapped_demos=2, max_labeled_demos=2, num_candidate_programs=5)
optimized_program = optimizer.compile(spoken_qa, trainset = trainset)
evaluate_program(optimized_program)
prompt_lm = dspy.LM(model='gpt-4o-mini') #NOTE - this is the LLM guiding the MIPROv2 instruction candidate proposal
optimizer = dspy.MIPROv2(metric=dspy.evaluate.answer_exact_match, auto="light", prompt_model = prompt_lm)
#NOTE - MIPROv2's dataset summarizer cannot process the audio files in the dataset, so we turn off the data_aware_proposer
optimized_program = optimizer.compile(spoken_qa, trainset=trainset, max_bootstrapped_demos=2, max_labeled_demos=2, data_aware_proposer=False)
evaluate_program(optimized_program)
使用这个小子集,MIPROv2相比基线性能带来了约10%的提升。
现在我们已经了解了如何在DSPy中使用支持音频输入的大语言模型,让我们来翻转这个设置。
在接下来的任务中,我们将使用一个标准的基于文本的LLM来为文本转语音模型生成提示,然后评估生成的语音质量以用于某些下游任务。这种方法通常比要求像gpt-4o-mini-audio-preview-2024-12-17这样的LLM直接生成音频更具成本效益,同时仍然能够构建一个可以优化以获得更高质量语音输出的管道。
加载CREMA-D数据集¶
我们将使用CREMA-D数据集(官方和HuggingFace版本用于教程演示),该数据集包含选定参与者用六种目标情绪之一说出相同句子的音频片段:中性、快乐、悲伤、愤怒、恐惧和厌恶。
from collections import defaultdict
label_map = ['neutral', 'happy', 'sad', 'anger', 'fear', 'disgust']
kwargs = dict(fields=("sentence", "label", "audio"), input_keys=("sentence", "label"))
crema_d = DataLoader().from_huggingface(dataset_name="myleslinder/crema-d", split="train", trust_remote_code=True, **kwargs)
def preprocess(x):
return dspy.Example(
raw_line=x.sentence,
target_style=label_map[x.label],
reference_audio=dspy.Audio.from_array(x.audio["array"], x.audio["sampling_rate"])
).with_inputs("raw_line", "target_style")
random.Random(42).shuffle(crema_d)
crema_d = crema_d[:100]
random.seed(42)
label_to_indices = defaultdict(list)
for idx, x in enumerate(crema_d):
label_to_indices[x.label].append(idx)
per_label = 100 // len(label_map)
train_indices, test_indices = [], []
for indices in label_to_indices.values():
selected = random.sample(indices, min(per_label, len(indices)))
split = len(selected) // 2
train_indices.extend(selected[:split])
test_indices.extend(selected[split:])
trainset = [preprocess(crema_d[idx]) for idx in train_indices]
testset = [preprocess(crema_d[idx]) for idx in test_indices]
DSPy 管道用于生成带有目标情感的语音合成指令¶
现在我们将构建一个管道,通过向TTS模型提供文本行和朗读方式指令来生成富有情感表现力的语音。 本任务的目标是使用DSPy生成提示,引导TTS输出与数据集中参考音频的情感和风格相匹配。
首先让我们设置TTS生成器,以生成带有指定情感或风格的语音音频。
我们利用gpt-4o-mini-tts,因为它支持通过原始输入提示模型进行语音合成,并生成经过dspy.Audio处理的.wav格式音频响应。
我们还为TTS输出设置了缓存。
import os
import base64
import hashlib
from openai import OpenAI
CACHE_DIR = ".audio_cache"
os.makedirs(CACHE_DIR, exist_ok=True)
def hash_key(raw_line: str, prompt: str) -> str:
return hashlib.sha256(f"{raw_line}|||{prompt}".encode("utf-8")).hexdigest()
def generate_dspy_audio(raw_line: str, prompt: str) -> dspy.Audio:
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
key = hash_key(raw_line, prompt)
wav_path = os.path.join(CACHE_DIR, f"{key}.wav")
if not os.path.exists(wav_path):
response = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="coral", #NOTE - this can be configured to any of the 11 offered OpenAI TTS voices - https://platform.openai.com/docs/guides/text-to-speech#voice-options.
input=raw_line,
instructions=prompt,
response_format="wav"
)
with open(wav_path, "wb") as f:
f.write(response.content)
with open(wav_path, "rb") as f:
encoded = base64.b64encode(f.read()).decode("utf-8")
return dspy.Audio(data=encoded, format="wav")
现在我们来定义用于生成TTS指令的DSPy程序。对于这个程序,我们可以再次使用基于文本的标准LLM,因为我们只是生成指令。
class EmotionStylePromptSignature(dspy.Signature):
"""Generate an OpenAI TTS instruction that makes the TTS model speak the given line with the target emotion or style."""
raw_line: str = dspy.InputField()
target_style: str = dspy.InputField()
openai_instruction: str = dspy.OutputField()
class EmotionStylePrompter(dspy.Module):
def __init__(self):
self.prompter = dspy.ChainOfThought(EmotionStylePromptSignature)
def forward(self, raw_line, target_style):
out = self.prompter(raw_line=raw_line, target_style=target_style)
audio = generate_dspy_audio(raw_line, out.openai_instruction)
return dspy.Prediction(audio=audio)
dspy.settings.configure(lm=dspy.LM(model='gpt-4o-mini'))
定义评估指标¶
音频参考比较通常是一项非平凡任务,因为评估语音存在主观差异,尤其是涉及情感表达时。在本教程中,我们使用基于嵌入的相似性度量进行客观评估,利用Wav2Vec 2.0将音频转换为嵌入向量,并计算参考音频与生成音频之间的余弦相似度。要更准确地评估音频质量,人类反馈或感知度量将更为合适。
import torch
import torchaudio
import soundfile as sf
import io
bundle = torchaudio.pipelines.WAV2VEC2_BASE
model = bundle.get_model().eval()
def decode_dspy_audio(dspy_audio):
audio_bytes = base64.b64decode(dspy_audio.data)
array, _ = sf.read(io.BytesIO(audio_bytes), dtype="float32")
return torch.tensor(array).unsqueeze(0)
def extract_embedding(audio_tensor):
with torch.inference_mode():
return model(audio_tensor)[0].mean(dim=1)
def cosine_similarity(a, b):
return torch.nn.functional.cosine_similarity(a, b).item()
def audio_similarity_metric(example, pred, trace=None):
ref_audio = decode_dspy_audio(example.reference_audio)
gen_audio = decode_dspy_audio(pred.audio)
ref_embed = extract_embedding(ref_audio)
gen_embed = extract_embedding(gen_audio)
score = cosine_similarity(ref_embed, gen_embed)
if trace is not None:
return score > 0.8
return score
evaluate_program = dspy.Evaluate(devset=testset, metric=audio_similarity_metric, display_progress=True, num_threads = 10, display_table=True)
evaluate_program(EmotionStylePrompter())
我们可以通过一个例子来看看DSPy程序生成了哪些指令以及对应的得分:
program = EmotionStylePrompter()
pred = program(raw_line=testset[1].raw_line, target_style=testset[1].target_style)
print(audio_similarity_metric(testset[1], pred)) #0.5725605487823486
dspy.inspect_history(n=1)
[2025-05-15T22:01:22.667596] System message: Your input fields are: 1. `raw_line` (str) 2. `target_style` (str) Your output fields are: 1. `reasoning` (str) 2. `openai_instruction` (str) All interactions will be structured in the following way, with the appropriate values filled in. [[ ## raw_line ## ]] {raw_line} [[ ## target_style ## ]] {target_style} [[ ## reasoning ## ]] {reasoning} [[ ## openai_instruction ## ]] {openai_instruction} [[ ## completed ## ]] In adhering to this structure, your objective is: Generate an OpenAI TTS instruction that makes the TTS model speak the given line with the target emotion or style. User message: [[ ## raw_line ## ]] It's eleven o'clock [[ ## target_style ## ]] disgust Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## openai_instruction ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`. Response: [[ ## reasoning ## ]] To generate the OpenAI TTS instruction, we need to specify the target emotion or style, which in this case is 'disgust'. We will use the OpenAI TTS instruction format, which includes the text to be spoken and the desired emotion or style. [[ ## openai_instruction ## ]] "Speak the following line with a tone of disgust: It's eleven o'clock" [[ ## completed ## ]]
TTS 指令:
Speak the following line with a tone of disgust: It's eleven o'clock
from IPython.display import Audio
audio_bytes = base64.b64decode(pred.audio.data)
array, rate = sf.read(io.BytesIO(audio_bytes), dtype="float32")
Audio(array, rate=rate)
指令指定了目标情绪,但除此之外信息量不大。我们还可以看到该样本的音频评分并不太高。让我们看看是否可以通过优化这个流程来做得更好。
使用DSPy进行优化¶
我们可以利用 dspy.MIPROv2 来优化下游任务目标并生成更高质量的TTS指令,从而实现更准确和富有表现力的音频生成:
prompt_lm = dspy.LM(model='gpt-4o-mini')
teleprompter = dspy.MIPROv2(metric=audio_similarity_metric, auto="light", prompt_model = prompt_lm)
optimized_program = teleprompter.compile(EmotionStylePrompter(),trainset=trainset)
evaluate_program(optimized_program)
让我们看看优化后的程序表现如何:
pred = optimized_program(raw_line=testset[1].raw_line, target_style=testset[1].target_style)
print(audio_similarity_metric(testset[1], pred)) #0.6691027879714966
dspy.inspect_history(n=1)
[2025-05-15T22:09:40.088592] System message: Your input fields are: 1. `raw_line` (str) 2. `target_style` (str) Your output fields are: 1. `reasoning` (str) 2. `openai_instruction` (str) All interactions will be structured in the following way, with the appropriate values filled in. [[ ## raw_line ## ]] {raw_line} [[ ## target_style ## ]] {target_style} [[ ## reasoning ## ]] {reasoning} [[ ## openai_instruction ## ]] {openai_instruction} [[ ## completed ## ]] In adhering to this structure, your objective is: Generate an OpenAI TTS instruction that makes the TTS model speak the given line with the target emotion or style, as if the speaker is a [insert persona relevant to the task, e.g. "irate customer", "angry boss", etc.]. The instruction should specify the tone, pitch, and other characteristics of the speaker's voice to convey the target emotion. User message: [[ ## raw_line ## ]] It's eleven o'clock [[ ## target_style ## ]] disgust Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## openai_instruction ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`. Response: [[ ## reasoning ## ]] To convey disgust, the speaker's voice should be characterized by a high-pitched tone, a slightly nasal quality, and a sense of revulsion. The speaker's words should be delivered with a sense of distaste and aversion, as if the speaker is trying to convey their strong negative emotions. [[ ## openai_instruction ## ]] Generate a text-to-speech synthesis of the input text "It's eleven o'clock" with the following characteristics: - Tone: Disgusted - Pitch: High-pitched, slightly nasal - Emphasis: Emphasize the words to convey a sense of distaste and aversion - Volume: Moderate to loud, with a sense of rising inflection at the end to convey the speaker's strong negative emotions - Speaker: A person who is visibly and audibly disgusted, such as a character who has just been served a spoiled meal. [[ ## completed ## ]]
MIPROv2 优化程序指令:
Generate an OpenAI TTS instruction that makes the TTS model speak the given line with the target emotion or style, as if the speaker is a [insert persona relevant to the task, e.g. "irate customer", "angry boss", etc.]. The instruction should specify the tone, pitch, and other characteristics of the speaker's voice to convey the target emotion.
TTS 指令:
Generate a text-to-speech synthesis of the input text "It's eleven o'clock" with the following characteristics:
- Tone: Disgusted
- Pitch: High-pitched, slightly nasal
- Emphasis: Emphasize the words to convey a sense of distaste and aversion
- Volume: Moderate to loud, with a sense of rising inflection at the end to convey the speaker's strong negative emotions
- Speaker: A person who is visibly and audibly disgusted, such as a character who has just been served a spoiled meal.
from IPython.display import Audio
audio_bytes = base64.b64decode(pred.audio.data)
array, rate = sf.read(io.BytesIO(audio_bytes), dtype="float32")
Audio(array, rate=rate)
MIPROv2的指令微调为整体任务目标增添了更多特色,为TTS指令的定义提供了更多标准,进而使生成的指令更具体地针对语音韵律的各个因素,并产生更高的相似度得分。