跳到主要内容

提升Whisper转录质量:预处理和后处理技术

nbviewer

本笔记本提供了一份指南,以改善Whisper的转录质量。我们将通过修剪和分割来优化您的音频数据,从而提升Whisper的转录质量。在转录完成后,我们将通过添加标点符号、调整产品术语(例如,将’five two nine’调整为’529’)以及缓解Unicode问题来优化输出。这些策略将有助于提高转录的清晰度,但请记住,根据您独特的用例进行定制可能是有益的。

设置

让我们开始导入一些不同的库:

  • PyDub 是一个简单易用的Python库,用于音频处理任务,如切片、连接和导出音频文件。

  • IPython.display 模块中的 Audio 类允许您创建一个音频控件,可以在Jupyter笔记本中播放声音,为您提供了直接在笔记本中播放音频数据的简单方式。

  • 对于我们的音频文件,我们将使用ChatGPT编写的虚构收益电话,并由作者朗读。这个音频文件相对较短,但希望能为您提供一个说明性的想法,展示这些预处理和后处理步骤如何应用于任何音频文件。

from openai import OpenAI
import os
import urllib
from IPython.display import Audio
from pathlib import Path
from pydub import AudioSegment
import ssl

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

# 设置下载路径
earnings_call_remote_filepath = "https://cdn.openai.com/API/examples/data/EarningsCall.wav"

# 设置本地保存位置
earnings_call_filepath = "data/EarningsCall.wav"

# 下载示例音频文件并保存到本地
ssl._create_default_https_context = ssl._create_unverified_context
urllib.request.urlretrieve(earnings_call_remote_filepath, earnings_call_filepath)

('data/EarningsCall.wav', <http.client.HTTPMessage at 0x11be41f50>)

有时,文件开头有很长的静音会导致Whisper错误地转录音频。我们将使用Pydub来检测和修剪这些静音部分。

在这里,我们将分贝阈值设置为20。如果需要,您可以更改这个数值。

# 检测前导静音的函数
# 返回距离第一个声音(平均超过X分贝的块)的毫秒数
def milliseconds_until_sound(sound, silence_threshold_in_decibels=-20.0, chunk_size=10):
trim_ms = 0 # MS

assert chunk_size > 0 # 为了避免无限循环
while sound[trim_ms:trim_ms+chunk_size].dBFS < silence_threshold_in_decibels and trim_ms < len(sound):
trim_ms += chunk_size

return trim_ms


def trim_start(filepath):
path = Path(filepath)
directory = path.parent
filename = path.name
audio = AudioSegment.from_file(filepath, format="wav")
start_trim = milliseconds_until_sound(audio)
trimmed = audio[start_trim:]
new_filename = directory / f"trimmed_{filename}"
trimmed.export(new_filename, format="wav")
return trimmed, new_filename


def transcribe_audio(file,output_dir):
audio_path = os.path.join(output_dir, file)
with open(audio_path, 'rb') as audio_data:
transcription = client.audio.transcriptions.create(
model="whisper-1", file=audio_data)
return transcription.text

有时,我们在转录中看到了Unicode字符注入,删除任何非ASCII字符应该有助于缓解这个问题。

请记住,如果您在希腊语、西里尔语、阿拉伯语、中文等语言中进行转录,不应该使用这个函数。

# 定义函数以移除非ASCII字符
def remove_non_ascii(text):
return ''.join(i for i in text if ord(i)<128)


这个函数将为我们的转录添加格式和标点符号。Whisper生成带有标点符号但没有格式的转录。

# 定义函数以添加标点符号
def punctuation_assistant(ascii_transcript):

system_prompt = """You are a helpful assistant that adds punctuation to text.
Preserve the original words and only insert necessary punctuation such as periods,
commas, capialization, symbols like dollar sings or percentage signs, and formatting.
Use only the context provided. If there is no context provided say, 'No context provided'\n"""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
temperature=0,
messages=[
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": ascii_transcript
}
]
)
return response


我们的音频文件是一次虚假收益电话的录音,其中包含许多金融产品。这个函数可以帮助确保如果Whisper错误地转录了这些金融产品名称,它们可以被纠正。

# 定义函数以修正产品拼写错误
def product_assistant(ascii_transcript):
system_prompt = """You are an intelligent assistant specializing in financial products;
your task is to process transcripts of earnings calls, ensuring that all references to
financial products and common financial terms are in the correct format. For each
financial product or common term that is typically abbreviated as an acronym, the full term
should be spelled out followed by the acronym in parentheses. For example, '401k' should be
transformed to '401(k) retirement savings plan', 'HSA' should be transformed to 'Health Savings Account (HSA)'
, 'ROA' should be transformed to 'Return on Assets (ROA)', 'VaR' should be transformed to 'Value at Risk (VaR)'
, and 'PB' should be transformed to 'Price to Book (PB) ratio'. Similarly, transform spoken numbers representing
financial products into their numeric representations, followed by the full name of the product in parentheses.
For instance, 'five two nine' to '529 (Education Savings Plan)' and 'four zero one k' to '401(k) (Retirement Savings Plan)'.
However, be aware that some acronyms can have different meanings based on the context (e.g., 'LTV' can stand for
'Loan to Value' or 'Lifetime Value'). You will need to discern from the context which term is being referred to
and apply the appropriate transformation. In cases where numerical figures or metrics are spelled out but do not
represent specific financial products (like 'twenty three percent'), these should be left as is. Your role is to
analyze and adjust financial product terminology in the text. Once you've done that, produce the adjusted
transcript and a list of the words you've changed"""
response = client.chat.completions.create(
model="gpt-4",
temperature=0,
messages=[
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": ascii_transcript
}
]
)
return response


这个函数将创建一个在原文件名后添加’trimmed’的新文件。

# 修剪原始音频文件的开头部分
trimmed_audio = trim_start(earnings_call_filepath)


trimmed_audio, trimmed_filename = trim_start(earnings_call_filepath)


我们的虚假收益报告音频文件长度相当短,因此我们将相应调整片段。请记住您可以根据需要调整片段长度。

# 音频分段
trimmed_audio = AudioSegment.from_wav(trimmed_filename) # 加载修剪后的音频文件

one_minute = 1 * 60 * 1000 # 每个片段的时长(以毫秒为单位)

start_time = 0 # 第一段开始时间

i = 0 # 用于命名分段文件的索引

output_dir_trimmed = "trimmed_earnings_directory" # 分割文件的输出目录

if not os.path.isdir(output_dir_trimmed): # 如果输出目录不存在,则创建它。
os.makedirs(output_dir_trimmed)

while start_time < len(trimmed_audio): # 遍历修剪后的音频文件
segment = trimmed_audio[start_time:start_time + one_minute] # 提取一段
segment.export(os.path.join(output_dir_trimmed, f"trimmed_{i:02d}.wav"), format="wav") # 保存片段
start_time += one_minute # 更新下一段的开始时间
i += 1 # 为下一个文件命名时递增索引


# 获取经过修剪和分割的音频文件列表,并按数字顺序排序
audio_files = sorted(
(f for f in os.listdir(output_dir_trimmed) if f.endswith(".wav")),
key=lambda f: int(''.join(filter(str.isdigit, f)))
)


# 使用循环将transcribe函数应用于所有音频文件
transcriptions = [transcribe_audio(file, output_dir_trimmed) for file in audio_files]


# 将转录文本连接起来
full_transcript = ' '.join(transcriptions)

print(full_transcript)

Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of 125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to 37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to 16 million, which is a noteworthy increase from 10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized. debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our debt-to-equity ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with customer acquisition cost dropping by 15% and lifetime value growing by 25%. Our LTVCAC ratio is at an impressive 3.5%. In terms of risk management, we have a value-at-risk model in place with a 99%... confidence level indicating that our maximum loss will not exceed 5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy tier one capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around 135 million and 8% quarter over quarter growth driven primarily by our cutting edge blockchain solutions and AI driven predictive analytics. We're also excited about the upcoming IPO of our FinTech subsidiary Pay Plus, which we expect to raise 200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful Q3. Thank you so much.
# 从文稿中移除非ASCII字符
ascii_transcript = remove_non_ascii(full_transcript)

print(ascii_transcript)

Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of 125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to 37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to 16 million, which is a noteworthy increase from 10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized. debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our debt-to-equity ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with customer acquisition cost dropping by 15% and lifetime value growing by 25%. Our LTVCAC ratio is at an impressive 3.5%. In terms of risk management, we have a value-at-risk model in place with a 99%... confidence level indicating that our maximum loss will not exceed 5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy tier one capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around 135 million and 8% quarter over quarter growth driven primarily by our cutting edge blockchain solutions and AI driven predictive analytics. We're also excited about the upcoming IPO of our FinTech subsidiary Pay Plus, which we expect to raise 200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful Q3. Thank you so much.
# 使用标点助手功能
response = punctuation_assistant(ascii_transcript)

# 从模型的回答中提取带标点的文字记录。
punctuated_transcript = response.choices[0].message.content


print(punctuated_transcript)

Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of $125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to $37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to $16 million, which is a noteworthy increase from $10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk-adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our debt-to-equity ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with customer acquisition cost dropping by 15% and lifetime value growing by 25%. Our LTVCAC ratio is at an impressive 3.5%. In terms of risk management, we have a value-at-risk model in place with a 99% confidence level indicating that our maximum loss will not exceed $5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy tier one capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around $135 million and 8% quarter over quarter growth driven primarily by our cutting-edge blockchain solutions and AI-driven predictive analytics. We're also excited about the upcoming IPO of our FinTech subsidiary Pay Plus, which we expect to raise $200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful Q3. Thank you so much.
# 使用产品助手功能
response = product_assistant(punctuated_transcript)


# 从模型的回复中提取最终的文字记录
final_transcript = response.choices[0].message.content

print(final_transcript)

Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar second quarter (Q2) with a revenue of $125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our Earnings Before Interest, Taxes, Depreciation, and Amortization (EBITDA) has surged to $37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to $16 million, which is a noteworthy increase from $10 million in second quarter (Q2) 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in Collateralized Debt Obligations (CDOs), and Residential Mortgage-Backed Securities (RMBS). We've also invested $25 million in AAA rated corporate bonds, enhancing our risk-adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our Debt-to-Equity (D/E) ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with Customer Acquisition Cost (CAC) dropping by 15% and Lifetime Value (LTV) growing by 25%. Our LTV to CAC (LTVCAC) ratio is at an impressive 3.5%. In terms of risk management, we have a Value at Risk (VaR) model in place with a 99% confidence level indicating that our maximum loss will not exceed $5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy Tier 1 Capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around $135 million and 8% quarter over quarter growth driven primarily by our cutting-edge blockchain solutions and AI-driven predictive analytics. We're also excited about the upcoming Initial Public Offering (IPO) of our FinTech subsidiary Pay Plus, which we expect to raise $200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful third quarter (Q3). Thank you so much.

Words Changed:
1. Q2 -> second quarter (Q2)
2. EBITDA -> Earnings Before Interest, Taxes, Depreciation, and Amortization (EBITDA)
3. Q2 2022 -> second quarter (Q2) 2022
4. CDOs -> Collateralized Debt Obligations (CDOs)
5. RMBS -> Residential Mortgage-Backed Securities (RMBS)
6. D/E -> Debt-to-Equity (D/E)
7. CAC -> Customer Acquisition Cost (CAC)
8. LTV -> Lifetime Value (LTV)
9. LTVCAC -> LTV to CAC (LTVCAC)
10. VaR -> Value at Risk (VaR)
11. IPO -> Initial Public Offering (IPO)
12. Q3 -> third quarter (Q3)