Transformers 文档

耳语

Transformers

Whisper

概述

Whisper模型是由Alec Radford、Jong Wook Kim、Tao Xu、Greg Brockman、Christine McLeavey和Ilya Sutskever在通过大规模弱监督实现鲁棒语音识别中提出的。

论文的摘要如下：

我们研究了仅通过预测互联网上大量音频转录本来训练的语音处理系统的能力。当扩展到680,000小时的多语言和多任务监督时，生成的模型在标准基准测试中表现出色，并且通常与之前完全监督的结果相媲美，但在零样本转移设置中无需任何微调。与人类相比，这些模型接近其准确性和鲁棒性。我们正在发布模型和推理代码，以作为进一步研究鲁棒语音处理的基础。

该模型由Arthur Zucker贡献。该模型的Tensorflow版本由amyeroberts贡献。原始代码可以在这里找到。

快速使用

你可以在不到4行代码中运行Whisper，并在不到一分钟内完成转录！

# pip install transformers torch

import torch
from transformers import pipeline

whisper = pipeline("automatic-speech-recognition", "openai/whisper-large-v3", torch_dtype=torch.float16, device="cuda:0")

transcription = whisper("<audio_file.mp3>")

print(transcription["text"])

瞧！您可以根据需要，使用相同的管道在Hugging Face Hub上交换任何Whisper检查点的模型。

额外提示：你可以将 "cuda" 替换为 "mps"，使其在 Mac 上无缝运行。

使用提示

该模型通常表现良好，无需任何微调。
该架构遵循经典的编码器-解码器架构，这意味着它依赖于generate()函数进行推理。
可以使用WhisperProcessor来为模型准备音频，并将预测的ID解码回文本。
要转换模型和处理器，我们建议使用以下方法：

python src/transformers/models/whisper/convert_openai_to_hf.py --checkpoint_path "" --pytorch_dump_folder_path "Arthur/whisper-3" --convert_preprocessor True

脚本将自动从OpenAI检查点确定所有必要的参数。需要安装一个tiktoken库来执行将OpenAI分词器转换为tokenizers版本的操作。

推理

以下是使用预训练的Whisper模型转录音频样本的逐步指南：

>>> from datasets import load_dataset
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration

>>> # Select an audio file and read it:
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> audio_sample = ds[0]["audio"]

>>> # Load the Whisper model in Hugging Face format:
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")

>>> # Use the model and processor to transcribe the audio:
>>> input_features = processor(
...     audio_sample["array"], sampling_rate=audio_sample["sampling_rate"], return_tensors="pt"
... ).input_features

>>> # Generate token ids
>>> predicted_ids = model.generate(input_features)

>>> # Decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

>>> transcription[0]
' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'

Whisper 兼容以下针对短格式和长格式生成的优化：

PyTorch Scaled Dot Product Attention (SDPA): 闪存注意力和内存高效注意力内核。默认情况下，torch>=2.1.1 启用。
Flash Attention 2: 通过更好的并行性和工作分区改进了flash attention的实现。
torch.compile: 将前向传递进行JIT编译，以分派到高效的融合内核。

例如，以下代码片段启用了SDPA和torch.compile，以实现高达5倍的推理速度提升：

>>> from datasets import load_dataset
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration

>>> # Select an audio file and read it:
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> audio_sample = ds[0]["audio"]

>>> # Load the Whisper model with SDPA attention
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en", attn_implementation="sdpa")

>>> # Enable static cache and compile the forward pass
>>> model.generation_config.cache_implementation = "static"
>>> model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

>>> # Use the model and processor to transcribe the audio:
>>> input_features = processor(
...     audio_sample["array"], sampling_rate=audio_sample["sampling_rate"], return_tensors="pt"
... ).input_features

>>> # Compile the forward pass
>>> for _ in range(2):
>>>     model.generate(input_features)

>>> # Generate token ids using compiled graph (fast!)
>>> predicted_ids = model.generate(input_features)

>>> # Decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

>>> transcription[0]
' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'

有关每个优化的更多详细信息，请参阅上面链接的文档。

资源

一份官方的Hugging Face和社区（由🌎表示）资源列表，帮助您开始使用Whisper。如果您有兴趣提交资源以包含在此处，请随时打开一个Pull Request，我们将对其进行审核！理想情况下，资源应展示一些新内容，而不是重复现有资源。

Fine-tune Whisper 在您自己的数据集上进行微调，以获得更好的下游性能。
Distil-Whisper: 最高可达6倍速度，2倍小的蒸馏Whisper模型，适用于英语。我们发布了模型检查点，以及蒸馏代码。
一个带有脚本的分支，用于将Hugging Face格式的Whisper模型转换为OpenAI格式。🌎 使用示例：

pip install -U openai-whisper
python convert_hf_to_openai.py \
    --checkpoint openai/whisper-tiny \
    --whisper_dump_path whisper-tiny-openai.pt

Transformers

Whisper

概述

快速使用

使用提示

推理

资源

WhisperConfig

类 transformers.WhisperConfig

WhisperTokenizer

类 transformers.WhisperTokenizer

set_prefix_tokens

build_inputs_with_special_tokens

get_special_tokens_mask

create_token_type_ids_from_sequences

保存词汇表

batch_decode

解码

basic_normalize

normalize

WhisperTokenizerFast

类 transformers.WhisperTokenizerFast

set_prefix_tokens

build_inputs_with_special_tokens

get_special_tokens_mask

create_token_type_ids_from_sequences

保存词汇表

batch_decode

解码

basic_normalize

normalize

WhisperFeatureExtractor

类 transformers.WhisperFeatureExtractor

__call__

WhisperProcessor

类 transformers.WhisperProcessor

__call__

from_pretrained

save_pretrained

batch_decode

解码

WhisperModel

类 transformers.WhisperModel

前进

_mask_input_features

WhisperForConditionalGeneration

类 transformers.WhisperForConditionalGeneration

前进

生成

WhisperForCausalLM

类 transformers.WhisperForCausalLM

前进

WhisperForAudioClassification

类 transformers.WhisperForAudioClassification

前进

TFWhisperModel

类 transformers.TFWhisperModel

调用

TFWhisperForConditionalGeneration

类 transformers.TFWhisperForConditionalGeneration

调用

FlaxWhisperModel

类 transformers.FlaxWhisperModel

__call__

FlaxWhisperForConditionalGeneration

类 transformers.FlaxWhisperForConditionalGeneration

__call__

FlaxWhisperForAudioClassification

类 transformers.FlaxWhisperForAudioClassification

__call__

call

call

call

call

call