Transformers

Wav2Vec2Phoneme

概述

Wav2Vec2Phoneme模型由Qiantong Xu、Alexei Baevski和Michael Auli在简单有效的零样本跨语言音素识别（Xu等人，2021）中提出。

论文的摘要如下：

自训练、自监督预训练和无监督学习的最新进展使得语音识别系统在没有标注数据的情况下也能表现出色。然而，在许多情况下，相关语言有可用的标注数据，但这些方法并未利用这些数据。本文通过微调多语言预训练的wav2vec 2.0模型来转录未见过的语言，扩展了先前关于零样本跨语言迁移学习的工作。这是通过使用发音特征将训练语言的音素映射到目标语言来实现的。实验表明，这种简单的方法显著优于先前的工作，这些工作引入了特定任务的架构，并且仅使用了单语言预训练模型的一部分。

相关的检查点可以在https://huggingface.co/models?other=phoneme-recognition下找到。

该模型由patrickvonplaten贡献

原始代码可以在这里找到。

使用提示

Wav2Vec2Phoneme 使用与 Wav2Vec2 完全相同的架构
Wav2Vec2Phoneme 是一个语音模型，它接受与语音信号的原始波形相对应的浮点数组。
Wav2Vec2Phoneme 模型是使用连接时序分类（CTC）进行训练的，因此模型输出必须使用 Wav2Vec2PhonemeCTCTokenizer 进行解码。
Wav2Vec2Phoneme 可以同时在多种语言上进行微调，并在一次前向传递中将未见过的语言解码为一系列音素
默认情况下，模型输出一系列音素。为了将音素转换为单词序列，应使用字典和语言模型。

Wav2Vec2Phoneme的架构基于Wav2Vec2模型，有关API参考，请查看Wav2Vec2的文档页面，除了分词器部分。

Wav2Vec2PhonemeCTCTokenizer

类 transformers.Wav2Vec2PhonemeCTCTokenizer

< source >

( vocab_file bos_token = '' eos_token = '' unk_token = '' pad_token = '' phone_delimiter_token = ' ' word_delimiter_token = None do_phonemize = True phonemizer_lang = 'en-us' phonemizer_backend = 'espeak' **kwargs )

参数

vocab_file (str) — 包含词汇表的文件。
bos_token (str, optional, defaults to "") — 句子的开始标记。
eos_token (str, optional, defaults to "") — 句子的结束标记。
unk_token (str, optional, defaults to "") — 未知标记。不在词汇表中的标记无法转换为ID，而是设置为该标记。
pad_token (str, optional, defaults to "") — 用于填充的标记，例如在对不同长度的序列进行批处理时使用。
do_phonemize (bool, 可选, 默认为 True) — 是否应该对输入进行音素化。只有当音素序列传递给分词器时，do_phonemize 应设置为 False.
phonemizer_lang (str, optional, defaults to "en-us") — 标记器应将输入文本音素化的音素集的语言。
phonemizer_backend (str, 可选. 默认为 "espeak") — 用于phonemizer库的后端音标化库。默认为 espeak-ng。有关更多信息，请参阅 phonemizer包.
**kwargs — 传递给PreTrainedTokenizer的额外关键字参数

构建一个Wav2Vec2PhonemeCTC分词器。

这个分词器继承自PreTrainedTokenizer，其中包含了一些主要方法。用户应参考超类以获取有关这些方法的更多信息。

call

< source >

( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = None text_pair: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None text_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = None text_pair_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = None max_length: typing.Optional[int] = None stride: int = 0 is_split_into_words: bool = False pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs ) → BatchEncoding

参数

文本 (str, List[str], List[List[str]], 可选) — 要编码的序列或序列批次。每个序列可以是一个字符串或一个字符串列表（预分词的字符串）。如果序列以字符串列表（预分词）的形式提供，你必须设置 is_split_into_words=True（以消除与序列批次的歧义）。
text_pair (str, List[str], List[List[str]], optional) — 要编码的序列或序列批次。每个序列可以是一个字符串或一个字符串列表（预分词的字符串）。如果序列以字符串列表（预分词）的形式提供，你必须设置 is_split_into_words=True（以消除与序列批次的歧义）。
text_target (str, List[str], List[List[str]], optional) — 要编码为目标文本的序列或序列批次。每个序列可以是一个字符串或一个字符串列表（预分词的字符串）。如果序列以字符串列表（预分词）的形式提供，你必须设置is_split_into_words=True（以消除与序列批次的歧义）。
text_pair_target (str, List[str], List[List[str]], optional) — 要编码为目标文本的序列或序列批次。每个序列可以是一个字符串或一个字符串列表（预分词的字符串）。如果序列以字符串列表（预分词）的形式提供，你必须设置 is_split_into_words=True（以消除与序列批次的歧义）。
add_special_tokens (bool, optional, defaults to True) — 是否在编码序列时添加特殊标记。这将使用底层的 PretrainedTokenizerBase.build_inputs_with_special_tokens 函数，该函数定义了哪些标记会自动添加到输入ID中。如果您想自动添加 bos 或 eos 标记，这将非常有用。
padding (bool, str or PaddingStrategy, optional, defaults to False) — Activates and controls padding. Accepts the following values:
- True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool, str or TruncationStrategy, optional, defaults to False) — Activates and controls truncation. Accepts the following values:
- True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.
如果未设置或设置为None，则在需要截断/填充参数时，将使用预定义的模型最大长度。如果模型没有特定的最大输入长度（如XLNet），则截断/填充到最大长度的功能将被停用。
stride (int, 可选, 默认为 0) — 如果设置为一个数字并与 max_length 一起使用，当 return_overflowing_tokens=True 时返回的溢出标记将包含来自截断序列末尾的一些标记，以提供截断序列和溢出序列之间的一些重叠。此参数的值定义了重叠标记的数量。
is_split_into_words (bool, 可选, 默认为 False) — 输入是否已经预分词（例如，分割成单词）。如果设置为 True，分词器会假设输入已经分割成单词（例如，通过空格分割），然后进行分词。这对于NER或分词分类非常有用。
pad_to_multiple_of (int, 可选) — 如果设置，将序列填充到提供的值的倍数。需要激活padding。这对于在计算能力>= 7.5（Volta）的NVIDIA硬件上启用Tensor Cores特别有用。
padding_side (str, optional) — 模型应在哪一侧应用填充。应在['right', 'left']之间选择。默认值从同名的类属性中选取。
return_tensors (str 或 TensorType, 可选) — 如果设置，将返回张量而不是Python整数列表。可接受的值有：
- 'tf': 返回 TensorFlow tf.constant 对象。
- 'pt': 返回 PyTorch torch.Tensor 对象。
- 'np': 返回 Numpy np.ndarray 对象。
return_token_type_ids (bool, optional) — Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs attribute.
什么是token type IDs?
return_attention_mask (bool, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.
什么是注意力掩码？
return_overflowing_tokens (bool, optional, defaults to False) — 是否返回溢出的令牌序列。如果提供了一对输入ID序列（或一批对）并且使用了truncation_strategy = longest_first或True，则会引发错误而不是返回溢出的令牌。
return_special_tokens_mask (bool, optional, defaults to False) — 是否返回特殊令牌掩码信息。
return_offsets_mapping (bool, optional, defaults to False) — Whether or not to return (char_start, char_end) for each token.
这仅在继承自PreTrainedTokenizerFast的快速分词器上可用，如果使用Python的分词器，此方法将引发NotImplementedError。
return_length (bool, optional, defaults to False) — 是否返回编码输入的长度。
verbose (bool, 可选, 默认为 True) — 是否打印更多信息和警告。
**kwargs — 传递给 self.tokenize() 方法

BatchEncoding

一个BatchEncoding包含以下字段：

input_ids — 要输入模型的令牌ID列表。

什么是输入ID？
token_type_ids — 要输入模型的令牌类型ID列表（当return_token_type_ids=True或如果“token_type_ids”在self.model_input_names中）。

什么是令牌类型ID？
attention_mask — 指定模型应关注哪些令牌的索引列表（当 return_attention_mask=True或如果“attention_mask”在self.model_input_names中）。

什么是注意力掩码？
overflowing_tokens — 溢出令牌序列列表（当指定了max_length并且 return_overflowing_tokens=True）。
num_truncated_tokens — 截断的令牌数量（当指定了max_length并且 return_overflowing_tokens=True）。
special_tokens_mask — 0和1的列表，1表示添加的特殊令牌，0表示常规序列令牌（当add_special_tokens=True和return_special_tokens_mask=True）。
length — 输入的长度（当return_length=True）

主要方法，用于将一个或多个序列或一个或多个序列对进行标记化并准备供模型使用。

batch_decode

< source >

( sequences: typing.Union[typing.List[int], typing.List[typing.List[int]], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor'), ForwardRef('tf.Tensor')] skip_special_tokens: bool = False clean_up_tokenization_spaces: bool = None output_char_offsets: bool = False **kwargs ) → List[str] 或 ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput

参数

sequences (Union[List[int], List[List[int]], np.ndarray, torch.Tensor, tf.Tensor]) — 标记化的输入ID列表。可以使用__call__方法获取。
skip_special_tokens (bool, optional, defaults to False) — 是否在解码过程中移除特殊标记。
clean_up_tokenization_spaces (bool, optional) — 是否清理分词后的空格。
output_char_offsets (bool, optional, defaults to False) — Whether or not to output character offsets. Character offsets can be used in combination with the sampling rate and model downsampling rate to compute the time-stamps of transcribed characters.

请查看~models.wav2vec2.tokenization_wav2vec2.decode的示例，以更好地理解如何使用output_word_offsets。 ~model.wav2vec2_phoneme.tokenization_wav2vec2_phoneme.batch_decode的工作原理与音素和批处理输出类似。
kwargs (额外的关键字参数，可选) — 将被传递给底层模型的特定解码方法。

List[str] 或 ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput

解码后的句子。当 output_char_offsets == True时，将是一个 ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput。

通过调用decode将token id的列表列表转换为字符串列表。

解码

< source >

( token_ids: typing.Union[int, typing.List[int], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor'), ForwardRef('tf.Tensor')] skip_special_tokens: bool = False clean_up_tokenization_spaces: bool = None output_char_offsets: bool = False **kwargs ) → str 或 ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput

参数

token_ids (Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]) — 标记化的输入ID列表。可以使用__call__方法获取。
skip_special_tokens (bool, optional, defaults to False) — 是否在解码过程中移除特殊标记。
clean_up_tokenization_spaces (bool, optional) — 是否清理分词后的空格。
output_char_offsets (bool, optional, defaults to False) — Whether or not to output character offsets. Character offsets can be used in combination with the sampling rate and model downsampling rate to compute the time-stamps of transcribed characters.

请查看~models.wav2vec2.tokenization_wav2vec2.decode的示例，以更好地理解如何使用output_word_offsets。 ~model.wav2vec2_phoneme.tokenization_wav2vec2_phoneme.batch_decode在音素上的工作方式相同。
kwargs (额外的关键字参数，可选) — 将被传递给底层模型的特定解码方法。