Transformers 文档

Pop2Piano

Transformers

Pop2Piano

概述

Pop2Piano模型由Jongho Choi和Kyogu Lee在Pop2Piano : Pop Audio-based Piano Cover Generation中提出。

流行音乐的钢琴翻奏广受欢迎，但从音乐中生成它们并非易事。这需要极高的钢琴演奏技巧以及对歌曲不同特征和旋律的了解。使用Pop2Piano，您可以直接从歌曲的音频波形生成翻奏。这是第一个无需旋律和和弦提取模块，直接从流行音频生成钢琴翻奏的模型。

Pop2Piano 是一个基于 T5 的编码器-解码器 Transformer 模型。输入的音频被转换为波形并传递给编码器，编码器将其转换为潜在表示。解码器使用这些潜在表示以自回归的方式生成令牌 ID。每个令牌 ID 对应于四种不同的令牌类型之一：时间、速度、音符和“特殊”。然后，这些令牌 ID 被解码为等效的 MIDI 文件。

论文的摘要如下：

流行音乐的钢琴翻奏受到许多人的喜爱。然而，自动生成流行音乐钢琴翻奏的任务仍然研究不足。这部分是由于缺乏同步的{流行音乐, 钢琴翻奏}数据对，这使得应用最新的数据密集型深度学习方法变得具有挑战性。为了利用数据驱动方法的力量，我们使用自动化管道制作了大量配对和同步的{流行音乐, 钢琴翻奏}数据。在本文中，我们介绍了Pop2Piano，这是一个Transformer网络，它可以根据流行音乐的波形生成钢琴翻奏。据我们所知，这是第一个直接从流行音频生成钢琴翻奏而不使用旋律和和弦提取模块的模型。我们展示了使用我们的数据集训练的Pop2Piano能够生成合理的钢琴翻奏。

该模型由Susnato Dhar贡献。原始代码可以在这里找到。

使用提示

要使用Pop2Piano，您需要安装🤗 Transformers库以及以下第三方模块：

pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy

请注意，安装后您可能需要重新启动运行时。

Pop2Piano 是一个基于编码器-解码器的模型，类似于 T5。
Pop2Piano 可用于为给定的音频序列生成 midi 音频文件。
在Pop2PianoForConditionalGeneration.generate()中选择不同的作曲家可以导致各种不同的结果。
加载音频文件时将采样率设置为44.1 kHz可以获得良好的性能。
尽管Pop2Piano主要是在韩国流行音乐上训练的，但它在其他西方流行或嘻哈歌曲上也表现得相当不错。

示例

使用 HuggingFace 数据集的示例：

>>> from datasets import load_dataset
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor

>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")
>>> ds = load_dataset("sweetcocoa/pop2piano_ci", split="test")

>>> inputs = processor(
...     audio=ds["audio"][0]["array"], sampling_rate=ds["audio"][0]["sampling_rate"], return_tensors="pt"
... )
>>> model_output = model.generate(input_features=inputs["input_features"], composer="composer1")
>>> tokenizer_output = processor.batch_decode(
...     token_ids=model_output, feature_extractor_output=inputs
... )["pretty_midi_objects"][0]
>>> tokenizer_output.write("./Outputs/midi_output.mid")

使用您自己的音频文件的示例：

>>> import librosa
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor

>>> audio, sr = librosa.load("<your_audio_file_here>", sr=44100)  # feel free to change the sr to a suitable value.
>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")

>>> inputs = processor(audio=audio, sampling_rate=sr, return_tensors="pt")
>>> model_output = model.generate(input_features=inputs["input_features"], composer="composer1")
>>> tokenizer_output = processor.batch_decode(
...     token_ids=model_output, feature_extractor_output=inputs
... )["pretty_midi_objects"][0]
>>> tokenizer_output.write("./Outputs/midi_output.mid")

批量处理多个音频文件的示例：

>>> import librosa
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor

>>> # feel free to change the sr to a suitable value.
>>> audio1, sr1 = librosa.load("<your_first_audio_file_here>", sr=44100)  
>>> audio2, sr2 = librosa.load("<your_second_audio_file_here>", sr=44100)
>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")

>>> inputs = processor(audio=[audio1, audio2], sampling_rate=[sr1, sr2], return_attention_mask=True, return_tensors="pt")
>>> # Since we now generating in batch(2 audios) we must pass the attention_mask
>>> model_output = model.generate(
...     input_features=inputs["input_features"],
...     attention_mask=inputs["attention_mask"],
...     composer="composer1",
... )
>>> tokenizer_output = processor.batch_decode(
...     token_ids=model_output, feature_extractor_output=inputs
... )["pretty_midi_objects"]

>>> # Since we now have 2 generated MIDI files
>>> tokenizer_output[0].write("./Outputs/midi_output1.mid")
>>> tokenizer_output[1].write("./Outputs/midi_output2.mid")

批量处理多个音频文件的示例（使用 Pop2PianoFeatureExtractor 和 Pop2PianoTokenizer）：

>>> import librosa
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoFeatureExtractor, Pop2PianoTokenizer

>>> # feel free to change the sr to a suitable value.
>>> audio1, sr1 = librosa.load("<your_first_audio_file_here>", sr=44100)  
>>> audio2, sr2 = librosa.load("<your_second_audio_file_here>", sr=44100)
>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
>>> feature_extractor = Pop2PianoFeatureExtractor.from_pretrained("sweetcocoa/pop2piano")
>>> tokenizer = Pop2PianoTokenizer.from_pretrained("sweetcocoa/pop2piano")

>>> inputs = feature_extractor(
...     audio=[audio1, audio2], 
...     sampling_rate=[sr1, sr2], 
...     return_attention_mask=True, 
...     return_tensors="pt",
... )
>>> # Since we now generating in batch(2 audios) we must pass the attention_mask
>>> model_output = model.generate(
...     input_features=inputs["input_features"],
...     attention_mask=inputs["attention_mask"],
...     composer="composer1",
... )
>>> tokenizer_output = tokenizer.batch_decode(
...     token_ids=model_output, feature_extractor_output=inputs
... )["pretty_midi_objects"]

>>> # Since we now have 2 generated MIDI files
>>> tokenizer_output[0].write("./Outputs/midi_output1.mid")
>>> tokenizer_output[1].write("./Outputs/midi_output2.mid")

Pop2PianoConfig

类 transformers.Pop2PianoConfig

< source >

( vocab_size = 2400 composer_vocab_size = 21 d_model = 512 d_kv = 64 d_ff = 2048 num_layers = 6 num_decoder_layers = None num_heads = 8 relative_attention_num_buckets = 32 relative_attention_max_distance = 128 dropout_rate = 0.1 layer_norm_epsilon = 1e-06 initializer_factor = 1.0 feed_forward_proj = 'gated-gelu' is_encoder_decoder = True use_cache = True pad_token_id = 0 eos_token_id = 1 dense_act_fn = 'relu' **kwargs )

参数

vocab_size (int, 可选, 默认为 2400) — Pop2PianoForConditionalGeneration 模型的词汇表大小。定义了在调用 Pop2PianoForConditionalGeneration 时传递的 inputs_ids 可以表示的不同标记的数量。
composer_vocab_size (int, optional, 默认为 21) — 表示作曲家的数量。
d_model (int, optional, 默认为 512) — 编码器层和池化层的大小。
d_kv (int, 可选, 默认为 64) — 每个注意力头的键、查询、值投影的大小。投影层的 inner_dim 将被定义为 num_heads * d_kv.
d_ff (int, 可选, 默认为 2048) — 每个 Pop2PianoBlock 中中间前馈层的大小.
num_layers (int, optional, 默认为 6) — Transformer 编码器中的隐藏层数。
num_decoder_layers (int, optional) — Transformer解码器中的隐藏层数。如果未设置，将使用与num_layers相同的值。
num_heads (int, optional, defaults to 8) — Transformer编码器中每个注意力层的注意力头数。
relative_attention_num_buckets (int, optional, defaults to 32) — 用于每个注意力层的桶的数量。
relative_attention_max_distance (int, optional, defaults to 128) — 用于桶分离的较长序列的最大距离。
dropout_rate (float, optional, defaults to 0.1) — 所有 dropout 层的比率。
layer_norm_epsilon (float, optional, 默认为 1e-6) — 层归一化层使用的 epsilon 值。
initializer_factor (float, optional, 默认为 1.0) — 用于初始化所有权重矩阵的因子（应保持为1.0，内部用于初始化测试）。
feed_forward_proj (string, 可选, 默认为 "gated-gelu") — 使用的前馈层类型。应为 "relu" 或 "gated-gelu" 之一。
use_cache (bool, 可选, 默认为 True) — 模型是否应返回最后的键/值注意力（并非所有模型都使用）。
dense_act_fn (string, 可选, 默认为 "relu") — 在 Pop2PianoDenseActDense 和 Pop2PianoDenseGatedActDense 中使用的激活函数类型。

这是用于存储Pop2PianoForConditionalGeneration配置的配置类。它用于根据指定的参数实例化一个Pop2PianoForConditionalGeneration模型，定义模型架构。使用默认值实例化配置将产生与sweetcocoa/pop2piano架构类似的配置。

配置对象继承自PretrainedConfig，可用于控制模型输出。阅读PretrainedConfig的文档以获取更多信息。

Pop2PianoFeatureExtractor

类 transformers.Pop2PianoFeatureExtractor

< source >

( *args **kwargs )

call

( *args **kwargs )

将自身作为函数调用。

Pop2PianoForConditionalGeneration

类 transformers.Pop2PianoForConditionalGeneration

< source >

( config: Pop2PianoConfig )

参数

config (Pop2PianoConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，只会加载配置。查看 from_pretrained() 方法以加载模型权重。

Pop2Piano 模型顶部有一个language modeling头。该模型继承自PreTrainedModel。请查看超类文档以了解库为其所有模型实现的通用方法（如下载或保存、调整输入嵌入大小、修剪头等）。

该模型也是一个PyTorch torch.nn.Module 子类。将其作为常规的PyTorch模块使用，并参考PyTorch文档以获取与一般使用和行为相关的所有信息。

前进

< source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.FloatTensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.BoolTensor] = None head_mask: typing.Optional[torch.FloatTensor] = None decoder_head_mask: typing.Optional[torch.FloatTensor] = None cross_attn_head_mask: typing.Optional[torch.Tensor] = None encoder_outputs: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None input_features: typing.Optional[torch.FloatTensor] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None ) → transformers.modeling_outputs.Seq2SeqLMOutput 或 tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Pop2Piano is a model with relative position embeddings so you should be able to pad the inputs on both the right and the left. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for detail. What are input IDs? To know more on how to prepare input_ids for pretraining take a look a Pop2Piano Training.
attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — 用于避免在填充标记索引上执行注意力的掩码。掩码值在 [0, 1] 中选择：
- 1 表示 未掩码 的标记，
- 0 表示掩码的标记。什么是注意力掩码？
decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are decoder input IDs? Pop2Piano uses the pad_token_id as the starting token for decoder_input_ids generation. If past_key_values is used, optionally only the last decoder_input_ids have to be input (see past_key_values). To know more on how to prepare
decoder_attention_mask (torch.BoolTensor of shape (batch_size, target_sequence_length), 可选) — 默认行为：生成一个忽略decoder_input_ids中填充标记的张量。默认情况下也会使用因果掩码。
head_mask (torch.FloatTensor 形状为 (num_heads,) 或 (num_layers, num_heads), 可选) — 用于在编码器中屏蔽自注意力模块中选定的头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被屏蔽,
- 0 表示头部 被屏蔽.
decoder_head_mask (torch.FloatTensor 形状为 (num_heads,) 或 (num_layers, num_heads), 可选) — 用于在解码器中屏蔽自注意力模块中选定的头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被屏蔽,
- 0 表示头部 被屏蔽.
cross_attn_head_mask (torch.Tensor of shape (num_heads,) or (num_layers, num_heads), optional) — 用于在解码器中取消选择交叉注意力模块中特定头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被掩码,
- 0 表示头部 被掩码.
encoder_outputs (tuple(tuple(torch.FloatTensor), 可选) — 元组由 (last_hidden_state, 可选: hidden_states, 可选: attentions) 组成 last_hidden_state 的形状为 (batch_size, sequence_length, hidden_size) 是编码器最后一层输出的隐藏状态序列。用于解码器的交叉注意力中。
past_key_values (tuple(tuple(torch.FloatTensor)) 长度为 config.n_layers，每个元组包含4个形状为 (batch_size, num_heads, sequence_length - 1, embed_size_per_head) 的张量) — 包含预计算的注意力块的关键和值隐藏状态。可用于加速解码。如果使用了 past_key_values，用户可以选择仅输入形状为 (batch_size, 1) 的最后一个 decoder_input_ids（那些没有将其过去的关键值状态提供给此模型的），而不是所有形状为 (batch_size, sequence_length) 的 decoder_input_ids。
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 可选地，您可以选择直接传递嵌入表示，而不是传递input_ids。如果您希望对如何将input_ids索引转换为相关向量有更多控制权，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
input_features (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size), 可选) — 执行与 inputs_embeds 相同的任务。如果 inputs_embeds 不存在但 input_features 存在，则 input_features 将被视为 inputs_embeds.
decoder_inputs_embeds (torch.FloatTensor of shape (batch_size, target_sequence_length, hidden_size), optional) — Optionally, instead of passing decoder_input_ids you can choose to directly pass an embedded representation. If past_key_values is used, optionally only the last decoder_inputs_embeds have to be input (see past_key_values). This is useful if you want more control over how to convert decoder_input_ids indices into associated vectors than the model’s internal embedding lookup matrix. If decoder_input_ids and decoder_inputs_embeds are both unset, decoder_inputs_embeds takes the value of inputs_embeds.
use_cache (bool, 可选) — 如果设置为 True，past_key_values 键值状态将被返回，并可用于加速解码（参见 past_key_values）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量中的attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict (bool, 可选) — 是否返回一个ModelOutput而不是一个普通的元组。
cache_position (torch.LongTensor of shape (sequence_length), optional) — 表示输入序列标记在序列中的位置的索引。它用于在正确的位置更新缓存并推断完整的序列长度。
labels (torch.LongTensor of shape (batch_size,), optional) — 用于计算序列分类/回归损失的标签。索引应在 [-100, 0, ..., config.vocab_size - 1] 范围内。所有设置为 -100 的标签将被忽略（掩码），损失仅计算在 [0, ..., config.vocab_size] 范围内的标签

transformers.modeling_outputs.Seq2SeqLMOutput 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.Seq2SeqLMOutput 或一个由 torch.FloatTensor 组成的元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），包含各种元素，具体取决于配置（Pop2PianoConfig）和输入。

loss (torch.FloatTensor 形状为 (1,)，可选，当提供 labels 时返回) — 语言建模损失。
logits (torch.FloatTensor 形状为 (batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数（SoftMax 之前的每个词汇标记的分数）。
past_key_values (tuple(tuple(torch.FloatTensor))，可选，当传递 use_cache=True 或当 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量和 2 个形状为 (batch_size, num_heads, encoder_sequence_length, embed_size_per_head) 的额外张量。

包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于（参见 past_key_values 输入）加速顺序解码。
decoder_hidden_states (tuple(torch.FloatTensor)，可选，当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。

解码器在每层输出处的隐藏状态加上初始嵌入输出。
decoder_attentions (tuple(torch.FloatTensor)，可选，当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

解码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。
cross_attentions (tuple(torch.FloatTensor)，可选，当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

解码器的交叉注意力层的注意力权重，在注意力 softmax 之后，用于计算交叉注意力头中的加权平均值。
encoder_last_hidden_state (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size)，可选) — 模型编码器最后一层输出的隐藏状态序列。
encoder_hidden_states (tuple(torch.FloatTensor)，可选，当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。

编码器在每层输出处的隐藏状态加上初始嵌入输出。
encoder_attentions (tuple(torch.FloatTensor)，可选，当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

编码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。

Pop2PianoForConditionalGeneration 的前向方法，重写了 __call__ 特殊方法。

尽管前向传递的配方需要在此函数内定义，但之后应该调用Module实例而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

生成

< source >

( input_features attention_mask = 无 composer = 'composer1' generation_config = 无 **kwargs ) → ModelOutput 或 torch.LongTensor

参数

input_features (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 这是由Pop2PianoFeatureExtractor生成的音频的特征化版本。
attention_mask — 对于批量生成，input_features 被填充以使所有示例具有相同的形状。 attention_mask 帮助确定哪些区域被填充，哪些区域没有被填充。
- 1 表示 未填充 的标记，
- 0 表示填充的标记。
composer (str, 可选, 默认为 "composer1") — 此值传递给 Pop2PianoConcatEmbeddingToMel 以生成每个 "composer" 的不同嵌入。请确保 composer_to_feature_token 中存在该值在 generation_config 中。有关示例，请参见 https://huggingface.co/sweetcocoa/pop2piano/blob/main/generation_config.json .
generation_config (~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation.
kwargs — 对generate_config的临时参数化，和/或将被转发到模型的forward函数的额外模型特定kwargs。如果模型是编码器-解码器模型，编码器特定的kwargs不应加前缀，而解码器特定的kwargs应加上decoder_前缀。

ModelOutput 或 torch.LongTensor

一个 ModelOutput（如果 return_dict_in_generate=True 或当 config.return_dict_in_generate=True 时）或一个 torch.FloatTensor。由于 Pop2Piano 是一个编码器-解码器模型（model.config.is_encoder_decoder=True），可能的 ModelOutput 类型有：

为MIDI输出生成令牌ID。

大多数生成控制参数都在generation_config中设置，如果没有传递，将设置为模型的默认生成配置。您可以通过将相应的参数传递给generate()来覆盖任何generation_config，例如.generate(inputs, num_beams=4, do_sample=True)。有关生成策略和代码示例的概述，请查看以下指南。

Pop2PianoTokenizer

类 transformers.Pop2PianoTokenizer

< source >

( *args **kwargs )

call

( *args **kwargs )

将自身作为函数调用。

Pop2PianoProcessor

类 transformers.Pop2PianoProcessor

< source >

( *args **kwargs )

call

( *args **kwargs )

将自身作为函数调用。

< > Update on GitHub

←MusicGen Melody Seamless-M4T→