Transformers 文档

时间序列转换器

Transformers

时间序列转换器

概述

时间序列Transformer模型是一个用于时间序列预测的普通编码器-解码器Transformer。该模型由kashif贡献。

使用提示

与库中的其他模型类似，TimeSeriesTransformerModel 是没有顶部任何头部的原始 Transformer，而 TimeSeriesTransformerForPrediction 在前者的基础上添加了一个分布头部，可用于时间序列预测。请注意，这是一个所谓的概率预测模型，而不是点预测模型。这意味着模型学习了一个分布，可以从中进行采样。模型不会直接输出值。
TimeSeriesTransformerForPrediction 由2个模块组成：一个编码器，它接收一个context_length的时间序列值作为输入（称为past_values），和一个解码器，它预测一个prediction_length的时间序列值到未来（称为future_values）。在训练期间，需要向模型提供 (past_values 和 future_values) 的对。
In addition to the raw (past_values and future_values), one typically provides additional features to the model. These can be the following:
- past_time_features: temporal features which the model will add to past_values. These serve as “positional encodings” for the Transformer encoder. Examples are “day of the month”, “month of the year”, etc. as scalar values (and then stacked together as a vector). e.g. if a given time-series value was obtained on the 11th of August, then one could have [11, 8] as time feature vector (11 being “day of the month”, 8 being “month of the year”).
- future_time_features: temporal features which the model will add to future_values. These serve as “positional encodings” for the Transformer decoder. Examples are “day of the month”, “month of the year”, etc. as scalar values (and then stacked together as a vector). e.g. if a given time-series value was obtained on the 11th of August, then one could have [11, 8] as time feature vector (11 being “day of the month”, 8 being “month of the year”).
- static_categorical_features: categorical features which are static over time (i.e., have the same value for all past_values and future_values). An example here is the store ID or region ID that identifies a given time-series. Note that these features need to be known for ALL data points (also those in the future).
- static_real_features: real-valued features which are static over time (i.e., have the same value for all past_values and future_values). An example here is the image representation of the product for which you have the time-series values (like the ResNet embedding of a “shoe” picture, if your time-series is about the sales of shoes). Note that these features need to be known for ALL data points (also those in the future).
模型使用“教师强制”进行训练，类似于Transformer在机器翻译中的训练方式。这意味着，在训练期间，将future_values向右移动一个位置作为解码器的输入，并在前面加上past_values的最后一个值。在每个时间步，模型需要预测下一个目标。因此，训练的设置类似于GPT语言模型，只是没有decoder_start_token_id的概念（我们只使用上下文的最后一个值作为解码器的初始输入）。
在推理时，我们将past_values的最终值作为输入提供给解码器。接下来，我们可以从模型中采样以在下一个时间步进行预测，然后将其输入解码器以进行下一个预测（也称为自回归生成）。

资源

一份官方的Hugging Face和社区（由🌎表示）资源列表，帮助您入门。如果您有兴趣提交资源以包含在此处，请随时打开一个Pull Request，我们将进行审核！理想情况下，资源应展示一些新内容，而不是重复现有资源。

查看HuggingFace博客中的时间序列Transformer博客文章：Probabilistic Time Series Forecasting with 🤗 Transformers

TimeSeriesTransformerConfig

类 transformers.TimeSeriesTransformerConfig

< source >

( prediction_length: typing.Optional[int] = None context_length: typing.Optional[int] = None distribution_output: str = 'student_t' loss: str = 'nll' input_size: int = 1 lags_sequence: typing.List[int] = [1, 2, 3, 4, 5, 6, 7] scaling: typing.Union[bool, str, NoneType] = 'mean' num_dynamic_real_features: int = 0 num_static_categorical_features: int = 0 num_static_real_features: int = 0 num_time_features: int = 0 cardinality: typing.Optional[typing.List[int]] = None embedding_dimension: typing.Optional[typing.List[int]] = None encoder_ffn_dim: int = 32 decoder_ffn_dim: int = 32 encoder_attention_heads: int = 2 decoder_attention_heads: int = 2 encoder_layers: int = 2 decoder_layers: int = 2 is_encoder_decoder: bool = True activation_function: str = 'gelu' d_model: int = 64 dropout: float = 0.1 encoder_layerdrop: float = 0.1 decoder_layerdrop: float = 0.1 attention_dropout: float = 0.1 activation_dropout: float = 0.1 num_parallel_samples: int = 100 init_std: float = 0.02 use_cache = True **kwargs )

参数

prediction_length (int) — 解码器的预测长度。换句话说，模型的预测范围。这个值通常由数据集决定，我们建议适当地设置它。
context_length (int, optional, 默认为 prediction_length) — 编码器的上下文长度。如果为 None，上下文长度将与 prediction_length 相同。
distribution_output (string, 可选, 默认为 "student_t") — 模型的分布输出头。可以是“student_t”、“normal”或“negative_binomial”。
loss (string, 可选, 默认为 "nll") — 模型对应的distribution_output头的损失函数。对于参数化分布，它是负对数似然（nll）——目前是唯一支持的一种。
input_size (int, optional, 默认为 1) — 目标变量的大小，默认情况下对于单变量目标为1。在多变量目标的情况下会大于1。
scaling (string 或 bool, 可选默认为 "mean") — 是否通过“mean”缩放器、“std”缩放器或无缩放器（如果为 None）来缩放输入目标。如果为 True，则缩放器设置为“mean”。
lags_sequence (list[int], 可选, 默认为 [1, 2, 3, 4, 5, 6, 7]) — 输入时间序列的滞后作为协变量，通常由数据的频率决定。默认是 [1, 2, 3, 4, 5, 6, 7]，但我们建议根据数据集适当更改它。
num_time_features (int, 可选, 默认为 0) — 输入时间序列中的时间特征数量。
num_dynamic_real_features (int, optional, defaults to 0) — 动态实值特征的数量。
num_static_categorical_features (int, 可选, 默认为 0) — 静态分类特征的数量.
num_static_real_features (int, 可选, 默认为 0) — 静态实值特征的数量。
cardinality (list[int], 可选) — 每个静态分类特征的基数（不同值的数量）。应该是一个整数列表，长度与num_static_categorical_features相同。如果num_static_categorical_features大于0，则不能为None.
embedding_dimension (list[int], 可选) — 每个静态分类特征的嵌入维度。应该是一个整数列表，长度与num_static_categorical_features相同。如果 num_static_categorical_features大于0，则不能为None.
d_model (int, optional, defaults to 64) — Transformer层的维度。
encoder_layers (int, optional, defaults to 2) — 编码器层数.
decoder_layers (int, optional, defaults to 2) — 解码器层数.
encoder_attention_heads (int, optional, 默认为 2) — Transformer 编码器中每个注意力层的注意力头数。
decoder_attention_heads (int, optional, defaults to 2) — Transformer解码器中每个注意力层的注意力头数量。
encoder_ffn_dim (int, optional, defaults to 32) — 编码器中“中间”（通常称为前馈）层的维度。
decoder_ffn_dim (int, optional, defaults to 32) — 解码器中“中间”（通常称为前馈）层的维度。
activation_function (str 或 function, 可选, 默认为 "gelu") — 编码器和解码器中的非线性激活函数（函数或字符串）。如果是字符串，支持 "gelu" 和 "relu".
dropout (float, optional, defaults to 0.1) — 编码器和解码器中所有全连接层的dropout概率。
encoder_layerdrop (float, optional, defaults to 0.1) — 每个编码器层的注意力和全连接层的丢弃概率。
decoder_layerdrop (float, optional, 默认为 0.1) — 每个解码器层的注意力和全连接层的丢弃概率。
attention_dropout (float, optional, defaults to 0.1) — 注意力概率的丢弃概率。
activation_dropout (float, optional, defaults to 0.1) — 在前馈网络的两层之间使用的丢弃概率。
num_parallel_samples (int, 可选, 默认为 100) — 在推理的每个时间步中并行生成的样本数量。
init_std (float, optional, 默认为 0.02) — 截断正态权重初始化分布的标准差。
use_cache (bool, 可选, 默认为 True) — 是否使用过去的键/值注意力（如果适用于模型）以加速解码。
示例 —

这是用于存储TimeSeriesTransformerModel配置的配置类。它用于根据指定的参数实例化一个时间序列变换器模型，定义模型架构。使用默认值实例化配置将产生类似于时间序列变换器huggingface/time-series-transformer-tourism-monthly架构的配置。

配置对象继承自PretrainedConfig，可用于控制模型输出。请阅读PretrainedConfig的文档以获取更多信息。

>>> from transformers import TimeSeriesTransformerConfig, TimeSeriesTransformerModel

>>> # Initializing a Time Series Transformer configuration with 12 time steps for prediction
>>> configuration = TimeSeriesTransformerConfig(prediction_length=12)

>>> # Randomly initializing a model (with random weights) from the configuration
>>> model = TimeSeriesTransformerModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

时间序列转换模型

类 transformers.TimeSeriesTransformerModel

< source >

( config: TimeSeriesTransformerConfig )

参数

config (TimeSeriesTransformerConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型相关的权重，只会加载配置。查看 from_pretrained() 方法以加载模型权重。

裸时间序列变换器模型输出原始隐藏状态，没有任何特定的头部。该模型继承自PreTrainedModel。请查看超类文档以了解库为其所有模型实现的通用方法（如下载或保存、调整输入嵌入大小、修剪头部等）。

该模型也是一个PyTorch torch.nn.Module 子类。将其作为常规的PyTorch模块使用，并参考PyTorch文档以获取与一般使用和行为相关的所有信息。

前进

< source >

( past_values: Tensor past_time_features: Tensor past_observed_mask: Tensor static_categorical_features: typing.Optional[torch.Tensor] = None static_real_features: typing.Optional[torch.Tensor] = None future_values: typing.Optional[torch.Tensor] = None future_time_features: typing.Optional[torch.Tensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.Tensor] = None decoder_head_mask: typing.Optional[torch.Tensor] = None cross_attn_head_mask: typing.Optional[torch.Tensor] = None encoder_outputs: typing.Optional[typing.List[torch.FloatTensor]] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None output_hidden_states: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None use_cache: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.Seq2SeqTSModelOutput 或 tuple(torch.FloatTensor)

参数

past_values (torch.FloatTensor of shape (batch_size, sequence_length) or (batch_size, sequence_length, input_size)) — Past values of the time series, that serve as context in order to predict the future. The sequence size of this tensor must be larger than the context_length of the model, since the model will use the larger size to construct lag features, i.e. additional values from the past which are added in order to serve as “extra context”.
这里的sequence_length等于config.context_length + max(config.lags_sequence)，如果没有配置lags_sequence，则等于config.context_length + 7（因为默认情况下，config.lags_sequence中的最大回看索引是7）。属性_past_length返回过去的实际长度。

past_values 是 Transformer 编码器作为输入的内容（带有可选的附加特征，例如 static_categorical_features、static_real_features、past_time_features 和滞后值）。

可选地，缺失值需要用零替换，并通过past_observed_mask指示。

对于多元时间序列，input_size > 1 维度是必需的，并且对应于每个时间步长中时间序列的变量数量。
past_time_features (torch.FloatTensor of shape (batch_size, sequence_length, num_features)) — Required time features, which the model internally will add to past_values. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step. Holiday features are also a good example of time features.
这些特征作为输入的“位置编码”。因此，与像BERT这样的模型不同，在BERT中，位置编码是从头开始作为模型参数学习的，时间序列Transformer需要提供额外的时间特征。时间序列Transformer只学习static_categorical_features的额外嵌入。

可以将额外的动态真实协变量连接到此张量，但需要注意的是，这些特征在预测时必须已知。

这里的num_features等于config.num_time_features+config.num_dynamic_real_features`.
past_observed_mask (torch.BoolTensor of shape (batch_size, sequence_length) or (batch_size, sequence_length, input_size), optional) — 布尔掩码，用于指示哪些past_values被观察到，哪些缺失。掩码值在 [0, 1]中选择：
- 1 表示被观察到的值，
- 0 表示缺失的值（即被零替换的NaNs）。
static_categorical_features (torch.LongTensor of shape (batch_size, number of static categorical features), optional) — Optional static categorical features for which the model will learn an embedding, which it will add to the values of the time series.
静态分类特征是指在所有时间步长中具有相同值的特征（随时间保持不变）。

静态分类特征的一个典型例子是时间序列ID。
static_real_features (torch.FloatTensor of shape (batch_size, number of static real features), optional) — Optional static real features which the model will add to the values of the time series.
静态真实特征是指对于所有时间步长具有相同值的特征（随时间保持不变）。

静态真实特征的一个典型例子是促销信息。
future_values (torch.FloatTensor of shape (batch_size, prediction_length) or (batch_size, prediction_length, input_size), optional) — Future values of the time series, that serve as labels for the model. The future_values is what the Transformer needs during training to learn to output, given the past_values.
这里的序列长度等于prediction_length。

详情请参见演示笔记本和代码片段。

可选地，在训练期间，任何缺失值需要用零替换，并通过future_observed_mask指示。

对于多元时间序列，input_size > 1 维度是必需的，并且对应于每个时间步长中时间序列的变量数量。
future_time_features (torch.FloatTensor of shape (batch_size, prediction_length, num_features)) — Required time features for the prediction window, which the model internally will add to future_values. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step. Holiday features are also a good example of time features.
这些特征作为输入的“位置编码”。因此，与像BERT这样的模型不同，在BERT中，位置编码是从头开始作为模型参数学习的，时间序列Transformer需要提供额外的时间特征。时间序列Transformer只学习static_categorical_features的额外嵌入。

可以将额外的动态真实协变量连接到此张量，但需要注意的是，这些特征在预测时必须已知。

这里的num_features等于config.num_time_features+config.num_dynamic_real_features`.
future_observed_mask (torch.BoolTensor of shape (batch_size, sequence_length) or (batch_size, sequence_length, input_size), optional) — Boolean mask to indicate which future_values were observed and which were missing. Mask values selected in [0, 1]:
- 1 for values that are observed,
- 0 for values that are missing (i.e. NaNs that were replaced by zeros).
此掩码用于过滤掉缺失值以进行最终的损失计算。
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on certain token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
什么是注意力掩码？
decoder_attention_mask (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — 用于避免对某些标记索引执行注意力的掩码。默认情况下，将使用因果掩码，以确保模型只能查看先前的输入以预测未来。
head_mask (torch.Tensor of shape (encoder_layers, encoder_attention_heads), optional) — 用于在编码器中屏蔽注意力模块中选定的头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被屏蔽,
- 0 表示头部 被屏蔽.
decoder_head_mask (torch.Tensor of shape (decoder_layers, decoder_attention_heads), optional) — 用于在解码器中取消选择注意力模块的特定头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被掩码,
- 0 表示头部 被掩码.
cross_attn_head_mask (torch.Tensor of shape (decoder_layers, decoder_attention_heads), optional) — 用于屏蔽交叉注意力模块中选定头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被屏蔽,
- 0 表示头部 被屏蔽.
encoder_outputs (tuple(tuple(torch.FloatTensor), 可选) — 元组由 last_hidden_state, hidden_states (可选) 和 attentions (可选) 组成 last_hidden_state 的形状为 (batch_size, sequence_length, hidden_size) (可选) 是编码器最后一层的输出隐藏状态序列。用于解码器的交叉注意力机制中。
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).
包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），这些状态可用于（参见past_key_values输入）以加速顺序解码。

如果使用了past_key_values，用户可以选择只输入形状为(batch_size, 1)的最后一个decoder_input_ids（那些没有将其过去键值状态提供给此模型的），而不是形状为(batch_size, sequence_length)的所有decoder_input_ids。
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 可选地，您可以选择直接传递嵌入表示，而不是传递input_ids。如果您希望对如何将input_ids索引转换为相关向量有更多控制权，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
use_cache (bool, 可选) — 如果设置为 True，past_key_values 键值状态将被返回，并可用于加速解码（参见 past_key_values）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict (bool, 可选) — 是否返回一个 ModelOutput 而不是一个普通的元组。

transformers.modeling_outputs.Seq2SeqTSModelOutput 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.Seq2SeqTSModelOutput 或一个由 torch.FloatTensor 组成的元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），包含各种元素，具体取决于配置（TimeSeriesTransformerConfig）和输入。

last_hidden_state (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size)) — 模型解码器最后一层输出的隐藏状态序列。

如果使用了 past_key_values，则只输出形状为 (batch_size, 1, hidden_size) 的序列的最后一个隐藏状态。
past_key_values (tuple(tuple(torch.FloatTensor)), 可选, 当传递了 use_cache=True 或当 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量和 2 个形状为 (batch_size, num_heads, encoder_sequence_length, embed_size_per_head) 的额外张量。

包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于（参见 past_key_values 输入）加速顺序解码。
decoder_hidden_states (tuple(torch.FloatTensor), 可选, 当传递了 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 由 torch.FloatTensor 组成的元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。

解码器每层输出的隐藏状态加上可选的初始嵌入输出。
decoder_attentions (tuple(torch.FloatTensor), 可选, 当传递了 output_attentions=True 或当 config.output_attentions=True 时返回) — 由 torch.FloatTensor 组成的元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

解码器的注意力权重，经过注意力 softmax 后，用于计算自注意力头中的加权平均值。
cross_attentions (tuple(torch.FloatTensor), 可选, 当传递了 output_attentions=True 或当 config.output_attentions=True 时返回) — 由 torch.FloatTensor 组成的元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

解码器交叉注意力层的注意力权重，经过注意力 softmax 后，用于计算交叉注意力头中的加权平均值。
encoder_last_hidden_state (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size), 可选) — 模型编码器最后一层输出的隐藏状态序列。
encoder_hidden_states (tuple(torch.FloatTensor), 可选, 当传递了 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 由 torch.FloatTensor 组成的元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。

编码器每层输出的隐藏状态加上可选的初始嵌入输出。
encoder_attentions (tuple(torch.FloatTensor), 可选, 当传递了 output_attentions=True 或当 config.output_attentions=True 时返回) — 由 torch.FloatTensor 组成的元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

编码器的注意力权重，经过注意力 softmax 后，用于计算自注意力头中的加权平均值。
loc (torch.FloatTensor 形状为 (batch_size,) 或 (batch_size, input_size), 可选) — 每个时间序列上下文窗口的偏移值，用于使模型输入具有相同的幅度，然后用于将结果偏移回原始幅度。
scale (torch.FloatTensor 形状为 (batch_size,) 或 (batch_size, input_size), 可选) — 每个时间序列上下文窗口的缩放值，用于使模型输入具有相同的幅度，然后用于将结果缩放回原始幅度。
static_features (torch.FloatTensor 形状为 (batch_size, feature size), 可选) — 每个时间序列的静态特征，在推理时复制到协变量中。

TimeSeriesTransformerModel 的 forward 方法，重写了 __call__ 特殊方法。

尽管前向传递的配方需要在此函数内定义，但之后应该调用Module实例而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

>>> from huggingface_hub import hf_hub_download
>>> import torch
>>> from transformers import TimeSeriesTransformerModel

>>> file = hf_hub_download(
...     repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
... )
>>> batch = torch.load(file)

>>> model = TimeSeriesTransformerModel.from_pretrained("huggingface/time-series-transformer-tourism-monthly")

>>> # during training, one provides both past and future values
>>> # as well as possible additional features
>>> outputs = model(
...     past_values=batch["past_values"],
...     past_time_features=batch["past_time_features"],
...     past_observed_mask=batch["past_observed_mask"],
...     static_categorical_features=batch["static_categorical_features"],
...     static_real_features=batch["static_real_features"],
...     future_values=batch["future_values"],
...     future_time_features=batch["future_time_features"],
... )

>>> last_hidden_state = outputs.last_hidden_state

TimeSeriesTransformerForPrediction

类 transformers.TimeSeriesTransformerForPrediction

< source >

( config: TimeSeriesTransformerConfig )

参数

config (TimeSeriesTransformerConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型相关的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。

时间序列变换器模型，顶部带有分布头，用于时间序列预测。该模型继承自PreTrainedModel。请查看超类文档，了解库为其所有模型实现的通用方法（如下载或保存、调整输入嵌入大小、修剪头等）。

该模型也是一个PyTorch torch.nn.Module 子类。将其作为常规的PyTorch模块使用，并参考PyTorch文档以获取与一般使用和行为相关的所有信息。

前进

< source >

( past_values: Tensor past_time_features: Tensor past_observed_mask: Tensor static_categorical_features: typing.Optional[torch.Tensor] = None static_real_features: typing.Optional[torch.Tensor] = None future_values: typing.Optional[torch.Tensor] = None future_time_features: typing.Optional[torch.Tensor] = None future_observed_mask: typing.Optional[torch.Tensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.Tensor] = None decoder_head_mask: typing.Optional[torch.Tensor] = None cross_attn_head_mask: typing.Optional[torch.Tensor] = None encoder_outputs: typing.Optional[typing.List[torch.FloatTensor]] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None output_hidden_states: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None use_cache: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.Seq2SeqTSModelOutput 或 tuple(torch.FloatTensor)

参数

past_values (torch.FloatTensor of shape (batch_size, sequence_length) or (batch_size, sequence_length, input_size)) — Past values of the time series, that serve as context in order to predict the future. The sequence size of this tensor must be larger than the context_length of the model, since the model will use the larger size to construct lag features, i.e. additional values from the past which are added in order to serve as “extra context”.
这里的sequence_length等于config.context_length + max(config.lags_sequence)，如果没有配置lags_sequence，则等于config.context_length + 7（因为默认情况下，config.lags_sequence中的最大回看索引是7）。属性_past_length返回过去的实际长度。

past_values 是 Transformer 编码器作为输入的内容（带有可选的附加特征，例如 static_categorical_features、static_real_features、past_time_features 和滞后值）。

可选地，缺失值需要用零替换，并通过past_observed_mask指示。

对于多元时间序列，input_size > 1 维度是必需的，并且对应于每个时间步长中时间序列的变量数量。
past_time_features (torch.FloatTensor of shape (batch_size, sequence_length, num_features)) — Required time features, which the model internally will add to past_values. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step. Holiday features are also a good example of time features.
这些特征作为输入的“位置编码”。因此，与像BERT这样的模型不同，在BERT中，位置编码是从头开始作为模型参数学习的，时间序列Transformer需要提供额外的时间特征。时间序列Transformer只学习static_categorical_features的额外嵌入。

可以将额外的动态真实协变量连接到此张量，但需要注意的是，这些特征在预测时必须已知。

这里的num_features等于config.num_time_features+config.num_dynamic_real_features`.
past_observed_mask (torch.BoolTensor of shape (batch_size, sequence_length) or (batch_size, sequence_length, input_size), optional) — 布尔掩码，用于指示哪些past_values被观察到，哪些缺失。掩码值在 [0, 1]中选择：
- 1 表示被观察到的值，
- 0 表示缺失的值（即被零替换的NaNs）。
static_categorical_features (torch.LongTensor of shape (batch_size, number of static categorical features), optional) — Optional static categorical features for which the model will learn an embedding, which it will add to the values of the time series.
静态分类特征是指在所有时间步长中具有相同值的特征（随时间保持不变）。

静态分类特征的一个典型例子是时间序列ID。
static_real_features (torch.FloatTensor of shape (batch_size, number of static real features), optional) — Optional static real features which the model will add to the values of the time series.
静态真实特征是指对于所有时间步长具有相同值的特征（随时间保持不变）。

静态真实特征的一个典型例子是促销信息。
future_values (torch.FloatTensor of shape (batch_size, prediction_length) or (batch_size, prediction_length, input_size), optional) — Future values of the time series, that serve as labels for the model. The future_values is what the Transformer needs during training to learn to output, given the past_values.
这里的序列长度等于prediction_length。

详情请参见演示笔记本和代码片段。

可选地，在训练期间，任何缺失值需要用零替换，并通过future_observed_mask指示。

对于多元时间序列，input_size > 1 维度是必需的，并且对应于每个时间步长中时间序列的变量数量。
future_time_features (torch.FloatTensor of shape (batch_size, prediction_length, num_features)) — Required time features for the prediction window, which the model internally will add to future_values. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step. Holiday features are also a good example of time features.
这些特征作为输入的“位置编码”。因此，与像BERT这样的模型不同，在BERT中，位置编码是从头开始作为模型参数学习的，时间序列Transformer需要提供额外的时间特征。时间序列Transformer只学习static_categorical_features的额外嵌入。

可以将额外的动态真实协变量连接到此张量，但需要注意的是，这些特征在预测时必须已知。

这里的num_features等于config.num_time_features+config.num_dynamic_real_features`.
future_observed_mask (torch.BoolTensor of shape (batch_size, sequence_length) or (batch_size, sequence_length, input_size), optional) — Boolean mask to indicate which future_values were observed and which were missing. Mask values selected in [0, 1]:
- 1 for values that are observed,
- 0 for values that are missing (i.e. NaNs that were replaced by zeros).
此掩码用于过滤掉缺失值以进行最终的损失计算。
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on certain token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
什么是注意力掩码？
decoder_attention_mask (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — 用于避免对某些标记索引执行注意力的掩码。默认情况下，将使用因果掩码，以确保模型只能查看先前的输入以预测未来。
head_mask (torch.Tensor of shape (encoder_layers, encoder_attention_heads), optional) — 用于在编码器中屏蔽注意力模块中选定的头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被屏蔽,
- 0 表示头部 被屏蔽.
decoder_head_mask (torch.Tensor of shape (decoder_layers, decoder_attention_heads), optional) — 用于在解码器中取消选择注意力模块的特定头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被掩码,
- 0 表示头部 被掩码.
cross_attn_head_mask (torch.Tensor of shape (decoder_layers, decoder_attention_heads), optional) — 用于屏蔽交叉注意力模块中选定的头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被屏蔽,
- 0 表示头部 被屏蔽.
encoder_outputs (tuple(tuple(torch.FloatTensor), 可选) — 元组由 last_hidden_state, hidden_states (可选) 和 attentions (可选) 组成 last_hidden_state 的形状为 (batch_size, sequence_length, hidden_size) (可选) 是编码器最后一层输出的隐藏状态序列。用于解码器的交叉注意力机制中。
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).
包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），这些状态可用于（参见past_key_values输入）以加速顺序解码。

如果使用了past_key_values，用户可以选择只输入形状为(batch_size, 1)的最后一个decoder_input_ids（那些没有将其过去键值状态提供给此模型的），而不是形状为(batch_size, sequence_length)的所有decoder_input_ids。
inputs_embeds (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size), 可选) — 可选地，您可以选择直接传递嵌入表示，而不是传递 input_ids。如果您希望对如何将 input_ids 索引转换为相关向量有更多控制，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
use_cache (bool, 可选) — 如果设置为 True，past_key_values 键值状态将被返回，并可用于加速解码（参见 past_key_values）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict (bool, 可选) — 是否返回一个ModelOutput而不是一个普通的元组。

transformers.modeling_outputs.Seq2SeqTSModelOutput 或 tuple(torch.FloatTensor)

last_hidden_state (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size)) — 模型解码器最后一层输出的隐藏状态序列。

如果使用了 past_key_values，则只输出形状为 (batch_size, 1, hidden_size) 的序列的最后一个隐藏状态。
past_key_values (tuple(tuple(torch.FloatTensor)), 可选, 当传递了 use_cache=True 或当 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量和 2 个形状为 (batch_size, num_heads, encoder_sequence_length, embed_size_per_head) 的额外张量。

包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于（参见 past_key_values 输入）加速顺序解码。
decoder_hidden_states (tuple(torch.FloatTensor), 可选, 当传递了 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 由 torch.FloatTensor 组成的元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。

解码器每层输出的隐藏状态加上可选的初始嵌入输出。
decoder_attentions (tuple(torch.FloatTensor), 可选, 当传递了 output_attentions=True 或当 config.output_attentions=True 时返回) — 由 torch.FloatTensor 组成的元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

解码器的注意力权重，经过注意力 softmax 后，用于计算自注意力头中的加权平均值。
cross_attentions (tuple(torch.FloatTensor), 可选, 当传递了 output_attentions=True 或当 config.output_attentions=True 时返回) — 由 torch.FloatTensor 组成的元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

解码器交叉注意力层的注意力权重，经过注意力 softmax 后，用于计算交叉注意力头中的加权平均值。
encoder_last_hidden_state (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size), 可选) — 模型编码器最后一层输出的隐藏状态序列。
encoder_hidden_states (tuple(torch.FloatTensor), 可选, 当传递了 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 由 torch.FloatTensor 组成的元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。

编码器每层输出的隐藏状态加上可选的初始嵌入输出。
encoder_attentions (tuple(torch.FloatTensor), 可选, 当传递了 output_attentions=True 或当 config.output_attentions=True 时返回) — 由 torch.FloatTensor 组成的元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

编码器的注意力权重，经过注意力 softmax 后，用于计算自注意力头中的加权平均值。
loc (torch.FloatTensor 形状为 (batch_size,) 或 (batch_size, input_size), 可选) — 每个时间序列上下文窗口的偏移值，用于使模型输入具有相同的幅度，然后用于将结果偏移回原始幅度。
scale (torch.FloatTensor 形状为 (batch_size,) 或 (batch_size, input_size), 可选) — 每个时间序列上下文窗口的缩放值，用于使模型输入具有相同的幅度，然后用于将结果缩放回原始幅度。
static_features (torch.FloatTensor 形状为 (batch_size, feature size), 可选) — 每个时间序列的静态特征，在推理时复制到协变量中。

TimeSeriesTransformerForPrediction 的前向方法，重写了 __call__ 特殊方法。

尽管前向传递的配方需要在此函数内定义，但之后应该调用Module实例而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

>>> from huggingface_hub import hf_hub_download
>>> import torch
>>> from transformers import TimeSeriesTransformerForPrediction

>>> file = hf_hub_download(
...     repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
... )
>>> batch = torch.load(file)

>>> model = TimeSeriesTransformerForPrediction.from_pretrained(
...     "huggingface/time-series-transformer-tourism-monthly"
... )

>>> # during training, one provides both past and future values
>>> # as well as possible additional features
>>> outputs = model(
...     past_values=batch["past_values"],
...     past_time_features=batch["past_time_features"],
...     past_observed_mask=batch["past_observed_mask"],
...     static_categorical_features=batch["static_categorical_features"],
...     static_real_features=batch["static_real_features"],
...     future_values=batch["future_values"],
...     future_time_features=batch["future_time_features"],
... )

>>> loss = outputs.loss
>>> loss.backward()

>>> # during inference, one only provides past values
>>> # as well as possible additional features
>>> # the model autoregressively generates future values
>>> outputs = model.generate(
...     past_values=batch["past_values"],
...     past_time_features=batch["past_time_features"],
...     past_observed_mask=batch["past_observed_mask"],
...     static_categorical_features=batch["static_categorical_features"],
...     static_real_features=batch["static_real_features"],
...     future_time_features=batch["future_time_features"],
... )

>>> mean_prediction = outputs.sequences.mean(dim=1)

< > Update on GitHub

←PatchTST Graphormer→