Transformers 文档

通知者

Transformers

通知者

概述

Informer模型由Haoyi Zhou、Shanghang Zhang、Jieqi Peng、Shuai Zhang、Jianxin Li、Hui Xiong和Wancai Zhang在Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting中提出。

该方法引入了一种概率注意力机制，用于选择“活跃”查询而不是“懒惰”查询，并提供了一种稀疏的Transformer，从而减轻了普通注意力的二次计算和内存需求。

论文的摘要如下：

许多现实世界的应用需要预测长时间序列，例如电力消耗规划。长时间序列预测（LSTF）要求模型具有较高的预测能力，即能够有效地捕捉输出和输入之间的精确长程依赖耦合。最近的研究表明，Transformer具有提高预测能力的潜力。然而，Transformer存在几个严重问题，使其无法直接应用于LSTF，包括二次时间复杂度、高内存使用量以及编码器-解码器架构的固有局限性。为了解决这些问题，我们设计了一种高效的基于Transformer的LSTF模型，名为Informer，具有三个显著特点：（i）ProbSparse自注意力机制，在时间复杂度和内存使用上达到O(L logL)，并且在序列依赖对齐上具有可比的性能。（ii）自注意力蒸馏通过将级联层输入减半来突出主导注意力，并有效处理极长的输入序列。（iii）生成式解码器在概念上简单，通过一次前向操作而不是逐步方式预测长时间序列，从而大大提高了长序列预测的推理速度。在四个大规模数据集上的广泛实验表明，Informer显著优于现有方法，并为LSTF问题提供了新的解决方案。

该模型由elisim和kashif贡献。原始代码可以在这里找到。

资源

一份官方的Hugging Face和社区（由🌎表示）资源列表，帮助您入门。如果您有兴趣提交资源以包含在此处，请随时打开一个Pull Request，我们将进行审核！理想情况下，资源应展示一些新内容，而不是重复现有资源。

查看 HuggingFace 博客中的 Informer 博客文章：Multivariate Probabilistic Time Series Forecasting with Informer

InformerConfig

类 transformers.InformerConfig

< source >

( prediction_length: typing.Optional[int] = None context_length: typing.Optional[int] = None distribution_output: str = 'student_t' loss: str = 'nll' input_size: int = 1 lags_sequence: typing.List[int] = None scaling: typing.Union[str, bool, NoneType] = 'mean' num_dynamic_real_features: int = 0 num_static_real_features: int = 0 num_static_categorical_features: int = 0 num_time_features: int = 0 cardinality: typing.Optional[typing.List[int]] = None embedding_dimension: typing.Optional[typing.List[int]] = None d_model: int = 64 encoder_ffn_dim: int = 32 decoder_ffn_dim: int = 32 encoder_attention_heads: int = 2 decoder_attention_heads: int = 2 encoder_layers: int = 2 decoder_layers: int = 2 is_encoder_decoder: bool = True activation_function: str = 'gelu' dropout: float = 0.05 encoder_layerdrop: float = 0.1 decoder_layerdrop: float = 0.1 attention_dropout: float = 0.1 activation_dropout: float = 0.1 num_parallel_samples: int = 100 init_std: float = 0.02 use_cache = True attention_type: str = 'prob' sampling_factor: int = 5 distil: bool = True **kwargs )

参数

prediction_length (int) — 解码器的预测长度。换句话说，模型的预测范围。这个值通常由数据集决定，我们建议适当地设置它。
context_length (int, 可选, 默认为 prediction_length) — 编码器的上下文长度。如果为 None，上下文长度将与 prediction_length 相同。
distribution_output (string, 可选, 默认为 "student_t") — 模型的分布输出头。可以是“student_t”、“normal”或“negative_binomial”。
loss (string, 可选, 默认为 "nll") — 模型对应的distribution_output头的损失函数。对于参数分布，它是负对数似然（nll）——目前是唯一支持的一种。
input_size (int, optional, 默认为 1) — 目标变量的大小，默认情况下对于单变量目标为 1。在多变量目标的情况下会大于 1。
scaling (string 或 bool, 可选默认为 "mean") — 是否通过“mean”缩放器、“std”缩放器或无缩放器（如果为 None）来缩放输入目标。如果为 True，则缩放器设置为“mean”。
lags_sequence (list[int], 可选, 默认为 [1, 2, 3, 4, 5, 6, 7]) — 输入时间序列的滞后作为协变量，通常由数据的频率决定。默认值为 [1, 2, 3, 4, 5, 6, 7]，但我们建议根据数据集适当更改它。
num_time_features (int, optional, defaults to 0) — 输入时间序列中的时间特征数量。
num_dynamic_real_features (int, 可选, 默认为 0) — 动态实值特征的数量。
num_static_categorical_features (int, 可选, 默认为 0) — 静态分类特征的数量.
num_static_real_features (int, optional, defaults to 0) — 静态实值特征的数量。
cardinality (list[int], optional) — 每个静态分类特征的基数（不同值的数量）。应该是一个整数列表，长度与num_static_categorical_features相同。如果num_static_categorical_features大于0，则不能为None.
embedding_dimension (list[int], 可选) — 每个静态分类特征的嵌入维度。应该是一个整数列表，长度与num_static_categorical_features相同。如果 num_static_categorical_features大于0，则不能为None.
d_model (int, optional, 默认为 64) — Transformer 层的维度.
encoder_layers (int, optional, 默认为 2) — 编码器层数.
decoder_layers (int, optional, defaults to 2) — 解码器层数.
encoder_attention_heads (int, optional, 默认为 2) — Transformer 编码器中每个注意力层的注意力头数。
decoder_attention_heads (int, optional, defaults to 2) — Transformer解码器中每个注意力层的注意力头数量。
encoder_ffn_dim (int, optional, defaults to 32) — 编码器中“中间”（通常称为前馈）层的维度。
decoder_ffn_dim (int, optional, defaults to 32) — 解码器中“中间”（通常称为前馈）层的维度。
activation_function (str 或 function, 可选, 默认为 "gelu") — 编码器和解码器中的非线性激活函数（函数或字符串）。如果是字符串，支持 "gelu" 和 "relu".
dropout (float, optional, defaults to 0.1) — 编码器和解码器中所有全连接层的dropout概率。
encoder_layerdrop (float, optional, defaults to 0.1) — 每个编码器层的注意力和全连接层的丢弃概率。
decoder_layerdrop (float, optional, 默认为 0.1) — 每个解码器层的注意力和全连接层的丢弃概率。
attention_dropout (float, optional, 默认为 0.1) — 注意力概率的丢弃概率。
activation_dropout (float, optional, 默认为 0.1) — 在前馈网络的两层之间使用的 dropout 概率。
num_parallel_samples (int, 可选, 默认为 100) — 在推理的每个时间步中并行生成的样本数量。
init_std (float, 可选, 默认值为 0.02) — 截断正态权重初始化分布的标准差。
use_cache (bool, 可选, 默认为 True) — 是否使用过去的键/值注意力（如果适用于模型）来加速解码。
attention_type (str, 可选, 默认为“prob”) — 编码器中使用的注意力类型。可以设置为“prob”（Informer的ProbAttention）或“full”（vanilla transformer的经典自注意力）。
sampling_factor (int, 可选, 默认为 5) — ProbSparse 采样因子（仅在 attention_type=“prob” 时生效）。它用于控制缩减后的查询矩阵（Q_reduce）的输入长度。
distil (bool, 可选, 默认为 True) — 是否在编码器中使用蒸馏。

这是用于存储InformerModel配置的配置类。它用于根据指定的参数实例化一个Informer模型，定义模型架构。使用默认值实例化配置将产生类似于Informer huggingface/informer-tourism-monthly架构的配置。

配置对象继承自PretrainedConfig，可用于控制模型输出。请阅读PretrainedConfig的文档以获取更多信息。

示例：

>>> from transformers import InformerConfig, InformerModel

>>> # Initializing an Informer configuration with 12 time steps for prediction
>>> configuration = InformerConfig(prediction_length=12)

>>> # Randomly initializing a model (with random weights) from the configuration
>>> model = InformerModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

InformerModel

类 transformers.InformerModel

< source >

( config: InformerConfig )

参数

config (TimeSeriesTransformerConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型相关的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。

裸的 Informer 模型输出原始的隐藏状态，没有任何特定的头部。此模型继承自 PreTrainedModel。请查看超类文档以了解库为其所有模型实现的通用方法（如下载或保存、调整输入嵌入的大小、修剪头部等）。

该模型也是一个PyTorch torch.nn.Module 子类。将其作为常规的PyTorch模块使用，并参考PyTorch文档以获取与一般使用和行为相关的所有信息。

前进

< source >

( past_values: Tensor past_time_features: Tensor past_observed_mask: Tensor static_categorical_features: typing.Optional[torch.Tensor] = None static_real_features: typing.Optional[torch.Tensor] = None future_values: typing.Optional[torch.Tensor] = None future_time_features: typing.Optional[torch.Tensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.Tensor] = None decoder_head_mask: typing.Optional[torch.Tensor] = None cross_attn_head_mask: typing.Optional[torch.Tensor] = None encoder_outputs: typing.Optional[typing.List[torch.FloatTensor]] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None output_hidden_states: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None use_cache: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.Seq2SeqTSModelOutput 或 tuple(torch.FloatTensor)

参数

past_values (torch.FloatTensor of shape (batch_size, sequence_length) or (batch_size, sequence_length, input_size)) — Past values of the time series, that serve as context in order to predict the future. The sequence size of this tensor must be larger than the context_length of the model, since the model will use the larger size to construct lag features, i.e. additional values from the past which are added in order to serve as “extra context”.
这里的sequence_length等于config.context_length + max(config.lags_sequence)，如果没有配置lags_sequence，则等于config.context_length + 7（因为默认情况下，config.lags_sequence中的最大回看索引是7）。属性_past_length返回过去的实际长度。

past_values 是 Transformer 编码器作为输入的内容（带有可选的附加特征，例如 static_categorical_features、static_real_features、past_time_features 和滞后值）。

可选地，缺失值需要用零替换，并通过past_observed_mask指示。

对于多元时间序列，input_size > 1 维度是必需的，并且对应于每个时间步长中时间序列的变量数量。
past_time_features (torch.FloatTensor of shape (batch_size, sequence_length, num_features)) — Required time features, which the model internally will add to past_values. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step. Holiday features are also a good example of time features.
这些特征作为输入的“位置编码”。因此，与像BERT这样的模型不同，在BERT中，位置编码是从头开始作为模型参数学习的，时间序列Transformer需要提供额外的时间特征。时间序列Transformer只学习static_categorical_features的额外嵌入。

可以将额外的动态真实协变量连接到此张量，但需要注意的是，这些特征在预测时必须已知。

这里的num_features等于config.num_time_features+config.num_dynamic_real_features`.
past_observed_mask (torch.BoolTensor of shape (batch_size, sequence_length) or (batch_size, sequence_length, input_size), optional) — 布尔掩码，用于指示哪些past_values被观察到，哪些缺失。掩码值在 [0, 1]中选择：
- 1 表示被观察到的值，
- 0 表示缺失的值（即被零替换的NaNs）。
static_categorical_features (torch.LongTensor of shape (batch_size, number of static categorical features), optional) — Optional static categorical features for which the model will learn an embedding, which it will add to the values of the time series.
静态分类特征是指在所有时间步长中具有相同值的特征（随时间保持不变）。

静态分类特征的一个典型例子是时间序列ID。
static_real_features (torch.FloatTensor of shape (batch_size, number of static real features), optional) — Optional static real features which the model will add to the values of the time series.
静态真实特征是指对于所有时间步长具有相同值的特征（随时间保持不变）。

静态真实特征的一个典型例子是促销信息。
future_values (torch.FloatTensor of shape (batch_size, prediction_length) or (batch_size, prediction_length, input_size), optional) — Future values of the time series, that serve as labels for the model. The future_values is what the Transformer needs during training to learn to output, given the past_values.
这里的序列长度等于prediction_length。

详情请参见演示笔记本和代码片段。

可选地，在训练期间，任何缺失值需要用零替换，并通过future_observed_mask指示。

对于多元时间序列，input_size > 1 维度是必需的，并且对应于每个时间步长中时间序列的变量数量。
future_time_features (torch.FloatTensor of shape (batch_size, prediction_length, num_features)) — Required time features for the prediction window, which the model internally will add to future_values. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step. Holiday features are also a good example of time features.
这些特征作为输入的“位置编码”。因此，与像BERT这样的模型不同，在BERT中，位置编码是从头开始作为模型参数学习的，时间序列Transformer需要提供额外的时间特征。时间序列Transformer只学习static_categorical_features的额外嵌入。

可以将额外的动态真实协变量连接到此张量，但需要注意的是，这些特征在预测时必须已知。

这里的num_features等于config.num_time_features+config.num_dynamic_real_features`.
future_observed_mask (torch.BoolTensor of shape (batch_size, sequence_length) or (batch_size, sequence_length, input_size), optional) — Boolean mask to indicate which future_values were observed and which were missing. Mask values selected in [0, 1]:
- 1 for values that are observed,
- 0 for values that are missing (i.e. NaNs that were replaced by zeros).
此掩码用于过滤掉缺失值以进行最终的损失计算。
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on certain token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
什么是注意力掩码？
decoder_attention_mask (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — 用于避免对某些标记索引执行注意力的掩码。默认情况下，将使用因果掩码，以确保模型只能查看先前的输入以预测未来。
head_mask (torch.Tensor of shape (encoder_layers, encoder_attention_heads), optional) — 用于在编码器中屏蔽注意力模块中选定的头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被屏蔽,
- 0 表示头部 被屏蔽.
decoder_head_mask (torch.Tensor of shape (decoder_layers, decoder_attention_heads), optional) — 用于在解码器中取消选择注意力模块的特定头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被掩码,
- 0 表示头部 被掩码.
cross_attn_head_mask (torch.Tensor of shape (decoder_layers, decoder_attention_heads), optional) — 用于屏蔽交叉注意力模块中选定头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被屏蔽,
- 0 表示头部 被屏蔽.
encoder_outputs (tuple(tuple(torch.FloatTensor), 可选) — 元组由 last_hidden_state, hidden_states (可选) 和 attentions (可选) 组成 last_hidden_state 的形状为 (batch_size, sequence_length, hidden_size) (可选) 是编码器最后一层的输出隐藏状态序列。用于解码器的交叉注意力机制中。
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).
包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），这些状态可用于（参见past_key_values输入）以加速顺序解码。

如果使用了past_key_values，用户可以选择只输入形状为(batch_size, 1)的最后一个decoder_input_ids（那些没有将其过去键值状态提供给此模型的），而不是形状为(batch_size, sequence_length)的所有decoder_input_ids。
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 可选地，您可以选择直接传递嵌入表示，而不是传递input_ids。如果您希望对如何将input_ids索引转换为相关向量有更多控制权，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
use_cache (bool, 可选) — 如果设置为 True，past_key_values 键值状态将被返回，并可用于加速解码（参见 past_key_values）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量中的attentions。
output_hidden_states (bool, optional) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict (bool, 可选) — 是否返回一个ModelOutput而不是一个普通的元组。

transformers.modeling_outputs.Seq2SeqTSModelOutput 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.Seq2SeqTSModelOutput 或一个由 torch.FloatTensor 组成的元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），包含各种元素，具体取决于配置（InformerConfig）和输入。

last_hidden_state (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size)) — 模型解码器最后一层输出的隐藏状态序列。

如果使用了 past_key_values，则只输出形状为 (batch_size, 1, hidden_size) 的序列的最后一个隐藏状态。
past_key_values (tuple(tuple(torch.FloatTensor)), 可选, 当传递 use_cache=True 或当 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量和 2 个形状为 (batch_size, num_heads, encoder_sequence_length, embed_size_per_head) 的额外张量。

包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于（参见 past_key_values 输入）加速顺序解码。
decoder_hidden_states (tuple(torch.FloatTensor), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 由 torch.FloatTensor 组成的元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。

解码器每层输出的隐藏状态加上可选的初始嵌入输出。
decoder_attentions (tuple(torch.FloatTensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — 由 torch.FloatTensor 组成的元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

解码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。
cross_attentions (tuple(torch.FloatTensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — 由 torch.FloatTensor 组成的元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

解码器交叉注意力层的注意力权重，在注意力 softmax 之后，用于计算交叉注意力头中的加权平均值。
encoder_last_hidden_state (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size), 可选) — 模型编码器最后一层输出的隐藏状态序列。
encoder_hidden_states (tuple(torch.FloatTensor), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 由 torch.FloatTensor 组成的元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。

编码器每层输出的隐藏状态加上可选的初始嵌入输出。
encoder_attentions (tuple(torch.FloatTensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — 由 torch.FloatTensor 组成的元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

编码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。
loc (torch.FloatTensor 形状为 (batch_size,) 或 (batch_size, input_size), 可选) — 每个时间序列上下文窗口的偏移值，用于使模型输入具有相同的幅度，然后用于移回原始幅度。
scale (torch.FloatTensor 形状为 (batch_size,) 或 (batch_size, input_size), 可选) — 每个时间序列上下文窗口的缩放值，用于使模型输入具有相同的幅度，然后用于重新缩放回原始幅度。
static_features (torch.FloatTensor 形状为 (batch_size, feature size), 可选) — 每个时间序列的静态特征，在推理时复制到协变量中。

InformerModel 的前向方法，重写了 __call__ 特殊方法。

尽管前向传递的配方需要在此函数内定义，但之后应该调用Module实例而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

>>> from huggingface_hub import hf_hub_download
>>> import torch
>>> from transformers import InformerModel

>>> file = hf_hub_download(
...     repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
... )
>>> batch = torch.load(file)

>>> model = InformerModel.from_pretrained("huggingface/informer-tourism-monthly")

>>> # during training, one provides both past and future values
>>> # as well as possible additional features
>>> outputs = model(
...     past_values=batch["past_values"],
...     past_time_features=batch["past_time_features"],
...     past_observed_mask=batch["past_observed_mask"],
...     static_categorical_features=batch["static_categorical_features"],
...     static_real_features=batch["static_real_features"],
...     future_values=batch["future_values"],
...     future_time_features=batch["future_time_features"],
... )

>>> last_hidden_state = outputs.last_hidden_state

InformerForPrediction

类 transformers.InformerForPrediction

< source >

( config: InformerConfig )

参数

config (TimeSeriesTransformerConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型相关的权重，只会加载配置。查看 from_pretrained() 方法以加载模型权重。

用于时间序列预测的带有分布头的Informer模型。该模型继承自PreTrainedModel。请查看超类文档以了解库为其所有模型实现的通用方法（如下载或保存、调整输入嵌入的大小、修剪头等）。

该模型也是一个PyTorch torch.nn.Module 子类。将其作为常规的PyTorch模块使用，并参考PyTorch文档以获取与一般使用和行为相关的所有信息。

前进

< source >

( past_values: Tensor past_time_features: Tensor past_observed_mask: Tensor static_categorical_features: typing.Optional[torch.Tensor] = None static_real_features: typing.Optional[torch.Tensor] = None future_values: typing.Optional[torch.Tensor] = None future_time_features: typing.Optional[torch.Tensor] = None future_observed_mask: typing.Optional[torch.Tensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.Tensor] = None decoder_head_mask: typing.Optional[torch.Tensor] = None cross_attn_head_mask: typing.Optional[torch.Tensor] = None encoder_outputs: typing.Optional[typing.List[torch.FloatTensor]] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None output_hidden_states: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None use_cache: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.Seq2SeqTSModelOutput 或 tuple(torch.FloatTensor)

参数

past_values (torch.FloatTensor of shape (batch_size, sequence_length) or (batch_size, sequence_length, input_size)) — Past values of the time series, that serve as context in order to predict the future. The sequence size of this tensor must be larger than the context_length of the model, since the model will use the larger size to construct lag features, i.e. additional values from the past which are added in order to serve as “extra context”.
这里的sequence_length等于config.context_length + max(config.lags_sequence)，如果没有配置lags_sequence，则等于config.context_length + 7（因为默认情况下，config.lags_sequence中的最大回看索引是7）。属性_past_length返回过去的实际长度。

past_values 是 Transformer 编码器作为输入的内容（带有可选的附加特征，例如 static_categorical_features、static_real_features、past_time_features 和滞后值）。

可选地，缺失值需要用零替换，并通过past_observed_mask指示。

对于多元时间序列，input_size > 1 维度是必需的，并且对应于每个时间步长中时间序列的变量数量。
past_time_features (torch.FloatTensor of shape (batch_size, sequence_length, num_features)) — Required time features, which the model internally will add to past_values. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step. Holiday features are also a good example of time features.
这些特征作为输入的“位置编码”。因此，与像BERT这样的模型不同，在BERT中，位置编码是从头开始作为模型参数学习的，时间序列Transformer需要提供额外的时间特征。时间序列Transformer只学习static_categorical_features的额外嵌入。

可以将额外的动态真实协变量连接到此张量，但需要注意的是，这些特征在预测时必须已知。

这里的num_features等于config.num_time_features+config.num_dynamic_real_features`.
past_observed_mask (torch.BoolTensor of shape (batch_size, sequence_length) or (batch_size, sequence_length, input_size), optional) — 布尔掩码，用于指示哪些past_values被观察到，哪些缺失。掩码值在 [0, 1]中选择：
- 1 表示被观察到的值，
- 0 表示缺失的值（即被零替换的NaNs）。
static_categorical_features (torch.LongTensor of shape (batch_size, number of static categorical features), optional) — Optional static categorical features for which the model will learn an embedding, which it will add to the values of the time series.
静态分类特征是指在所有时间步长中具有相同值的特征（随时间保持不变）。

静态分类特征的一个典型例子是时间序列ID。
static_real_features (torch.FloatTensor of shape (batch_size, number of static real features), optional) — Optional static real features which the model will add to the values of the time series.
静态真实特征是指对于所有时间步长具有相同值的特征（随时间保持不变）。

静态真实特征的一个典型例子是促销信息。
future_values (torch.FloatTensor of shape (batch_size, prediction_length) or (batch_size, prediction_length, input_size), optional) — Future values of the time series, that serve as labels for the model. The future_values is what the Transformer needs during training to learn to output, given the past_values.
这里的序列长度等于prediction_length。

详情请参见演示笔记本和代码片段。

可选地，在训练期间，任何缺失值需要用零替换，并通过future_observed_mask指示。

对于多元时间序列，input_size > 1 维度是必需的，并且对应于每个时间步长中时间序列的变量数量。
future_time_features (torch.FloatTensor of shape (batch_size, prediction_length, num_features)) — Required time features for the prediction window, which the model internally will add to future_values. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step. Holiday features are also a good example of time features.
这些特征作为输入的“位置编码”。因此，与像BERT这样的模型不同，在BERT中，位置编码是从头开始作为模型参数学习的，时间序列Transformer需要提供额外的时间特征。时间序列Transformer只学习static_categorical_features的额外嵌入。

可以将额外的动态真实协变量连接到此张量，但需要注意的是，这些特征在预测时必须已知。

这里的num_features等于config.num_time_features+config.num_dynamic_real_features`.
future_observed_mask (torch.BoolTensor of shape (batch_size, sequence_length) or (batch_size, sequence_length, input_size), optional) — Boolean mask to indicate which future_values were observed and which were missing. Mask values selected in [0, 1]:
- 1 for values that are observed,
- 0 for values that are missing (i.e. NaNs that were replaced by zeros).
此掩码用于过滤掉缺失值以进行最终的损失计算。
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on certain token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
什么是注意力掩码？
decoder_attention_mask (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — 用于避免对某些标记索引执行注意力的掩码。默认情况下，将使用因果掩码，以确保模型只能查看先前的输入以预测未来。
head_mask (torch.Tensor of shape (encoder_layers, encoder_attention_heads), optional) — 用于在编码器中屏蔽注意力模块中选定的头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被屏蔽,
- 0 表示头部 被屏蔽.
decoder_head_mask (torch.Tensor of shape (decoder_layers, decoder_attention_heads), optional) — 用于在解码器中取消选择注意力模块的特定头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被掩码,
- 0 表示头部 被掩码.
cross_attn_head_mask (torch.Tensor 形状为 (decoder_layers, decoder_attention_heads), 可选) — 用于屏蔽交叉注意力模块中选定头部的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头部 未被屏蔽,
- 0 表示头部 被屏蔽.
encoder_outputs (tuple(tuple(torch.FloatTensor), 可选) — 元组由 last_hidden_state, hidden_states (可选) 和 attentions (可选) 组成 last_hidden_state 的形状为 (batch_size, sequence_length, hidden_size) (可选) 是编码器最后一层的输出隐藏状态序列。用于解码器的交叉注意力机制中。
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).
包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），这些状态可用于（参见past_key_values输入）以加速顺序解码。

如果使用了past_key_values，用户可以选择只输入形状为(batch_size, 1)的最后一个decoder_input_ids（那些没有将其过去键值状态提供给此模型的），而不是形状为(batch_size, sequence_length)的所有decoder_input_ids。
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 可选地，您可以选择直接传递嵌入表示，而不是传递input_ids。如果您希望对如何将input_ids索引转换为相关向量有更多控制，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
use_cache (bool, 可选) — 如果设置为 True，past_key_values 键值状态将被返回，并可用于加速解码（参见 past_key_values）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量中的attentions。
output_hidden_states (bool, optional) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict (bool, 可选) — 是否返回一个ModelOutput而不是一个普通的元组。

transformers.modeling_outputs.Seq2SeqTSModelOutput 或 tuple(torch.FloatTensor)

last_hidden_state (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size)) — 模型解码器最后一层输出的隐藏状态序列。

如果使用了 past_key_values，则只输出形状为 (batch_size, 1, hidden_size) 的序列的最后一个隐藏状态。
past_key_values (tuple(tuple(torch.FloatTensor)), 可选, 当传递了 use_cache=True 或当 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量和 2 个形状为 (batch_size, num_heads, encoder_sequence_length, embed_size_per_head) 的额外张量。

包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于（参见 past_key_values 输入）加速顺序解码。
decoder_hidden_states (tuple(torch.FloatTensor), 可选, 当传递了 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 由 torch.FloatTensor 组成的元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。

解码器每层输出的隐藏状态加上可选的初始嵌入输出。
decoder_attentions (tuple(torch.FloatTensor), 可选, 当传递了 output_attentions=True 或当 config.output_attentions=True 时返回) — 由 torch.FloatTensor 组成的元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

解码器的注意力权重，经过注意力 softmax 后，用于计算自注意力头中的加权平均值。
cross_attentions (tuple(torch.FloatTensor), 可选, 当传递了 output_attentions=True 或当 config.output_attentions=True 时返回) — 由 torch.FloatTensor 组成的元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

解码器交叉注意力层的注意力权重，经过注意力 softmax 后，用于计算交叉注意力头中的加权平均值。
encoder_last_hidden_state (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size), 可选) — 模型编码器最后一层输出的隐藏状态序列。
encoder_hidden_states (tuple(torch.FloatTensor), 可选, 当传递了 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 由 torch.FloatTensor 组成的元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。

编码器每层输出的隐藏状态加上可选的初始嵌入输出。
encoder_attentions (tuple(torch.FloatTensor), 可选, 当传递了 output_attentions=True 或当 config.output_attentions=True 时返回) — 由 torch.FloatTensor 组成的元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

编码器的注意力权重，经过注意力 softmax 后，用于计算自注意力头中的加权平均值。
loc (torch.FloatTensor 形状为 (batch_size,) 或 (batch_size, input_size), 可选) — 每个时间序列上下文窗口的偏移值，用于使模型输入具有相同的幅度，然后用于恢复到原始幅度。
scale (torch.FloatTensor 形状为 (batch_size,) 或 (batch_size, input_size), 可选) — 每个时间序列上下文窗口的缩放值，用于使模型输入具有相同的幅度，然后用于恢复到原始幅度。
static_features (torch.FloatTensor 形状为 (batch_size, feature size), 可选) — 每个时间序列的静态特征，在推理时复制到协变量中。

InformerForPrediction 的前向方法，重写了 __call__ 特殊方法。

尽管前向传递的配方需要在此函数内定义，但之后应该调用Module实例而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

>>> from huggingface_hub import hf_hub_download
>>> import torch
>>> from transformers import InformerForPrediction

>>> file = hf_hub_download(
...     repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
... )
>>> batch = torch.load(file)

>>> model = InformerForPrediction.from_pretrained(
...     "huggingface/informer-tourism-monthly"
... )

>>> # during training, one provides both past and future values
>>> # as well as possible additional features
>>> outputs = model(
...     past_values=batch["past_values"],
...     past_time_features=batch["past_time_features"],
...     past_observed_mask=batch["past_observed_mask"],
...     static_categorical_features=batch["static_categorical_features"],
...     static_real_features=batch["static_real_features"],
...     future_values=batch["future_values"],
...     future_time_features=batch["future_time_features"],
... )

>>> loss = outputs.loss
>>> loss.backward()

>>> # during inference, one only provides past values
>>> # as well as possible additional features
>>> # the model autoregressively generates future values
>>> outputs = model.generate(
...     past_values=batch["past_values"],
...     past_time_features=batch["past_time_features"],
...     past_observed_mask=batch["past_observed_mask"],
...     static_categorical_features=batch["static_categorical_features"],
...     static_real_features=batch["static_real_features"],
...     future_time_features=batch["future_time_features"],
... )

>>> mean_prediction = outputs.sequences.mean(dim=1)

< > Update on GitHub

←Autoformer PatchTSMixer→