Transformers 文档

Cohere

Transformers

Cohere

概述

Cohere Command-R 模型在博客文章 Command-R: Retrieval Augmented Generation at Production Scale 中由 Cohere 团队提出。

论文的摘要如下：

Command-R 是一个针对 RAG 和工具使用的可扩展生成模型，旨在为企业提供生产规模的 AI。今天，我们推出了 Command-R，这是一个针对大规模生产工作负载的新 LLM。Command-R 针对新兴的“可扩展”模型类别，这些模型在高效性和强准确性之间取得平衡，使企业能够超越概念验证，进入生产阶段。

*Command-R 是一个生成模型，专为长上下文任务优化，如检索增强生成（RAG）和使用外部API和工具。它设计用于与我们行业领先的Embed和Rerank模型协同工作，为RAG应用程序提供一流的集成，并在企业用例中表现出色。作为一个为公司大规模实施而构建的模型，Command-R 拥有以下特点：

在RAG和工具使用方面具有强大的准确性
低延迟，高吞吐量
更长的128k上下文和更低的价格
在10种关键语言中具备强大的能力
HuggingFace上提供模型权重用于研究和评估

查看模型检查点这里。该模型由 Saurabh Dash 和 Ahmet Üstün 贡献。Hugging Face 中的实现代码基于 GPT-NeoX 这里。

使用提示

在Hub上上传的检查点使用torch_dtype = 'float16'，这将由AutoModel API用于将检查点从torch.float32转换为torch.float16。

在线权重的dtype大多无关紧要，除非你在使用torch_dtype="auto"初始化模型时使用model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")。原因是模型会首先被下载（使用在线检查点的dtype），然后会被转换为torch的默认dtype（变为torch.float32），最后，如果配置中提供了torch_dtype，它将被使用。

不建议使用float16训练模型，已知会导致nan；因此，模型应使用bfloat16进行训练。

The model and tokenizer can be loaded via:

# pip install transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereForAI/c4ai-command-r-v01"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Format message with the command-r chat template
messages = [{"role": "user", "content": "Hello, how are you?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>

gen_tokens = model.generate(
    input_ids, 
    max_new_tokens=100, 
    do_sample=True, 
    temperature=0.3,
    )

gen_text = tokenizer.decode(gen_tokens[0])
print(gen_text)

当通过attn_implementation="flash_attention_2"使用Flash Attention 2时，不要将torch_dtype传递给from_pretrained类方法，并使用自动混合精度训练。当使用Trainer时，只需将fp16或bf16指定为True。否则，请确保您正在使用torch.autocast。这是必需的，因为Flash Attention仅支持fp16和bf16数据类型。

资源

一份官方的Hugging Face和社区（由🌎表示）资源列表，帮助您开始使用Command-R。如果您有兴趣提交资源以包含在此处，请随时打开一个Pull Request，我们将进行审核！理想情况下，资源应展示一些新内容，而不是重复现有资源。

Text Generation

正在加载 FP16 模型

# pip install transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereForAI/c4ai-command-r-v01"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Format message with the command-r chat template
messages = [{"role": "user", "content": "Hello, how are you?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>

gen_tokens = model.generate(
    input_ids, 
    max_new_tokens=100, 
    do_sample=True, 
    temperature=0.3,
    )

gen_text = tokenizer.decode(gen_tokens[0])
print(gen_text)

正在加载 bitsnbytes 4bit 量化模型

# pip install transformers bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True)

model_id = "CohereForAI/c4ai-command-r-v01"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)

gen_tokens = model.generate(
    input_ids, 
    max_new_tokens=100, 
    do_sample=True, 
    temperature=0.3,
    )

gen_text = tokenizer.decode(gen_tokens[0])
print(gen_text)

CohereConfig

类 transformers.CohereConfig

< source >

( vocab_size = 256000 hidden_size = 8192 intermediate_size = 22528 logit_scale = 0.0625 num_hidden_layers = 40 num_attention_heads = 64 num_key_value_heads = None hidden_act = 'silu' max_position_embeddings = 8192 initializer_range = 0.02 layer_norm_eps = 1e-05 use_cache = True pad_token_id = 0 bos_token_id = 5 eos_token_id = 255001 tie_word_embeddings = True rope_theta = 10000.0 rope_scaling = None attention_bias = False attention_dropout = 0.0 use_qk_norm = False **kwargs )

参数

vocab_size (int, 可选, 默认为 256000) — Cohere 模型的词汇表大小。定义了调用 CohereModel 时传递的 inputs_ids 可以表示的不同标记的数量
hidden_size (int, optional, 默认为 8192) — 隐藏表示的维度。
intermediate_size (int, optional, 默认为 22528) — MLP 表示的维度。
logit_scale (float, optional, defaults to 0.0625) — 输出logits的缩放因子。
num_hidden_layers (int, optional, 默认为 40) — Transformer 解码器中的隐藏层数量。
num_attention_heads (int, optional, 默认为 64) — Transformer 解码器中每个注意力层的注意力头数。
num_key_value_heads (int, optional) — 这是用于实现分组查询注意力（Grouped Query Attention）的键值头数量。如果 num_key_value_heads=num_attention_heads，模型将使用多头注意力（MHA），如果 num_key_value_heads=1，模型将使用多查询注意力（MQA），否则将使用GQA。当将多头检查点转换为GQA检查点时，每个组的键和值头应通过平均池化该组中的所有原始头来构建。更多详情请查看这篇论文。如果未指定，将默认为 num_attention_heads.
hidden_act (str 或 function, 可选, 默认为 "silu") — 解码器中的非线性激活函数（函数或字符串）。
max_position_embeddings (int, optional, 默认为 8192) — 该模型可能使用的最大序列长度。
initializer_range (float, 可选, 默认为 0.02) — 用于初始化所有权重矩阵的截断正态初始化器的标准差。
layer_norm_eps (float, optional, defaults to 1e-05) — 层归一化使用的epsilon值。
use_cache (bool, 可选, 默认为 True) — 模型是否应返回最后的键/值注意力（并非所有模型都使用）。仅在 config.is_decoder=True 时相关。
pad_token_id (int, optional, defaults to 0) — 填充标记id.
bos_token_id (int, optional, 默认为 5) — 流的开始标记 id.
eos_token_id (int, optional, 默认为 255001) — 流结束标记 id.
tie_word_embeddings (bool, optional, defaults to True) — 是否绑定权重嵌入
rope_theta (float, optional, 默认为 10000.0) — RoPE 嵌入的基础周期。
rope_scaling (Dict, optional) — Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type and you expect the model to work on longer max_position_embeddings, we recommend you to update this value accordingly. Expected contents: rope_type (str): The sub-variant of RoPE to use. Can be one of [‘default’, ‘linear’, ‘dynamic’, ‘yarn’, ‘longrope’, ‘llama3’], with ‘default’ being the original RoPE implementation. factor (float, optional): Used with all rope types except ‘default’. The scaling factor to apply to the RoPE embeddings. In most scaling types, a factor of x will enable the model to handle sequences of length x original maximum pre-trained length. original_max_position_embeddings (int, optional): Used with ‘dynamic’, ‘longrope’ and ‘llama3’. The original max position embeddings used during pretraining. attention_factor (float, optional): Used with ‘yarn’ and ‘longrope’. The scaling factor to be applied on the attention computation. If unspecified, it defaults to value recommended by the implementation, using the factor field to infer the suggested value. beta_fast (float, optional): Only used with ‘yarn’. Parameter to set the boundary for extrapolation (only) in the linear ramp function. If unspecified, it defaults to 32. beta_slow (float, optional): Only used with ‘yarn’. Parameter to set the boundary for interpolation (only) in the linear ramp function. If unspecified, it defaults to 1. short_factor (List[float], optional): Only used with ‘longrope’. The scaling factor to be applied to short contexts (< original_max_position_embeddings). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2 long_factor (List[float], optional): Only used with ‘longrope’. The scaling factor to be applied to long contexts (< original_max_position_embeddings). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2 low_freq_factor (float, optional): Only used with ‘llama3’. Scaling factor applied to low frequency components of the RoPE high_freq_factor (float, optional*): Only used with ‘llama3’. Scaling factor applied to high frequency components of the RoPE
attention_bias (bool, 默认为 False, 可选, 默认为 False) — 是否在自注意力机制中的查询、键、值和输出投影层中使用偏置。
attention_dropout (float, optional, defaults to 0.0) — 注意力概率的丢弃比率。
use_qk_norm (bool, optional, defaults to False) — 是否在注意力机制中使用查询-键归一化

这是用于存储CohereModel配置的配置类。它用于根据指定的参数实例化Cohere模型，定义模型架构。

配置对象继承自PretrainedConfig，可用于控制模型输出。阅读PretrainedConfig的文档以获取更多信息。使用默认值实例化配置将产生与CohereForAI/c4ai-command-r-v01模型类似的配置。

>>> from transformers import CohereModel, CohereConfig

>>> # Initializing a Cohere model configuration
>>> configuration = CohereConfig()

>>> # Initializing a model from the Cohere configuration
>>> model = CohereModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config

CohereTokenizerFast

类 transformers.CohereTokenizerFast

< source >

( vocab_file = None merges_file = None tokenizer_file = None clean_up_tokenization_spaces = False unk_token = '' bos_token = '' eos_token = '<|END_OF_TURN_TOKEN|>' add_bos_token = True add_eos_token = False use_default_system_prompt = False add_prefix_space = False **kwargs )

参数

vocab_file (str, optional) — 词汇表文件的路径。
merges_file (str, optional) — 合并文件的路径。
tokenizer_file (str, 可选) — tokenizers 文件（通常具有 .json 扩展名），包含加载分词器所需的所有内容。
clean_up_tokenization_spaces (bool, 可选, 默认为 False) — 是否在解码后清理空格，清理包括移除可能的额外空格等潜在问题。
unk_token (str 或 tokenizers.AddedToken, 可选, 默认为 "") — 未知标记。不在词汇表中的标记无法转换为ID，而是设置为该标记。
bos_token (str 或 tokenizers.AddedToken, 可选, 默认为 "") — 在预训练期间使用的序列开始标记。可以用作序列分类器标记。
eos_token (str or tokenizers.AddedToken, optional, defaults to "<|END_OF_TURN_TOKEN|>") — 序列结束标记。
add_bos_token (bool, optional, defaults to True) — 是否在序列的开头添加一个bos_token。
add_eos_token (bool, optional, defaults to False) — 是否在序列末尾添加一个eos_token。
use_default_system_prompt (bool, optional, defaults to False) — 是否应使用Cohere分词器的默认系统提示。
add_prefix_space (bool, 可选, 默认为 False) — 是否应该自动添加前缀空格

构建一个Cohere分词器。基于字节级的字节对编码。

这主要使用了ByteFallback和NFC规范化。

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
>>> tokenizer.encode("Hello this is a test")
[5, 28339, 2075, 1801, 1671, 3282]

如果你想更改bos_token或eos_token，请确保在初始化模型时指定它们，或者调用tokenizer.update_post_processor()以确保后处理正确完成（否则编码序列的第一个标记和最后一个标记的值将不正确）。更多详情，请查看[后处理器] (https://huggingface.co/docs/tokenizers/api/post-processors) 文档。

你可以通过在实例化这个分词器时传递add_prefix_space=True来绕过这种行为，但由于模型不是以这种方式预训练的，这可能会导致性能下降。

当与is_split_into_words=True一起使用时，此分词器需要使用add_prefix_space=True进行实例化。

这个分词器继承自PreTrainedTokenizerFast，其中包含了大部分主要方法。用户应参考这个超类以获取有关这些方法的更多信息。

build_inputs_with_special_tokens

< source >

( token_ids_0 token_ids_1 = 无 )

get_special_tokens_mask

< source >

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None already_has_special_tokens: bool = False ) → 一个在范围 [0, 1] 内的整数列表

参数

token_ids_0 (List[int]) — 第一个序列的ID列表。
token_ids_1 (List[int], optional) — 第二个序列的ID列表。
already_has_special_tokens (bool, optional, defaults to False) — 标记列表是否已经用模型的特殊标记格式化。

一个在范围 [0, 1] 内的整数列表

1 表示特殊标记，0 表示序列标记。

从没有添加特殊标记的标记列表中检索序列ID。当使用标记器的prepare_for_model或encode_plus方法添加特殊标记时，会调用此方法。

create_token_type_ids_from_sequences

< source >

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]

参数

token_ids_0 (List[int]) — 第一个分词序列.
token_ids_1 (List[int], optional) — 第二个标记化序列.

List[int]

令牌类型ID。

创建与传递的序列相对应的令牌类型ID。什么是令牌类型ID？

如果模型有特殊的构建方式，应该在子类中重写。

update_post_processor

< source >

( )

使用当前的bos_token和eos_token更新底层后处理器。

保存词汇表

< source >

( save_directory: str filename_prefix: typing.Optional[str] = None ) → Tuple(str)

参数

save_directory (str) — 保存词汇表的目录。
filename_prefix (str, optional) — 一个可选的前缀，用于添加到保存文件的名称中。

Tuple(str)

保存文件的路径。

仅保存分词器的词汇表（词汇表 + 添加的标记）。

此方法不会保存分词器的配置和特殊标记映射。使用 _save_pretrained() 来保存分词器的整个状态。

CohereModel

类 transformers.CohereModel

< source >

( config: CohereConfig )

参数

config (CohereConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。
config — CohereConfig

裸的Cohere模型输出原始的隐藏状态，没有任何特定的头部。该模型继承自PreTrainedModel。请查看超类文档以了解库为其所有模型实现的通用方法（如下载或保存、调整输入嵌入大小等）。

该模型也是一个PyTorch torch.nn.Module 子类。将其作为常规的PyTorch模块使用，并参考PyTorch文档以获取与一般使用和行为相关的所有信息。

Transformer解码器由config.num_hidden_layers层组成。每一层都是一个CohereDecoderLayer

前进

< source >

( input_ids: LongTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Union[transformers.cache_utils.Cache, typing.List[torch.FloatTensor], NoneType] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None **flash_attn_kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs] )

参数

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
可以使用AutoTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

什么是输入ID？
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
什么是注意力掩码？

可以使用AutoTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

如果使用了past_key_values，可以选择只输入最后的input_ids（参见past_key_values）。

如果你想改变填充行为，你应该阅读modeling_opt._prepare_decoder_attention_mask 并根据你的需求进行修改。有关默认策略的更多信息，请参见论文中的图1。
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].
什么是位置ID？
past_key_values (Cache or tuple(tuple(torch.FloatTensor)), optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True.
允许两种格式：
- a Cache instance, see our kv cache guide;
- Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)). This is also known as the legacy cache format.
模型将输出与输入相同的缓存格式。如果没有传递past_key_values，将返回旧的缓存格式。

如果使用了past_key_values，用户可以选择只输入形状为(batch_size, 1)的最后input_ids（那些没有将其过去键值状态提供给此模型的input_ids），而不是形状为(batch_size, sequence_length)的所有input_ids。
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 可选地，您可以选择直接传递嵌入表示，而不是传递 input_ids。如果您希望对如何将 input_ids 索引转换为相关向量有更多控制，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
use_cache (bool, 可选) — 如果设置为 True，past_key_values 键值状态将被返回，并可用于加速解码（参见 past_key_values）。
output_attentions (bool, optional) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量中的attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict (bool, 可选) — 是否返回一个ModelOutput而不是一个普通的元组。

CohereModel 的 forward 方法，重写了 __call__ 特殊方法。

尽管前向传递的配方需要在此函数内定义，但之后应该调用Module实例而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

CohereForCausalLM

类 transformers.CohereForCausalLM

< source >

( config )

前进

< source >

( input_ids: LongTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None num_logits_to_keep: int = 0 **loss_kwargs ) → transformers.modeling_outputs.CausalLMOutputWithPast 或 tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
可以使用AutoTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

什么是输入ID？
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
什么是注意力掩码？

可以使用AutoTokenizer获取索引。详情请参见PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。

如果使用了past_key_values，可以选择只输入最后的input_ids（参见past_key_values）。

如果你想改变填充行为，你应该阅读modeling_opt._prepare_decoder_attention_mask 并根据你的需求进行修改。有关默认策略的更多信息，请参见论文中的图1。
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].
什么是位置ID？
past_key_values (Cache or tuple(tuple(torch.FloatTensor)), optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True.
允许两种格式：
- a Cache instance, see our kv cache guide;
- Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)). This is also known as the legacy cache format.
模型将输出与输入相同的缓存格式。如果没有传递past_key_values，将返回旧的缓存格式。

如果使用了past_key_values，用户可以选择只输入形状为(batch_size, 1)的最后input_ids（那些没有将其过去键值状态提供给此模型的input_ids），而不是形状为(batch_size, sequence_length)的所有input_ids。
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 可选地，您可以选择直接传递嵌入表示，而不是传递 input_ids。如果您希望对如何将 input_ids 索引转换为相关向量有更多控制，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
use_cache (bool, 可选) — 如果设置为 True，past_key_values 键值状态将被返回，并可用于加速解码（参见 past_key_values）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的attentions。
output_hidden_states (bool, optional) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict (bool, optional) — 是否返回一个ModelOutput而不是一个普通的元组。
Args — labels (torch.LongTensor of shape (batch_size, sequence_length), optional): Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].
num_logits_to_keep (int, 可选): 计算最后num_logits_to_keep个token的logits。如果为0，则计算所有input_ids的logits（特殊情况）。生成时只需要最后一个token的logits，仅计算该token的logits可以节省内存，这对于长序列或大词汇量来说非常重要。

transformers.modeling_outputs.CausalLMOutputWithPast 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.CausalLMOutputWithPast 或一个由 torch.FloatTensor 组成的元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），包含各种元素，取决于配置（CohereConfig）和输入。

loss (torch.FloatTensor 形状为 (1,), 可选, 当提供 labels 时返回) — 语言建模损失（用于下一个标记预测）。
logits (torch.FloatTensor 形状为 (batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数（SoftMax 之前每个词汇标记的分数）。
past_key_values (tuple(tuple(torch.FloatTensor)), 可选, 当传递 use_cache=True 或当 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量）

包含预先计算的隐藏状态（自注意力块中的键和值），可用于（参见 past_key_values 输入）加速顺序解码。
hidden_states (tuple(torch.FloatTensor), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）形状为 (batch_size, sequence_length, hidden_size)。

模型在每层输出处的隐藏状态加上可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 的元组（每层一个）形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

CohereForCausalLM 的前向方法，重写了 __call__ 特殊方法。

尽管前向传递的配方需要在此函数内定义，但之后应该调用Module实例而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

>> from transformers import AutoTokenizer, CohereForCausalLM

>> model = CohereForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01")
>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")

>> prompt = "Hey, are you conscious? Can you talk to me?"
>> inputs = tokenizer(prompt, return_tensors="pt")

>> # Generate
>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."

< > Update on GitHub

←CodeLlama ConvBERT→