Transformers 文档

训练师

Transformers

训练器

Trainer 类提供了一个用于在 PyTorch 中进行功能完整训练的 API，并且它支持在多个 GPU/TPU 上进行分布式训练，支持 NVIDIA GPUs、AMD GPUs 以及 torch.amp 的混合精度训练。Trainer 与 TrainingArguments 类紧密结合，后者提供了广泛的选项来自定义模型的训练方式。这两个类共同提供了一个完整的训练 API。

Seq2SeqTrainer 和 Seq2SeqTrainingArguments 继承自 Trainer 和 TrainingArguments 类，它们适用于训练序列到序列任务的模型，例如摘要或翻译。

Trainer 类针对 🤗 Transformers 模型进行了优化，当与其他模型一起使用时可能会有意想不到的行为。当将其与您自己的模型一起使用时，请确保：

你的模型总是返回元组或ModelOutput的子类
如果提供了labels参数，你的模型可以计算损失，并且该损失将作为元组的第一个元素返回（如果你的模型返回元组）
你的模型可以接受多个标签参数（在TrainingArguments中使用label_names来向Trainer指示它们的名称），但其中任何一个都不应命名为"label"

训练师

类 transformers.Trainer

( model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module] = None args: TrainingArguments = None data_collator: typing.Optional[transformers.data.data_collator.DataCollator] = None train_dataset: typing.Union[torch.utils.data.dataset.Dataset, torch.utils.data.dataset.IterableDataset, ForwardRef('datasets.Dataset'), NoneType] = None eval_dataset: typing.Union[torch.utils.data.dataset.Dataset, typing.Dict[str, torch.utils.data.dataset.Dataset], ForwardRef('datasets.Dataset'), NoneType] = None processing_class: typing.Union[transformers.tokenization_utils_base.PreTrainedTokenizerBase, transformers.image_processing_utils.BaseImageProcessor, transformers.feature_extraction_utils.FeatureExtractionMixin, transformers.processing_utils.ProcessorMixin, NoneType] = None model_init: typing.Optional[typing.Callable[[], transformers.modeling_utils.PreTrainedModel]] = None compute_loss_func: typing.Optional[typing.Callable] = None compute_metrics: typing.Optional[typing.Callable[[transformers.trainer_utils.EvalPrediction], typing.Dict]] = None callbacks: typing.Optional[typing.List[transformers.trainer_callback.TrainerCallback]] = None optimizers: typing.Tuple[typing.Optional[torch.optim.optimizer.Optimizer], typing.Optional[torch.optim.lr_scheduler.LambdaLR]] = (None, None) optimizer_cls_and_kwargs: typing.Optional[typing.Tuple[typing.Type[torch.optim.optimizer.Optimizer], typing.Dict[str, typing.Any]]] = None preprocess_logits_for_metrics: typing.Optional[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None )

参数

model (PreTrainedModel or torch.nn.Module, optional) — The model to train, evaluate or use for predictions. If not provided, a model_init must be passed.

Trainer 经过优化，可以与库提供的 PreTrainedModel 配合使用。只要您自己定义的模型以与 🤗 Transformers 模型相同的方式工作，您仍然可以使用它们，即使它们被定义为 torch.nn.Module。
args (TrainingArguments, 可选) — 用于调整训练的参数。如果未提供，将默认为TrainingArguments的基本实例，并将output_dir设置为当前目录中名为tmp_trainer的目录。
data_collator (DataCollator, 可选) — 用于从train_dataset或eval_dataset的元素列表中形成批次的函数。如果未提供processing_class，则默认为default_data_collator()，如果processing_class是特征提取器或分词器，则默认为DataCollatorWithPadding的实例。
train_dataset (Union[torch.utils.data.Dataset, torch.utils.data.IterableDataset, datasets.Dataset], optional) — The dataset to use for training. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed.
请注意，如果它是一个带有一些随机化的torch.utils.data.IterableDataset，并且你正在以分布式方式进行训练，你的可迭代数据集应该使用一个内部属性generator，它是一个torch.Generator，用于在所有进程中必须相同的随机化（并且Trainer将在每个epoch手动设置这个generator的种子），或者有一个set_epoch()方法，该方法内部设置使用的RNG的种子。
eval_dataset (Union[torch.utils.data.Dataset, Dict[str, torch.utils.data.Dataset, datasets.Dataset]), optional) — 用于评估的数据集。如果它是一个Dataset，则model.forward()方法不接受的列将自动移除。如果它是一个字典，它将在每个数据集上评估，并将字典键添加到指标名称前。
processing_class (PreTrainedTokenizerBase 或 BaseImageProcessor 或 FeatureExtractionMixin 或 ProcessorMixin, 可选) — 用于处理数据的处理类。如果提供，将用于自动处理模型的输入，并且它将与模型一起保存，以便更容易重新运行中断的训练或重用微调后的模型。这将取代现已弃用的 tokenizer 参数。
model_init (Callable[[], PreTrainedModel], optional) — A function that instantiates the model to be used. If provided, each call to train() will start from a new instance of the model as given by this function.
该函数可能没有参数，或者有一个包含optuna/Ray Tune/SigOpt试验对象的参数，以便能够根据超参数（如层数、内部层的大小、丢弃概率等）选择不同的架构。
compute_loss_func (Callable, 可选) — 一个接受原始模型输出、标签和整个累积批次中的项目数（batch_size * gradient_accumulation_steps）并返回损失值的函数。例如，参见Trainer使用的默认损失函数.
compute_metrics (Callable[[EvalPrediction], Dict], 可选) — 用于在评估时计算指标的函数。必须接受一个EvalPrediction并返回一个字典字符串到指标值。注意当传递带有batch_eval_metrics设置为True的TrainingArgs时，你的compute_metrics函数必须接受一个布尔值compute_result参数。这将在最后一个评估批次后触发，以指示该函数需要计算并返回全局汇总统计信息，而不是累积批次级别的统计信息
callbacks (List of TrainerCallback, optional) — A list of callbacks to customize the training loop. Will add those to the list of default callbacks detailed in here.
如果你想移除其中一个默认使用的回调函数，请使用Trainer.remove_callback()方法。
optimizers (Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR], optional, defaults to (None, None)) — 一个包含优化器和调度器的元组。默认情况下，将使用AdamW优化器和由get_linear_schedule_with_warmup()提供的调度器，该调度器由args控制。
optimizer_cls_and_kwargs (Tuple[Type[torch.optim.Optimizer], Dict[str, Any]], optional) — A tuple containing the optimizer class and keyword arguments to use. Overrides optim and optim_args in args. Incompatible with the optimizers argument.
与optimizers不同，此参数避免了在初始化Trainer之前将模型参数放置在正确设备上的需要。
preprocess_logits_for_metrics (Callable[[torch.Tensor, torch.Tensor], torch.Tensor], optional) — A function that preprocess the logits right before caching them at each evaluation step. Must take two tensors, the logits and the labels, and return the logits once processed as desired. The modifications made by this function will be reflected in the predictions received by compute_metrics.
请注意，如果数据集没有标签，标签（第二个参数）将为None。

Trainer 是一个简单但功能完整的 PyTorch 训练和评估循环，专为 🤗 Transformers 优化。

重要属性：

model — 始终指向核心模型。如果使用的是transformers模型，它将是一个PreTrainedModel子类。
model_wrapped — 始终指向最外部的模型，以防一个或多个其他模块包装原始模型。这是应该用于前向传递的模型。例如，在DeepSpeed下，内部模型被包装在DeepSpeed中，然后再次包装在torch.nn.DistributedDataParallel中。如果内部模型没有被包装，那么self.model_wrapped与self.model相同。
is_model_parallel — 模型是否已切换到模型并行模式（与数据并行不同，这意味着模型的某些层被拆分到不同的GPU上）。
place_model_on_device — 是否自动将模型放置在设备上 - 如果使用了模型并行或deepspeed，或者默认的TrainingArguments.place_model_on_device被覆盖以返回False，则将其设置为False。
is_in_train — 模型当前是否正在运行 train（例如，当在 train 过程中调用 evaluate 时）

add_callback

( callback )

参数

callback (type 或 [`~transformers.TrainerCallback]`) — 一个 TrainerCallback 类或 TrainerCallback 的实例。在第一种情况下，将实例化该类的成员。

将回调添加到当前的TrainerCallback列表中。

autocast_smart_context_manager

( cache_enabled: typing.Optional[bool] = True )

一个辅助包装器，根据情况为autocast创建适当的上下文管理器，同时为其提供所需的参数。

compute_loss

( model inputs return_outputs = False num_items_in_batch = None )

Trainer 如何计算损失。默认情况下，所有模型都会在第一个元素中返回损失。

子类化和覆盖以实现自定义行为。

compute_loss_context_manager

( )

一个帮助包装器，用于将上下文管理器分组在一起。

create_model_card

( language: typing.Optional[str] = None license: typing.Optional[str] = None tags: typing.Union[str, typing.List[str], NoneType] = None model_name: typing.Optional[str] = None finetuned_from: typing.Optional[str] = None tasks: typing.Union[str, typing.List[str], NoneType] = None dataset_tags: typing.Union[str, typing.List[str], NoneType] = None dataset: typing.Union[str, typing.List[str], NoneType] = None dataset_args: typing.Union[str, typing.List[str], NoneType] = None )

参数

语言 (str, 可选) — 模型的适用语言（如果适用）
license (str, optional) — 模型的许可证。如果提供给Trainer的原始模型来自Hub上的仓库，则默认使用预训练模型的许可证。
标签 (str 或 List[str], 可选) — 一些要包含在模型卡元数据中的标签。
model_name (str, optional) — 模型的名称。
finetuned_from (str, optional) — 用于微调此模型的模型名称（如果适用）。将默认为提供给Trainer的原始模型的仓库名称（如果它来自Hub）。
任务 (str 或 List[str], 可选) — 一个或多个任务标识符，将包含在模型卡的元数据中。
dataset_tags (str 或 List[str], 可选) — 一个或多个数据集标签，将包含在模型卡的元数据中。
数据集 (str 或 List[str], 可选) — 一个或多个数据集标识符，将包含在模型卡的元数据中。
dataset_args (str 或 List[str], 可选) — 一个或多个数据集参数，将包含在模型卡的元数据中。

使用Trainer可用的信息创建模型卡的草稿。

create_optimizer

( )

设置优化器。

我们提供了一个合理的默认值，效果很好。如果你想使用其他东西，你可以通过optimizers在Trainer的初始化中传递一个元组，或者在子类中重写这个方法。

create_optimizer_and_scheduler

( 训练步数: int )

设置优化器和学习率调度器。

我们提供了一个合理的默认设置，效果很好。如果你想使用其他设置，可以通过Trainer的init中的optimizers传递一个元组，或者在子类中重写此方法（或create_optimizer和/或create_scheduler）。

create_scheduler

( num_training_steps: int optimizer: Optimizer = None )

参数

num_training_steps (int) — 要进行的训练步骤数。

设置调度器。训练器的优化器必须在此方法调用之前设置好，或者作为参数传递。

评估

( eval_dataset: typing.Union[torch.utils.data.dataset.Dataset, typing.Dict[str, torch.utils.data.dataset.Dataset], NoneType] = None ignore_keys: typing.Optional[typing.List[str]] = None metric_key_prefix: str = 'eval' )

参数

eval_dataset (Union[Dataset, Dict[str, Dataset]), optional) — Pass a dataset if you wish to override self.eval_dataset. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed. If it is a dictionary, it will evaluate on each dataset, prepending the dictionary key to the metric name. Datasets must implement the __len__ method.

如果你传递一个以数据集名称为键、数据集为值的字典，evaluate 将对每个数据集分别运行评估。这对于监控训练如何影响其他数据集或仅仅为了获得更细粒度的评估非常有用。当与 load_best_model_at_end 一起使用时，确保 metric_for_best_model 准确引用其中一个数据集。例如，如果你为两个数据集 data1 和 data2 传递了 {"data1": data1, "data2": data2}，你可以指定 metric_for_best_model="eval_data1_loss" 来使用 data1 的损失，或者指定 metric_for_best_model="eval_data2_loss" 来使用 data2 的损失。
ignore_keys (List[str], optional) — 模型输出中的键列表（如果是字典），在收集预测时应忽略这些键。
metric_key_prefix (str, 可选, 默认为 "eval") — 一个可选的前缀，用作指标键的前缀。例如，如果前缀是“eval”（默认值），则指标“bleu”将被命名为“eval_bleu”

运行评估并返回指标。

调用脚本将负责提供计算指标的方法，因为它们是任务相关的（将其传递给init compute_metrics参数）。

你也可以通过子类化并重写这个方法来注入自定义行为。

evaluation_loop

( dataloader: DataLoader description: str prediction_loss_only: typing.Optional[bool] = None ignore_keys: typing.Optional[typing.List[str]] = None metric_key_prefix: str = 'eval' )

预测/评估循环，由Trainer.evaluate()和Trainer.predict()共享。

无论是否有标签都可以工作。

floating_point_ops

( inputs: typing.Dict[str, typing.Union[torch.Tensor, typing.Any]] ) → int

参数

inputs (Dict[str, Union[torch.Tensor, Any]]) — 模型的输入和目标。

返回

int

浮点运算的数量。

对于继承自PreTrainedModel的模型，使用该方法来计算每次反向传播和正向传播的浮点操作数。如果使用其他模型，请在模型中实现此类方法或子类化并重写此方法。

get_decay_parameter_names

( model )

获取所有将应用权重衰减的参数名称

请注意，一些模型实现了自己的layernorm而不是调用nn.LayerNorm，权重衰减仍然可能适用于这些模块，因为此函数仅过滤掉nn.LayerNorm的实例

get_eval_dataloader

( eval_dataset: typing.Union[str, torch.utils.data.dataset.Dataset, NoneType] = None )

参数

eval_dataset (str 或 torch.utils.data.Dataset, 可选) — 如果是一个 str，将使用 self.eval_dataset[eval_dataset] 作为评估数据集。如果是一个 Dataset，将覆盖 self.eval_dataset 并且必须实现 __len__。如果它是一个 Dataset，model.forward() 方法不接受的列将自动被移除。

返回评估 ~torch.utils.data.DataLoader。

如果您想注入一些自定义行为，请子类化并重写此方法。

get_learning_rates

( )

返回self.optimizer中每个参数的学习率。

get_num_trainable_parameters

( )

获取可训练参数的数量。

get_optimizer_cls_and_kwargs

( args: TrainingArguments model: typing.Optional[transformers.modeling_utils.PreTrainedModel] = None )

参数

args (transformers.training_args.TrainingArguments) — 训练会话的训练参数。

根据训练参数返回优化器类和优化器参数。

get_optimizer_group

( param: typing.Union[str, torch.nn.parameter.Parameter, NoneType] = None )

参数

param (str 或 torch.nn.parameter.Parameter, 可选) — 需要返回优化器组的参数。

如果给定参数，则返回该参数的优化器组，否则返回所有参数的优化器组。

get_test_dataloader

( test_dataset: 数据集 )

参数

test_dataset (torch.utils.data.Dataset, optional) — 使用的测试数据集。如果它是一个Dataset，不被model.forward()方法接受的列将自动被移除。它必须实现__len__方法。

返回测试 ~torch.utils.data.DataLoader。

如果您想注入一些自定义行为，请子类化并重写此方法。

get_train_dataloader

( )

返回训练用的 ~torch.utils.data.DataLoader。

如果 train_dataset 没有实现 __len__，则不使用采样器，否则使用随机采样器（必要时适应分布式训练）。

如果您想注入一些自定义行为，请子类化并重写此方法。

超参数搜索

( hp_space: typing.Optional[typing.Callable[[ForwardRef('optuna.Trial')], typing.Dict[str, float]]] = None compute_objective: typing.Optional[typing.Callable[[typing.Dict[str, float]], float]] = None n_trials: int = 20 direction: typing.Union[str, typing.List[str]] = 'minimize' backend: typing.Union[ForwardRef('str'), transformers.trainer_utils.HPSearchBackend, NoneType] = None hp_name: typing.Optional[typing.Callable[[ForwardRef('optuna.Trial')], str]] = None **kwargs ) → [trainer_utils.BestRun 或 List[trainer_utils.BestRun]]

参数

hp_space (Callable[["optuna.Trial"], Dict[str, float]], 可选) — 定义超参数搜索空间的函数。默认情况下将根据您的后端选择 default_hp_space_optuna() 或 default_hp_space_ray() 或 default_hp_space_sigopt()。
compute_objective (Callable[[Dict[str, float]], float], 可选) — 一个函数，用于计算从evaluate方法返回的指标中最小化或最大化的目标。默认值为default_compute_objective().
n_trials (int, optional, defaults to 100) — 用于测试的试验运行次数。
direction (str 或 List[str], 可选, 默认为 "minimize") — 如果是单目标优化，direction 是 str，可以是 "minimize" 或 "maximize"，当优化验证损失时应选择 "minimize"，当优化一个或多个指标时应选择 "maximize"。如果是多目标优化，direction 是 List[str]，可以是 "minimize" 和 "maximize" 的列表，当优化验证损失时应选择 "minimize"，当优化一个或多个指标时应选择 "maximize"。
backend (str 或 ~training_utils.HPSearchBackend, 可选) — 用于超参数搜索的后端。将默认为 optuna 或 Ray Tune 或 SigOpt，取决于哪个已安装。如果全部已安装，将默认为 optuna.
hp_name (Callable[["optuna.Trial"], str]], optional) — 一个定义试验/运行名称的函数。默认为 None.
kwargs (Dict[str, Any], optional) — Additional keyword arguments for each backend:
- optuna: parameters from optuna.study.create_study and also the parameters timeout, n_jobs and gc_after_trial from optuna.study.Study.optimize
- ray: parameters from tune.run. If resources_per_trial is not set in the kwargs, it defaults to 1 CPU core and 1 GPU (if available). If progress_reporter is not set in the kwargs, ray.tune.CLIReporter is used.
- sigopt: the parameter proxies from sigopt.Connection.set_proxies.

返回

[trainer_utils.BestRun 或 List[trainer_utils.BestRun]]

所有关于最佳运行或多目标优化最佳运行的信息。实验摘要可以在Ray后端的run_summary属性中找到。

使用optuna、Ray Tune或SigOpt启动超参数搜索。优化的量由compute_objective确定，当没有提供指标时，默认返回评估损失，否则返回所有指标的总和。

要使用此方法，您需要在初始化Trainer时提供一个model_init：我们需要在每次新运行时重新初始化模型。这与optimizers参数不兼容，因此您需要子类化Trainer并重写方法create_optimizer_and_scheduler()以自定义优化器/调度器。

init_hf_repo

( token: typing.Optional[str] = None )

在 self.args.hub_model_id 中初始化一个 git 仓库。

is_local_process_zero

( )

此进程是否为本地（例如，如果在多台机器上以分布式方式进行训练，则在一台机器上）主进程。

is_world_process_zero

( )

此进程是否为全局主进程（在多台机器上以分布式方式进行训练时，只有一个进程会是True）。

日志

( logs: typing.Dict[str, float] start_time: typing.Optional[float] = None )

参数

日志 (Dict[str, float]) — 要记录的值。
start_time (Optional[float]) — 训练的开始时间.

记录 logs 在各种对象上观察训练。

子类化并重写此方法以注入自定义行为。

log_metrics

( split metrics )

参数

split (str) — 模式/分割名称：其中之一为 train, eval, test
指标 (Dict[str, float]) — 从训练/评估/预测返回的指标: 指标字典

以特殊格式记录指标

在分布式环境中，这仅对排名为0的进程执行。

内存报告注意事项：

为了获取内存使用报告，您需要安装 psutil。您可以使用 pip install psutil 来完成安装。

现在当这个方法运行时，您将看到一个报告，其中包括：

init_mem_cpu_alloc_delta   =     1301MB
init_mem_cpu_peaked_delta  =      154MB
init_mem_gpu_alloc_delta   =      230MB
init_mem_gpu_peaked_delta  =        0MB
train_mem_cpu_alloc_delta  =     1345MB
train_mem_cpu_peaked_delta =        0MB
train_mem_gpu_alloc_delta  =      693MB
train_mem_gpu_peaked_delta =        7MB

理解报告：

第一段，例如 train__，告诉你这些指标属于哪个阶段。以 init_ 开头的报告将被添加到第一个运行的阶段。因此，如果只运行评估，__init__ 的内存使用情况将与 eval_ 指标一起报告。
第三段，是cpu或gpu，告诉你它是通用RAM还是gpu0内存指标。
*_alloc_delta - 是阶段结束和开始时使用的/分配的内存计数器的差异 - 如果函数释放的内存多于分配的内存，则可能为负数。
*_peaked_delta - 是任何被消耗然后释放的额外内存 - 相对于当前分配的内存计数器 - 它永远不会是负数。当你查看任何阶段的指标时，你将alloc_delta + peaked_delta相加，你就知道完成该阶段需要多少内存。

报告仅针对排名为0的进程和gpu 0（如果有gpu）。通常这已经足够了，因为主进程完成了大部分工作，但如果使用了模型并行，其他GPU可能会使用不同数量的gpu内存。在DataParallel下，情况也不尽相同，因为gpu0可能需要比其余GPU更多的内存，因为它存储了所有参与GPU的梯度和优化器状态。也许在未来，这些报告会进一步发展以测量这些内容。

CPU RAM 指标测量的是 RSS（驻留集大小），它包括进程独有的内存和与其他进程共享的内存。需要注意的是，它不包括被交换出去的内存，因此报告可能不准确。

CPU峰值内存是通过采样线程测量的。由于Python的GIL（全局解释器锁），如果该线程在最高内存使用时没有机会运行，可能会错过一些峰值内存。因此，此报告可能低于实际情况。使用tracemalloc会报告确切的峰值内存，但它不会报告Python之外的内存分配。因此，如果某些C++ CUDA扩展分配了自己的内存，它将不会被报告。因此，我们选择了内存采样方法，该方法读取当前进程的内存使用情况。

GPU分配和峰值内存报告是通过torch.cuda.memory_allocated()和torch.cuda.max_memory_allocated()完成的。此指标仅报告pytorch特定分配的“增量”，因为torch.cuda内存管理系统不跟踪pytorch之外分配的任何内存。例如，第一次cuda调用通常会加载CUDA内核，这可能会占用0.5到2GB的GPU内存。

请注意，此跟踪器不计算Trainer的__init__、train、evaluate和predict调用之外的内存分配。

因为 evaluation 调用可能发生在 train 期间，我们不能处理嵌套调用，因为 torch.cuda.max_memory_allocated 是一个单一的计数器，所以如果它被嵌套的 eval 调用重置，train 的跟踪器将报告错误的信息。如果这个 pytorch 问题得到解决，将有可能改变这个类以使其可重入。在此之前，我们将只跟踪 train、evaluate 和 predict 方法的外部级别。这意味着如果在 train 期间调用 eval，后者将负责其内存使用以及前者的内存使用。

这也意味着，如果与Trainer一起使用的任何其他工具调用 torch.cuda.reset_peak_memory_stats，GPU峰值内存统计可能会无效。并且Trainer会干扰任何依赖调用torch.cuda.reset_peak_memory_stats的工具的正常行为。

为了获得最佳性能，您可能希望考虑在生产运行中关闭内存分析。

metrics_format

( metrics: typing.Dict[str, float] ) → 指标 (Dict[str, float])

参数

metrics (Dict[str, float]) — 从训练/评估/预测返回的指标

返回

指标 (Dict[str, float])

重新格式化的指标

将训练器指标值重新格式化为人类可读的格式

num_examples

( dataloader: DataLoader )

帮助器通过访问其数据集来获取~torch.utils.data.DataLoader中的样本数量。当dataloader.dataset不存在或没有长度时，尽可能地进行估计

num_tokens

( train_dl: DataLoader max_steps: typing.Optional[int] = None )

通过枚举数据加载器来获取~torch.utils.data.DataLoader中的标记数量的助手。

pop_callback

( callback ) → TrainerCallback

参数

callback (type 或 [`~transformers.TrainerCallback]`) — 一个 TrainerCallback 类或 TrainerCallback 的实例。在第一种情况下，将弹出回调列表中找到的该类的第一个成员。

返回

TrainerCallback

如果找到，回调将被移除。

从当前的TrainerCallback列表中移除一个回调并返回它。

如果未找到回调函数，则返回 None（并且不会引发错误）。

预测

( test_dataset: 数据集 ignore_keys: 可选[列表[str]] = 无 metric_key_prefix: 字符串 = '测试' )

参数

test_dataset (Dataset) — 用于运行预测的数据集。如果它是一个datasets.Dataset，不被model.forward()方法接受的列将自动被移除。必须实现__len__方法
ignore_keys (List[str], 可选) — 模型输出中的键列表（如果是字典），在收集预测时应忽略这些键。
metric_key_prefix (str, 可选, 默认为 "test") — 一个可选的前缀，用作指标键的前缀。例如，如果前缀是“test”（默认值），则指标“bleu”将被命名为“test_bleu”

运行预测并返回预测结果和潜在指标。

根据数据集和您的使用情况，您的测试数据集可能包含标签。在这种情况下，此方法还将返回指标，如evaluate()中所示。

如果你的预测或标签具有不同的序列长度（例如，因为你在标记分类任务中进行了动态填充），预测将被填充（在右侧）以允许连接成一个数组。填充索引为-100。

返回：NamedTuple 一个包含以下键的命名元组：

predictions (np.ndarray): 在 test_dataset 上的预测结果。
label_ids (np.ndarray, 可选): 标签（如果数据集包含一些）。
metrics (Dict[str, float], 可选): 可能的指标字典（如果数据集包含标签）。

prediction_loop

( dataloader: DataLoader description: str prediction_loss_only: typing.Optional[bool] = None ignore_keys: typing.Optional[typing.List[str]] = None metric_key_prefix: str = 'eval' )

预测/评估循环，由Trainer.evaluate()和Trainer.predict()共享。

无论是否有标签都可以工作。

prediction_step

( model: 模块 inputs: 类型.Dict[str, 类型.Union[torch.Tensor, 类型.Any]] prediction_loss_only: 布尔值 ignore_keys: 类型.Optional[类型.List[str]] = 无 ) → 元组[可选[torch.Tensor], 可选[torch.Tensor], 可选[torch.Tensor]]

参数

model (nn.Module) — 要评估的模型。
inputs (Dict[str, Union[torch.Tensor, Any]]) — The inputs and targets of the model.
字典在输入模型之前将被解包。大多数模型期望目标在参数labels下。请检查您的模型文档以了解所有接受的参数。
prediction_loss_only (bool) — 是否仅返回损失值。
ignore_keys (List[str], optional) — 模型输出中的键列表（如果是字典），在收集预测时应忽略这些键。

返回

元组[可选[torch.Tensor], 可选[torch.Tensor], 可选[torch.Tensor]]

一个包含损失、logits 和标签的元组（每个都是可选的）。

使用inputs对model执行评估步骤。

子类化并重写以注入自定义行为。

propagate_args_to_deepspeed

( auto_find_batch_size = False )

根据Trainer参数设置deepspeed插件中的值

push_to_hub

( commit_message: typing.Optional[str] = '训练结束' blocking: bool = True token: typing.Optional[str] = None revision: typing.Optional[str] = None **kwargs )

参数

commit_message (str, optional, defaults to "End of training") — 推送时要提交的消息。
blocking (bool, optional, defaults to True) — 函数是否应该仅在 git push 完成后返回。
token (str, 可选, 默认为 None) — 具有写入权限的令牌，用于覆盖Trainer的原始参数。
revision (str, optional) — 要提交的git修订版本。默认为“main”分支的头部。
kwargs (Dict[str, Any], 可选) — 传递给 create_model_card() 的额外关键字参数.

将 self.model 和 self.processing_class 上传到 🤗 模型中心的仓库 self.args.hub_model_id。

remove_callback

( callback )

参数

callback (type 或 [`~transformers.TrainerCallback]`) — 一个 TrainerCallback 类或 TrainerCallback 的实例。在第一种情况下，将移除回调列表中该类的第一个成员。

从当前的TrainerCallback列表中移除一个回调。

保存指标

( split metrics combined = True )

参数

split (str) — 模式/分割名称：其中之一为 train, eval, test, all
metrics (Dict[str, float]) — 从训练/评估/预测返回的指标
combined (bool, 可选, 默认为 True) — 通过使用此调用的指标更新 all_results.json 来创建组合指标

将该分割的指标保存到一个json文件中，例如 train_results.json。

在分布式环境中，这仅对排名为0的进程执行。

要理解这些指标，请阅读log_metrics()的文档字符串。唯一的区别是在当前方法中保存的是原始的未格式化数字。

保存模型

( output_dir: typing.Optional[str] = None _internal_call: bool = False )

将保存模型，以便您可以使用from_pretrained()重新加载它。

仅从主进程保存。

保存状态

( )

保存Trainer状态，因为Trainer.save_model仅保存带有模型的tokenizer

在分布式环境中，这仅对排名为0的进程执行。

训练

( resume_from_checkpoint: typing.Union[str, bool, NoneType] = None trial: typing.Union[ForwardRef('optuna.Trial'), typing.Dict[str, typing.Any]] = None ignore_keys_for_eval: typing.Optional[typing.List[str]] = None **kwargs )

参数

resume_from_checkpoint (str 或 bool, 可选) — 如果是一个 str，则是之前 Trainer 实例保存的检查点的本地路径。如果是一个 bool 并且等于 True，则加载之前 Trainer 实例保存在 args.output_dir 中的最后一个检查点。如果存在，训练将从加载的模型/优化器/调度器状态继续。
trial (optuna.Trial or Dict[str, Any], optional) — 用于超参数搜索的试验运行或超参数字典。
ignore_keys_for_eval (List[str], optional) — 在训练期间收集评估预测时，应忽略模型输出中的键列表（如果它是一个字典）。
kwargs (Dict[str, Any], 可选) — 用于隐藏已弃用参数的附加关键字参数

主要训练入口点。

training_step

( model: 模块 inputs: 类型.字典[字符串, 类型.联合[torch.张量, 类型.任意]] num_items_in_batch = 无 ) → torch.Tensor

参数

model (nn.Module) — 要训练的模型.
inputs (Dict[str, Union[torch.Tensor, Any]]) — The inputs and targets of the model.
字典在输入模型之前将被解包。大多数模型期望目标在参数labels下。请检查您的模型文档以了解所有接受的参数。

返回

torch.Tensor

此批次上的训练损失张量。

在一批输入上执行训练步骤。

子类化并重写以注入自定义行为。

Seq2SeqTrainer

类 transformers.Seq2SeqTrainer

( model: typing.Union[ForwardRef('PreTrainedModel'), torch.nn.modules.module.Module] = None args: TrainingArguments = None data_collator: typing.Optional[ForwardRef('DataCollator')] = None train_dataset: typing.Union[torch.utils.data.dataset.Dataset, ForwardRef('IterableDataset'), ForwardRef('datasets.Dataset'), NoneType] = None eval_dataset: typing.Union[torch.utils.data.dataset.Dataset, typing.Dict[str, torch.utils.data.dataset.Dataset], NoneType] = None processing_class: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('BaseImageProcessor'), ForwardRef('FeatureExtractionMixin'), ForwardRef('ProcessorMixin'), NoneType] = None model_init: typing.Optional[typing.Callable[[], ForwardRef('PreTrainedModel')]] = None compute_metrics: typing.Optional[typing.Callable[[ForwardRef('EvalPrediction')], typing.Dict]] = None callbacks: typing.Optional[typing.List[ForwardRef('TrainerCallback')]] = None optimizers: typing.Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None) preprocess_logits_for_metrics: typing.Optional[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None )

评估

( eval_dataset: typing.Optional[torch.utils.data.dataset.Dataset] = None ignore_keys: typing.Optional[typing.List[str]] = None metric_key_prefix: str = 'eval' **gen_kwargs )

参数

eval_dataset (Dataset, 可选) — 如果你想覆盖 self.eval_dataset，请传递一个数据集。如果它是一个 Dataset，则不被 model.forward() 方法接受的列将自动被移除。它必须实现 __len__ 方法。
ignore_keys (List[str], 可选) — 模型输出中的键列表（如果是字典），在收集预测时应忽略这些键。
metric_key_prefix (str, 可选, 默认为 "eval") — 一个可选的前缀，用作指标键的前缀。例如，如果前缀是 "eval"（默认），则指标“bleu”将被命名为“eval_bleu”
max_length (int, optional) — 在使用生成方法进行预测时使用的最大目标长度。
num_beams (int, optional) — 在使用生成方法进行预测时将使用的波束搜索的波束数量。1表示不使用波束搜索。
gen_kwargs — 额外的 generate 特定参数.

运行评估并返回指标。

调用脚本将负责提供计算指标的方法，因为它们是任务相关的（将其传递给init compute_metrics参数）。

你也可以通过子类化并重写这个方法来注入自定义行为。

预测

( test_dataset: 数据集 ignore_keys: 可选的键列表 = 无 metric_key_prefix: 字符串 = '测试' **gen_kwargs )

参数

test_dataset (Dataset) — 用于运行预测的数据集。如果它是一个Dataset，不被model.forward()方法接受的列将自动被移除。必须实现__len__方法
ignore_keys (List[str], 可选) — 模型输出中的键列表（如果是字典），在收集预测时应忽略这些键。
metric_key_prefix (str, 可选, 默认为 "eval") — 一个可选的前缀，用作指标键的前缀。例如，如果前缀是 "eval"（默认），则指标“bleu”将被命名为“eval_bleu”
max_length (int, optional) — 在使用生成方法进行预测时使用的最大目标长度。
num_beams (int, optional) — 在使用generate方法进行预测时，将用于beam search的beam数量。1表示不使用beam search.
gen_kwargs — 额外的 generate 特定参数.

运行预测并返回预测结果和潜在指标。

根据数据集和您的使用情况，您的测试数据集可能包含标签。在这种情况下，此方法还将返回指标，如evaluate()中所示。

如果你的预测或标签有不同的序列长度（例如，因为你在一个标记分类任务中进行了动态填充），预测将被填充（在右侧）以允许连接成一个数组。填充索引是-100。

返回：NamedTuple 一个包含以下键的命名元组：

predictions (np.ndarray): 在 test_dataset 上的预测结果。
label_ids (np.ndarray, 可选): 标签（如果数据集包含一些）。
metrics (Dict[str, float], 可选): 可能的指标字典（如果数据集包含标签）。

TrainingArguments

类 transformers.TrainingArguments

( output_dir: str overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: typing.Optional[int] = None per_gpu_eval_batch_size: typing.Optional[int] = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: typing.Optional[int] = None eval_delay: typing.Optional[float] = 0 torch_empty_cache_steps: typing.Optional[int] = None learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: typing.Union[transformers.trainer_utils.SchedulerType, str] = 'linear' lr_scheduler_kwargs: typing.Union[dict, str, NoneType] = warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: typing.Optional[str] = 'passive' log_level_replica: typing.Optional[str] = 'warning' log_on_each_node: bool = True logging_dir: typing.Optional[str] = None logging_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps' logging_first_step: bool = False logging_steps: float = 500 logging_nan_inf_filter: bool = True save_strategy: typing.Union[transformers.trainer_utils.SaveStrategy, str] = 'steps' save_steps: float = 500 save_total_limit: typing.Optional[int] = None save_safetensors: typing.Optional[bool] = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: typing.Optional[int] = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: typing.Optional[bool] = None local_rank: int = -1 ddp_backend: typing.Optional[str] = None tpu_num_cores: typing.Optional[int] = None tpu_metrics_debug: bool = False debug: typing.Union[str, typing.List[transformers.debug_utils.DebugOption]] = '' dataloader_drop_last: bool = False eval_steps: typing.Optional[float] = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: typing.Optional[int] = None past_index: int = -1 run_name: typing.Optional[str] = None disable_tqdm: typing.Optional[bool] = None remove_unused_columns: typing.Optional[bool] = True label_names: typing.Optional[typing.List[str]] = None load_best_model_at_end: typing.Optional[bool] = False metric_for_best_model: typing.Optional

参数

output_dir (str) — 模型预测和检查点将被写入的输出目录。
overwrite_output_dir (bool, 可选, 默认为 False) — 如果为 True，则覆盖输出目录的内容。如果 output_dir 指向一个检查点目录，使用此选项可以继续训练。
do_train (bool, 可选, 默认为 False) — 是否运行训练。此参数不直接由Trainer使用，而是供您的训练/评估脚本使用。有关更多详细信息，请参阅示例脚本.
do_eval (bool, 可选) — 是否在验证集上运行评估。如果 eval_strategy 不同于 "no"，则设置为 True。此参数不直接由 Trainer 使用，而是供您的训练/评估脚本使用。有关更多详细信息，请参阅示例脚本.
do_predict (bool, 可选, 默认为 False) — 是否在测试集上运行预测。此参数不由Trainer直接使用，而是供您的训练/评估脚本使用。有关更多详细信息，请参阅示例脚本.
eval_strategy (str 或 IntervalStrategy, 可选, 默认为 "no") — 在训练期间采用的评估策略。可能的值为：
- "no": 在训练期间不进行评估。
- "steps": 每 eval_steps 进行一次评估（并记录）。
- "epoch": 在每个 epoch 结束时进行评估。
prediction_loss_only (bool, 可选, 默认为 False) — 在执行评估和生成预测时，仅返回损失。
per_device_train_batch_size (int, optional, defaults to 8) — 每个GPU/XPU/TPU/MPS/NPU核心/CPU的训练批次大小。
per_device_eval_batch_size (int, optional, 默认为 8) — 每个 GPU/XPU/TPU/MPS/NPU 核心/CPU 的评估批次大小。
gradient_accumulation_steps (int, optional, defaults to 1) — Number of updates steps to accumulate the gradients for, before performing a backward/update pass.

在使用梯度累积时，一个步骤被计为一个带有反向传播的步骤。因此，日志记录、评估、保存将在每gradient_accumulation_steps * xxx_step个训练样本后进行。
eval_accumulation_steps (int, optional) — 在将结果移动到CPU之前，累积输出张量的预测步骤数。如果未设置，整个预测将在GPU/NPU/TPU上累积后再移动到CPU（速度更快但需要更多内存）。
eval_delay (float, optional) — 在第一次评估可以执行之前等待的周期数或步骤数，取决于 eval_strategy.
torch_empty_cache_steps (int, optional) — Number of steps to wait before calling torch.<device>.empty_cache(). If left unset or set to None, cache will not be emptied.

这可以通过降低峰值VRAM使用量来帮助避免CUDA内存不足错误，但代价是性能降低约10%。
learning_rate (float, optional, defaults to 5e-5) — AdamW 优化器的初始学习率。
weight_decay (float, 可选, 默认为 0) — 应用于除所有偏置和LayerNorm权重之外的所有层的权重衰减（如果不为零），在AdamW优化器中。
adam_beta1 (float, 可选, 默认为 0.9) — AdamW 优化器的 beta1 超参数.
adam_beta2 (float, 可选, 默认为 0.999) — AdamW 优化器的 beta2 超参数。
adam_epsilon (float, optional, 默认为 1e-8) — AdamW 优化器的 epsilon 超参数.
max_grad_norm (float, optional, defaults to 1.0) — 最大梯度范数（用于梯度裁剪）。
num_train_epochs(float, 可选, 默认为 3.0) — 要执行的训练总轮数（如果不是整数，将在停止训练前执行最后一轮的小数部分百分比）。
max_steps (int, 可选, 默认为 -1) — 如果设置为正数，则表示要执行的总训练步数。覆盖 num_train_epochs。对于有限的数据集，训练会通过数据集重复进行（如果所有数据都用完），直到达到 max_steps。
lr_scheduler_type (str 或 SchedulerType, 可选, 默认为 "linear") — 使用的调度器类型。请参阅 SchedulerType 的文档以获取所有可能的值。
lr_scheduler_kwargs (‘dict’, optional, defaults to {}) — lr_scheduler的额外参数。请参阅每个调度器的文档以获取可能的值。
warmup_ratio (float, optional, defaults to 0.0) — 用于从0到learning_rate的线性预热的总训练步骤的比例。
warmup_steps (int, 可选, 默认为 0) — 用于从 0 到 learning_rate 的线性预热步数。覆盖 warmup_ratio 的任何效果。
log_level (str, 可选, 默认为 passive) — 在主进程中使用的日志记录器日志级别。可能的选项是作为字符串的日志级别：‘debug’, ‘info’, ‘warning’, ‘error’ 和 ‘critical’，以及一个‘passive’级别，它不设置任何内容并保持 Transformers库的当前日志级别（默认情况下为 "warning"）。
log_level_replica (str, 可选, 默认为 "warning") — 在副本上使用的日志记录级别。与 log_level 相同的选项
log_on_each_node (bool, 可选, 默认为 True) — 在多节点分布式训练中，是否在每个节点上使用 log_level 记录日志，或者仅在主节点上记录日志。
logging_dir (str, 可选) — TensorBoard 日志目录。默认值为 *output_dir/runs/CURRENT_DATETIME_HOSTNAME*.
logging_strategy (str 或 IntervalStrategy, 可选, 默认为 "steps") — 训练期间采用的日志记录策略。可能的值为：
- "no": 训练期间不进行日志记录。
- "epoch": 在每个 epoch 结束时进行日志记录。
- "steps": 每 logging_steps 进行一次日志记录。
logging_first_step (bool, 可选, 默认为 False) — 是否记录第一个 global_step.
logging_steps (int 或 float, 可选, 默认为 500) — 如果 logging_strategy="steps"，则两次日志记录之间的更新步骤数。应为整数或范围在 [0,1) 内的浮点数。如果小于 1，则将被解释为总训练步骤的比例。
logging_nan_inf_filter (bool, optional, defaults to True) — Whether to filter nan and inf losses for logging. If set to True the loss of every step that is nan or inf is filtered and the average loss of the current logging window is taken instead.

logging_nan_inf_filter 仅影响损失值的日志记录，它不会改变梯度计算或应用于模型的行为。
save_strategy (str or SaveStrategy, optional, defaults to "steps") — The checkpoint save strategy to adopt during training. Possible values are:
- "no": No save is done during training.
- "epoch": Save is done at the end of each epoch.
- "steps": Save is done every save_steps.
- "best": Save is done whenever a new best_metric is achieved.
如果选择了"epoch"或"steps"，保存操作也将在训练的最后时刻执行，总是如此。
save_steps (int 或 float, 可选, 默认为 500) — 如果 save_strategy="steps"，则在两次检查点保存之间的更新步骤数。应为整数或范围在 [0,1) 内的浮点数。如果小于1，将被解释为总训练步骤的比例。
save_total_limit (int, 可选) — 如果传递了一个值，将限制检查点的总数。删除output_dir中较旧的检查点。当启用load_best_model_at_end时，根据metric_for_best_model的“最佳”检查点将始终保留，同时保留最新的检查点。例如，对于save_total_limit=5和load_best_model_at_end，将始终保留最后四个检查点以及最佳模型。当save_total_limit=1和load_best_model_at_end时，可能会保存两个检查点：最后一个和最佳的一个（如果它们不同）。
save_safetensors (bool, 可选, 默认为 True) — 使用 safetensors 保存和加载状态字典，而不是默认的 torch.load 和 torch.save.
save_on_each_node (bool, optional, defaults to False) — When doing multi-node distributed training, whether to save models and checkpoints on each node, or only on the main one.
当不同节点使用相同的存储时，不应激活此功能，因为文件将以相同的名称保存给每个节点。
save_only_model (bool, 可选, 默认为 False) — 在检查点时，是否仅保存模型，还是同时保存优化器、调度器和随机数生成器状态。请注意，当此选项为真时，您将无法从检查点恢复训练。这使您可以通过不存储优化器、调度器和随机数生成器状态来节省存储空间。您只能使用 from_pretrained 加载模型，且此选项设置为 True.
restore_callback_states_from_checkpoint (bool, 可选, 默认为 False) — 是否从检查点恢复回调状态。如果为 True，将覆盖传递给 Trainer 的回调，如果它们在检查点中存在。”
use_cpu (bool, optional, defaults to False) — 是否使用CPU。如果设置为False，我们将使用可用的cuda或mps设备。
seed (int, 可选, 默认为 42) — 在训练开始时设置的随机种子。为了确保运行之间的可重复性，如果模型有一些随机初始化的参数，请使用 ~Trainer.model_init 函数来实例化模型。
data_seed (int, optional) — 用于数据采样器的随机种子。如果未设置，数据采样的随机生成器将使用与seed相同的种子。这可以用于确保数据采样的可重复性，独立于模型种子。
jit_mode_eval (bool, optional, defaults to False) — 是否使用 PyTorch jit trace 进行推理。
use_ipex (bool, 可选, 默认为 False) — 当可用时使用Intel的PyTorch扩展。IPEX 安装.
bf16 (bool, 可选, 默认为 False) — 是否使用 bf16 16位（混合）精度训练代替 32位训练。需要 Ampere 或更高版本的 NVIDIA 架构，或使用 CPU (use_cpu) 或 Ascend NPU。这是一个实验性 API，可能会发生变化。
fp16 (bool, optional, defaults to False) — 是否使用fp16 16位（混合）精度训练而不是32位训练。
fp16_opt_level (str, 可选, 默认为 ‘O1’) — 对于 fp16 训练，Apex AMP 优化级别选择在 [‘O0’, ‘O1’, ‘O2’, 和 ‘O3’] 中。详情请参阅 Apex 文档.
fp16_backend (str, 可选, 默认为 "auto") — 此参数已弃用。请使用 half_precision_backend 代替。
half_precision_backend (str, optional, defaults to "auto") — 用于混合精度训练的后端。必须是 "auto", "apex", "cpu_amp" 之一。"auto" 将根据检测到的 PyTorch 版本使用 CPU/CUDA AMP 或 APEX，而其他选项将强制使用请求的后端。
bf16_full_eval (bool, 可选, 默认为 False) — 是否使用完整的bfloat16评估而不是32位。这将更快并节省内存，但可能会损害指标值。这是一个实验性API，可能会发生变化。
fp16_full_eval (bool, 可选, 默认为 False) — 是否使用全float16评估而不是32位。这将更快并节省内存，但可能会损害指标值。
tf32 (bool, 可选) — 是否启用TF32模式，该模式在Ampere及更新的GPU架构中可用。默认值取决于 PyTorch版本的默认设置torch.backends.cuda.matmul.allow_tf32。更多详情请参阅 TF32文档。这是一个实验性API，可能会发生变化。
local_rank (int, optional, defaults to -1) — 分布式训练期间进程的排名。
ddp_backend (str, 可选) — 用于分布式训练的后端。必须是 "nccl", "mpi", "ccl", "gloo", "hccl" 中的一个。
tpu_num_cores (int, 可选) — 在TPU上训练时，TPU核心的数量（由启动脚本自动传递）。
dataloader_drop_last (bool, 可选, 默认为 False) — 是否丢弃最后一个不完整的批次（如果数据集的长度不能被批次大小整除）或不丢弃。
eval_steps (int 或 float, 可选) — 如果 eval_strategy="steps"，则在两次评估之间的更新步骤数。如果未设置，将默认与 logging_steps 相同的值。应为整数或范围在 [0,1) 内的浮点数。如果小于1，将被解释为总训练步骤的比例。
dataloader_num_workers (int, optional, 默认为 0) — 用于数据加载的子进程数量（仅限 PyTorch）。0 表示数据将在主进程中加载。
past_index (int, 可选, 默认为 -1) — 一些模型如 TransformerXL 或 XLNet 可以利用过去的隐藏状态进行预测。如果此参数设置为正整数，Trainer 将使用相应的输出（通常是索引 2）作为过去的状态，并在下一个训练步骤中将其作为关键字参数 mems 提供给模型。
run_name (str, 可选, 默认为 output_dir) — 运行的描述符。通常用于 wandb、 mlflow 和 comet 日志记录。如果未指定，将与 output_dir 相同。
disable_tqdm (bool, optional) — 是否禁用由~notebook.NotebookTrainingTracker在Jupyter Notebooks中生成的tqdm进度条和指标表。如果日志级别设置为警告或更低（默认），则默认为True，否则为False。
remove_unused_columns (bool, 可选, 默认为 True) — 是否自动移除模型前向方法未使用的列。
label_names (List[str], optional) — The list of keys in your dictionary of inputs that correspond to the labels.
最终将默认为模型接受的包含“label”一词的参数名称列表，除非使用的模型是XxxForQuestionAnswering之一，在这种情况下，它还将包括["start_positions", "end_positions"]键。
load_best_model_at_end (bool, optional, defaults to False) — Whether or not to load the best model found during training at the end of training. When this option is enabled, the best checkpoint will always be saved. See save_total_limit for more.

当设置为True时，参数save_strategy需要与eval_strategy相同，并且在“steps”的情况下，save_steps必须是eval_steps的整数倍。
metric_for_best_model (str, optional) — Use in conjunction with load_best_model_at_end to specify the metric to use to compare two different models. Must be the name of a metric returned by the evaluation with or without the prefix "eval_". Will default to "loss" if unspecified and load_best_model_at_end=True (to use the evaluation loss).
如果你设置了这个值，greater_is_better 将默认为 True。如果你的指标在较低时更好，别忘了将其设置为 False。
greater_is_better (bool, 可选) — 与 load_best_model_at_end 和 metric_for_best_model 一起使用，以指定更好的模型是否应具有更高的指标。默认值为：
- True 如果 metric_for_best_model 设置为不以 "loss" 结尾的值。
- False 如果 metric_for_best_model 未设置，或设置为以 "loss" 结尾的值。
ignore_data_skip (bool, 可选, 默认为 False) — 在恢复训练时，是否跳过epochs和batches以使数据加载阶段与之前的训练相同。如果设置为 True，训练将更快开始（因为跳过步骤可能需要很长时间），但不会产生与中断训练相同的结果。
fsdp (bool, str or list of FSDPOption, optional, defaults to '') — Use PyTorch Distributed Parallel Training (in distributed training only).
以下选项列表：
- "full_shard": Shard parameters, gradients and optimizer states.
- "shard_grad_op": Shard optimizer states and gradients.
- "hybrid_shard": Apply FULL_SHARD within a node, and replicate parameters across nodes.
- "hybrid_shard_zero2": Apply SHARD_GRAD_OP within a node, and replicate parameters across nodes.
- "offload": Offload parameters and gradients to CPUs (only compatible with "full_shard" and "shard_grad_op").
- "auto_wrap": Automatically recursively wrap layers with FSDP using default_auto_wrap_policy.
fsdp_config (str or dict, optional) — Config to be used with fsdp (Pytorch Distributed Parallel Training). The value is either a location of fsdp json config file (e.g., fsdp_config.json) or an already loaded json file as dict.
配置及其选项的列表：
- min_num_params (int, 可选, 默认为 0): FSDP 的默认自动包装的最小参数数量。（仅在传递了 fsdp 字段时有用）。
- transformer_layer_cls_to_wrap (List[str], 可选): 要包装的transformer层类名列表（区分大小写），例如 BertLayer, GPTJBlock, T5Block … （仅在传递 fsdp 标志时有用）。
- backward_prefetch (str, 可选) FSDP的后向预取模式。控制何时预取下一组参数（仅在传递fsdp字段时有用）。
  
  以下选项列表：
  - "backward_pre" : Prefetches the next set of parameters before the current set of parameter’s gradient computation.
  - "backward_post" : This prefetches the next set of parameters after the current set of parameter’s gradient computation.
- forward_prefetch (bool, 可选, 默认为 False) FSDP的前向预取模式（仅在传递了fsdp字段时有用）。如果为"True"，则FSDP在执行前向传递时显式预取下一个即将到来的全收集操作。
- limit_all_gathers (bool, 可选, 默认为 False) FSDP的limit_all_gathers（仅在传递了fsdp字段时有用）。如果为"True"，FSDP会显式同步CPU线程以防止过多的正在进行的all-gather操作。
- use_orig_params (bool, 可选, 默认为 True) 如果为 "True"，允许在初始化期间非均匀的 requires_grad，这意味着支持交替冻结和可训练的参数。在参数高效微调等情况下非常有用。请参考此 [博客](https://dev-discuss.pytorch.org/t/rethinking-pytorch-fully-sharded-data-parallel-fsdp-from-first-principles/1019
- sync_module_states (bool, 可选, 默认为 True) 如果为 "True"，每个单独包装的 FSDP 单元将从 rank 0 广播模块参数，以确保初始化后所有 rank 上的参数相同
- cpu_ram_efficient_loading (bool, 可选, 默认为 False) 如果设置为 "True"，只有第一个进程会加载预训练模型的检查点，而其他所有进程的权重为空。当此设置设为 "True" 时，sync_module_states 也必须设为 "True"，否则除了主进程之外的所有进程都会有随机权重，导致训练期间出现意外行为。
- activation_checkpointing (bool, 可选, 默认为 False): 如果为 "True"，激活检查点是一种通过清除某些层的激活并在反向传播期间重新计算它们来减少内存使用的技术。实际上，这是用额外的计算时间来换取减少内存使用。
- xla (bool, 可选, 默认为 False): 是否使用 PyTorch/XLA 完全分片数据并行训练。这是一个实验性功能，其 API 可能会在未来发生变化。
- xla_fsdp_settings (dict, 可选) 该值是一个字典，用于存储XLA FSDP包装参数。
  
  有关完整选项列表，请参见这里。
- xla_fsdp_grad_ckpt (bool, 可选, 默认为 False): 将在每个嵌套的XLA FSDP包装层上使用梯度检查点。此设置只能在xla标志设置为true时使用，并且通过fsdp_min_num_params或fsdp_transformer_layer_cls_to_wrap指定了自动包装策略。
deepspeed (str or dict, optional) — Use Deepspeed. This is an experimental feature and its API may evolve in the future. The value is either the location of DeepSpeed json config file (e.g., ds_config.json) or an already loaded json file as a dict”
If enabling any Zero-init, make sure that your model is not initialized until *after* initializing the `TrainingArguments`, else it will not be applied.
accelerator_config (str, dict, or AcceleratorConfig, optional) — Config to be used with the internal Accelerator implementation. The value is either a location of accelerator json config file (e.g., accelerator_config.json), an already loaded json file as dict, or an instance of AcceleratorConfig.
配置及其选项的列表：
- split_batches (bool, optional, defaults to False): Whether or not the accelerator should split the batches yielded by the dataloaders across the devices. If True the actual batch size used will be the same on any kind of distributed processes, but it must be a round multiple of the num_processes you are using. If False, actual batch size used will be the one set in your script multiplied by the number of processes.
- dispatch_batches (bool, optional): If set to True, the dataloader prepared by the Accelerator is only iterated through on the main process and then the batches are split and broadcast to each process. Will default to True for DataLoader whose underlying dataset is an IterableDataset, False otherwise.
- even_batches (bool, optional, defaults to True): If set to True, in cases where the total batch size across all processes does not exactly divide the dataset, samples at the start of the dataset will be duplicated so the batch can be divided equally among all workers.
- use_seedable_sampler (bool, optional, defaults to True): Whether or not use a fully seedable random sampler (accelerate.data_loader.SeedableRandomSampler). Ensures training results are fully reproducable using a different sampling technique. While seed-to-seed results may differ, on average the differences are neglible when using multiple different seeds to compare. Should also be ran with ~utils.set_seed for the best results.
- use_configured_state (bool, optional, defaults to False): Whether or not to use a pre-configured AcceleratorState or PartialState defined before calling TrainingArguments. If True, an Accelerator or PartialState must be initialized. Note that by doing so, this could lead to issues with hyperparameter tuning.
label_smoothing_factor (float, 可选, 默认为 0.0) — 使用的标签平滑因子。零表示没有标签平滑，否则底层的一热编码标签将从0和1分别更改为label_smoothing_factor/num_labels和1 - label_smoothing_factor + label_smoothing_factor/num_labels.
debug (str or list of DebugOption, optional, defaults to "") — Enable one or more debug features. This is an experimental feature.
可能的选项有：
- "underflow_overflow": detects overflow in model’s input/outputs and reports the last frames that led to the event
- "tpu_metrics_debug": print debug metrics on TPU
选项应该用空格分隔。
optim (str 或 training_args.OptimizerNames, 可选, 默认为 "adamw_torch") — 要使用的优化器，例如 “adamw_hf”, “adamw_torch”, “adamw_torch_fused”, “adamw_apex_fused”, “adamw_anyprecision”, “adafactor”。请参阅 training_args.py 中的 OptimizerNames 以获取完整的优化器列表。
optim_args (str, 可选) — 可选参数，这些参数提供给优化器，如 AnyPrecisionAdamW、AdEMAMix 和 GaLore。
group_by_length (bool, 可选, 默认为 False) — 是否在训练数据集中将长度大致相同的样本分组（以最小化填充并提高效率）。仅在应用动态填充时有用。
length_column_name (str, 可选, 默认为 "length") — 用于预计算长度的列名。如果该列存在，按长度分组时将使用这些值，而不是在训练启动时计算它们。除非 group_by_length 为 True 且数据集是 Dataset 的实例，否则忽略此参数。
report_to (str 或 List[str], 可选, 默认为 "all") — 报告结果和日志的集成列表。支持的平台有 "azure_ml", "clearml", "codecarbon", "comet_ml", "dagshub", "dvclive", "flyte", "mlflow", "neptune", "tensorboard", 和 "wandb"。使用 "all" 报告给所有已安装的集成，使用 "none" 表示不报告给任何集成。
ddp_find_unused_parameters (bool, 可选) — 在使用分布式训练时，传递给DistributedDataParallel的标志find_unused_parameters的值。如果使用了梯度检查点，则默认为False，否则为True。
ddp_bucket_cap_mb (int, 可选) — 在使用分布式训练时，传递给DistributedDataParallel的bucket_cap_mb标志的值。
ddp_broadcast_buffers (bool, 可选) — 在使用分布式训练时，传递给DistributedDataParallel的标志broadcast_buffers的值。如果使用了梯度检查点，则默认为False，否则为True。
dataloader_pin_memory (bool, 可选, 默认为 True) — 是否在数据加载器中固定内存。默认为 True.
dataloader_persistent_workers (bool, 可选, 默认为 False) — 如果为True，数据加载器在数据集被消耗一次后不会关闭工作进程。这允许保持工作进程的Dataset实例存活。可能会加速训练，但会增加RAM使用量。默认为False.
dataloader_prefetch_factor (int, optional) — 每个工作线程预先加载的批次数。 2 表示所有工作线程将总共预取 2 * num_workers 个批次。
skip_memory_metrics (bool, 可选, 默认为 True) — 是否跳过将内存分析器报告添加到指标中。默认情况下会跳过此操作，因为它会减慢训练和评估速度。
push_to_hub (bool, optional, defaults to False) — Whether or not to push the model to the Hub every time the model is saved. If this is activated, output_dir will begin a git directory synced with the repo (determined by hub_model_id) and the content will be pushed each time a save is triggered (depending on your save_strategy). Calling save_model() will also trigger a push.

如果 output_dir 存在，它需要是 Trainer 将要推送到的存储库的本地克隆。
resume_from_checkpoint (str, optional) — 指向包含模型有效检查点的文件夹路径。此参数不直接由Trainer使用，而是供您的训练/评估脚本使用。有关更多详细信息，请参阅示例脚本.
hub_model_id (str, optional) — The name of the repository to keep in sync with the local output_dir. It can be a simple model ID in which case the model will be pushed in your namespace. Otherwise it should be the whole repository name, for instance "user_name/model", which allows you to push to an organization you are a member of with "organization_name/model". Will default to user_name/output_dir_name with output_dir_name being the name of output_dir.
将默认为output_dir的名称。
hub_strategy (str or HubStrategy, optional, defaults to "every_save") — Defines the scope of what is pushed to the Hub and when. Possible values are:
- "end": push the model, its configuration, the processing class e.g. tokenizer (if passed along to the Trainer) and a draft of a model card when the save_model() method is called.
- "every_save": push the model, its configuration, the processing class e.g. tokenizer (if passed along to the Trainer) and a draft of a model card each time there is a model save. The pushes are asynchronous to not block training, and in case the save are very frequent, a new push is only attempted if the previous one is finished. A last push is made with the final model at the end of training.
- "checkpoint": like "every_save" but the latest checkpoint is also pushed in a subfolder named last-checkpoint, allowing you to resume training easily with trainer.train(resume_from_checkpoint="last-checkpoint").
- "all_checkpoints": like "checkpoint" but all checkpoints are pushed like they appear in the output folder (so you will get one checkpoint folder per folder in your final repository)
hub_token (str, optional) — 用于将模型推送到 Hub 的令牌。默认为使用 huggingface-cli login 在缓存文件夹中获取的令牌。
hub_private_repo (bool, 可选) — 是否将仓库设为私有。如果为 None（默认值），仓库将为公开，除非组织的默认设置为私有。如果仓库已存在，则忽略此值。
hub_always_push (bool, optional, defaults to False) — 除非此值为True，否则Trainer将在前一次推送未完成时跳过推送检查点。
gradient_checkpointing (bool, optional, defaults to False) — 如果为True，则使用梯度检查点来节省内存，但会减慢反向传播的速度。
gradient_checkpointing_kwargs (dict, 可选, 默认为 None) — 传递给 gradient_checkpointing_enable 方法的关键字参数.
include_inputs_for_metrics (bool, 可选, 默认为 False) — 此参数已弃用。请使用 include_for_metrics 代替，例如 include_for_metrics = ["inputs"].
include_for_metrics (List[str], 可选, 默认为 []) — 如果需要在 compute_metrics 函数中包含额外的数据以进行指标计算。可以添加到 include_for_metrics 列表中的选项：
- "inputs": 传递给模型的输入数据，用于计算依赖于输入的指标。
- "loss": 在评估期间计算的损失值，用于计算依赖于损失的指标。
eval_do_concat_batches (bool, 可选, 默认为 True) — 是否递归地跨批次连接输入/损失/标签/预测。如果为 False，则将它们存储为列表，每个批次保持独立。
auto_find_batch_size (bool, 可选, 默认为 False) — 是否通过指数衰减自动找到一个适合内存的批量大小，避免CUDA内存不足错误。需要安装accelerate (pip install accelerate)
full_determinism (bool, 可选, 默认为 False) — 如果为 True，则调用 enable_full_determinism() 而不是 set_seed() 以确保在分布式训练中获得可重复的结果。重要提示：这将对性能产生负面影响，因此仅用于调试。
torchdynamo (str, 可选) — 如果设置，则为TorchDynamo的后端编译器。可能的选项有 "eager", "aot_eager", "inductor", "nvfuser", "aot_nvfuser", "aot_cudagraphs", "ofi", "fx2trt", "onnxrt" 和 "ipex".
ray_scope (str, 可选, 默认为 "last") — 在使用Ray进行超参数搜索时使用的范围。默认情况下，将使用"last"。Ray将使用所有试验的最后一个检查点，比较它们，并选择最佳的一个。然而，还有其他选项可用。有关更多选项，请参阅Ray文档.
ddp_timeout (int, 可选, 默认值为 1800) — torch.distributed.init_process_group 调用的超时时间，用于在分布式运行中执行慢操作时避免 GPU 套接字超时。请参阅 [PyTorch 文档] (https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) 以获取更多信息。
use_mps_device (bool, 可选, 默认为 False) — 此参数已弃用。mps 设备将在可用时使用，类似于 cuda 设备。
torch_compile (bool, optional, defaults to False) — Whether or not to compile the model using PyTorch 2.0 torch.compile.
这将使用torch.compile API的最佳默认值。您可以使用参数torch_compile_backend和torch_compile_mode自定义默认值，但我们不保证其中任何一个都能正常工作，因为支持正在逐步在PyTorch中推出。

此标志和整个编译API是实验性的，可能会在未来的版本中发生变化。
torch_compile_backend (str, optional) — The backend to use in torch.compile. If set to any value, torch_compile will be set to True.
请参考PyTorch文档以获取可能的值，并注意它们可能会在PyTorch版本之间发生变化。

此标志是实验性的，可能会在未来的版本中更改。
torch_compile_mode (str, optional) — The mode to use in torch.compile. If set to any value, torch_compile will be set to True.
请参考PyTorch文档以获取可能的值，并注意它们可能会在PyTorch版本之间发生变化。

此标志是实验性的，可能会在未来的版本中更改。
split_batches (bool, optional) — Whether or not the accelerator should split the batches yielded by the dataloaders across the devices during distributed training. If
设置为True，实际使用的批量大小在任何类型的分布式进程中将相同，但它必须是一个

四舍五入到你使用的进程数（如GPU）的倍数。
include_tokens_per_second (bool, optional) — Whether or not to compute the number of tokens per second per device for training speed metrics.
这将预先遍历整个训练数据加载器一次，

并且会减慢整个过程。
include_num_input_tokens_seen (bool, optional) — Whether or not to track the number of input tokens seen throughout training.
在分布式训练中可能会较慢，因为必须调用聚集操作。
neftune_noise_alpha (Optional[float]) — 如果不是 None，这将激活 NEFTune 噪声嵌入。这可以显著提高模型在指令微调中的性能。查看原始论文和原始代码。支持 transformers 的 PreTrainedModel 以及 peft 的 PeftModel。原始论文中使用的值范围是 [5.0, 15.0]。
optim_target_modules (Union[str, List[str]], 可选) — 要优化的目标模块，即您希望训练的模块名称，目前这仅用于GaLore算法 https://arxiv.org/abs/2403.03507 参见：https://github.com/jiaweizzhao/GaLore 了解更多详情。您需要确保传递一个有效的GaloRe 优化器，例如：“galore_adamw”、“galore_adamw_8bit”、“galore_adafactor”，并确保目标模块仅为nn.Linear模块
batch_eval_metrics (Optional[bool], 默认为 False) — 如果设置为 True，评估将在每批结束时调用 compute_metrics 以累积统计信息，而不是将所有评估 logits 保存在内存中。当设置为 True 时，您必须传递一个 compute_metrics 函数，该函数接受一个布尔参数 compute_result，当传递 True 时，将从您在评估集上累积的批级别统计信息中触发最终的全局汇总统计信息。
eval_on_start (bool, optional, defaults to False) — 是否在训练前执行评估步骤（完整性检查）以确保验证步骤正常工作。
eval_use_gather_object (bool, 可选, 默认为 False) — 是否在所有设备的嵌套列表/元组/字典中递归地收集对象。只有在用户不仅仅返回张量时才应启用此选项，并且PyTorch强烈不推荐这样做。
use_liger_kernel (bool, 可选, 默认为 False) — 是否启用 Liger 内核用于 LLM 模型训练。它可以有效提高多 GPU 训练吞吐量约 20%，并减少内存使用约 60%，开箱即用地支持 flash attention、PyTorch FSDP 和 Microsoft DeepSpeed。目前，它支持 llama、mistral、mixtral 和 gemma 模型。

TrainingArguments 是我们在示例脚本中使用的参数子集，这些参数与训练循环本身相关。

使用 HfArgumentParser 我们可以将这个类转换为 argparse 参数，这些参数可以在命令行中指定。

get_process_log_level

( )

返回根据此进程是节点0的主进程、非0节点的主进程还是非主进程来使用的日志级别。

对于主进程，日志级别默认为设置的日志级别（如果您没有进行任何操作，则为logging.WARNING），除非被log_level参数覆盖。

对于副本进程，日志级别默认为 logging.WARNING，除非被 log_level_replica 参数覆盖。

在主进程和副本进程设置之间的选择是根据should_log的返回值来决定的。

get_warmup_steps

( 训练步数: int )

获取用于线性预热的步数。

main_process_first

( local = True desc = '工作' )

参数

local (bool, 可选, 默认为 True) — 如果 True，首先表示每个节点的 rank 0 进程；如果 False，首先表示节点 rank 0 的进程。在多节点环境中，如果使用共享文件系统，您很可能希望使用 local=False，以便只有第一个节点的主进程会执行处理。然而，如果文件系统未共享，则每个节点的主进程都需要执行处理，这是默认行为。
desc (str, optional, defaults to "work") — 用于调试日志中的工作描述

一个用于torch分布式环境的上下文管理器，其中需要在主进程上执行某些操作，同时阻塞副本，并在完成后释放副本。

其中一个用途是用于datasets的map功能，为了高效运行，应该在主进程上运行一次，完成后保存结果的缓存版本，然后由副本自动加载。

set_dataloader

( train_batch_size: int = 8 eval_batch_size: int = 8 drop_last: bool = False num_workers: int = 0 pin_memory: bool = True persistent_workers: bool = False prefetch_factor: typing.Optional[int] = None auto_find_batch_size: bool = False ignore_data_skip: bool = False sampler_seed: typing.Optional[int] = None )

参数

drop_last (bool, 可选, 默认为 False) — 是否丢弃最后一个不完整的批次（如果数据集的长度不能被批次大小整除）。
num_workers (int, optional, 默认为 0) — 用于数据加载的子进程数量（仅限 PyTorch）。0 表示数据将在主进程中加载。
pin_memory (bool, 可选, 默认为 True) — 是否希望在数据加载器中固定内存。默认为 True.
persistent_workers (bool, 可选, 默认为 False) — 如果为True，数据加载器在数据集被消耗一次后不会关闭工作进程。这允许保持工作进程的Dataset实例存活。可能会加速训练，但会增加RAM使用量。默认为 False.
prefetch_factor (int, optional) — 每个工作线程预先加载的批次数。 2 表示所有工作线程将总共预取 2 * num_workers 个批次。
auto_find_batch_size (bool, 可选, 默认为 False) — 是否通过指数衰减自动找到一个适合内存的批量大小，避免CUDA内存不足错误。需要安装accelerate (pip install accelerate)
ignore_data_skip (bool, 可选, 默认为 False) — 在恢复训练时，是否跳过epochs和batches以使数据加载阶段与之前的训练相同。如果设置为 True，训练将更快开始（因为跳过步骤可能需要很长时间），但不会产生与中断训练相同的结果。
sampler_seed (int, optional) — 用于数据采样器的随机种子。如果未设置，数据采样的随机生成器将使用与self.seed相同的种子。这可以用于确保数据采样的可重复性，独立于模型种子。

一种方法，用于重新组合与数据加载器创建相关的所有参数。

示例：

>>> from transformers import TrainingArguments

>>> args = TrainingArguments("working_dir")
>>> args = args.set_dataloader(train_batch_size=16, eval_batch_size=64)
>>> args.per_device_train_batch_size
16

set_evaluate

( strategy: typing.Union[str, transformers.trainer_utils.IntervalStrategy] = 'no' steps: int = 500 batch_size: int = 8 accumulation_steps: typing.Optional[int] = None delay: typing.Optional[float] = None loss_only: bool = False jit_mode: bool = False )

参数

strategy (str or IntervalStrategy, optional, defaults to "no") — The evaluation strategy to adopt during training. Possible values are:
- "no": No evaluation is done during training.
- "steps": Evaluation is done (and logged) every steps.
- "epoch": Evaluation is done at the end of each epoch.
设置一个不同于"no"的strategy将会把self.do_eval设置为True。
steps (int, 可选, 默认为 500) — 如果 strategy="steps"，则在两次评估之间的更新步骤数。
batch_size (int optional, 默认为 8) — 用于评估的每个设备（GPU/TPU核心/CPU…）的批量大小。
accumulation_steps (int, optional) — 在将结果移动到CPU之前，累积输出张量的预测步骤数。如果未设置，则在将整个预测移动到CPU之前，会在GPU/TPU上累积（更快但需要更多内存）。
delay (float, optional) — 在首次评估可以执行之前等待的周期数或步骤数，取决于eval_strategy.
loss_only (bool, 可选, 默认为 False) — 忽略除损失外的所有输出。
jit_mode (bool, optional) — 是否使用 PyTorch jit trace 进行推理。

一种方法，用于重新分组与评估相关的所有参数。

示例：

>>> from transformers import TrainingArguments

>>> args = TrainingArguments("working_dir")
>>> args = args.set_evaluate(strategy="steps", steps=100)
>>> args.eval_steps
100

设置日志

( strategy: typing.Union[str, transformers.trainer_utils.IntervalStrategy] = 'steps' steps: int = 500 report_to: typing.Union[str, typing.List[str]] = 'none' level: str = 'passive' first_step: bool = False nan_inf_filter: bool = False on_each_node: bool = False replica_level: str = 'passive' )

参数

策略 (str 或 IntervalStrategy, 可选, 默认为 "steps") — 在训练期间采用的日志记录策略。可能的值为：
- "no": 在训练期间不进行日志记录。
- "epoch": 在每个epoch结束时进行日志记录。
- "steps": 每 logging_steps 进行一次日志记录。
步骤 (int, 可选, 默认为 500) — 如果 strategy="steps"，两次日志之间的更新步骤数。
level (str, 可选, 默认为 "passive") — 在主进程中使用的日志记录器日志级别。可能的选项是作为字符串的日志级别："debug"、 "info"、"warning"、"error" 和 "critical"，以及一个 "passive" 级别，它不设置任何内容并让应用程序设置级别。
report_to (str 或 List[str], 可选, 默认为 "all") — 报告结果和日志的集成列表。支持的平台有 "azure_ml", "clearml", "codecarbon", "comet_ml", "dagshub", "dvclive", "flyte", "mlflow", "neptune", "tensorboard", 和 "wandb"。使用 "all" 报告给所有已安装的集成， "none" 表示不使用任何集成。
first_step (bool, 可选, 默认为 False) — 是否记录并评估第一个 global_step。
nan_inf_filter (bool, optional, defaults to True) — Whether to filter nan and inf losses for logging. If set to True the loss of every step that is nan or inf is filtered and the average loss of the current logging window is taken instead.

nan_inf_filter 仅影响损失值的记录，它不会改变梯度计算或应用于模型的行为。
on_each_node (bool, 可选, 默认为 True) — 在多节点分布式训练中，是否在每个节点上使用 log_level 记录日志，或者仅在主节点上记录日志。
replica_level (str, 可选, 默认为 "passive") — 在副本上使用的日志记录级别。与 log_level 相同的选项

一种方法，用于重新组合与日志记录相关的所有参数。

示例：

>>> from transformers import TrainingArguments

>>> args = TrainingArguments("working_dir")
>>> args = args.set_logging(strategy="steps", steps=100)
>>> args.logging_steps
100

set_lr_scheduler

( name: typing.Union[str, transformers.trainer_utils.SchedulerType] = 'linear' num_epochs: float = 3.0 max_steps: int = -1 warmup_ratio: float = 0 warmup_steps: int = 0 )

参数

name (str 或 SchedulerType, 可选, 默认为 "linear") — 使用的调度器类型。请参阅 SchedulerType 的文档以获取所有可能的值。
num_epochs(float, 可选, 默认为 3.0) — 要执行的总训练轮数（如果不是整数，将在停止训练前执行最后一轮的小数部分百分比）。
max_steps (int, 可选, 默认为 -1) — 如果设置为正数，则表示要执行的总训练步数。覆盖 num_train_epochs。对于有限的数据集，训练会通过数据集重复进行（如果所有数据都用完），直到达到 max_steps。
warmup_ratio (float, optional, 默认为 0.0) — 用于从 0 到 learning_rate 的线性预热的总训练步骤的比例。
warmup_steps (int, optional, 默认为 0) — 用于从 0 到 learning_rate 的线性预热步数。覆盖 warmup_ratio 的任何效果。

一种方法，用于重新组合与学习率调度器及其超参数相关的所有参数。

示例：

>>> from transformers import TrainingArguments

>>> args = TrainingArguments("working_dir")
>>> args = args.set_lr_scheduler(name="cosine", warmup_ratio=0.05)
>>> args.warmup_ratio
0.05

set_optimizer

( name: typing.Union[str, transformers.training_args.OptimizerNames] = 'adamw_torch' learning_rate: float = 5e-05 weight_decay: float = 0 beta1: float = 0.9 beta2: float = 0.999 epsilon: float = 1e-08 args: typing.Optional[str] = None )

参数

name (str 或 training_args.OptimizerNames, 可选, 默认为 "adamw_torch") — 使用的优化器: "adamw_hf", "adamw_torch", "adamw_torch_fused", "adamw_apex_fused", "adamw_anyprecision" 或 "adafactor".
learning_rate (float, optional, 默认为 5e-5) — 初始学习率。
weight_decay (float, optional, 默认为 0) — 应用于所有层（除了所有偏置和LayerNorm权重）的权重衰减（如果不为零）。
beta1 (float, optional, 默认为 0.9) — adam 优化器或其变体的 beta1 超参数。
beta2 (float, optional, 默认为 0.999) — adam 优化器或其变体的 beta2 超参数。
epsilon (float, 可选, 默认为 1e-8) — 用于adam优化器或其变体的epsilon超参数。
args (str, 可选) — 可选参数，提供给 AnyPrecisionAdamW（仅在 optim="adamw_anyprecision" 时有用）。

一种方法，用于重新组合与优化器及其超参数相关的所有参数。

示例：

>>> from transformers import TrainingArguments

>>> args = TrainingArguments("working_dir")
>>> args = args.set_optimizer(name="adamw_torch", beta1=0.8)
>>> args.optim
'adamw_torch'

set_push_to_hub

( model_id: str strategy: typing.Union[str, transformers.trainer_utils.HubStrategy] = 'every_save' token: typing.Optional[str] = None private_repo: typing.Optional[bool] = None always_push: bool = False )

参数

model_id (str) — 要与本地output_dir保持同步的仓库名称。它可以是一个简单的模型ID，在这种情况下，模型将被推送到您的命名空间中。否则，它应该是完整的仓库名称，例如"user_name/model"，这允许您将模型推送到您所属的组织中，例如"organization_name/model".
strategy (str or HubStrategy, optional, defaults to "every_save") — Defines the scope of what is pushed to the Hub and when. Possible values are:
- "end": push the model, its configuration, the processing_class e.g. tokenizer (if passed along to the Trainer) and a draft of a model card when the save_model() method is called.
- "every_save": push the model, its configuration, the processing_class e.g. tokenizer (if passed along to the Trainer) and a draft of a model card each time there is a model save. The pushes are asynchronous to not block training, and in case the save are very frequent, a new push is only attempted if the previous one is finished. A last push is made with the final model at the end of training.
- "checkpoint": like "every_save" but the latest checkpoint is also pushed in a subfolder named last-checkpoint, allowing you to resume training easily with trainer.train(resume_from_checkpoint="last-checkpoint").
- "all_checkpoints": like "checkpoint" but all checkpoints are pushed like they appear in the output folder (so you will get one checkpoint folder per folder in your final repository)
token (str, optional) — 用于将模型推送到 Hub 的令牌。默认情况下，将使用通过 huggingface-cli login 获取的缓存文件夹中的令牌。
private_repo (bool, 可选, 默认为 False) — 是否将仓库设为私有。如果为 None（默认），仓库将为公开，除非组织的默认设置为私有。如果仓库已存在，则忽略此值。
always_push (bool, 可选, 默认为 False) — 除非此值为 True，否则 Trainer 将在前一次推送未完成时跳过推送检查点。

一种方法，用于重新组合与同步检查点到Hub相关的所有参数。

调用此方法会将self.push_to_hub设置为True，这意味着output_dir将开始一个与仓库同步的git目录（由model_id决定），并且每次触发保存时内容都会被推送（取决于你的self.save_strategy）。调用save_model()也会触发推送。

示例：

>>> from transformers import TrainingArguments

>>> args = TrainingArguments("working_dir")
>>> args = args.set_push_to_hub("me/awesome-model")
>>> args.hub_model_id
'me/awesome-model'

set_save

( 策略: typing.Union[str, transformers.trainer_utils.IntervalStrategy] = 'steps' 步数: int = 500 总限制: typing.Optional[int] = None 在每个节点上: bool = False )

参数

策略 (str 或 IntervalStrategy, 可选, 默认为 "steps") — 训练期间采用的检查点保存策略。可能的值为：
- "no": 训练期间不保存。
- "epoch": 在每个epoch结束时保存。
- "steps": 每 save_steps 保存一次。
步骤 (int, 可选, 默认为 500) — 如果 strategy="steps"，则在两次检查点保存之间的更新步骤数。
total_limit (int, 可选) — 如果传递了一个值，将限制检查点的总数。删除output_dir中较旧的检查点。
on_each_node (bool, optional, defaults to False) — When doing multi-node distributed training, whether to save models and checkpoints on each node, or only on the main one.
当不同节点使用相同的存储时，不应激活此功能，因为文件将以相同的名称保存每个节点。

一种方法，用于重新组织与检查点保存相关的所有参数。

示例：

>>> from transformers import TrainingArguments

>>> args = TrainingArguments("working_dir")
>>> args = args.set_save(strategy="steps", steps=100)
>>> args.save_steps
100

set_testing

( batch_size: int = 8 loss_only: bool = False jit_mode: bool = False )

参数

batch_size (int optional, 默认为 8) — 用于测试的每个设备（GPU/TPU核心/CPU…）的批量大小。
loss_only (bool, 可选, 默认为 False) — 忽略除损失之外的所有输出。
jit_mode (bool, optional) — 是否使用 PyTorch jit trace 进行推理。

一种方法，用于重新组合与在保留数据集上进行测试相关的所有基本参数。

调用此方法将自动将self.do_predict设置为True。

示例：

>>> from transformers import TrainingArguments

>>> args = TrainingArguments("working_dir")
>>> args = args.set_testing(batch_size=32)
>>> args.per_device_eval_batch_size
32

set_training

( learning_rate: float = 5e-05 batch_size: int = 8 weight_decay: float = 0 num_epochs: float = 3 max_steps: int = -1 gradient_accumulation_steps: int = 1 seed: int = 42 gradient_checkpointing: bool = False )

参数

learning_rate (float, optional, defaults to 5e-5) — 优化器的初始学习率。
batch_size (int optional, 默认为 8) — 每个设备（GPU/TPU核心/CPU…）用于训练的批量大小。
weight_decay (float, optional, 默认为 0) — 应用于优化器中除所有偏置和LayerNorm权重之外的所有层的权重衰减（如果不为零）。
num_train_epochs(float, 可选, 默认为 3.0) — 要执行的训练总轮数（如果不是整数，将在停止训练前执行最后一轮的小数部分百分比）。
max_steps (int, 可选, 默认为 -1) — 如果设置为正数，则表示要执行的总训练步数。覆盖 num_train_epochs。对于有限的数据集，训练会通过数据集重复进行（如果所有数据都用完），直到达到 max_steps。
gradient_accumulation_steps (int, optional, defaults to 1) — Number of updates steps to accumulate the gradients for, before performing a backward/update pass.

在使用梯度累积时，一个步骤被计为一个带有反向传播的步骤。因此，日志记录、评估、保存将在每gradient_accumulation_steps * xxx_step个训练样本后进行。
seed (int, optional, 默认为 42) — 在训练开始时设置的随机种子。为了确保运行之间的可重复性，如果模型有一些随机初始化的参数，请使用 ~Trainer.model_init 函数来实例化模型。
gradient_checkpointing (bool, optional, defaults to False) — 如果为True，则使用梯度检查点来节省内存，但会以较慢的反向传递为代价。

一种方法，用于重新组织与训练相关的所有基本参数。

调用此方法将自动将self.do_train设置为True。

示例：

>>> from transformers import TrainingArguments

>>> args = TrainingArguments("working_dir")
>>> args = args.set_training(learning_rate=1e-4, batch_size=32)
>>> args.learning_rate
1e-4

to_dict

( )

序列化此实例时，将Enum替换为其值（以支持JSON序列化）。它通过删除其值来混淆令牌值。

to_json_string

( )

将此实例序列化为JSON字符串。

to_sanitized_dict

( )

用于TensorBoard的hparams的消毒序列化

Seq2SeqTrainingArguments

类 transformers.Seq2SeqTrainingArguments

( output_dir: str overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: typing.Optional[int] = None per_gpu_eval_batch_size: typing.Optional[int] = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: typing.Optional[int] = None eval_delay: typing.Optional[float] = 0 torch_empty_cache_steps: typing.Optional[int] = None learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: typing.Union[transformers.trainer_utils.SchedulerType, str] = 'linear' lr_scheduler_kwargs: typing.Union[dict, str, NoneType] = warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: typing.Optional[str] = 'passive' log_level_replica: typing.Optional[str] = 'warning' log_on_each_node: bool = True logging_dir: typing.Optional[str] = None logging_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps' logging_first_step: bool = False logging_steps: float = 500 logging_nan_inf_filter: bool = True save_strategy: typing.Union[transformers.trainer_utils.SaveStrategy, str] = 'steps' save_steps: float = 500 save_total_limit: typing.Optional[int] = None save_safetensors: typing.Optional[bool] = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: typing.Optional[int] = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: typing.Optional[bool] = None local_rank: int = -1 ddp_backend: typing.Optional[str] = None tpu_num_cores: typing.Optional[int] = None tpu_metrics_debug: bool = False debug: typing.Union[str, typing.List[transformers.debug_utils.DebugOption]] = '' dataloader_drop_last: bool = False eval_steps: typing.Optional[float] = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: typing.Optional[int] = None past_index: int = -1 run_name: typing.Optional[str] = None disable_tqdm: typing.Optional[bool] = None remove_unused_columns: typing.Optional[bool] = True label_names: typing.Optional[typing.List[str]] = None load_best_model_at_end: typing.Optional[bool] = False metric_for_best_model: typing.Optional

参数

output_dir (str) — 模型预测和检查点将被写入的输出目录。
overwrite_output_dir (bool, 可选, 默认为 False) — 如果为 True，则覆盖输出目录的内容。如果 output_dir 指向一个检查点目录，使用此选项可以继续训练。
do_train (bool, 可选, 默认为 False) — 是否运行训练。此参数不直接由Trainer使用，而是供您的训练/评估脚本使用。有关更多详细信息，请参阅示例脚本.
do_eval (bool, 可选) — 是否在验证集上运行评估。如果 eval_strategy 不同于 "no"，则设置为 True。此参数不由 Trainer 直接使用，而是供您的训练/评估脚本使用。有关更多详细信息，请参阅示例脚本.
do_predict (bool, 可选, 默认为 False) — 是否在测试集上运行预测。此参数不由Trainer直接使用，而是供您的训练/评估脚本使用。有关更多详细信息，请参阅示例脚本.
eval_strategy (str 或 IntervalStrategy, 可选, 默认为 "no") — 在训练期间采用的评估策略。可能的值为：
- "no": 在训练期间不进行评估。
- "steps": 每 eval_steps 进行一次评估（并记录）。
- "epoch": 在每个 epoch 结束时进行评估。
prediction_loss_only (bool, 可选, 默认为 False) — 在执行评估和生成预测时，仅返回损失值。
per_device_train_batch_size (int, optional, defaults to 8) — 每个GPU/XPU/TPU/MPS/NPU核心/CPU的训练批次大小。
per_device_eval_batch_size (int, optional, 默认为 8) — 每个 GPU/XPU/TPU/MPS/NPU 核心/CPU 的评估批次大小。
gradient_accumulation_steps (int, optional, defaults to 1) — Number of updates steps to accumulate the gradients for, before performing a backward/update pass.

在使用梯度累积时，一个步骤被计为一个带有反向传播的步骤。因此，日志记录、评估、保存将在每gradient_accumulation_steps * xxx_step个训练样本后进行。
eval_accumulation_steps (int, optional) — 在将结果移动到CPU之前，累积输出张量的预测步骤数。如果未设置，整个预测将在GPU/NPU/TPU上累积后再移动到CPU（速度更快但需要更多内存）。
eval_delay (float, optional) — 在首次评估之前需要等待的周期数或步骤数，具体取决于eval_strategy.
torch_empty_cache_steps (int, optional) — Number of steps to wait before calling torch.<device>.empty_cache(). If left unset or set to None, cache will not be emptied.

这可以通过降低峰值VRAM使用量来帮助避免CUDA内存不足错误，但代价是性能降低约10%。
learning_rate (float, optional, defaults to 5e-5) — AdamW 优化器的初始学习率.
weight_decay (float, 可选, 默认为 0) — 应用于除所有偏置和LayerNorm权重之外的所有层的权重衰减（如果不为零），在AdamW优化器中。
adam_beta1 (float, 可选, 默认为 0.9) — AdamW 优化器的 beta1 超参数.
adam_beta2 (float, 可选, 默认值为 0.999) — AdamW 优化器的 beta2 超参数.
adam_epsilon (float, optional, 默认为 1e-8) — AdamW 优化器的 epsilon 超参数.
max_grad_norm (float, optional, defaults to 1.0) — 最大梯度范数（用于梯度裁剪）。
num_train_epochs(float, 可选, 默认为 3.0) — 要执行的训练总轮数（如果不是整数，将在停止训练前执行最后一轮的小数部分百分比）。
max_steps (int, 可选, 默认为 -1) — 如果设置为正数，则表示要执行的总训练步数。覆盖 num_train_epochs。对于有限的数据集，训练会通过数据集重复进行（如果所有数据都用完），直到达到 max_steps。
lr_scheduler_type (str 或 SchedulerType, 可选, 默认为 "linear") — 使用的调度器类型。请参阅 SchedulerType 的文档以获取所有可能的值。
lr_scheduler_kwargs (‘dict’, optional, defaults to {}) — lr_scheduler的额外参数。请参阅每个调度器的文档以获取可能的值。
warmup_ratio (float, optional, defaults to 0.0) — 用于从0到learning_rate的线性预热的总训练步骤的比例。
warmup_steps (int, 可选, 默认为 0) — 用于从 0 到 learning_rate 的线性预热步数。覆盖 warmup_ratio 的任何效果。
log_level (str, 可选, 默认为 passive) — 在主进程中使用的日志记录器日志级别。可能的选项是作为字符串的日志级别：‘debug’, ‘info’, ‘warning’, ‘error’ 和 ‘critical’，以及一个‘passive’级别，它不设置任何内容并保持 Transformers库的当前日志级别（默认情况下为 "warning"）。
log_level_replica (str, 可选, 默认为 "warning") — 在副本上使用的日志记录级别。与 log_level 相同的选项
log_on_each_node (bool, 可选, 默认为 True) — 在多节点分布式训练中，是否在每个节点上使用 log_level 记录日志，或者仅在主节点上记录日志。
logging_dir (str, 可选) — TensorBoard 日志目录。默认值为 *output_dir/runs/CURRENT_DATETIME_HOSTNAME*.
logging_strategy (str 或 IntervalStrategy, 可选, 默认为 "steps") — 在训练期间采用的日志记录策略。可能的值为：
- "no": 在训练期间不进行日志记录。
- "epoch": 在每个 epoch 结束时进行日志记录。
- "steps": 每 logging_steps 进行一次日志记录。
logging_first_step (bool, 可选, 默认为 False) — 是否记录第一个 global_step.
logging_steps (int 或 float, 可选, 默认为 500) — 如果 logging_strategy="steps"，则两次日志记录之间的更新步骤数。应为整数或范围在 [0,1) 内的浮点数。如果小于 1，则将被解释为总训练步骤的比例。
logging_nan_inf_filter (bool, optional, defaults to True) — Whether to filter nan and inf losses for logging. If set to True the loss of every step that is nan or inf is filtered and the average loss of the current logging window is taken instead.

logging_nan_inf_filter 仅影响损失值的日志记录，它不会改变梯度计算或应用于模型的行为。
save_strategy (str or SaveStrategy, optional, defaults to "steps") — The checkpoint save strategy to adopt during training. Possible values are:
- "no": No save is done during training.
- "epoch": Save is done at the end of each epoch.
- "steps": Save is done every save_steps.
- "best": Save is done whenever a new best_metric is achieved.
如果选择了"epoch"或"steps"，保存操作也将在训练的最后时刻执行，总是如此。
save_steps (int 或 float, 可选, 默认为 500) — 如果 save_strategy="steps"，则在两次检查点保存之间的更新步骤数。应为整数或范围在 [0,1) 内的浮点数。如果小于 1，将被解释为总训练步骤的比例。
save_total_limit (int, 可选) — 如果传递了一个值，将限制检查点的总数。删除output_dir中较旧的检查点。当启用load_best_model_at_end时，根据metric_for_best_model的“最佳”检查点将始终保留，同时保留最新的检查点。例如，对于save_total_limit=5和load_best_model_at_end，将始终保留最后四个检查点以及最佳模型。当save_total_limit=1和load_best_model_at_end时，可能会保存两个检查点：最后一个和最佳的一个（如果它们不同）。
save_safetensors (bool, 可选, 默认为 True) — 使用 safetensors 保存和加载状态字典，而不是默认的 torch.load 和 torch.save.
save_on_each_node (bool, optional, defaults to False) — When doing multi-node distributed training, whether to save models and checkpoints on each node, or only on the main one.
当不同节点使用相同的存储时，不应激活此功能，因为文件将以相同的名称保存给每个节点。
save_only_model (bool, 可选, 默认为 False) — 在检查点时，是否仅保存模型，还是同时保存优化器、调度器和随机数生成器状态。请注意，当此选项为真时，您将无法从检查点恢复训练。这使您可以通过不存储优化器、调度器和随机数生成器状态来节省存储空间。您只能使用 from_pretrained 加载模型，且此选项设置为 True.
restore_callback_states_from_checkpoint (bool, 可选, 默认为 False) — 是否从检查点恢复回调状态。如果为 True，将覆盖传递给 Trainer 的回调，如果它们在检查点中存在。”
use_cpu (bool, 可选, 默认为 False) — 是否使用CPU。如果设置为False，我们将使用可用的cuda或mps设备。
seed (int, optional, 默认为 42) — 在训练开始时设置的随机种子。为了确保运行之间的可重复性，如果模型有一些随机初始化的参数，请使用 ~Trainer.model_init 函数来实例化模型。
data_seed (int, optional) — 用于数据采样器的随机种子。如果未设置，数据采样的随机生成器将使用与seed相同的种子。这可以用于确保数据采样的可重复性，独立于模型种子。
jit_mode_eval (bool, optional, 默认为 False) — 是否使用 PyTorch jit trace 进行推理。
use_ipex (bool, 可选, 默认为 False) — 当可用时使用Intel的PyTorch扩展。IPEX 安装.
bf16 (bool, 可选, 默认为 False) — 是否使用 bf16 16位（混合）精度训练代替 32位训练。需要 Ampere 或更高版本的 NVIDIA 架构，或使用 CPU (use_cpu) 或 Ascend NPU。这是一个实验性 API，可能会发生变化。
fp16 (bool, 可选, 默认为 False) — 是否使用 fp16 16位（混合）精度训练而不是 32位训练.
fp16_opt_level (str, 可选, 默认为 ‘O1’) — 对于 fp16 训练，Apex AMP 优化级别在 [‘O0’, ‘O1’, ‘O2’, 和 ‘O3’] 中选择。详情请参阅 Apex 文档.
fp16_backend (str, 可选, 默认为 "auto") — 此参数已弃用。请使用 half_precision_backend 代替。
half_precision_backend (str, optional, defaults to "auto") — 用于混合精度训练的后端。必须是 "auto", "apex", "cpu_amp" 之一。"auto" 将根据检测到的 PyTorch 版本使用 CPU/CUDA AMP 或 APEX，而其他选择将强制使用请求的后端。
bf16_full_eval (bool, 可选, 默认为 False) — 是否使用完整的bfloat16评估而不是32位。这将更快并节省内存，但可能会损害指标值。这是一个实验性API，可能会发生变化。
fp16_full_eval (bool, 可选, 默认为 False) — 是否使用全float16评估而不是32位。这将更快并节省内存，但可能会损害指标值。
tf32 (bool, 可选) — 是否启用TF32模式，该模式在Ampere及更新的GPU架构中可用。默认值取决于 PyTorch版本的默认设置torch.backends.cuda.matmul.allow_tf32。更多详情请参阅 TF32文档。这是一个实验性API，可能会发生变化。
local_rank (int, optional, defaults to -1) — 分布式训练期间进程的排名。
ddp_backend (str, 可选) — 用于分布式训练的后端。必须是 "nccl", "mpi", "ccl", "gloo", "hccl" 中的一个。
tpu_num_cores (int, 可选) — 在TPU上训练时，TPU核心的数量（由启动脚本自动传递）。
dataloader_drop_last (bool, 可选, 默认为 False) — 是否丢弃最后一个不完整的批次（如果数据集的长度不能被批次大小整除）或不丢弃。
eval_steps (int 或 float, 可选) — 如果 eval_strategy="steps"，则在两次评估之间的更新步骤数。如果未设置，将默认为与 logging_steps 相同的值。应为整数或范围在 [0,1) 内的浮点数。如果小于1，将被解释为总训练步骤的比例。
dataloader_num_workers (int, optional, 默认为 0) — 用于数据加载的子进程数量（仅限 PyTorch）。0 表示数据将在主进程中加载。
past_index (int, 可选, 默认为 -1) — 一些模型如 TransformerXL 或 XLNet 可以利用过去的隐藏状态进行预测。如果此参数设置为正整数，Trainer 将使用相应的输出（通常为索引 2）作为过去的状态，并在下一个训练步骤中将其作为关键字参数 mems 提供给模型。
run_name (str, 可选, 默认为 output_dir) — 运行的描述符。通常用于 wandb、 mlflow 和 comet 日志记录。如果未指定，将与 output_dir 相同。
disable_tqdm (bool, optional) — 是否禁用由~notebook.NotebookTrainingTracker在Jupyter Notebooks中生成的tqdm进度条和指标表。如果日志级别设置为警告或更低（默认），则默认为True，否则为False。
remove_unused_columns (bool, optional, defaults to True) — 是否自动移除模型前向方法未使用的列。
label_names (List[str], optional) — The list of keys in your dictionary of inputs that correspond to the labels.
最终将默认为模型接受的包含“label”一词的参数名称列表，除非使用的模型是XxxForQuestionAnswering之一，在这种情况下，它还将包括["start_positions", "end_positions"]键。
load_best_model_at_end (bool, optional, defaults to False) — Whether or not to load the best model found during training at the end of training. When this option is enabled, the best checkpoint will always be saved. See save_total_limit for more.

当设置为True时，参数save_strategy需要与eval_strategy相同，并且在“steps”的情况下，save_steps必须是eval_steps的整数倍。
metric_for_best_model (str, optional) — Use in conjunction with load_best_model_at_end to specify the metric to use to compare two different models. Must be the name of a metric returned by the evaluation with or without the prefix "eval_". Will default to "loss" if unspecified and load_best_model_at_end=True (to use the evaluation loss).
如果你设置了这个值，greater_is_better 将默认为 True。如果你的指标在较低时更好，别忘了将其设置为 False。
greater_is_better (bool, 可选) — 与 load_best_model_at_end 和 metric_for_best_model 一起使用，以指定更好的模型是否应具有更大的指标。默认值为：
- True 如果 metric_for_best_model 设置为不以 "loss" 结尾的值。
- False 如果 metric_for_best_model 未设置，或设置为以 "loss" 结尾的值。
ignore_data_skip (bool, 可选, 默认为 False) — 在恢复训练时，是否跳过epochs和batches以使数据加载阶段与之前的训练相同。如果设置为 True，训练将更快开始（因为跳过步骤可能需要很长时间），但不会产生与中断训练相同的结果。
fsdp (bool, str or list of FSDPOption, optional, defaults to '') — Use PyTorch Distributed Parallel Training (in distributed training only).
以下选项列表：
- "full_shard": Shard parameters, gradients and optimizer states.
- "shard_grad_op": Shard optimizer states and gradients.
- "hybrid_shard": Apply FULL_SHARD within a node, and replicate parameters across nodes.
- "hybrid_shard_zero2": Apply SHARD_GRAD_OP within a node, and replicate parameters across nodes.
- "offload": Offload parameters and gradients to CPUs (only compatible with "full_shard" and "shard_grad_op").
- "auto_wrap": Automatically recursively wrap layers with FSDP using default_auto_wrap_policy.
fsdp_config (str or dict, optional) — Config to be used with fsdp (Pytorch Distributed Parallel Training). The value is either a location of fsdp json config file (e.g., fsdp_config.json) or an already loaded json file as dict.
配置及其选项的列表：
- min_num_params (int, 可选, 默认为 0): FSDP 的默认自动包装的最小参数数量。（仅在传递了 fsdp 字段时有用）。
- transformer_layer_cls_to_wrap (List[str], 可选): 要包装的transformer层类名列表（区分大小写），例如 BertLayer, GPTJBlock, T5Block … （仅在传递 fsdp 标志时有用）。
- backward_prefetch (str, 可选) FSDP的后向预取模式。控制何时预取下一组参数（仅在传递fsdp字段时有用）。
  
  以下选项列表：
  - "backward_pre" : Prefetches the next set of parameters before the current set of parameter’s gradient computation.
  - "backward_post" : This prefetches the next set of parameters after the current set of parameter’s gradient computation.
- forward_prefetch (bool, 可选, 默认为 False) FSDP的前向预取模式（仅在传递了fsdp字段时有用）。如果为"True"，则FSDP在执行前向传递时显式预取下一个即将到来的全收集操作。
- limit_all_gathers (bool, 可选, 默认为 False) FSDP的limit_all_gathers（仅在传递了fsdp字段时有用）。如果为"True"，FSDP会显式同步CPU线程以防止过多的正在进行的all-gather操作。
- use_orig_params (bool, 可选, 默认为 True) 如果为 "True"，允许在初始化期间非均匀的 requires_grad，这意味着支持交替冻结和可训练的参数。在参数高效微调等情况下非常有用。请参考此 [博客](https://dev-discuss.pytorch.org/t/rethinking-pytorch-fully-sharded-data-parallel-fsdp-from-first-principles/1019
- sync_module_states (bool, 可选, 默认为 True) 如果为 "True"，每个单独包装的 FSDP 单元将从 rank 0 广播模块参数，以确保初始化后所有 rank 上的参数相同
- cpu_ram_efficient_loading (bool, 可选, 默认为 False) 如果设置为 "True"，只有第一个进程会加载预训练模型的检查点，而其他所有进程的权重为空。当此设置设为 "True" 时，sync_module_states 也必须设为 "True"，否则除了主进程之外的所有进程都会有随机权重，导致训练期间出现意外行为。
- activation_checkpointing (bool, 可选, 默认为 False): 如果为 "True"，激活检查点是一种通过清除某些层的激活并在反向传播期间重新计算它们来减少内存使用的技术。实际上，这是用额外的计算时间来换取减少内存使用。
- xla (bool, 可选, 默认为 False): 是否使用 PyTorch/XLA 完全分片数据并行训练。这是一个实验性功能，其 API 可能会在未来发生变化。
- xla_fsdp_settings (dict, 可选) 该值是一个字典，用于存储XLA FSDP包装参数。
  
  有关完整选项列表，请参见这里。
- xla_fsdp_grad_ckpt (bool, 可选, 默认为 False): 将在每个嵌套的XLA FSDP包装层上使用梯度检查点。此设置只能在xla标志设置为true时使用，并且通过fsdp_min_num_params或fsdp_transformer_layer_cls_to_wrap指定了自动包装策略。
deepspeed (str or dict, optional) — Use Deepspeed. This is an experimental feature and its API may evolve in the future. The value is either the location of DeepSpeed json config file (e.g., ds_config.json) or an already loaded json file as a dict”
If enabling any Zero-init, make sure that your model is not initialized until *after* initializing the `TrainingArguments`, else it will not be applied.
accelerator_config (str, dict, or AcceleratorConfig, optional) — Config to be used with the internal Accelerator implementation. The value is either a location of accelerator json config file (e.g., accelerator_config.json), an already loaded json file as dict, or an instance of AcceleratorConfig.
配置及其选项的列表：
- split_batches (bool, optional, defaults to False): Whether or not the accelerator should split the batches yielded by the dataloaders across the devices. If True the actual batch size used will be the same on any kind of distributed processes, but it must be a round multiple of the num_processes you are using. If False, actual batch size used will be the one set in your script multiplied by the number of processes.
- dispatch_batches (bool, optional): If set to True, the dataloader prepared by the Accelerator is only iterated through on the main process and then the batches are split and broadcast to each process. Will default to True for DataLoader whose underlying dataset is an IterableDataset, False otherwise.
- even_batches (bool, optional, defaults to True): If set to True, in cases where the total batch size across all processes does not exactly divide the dataset, samples at the start of the dataset will be duplicated so the batch can be divided equally among all workers.
- use_seedable_sampler (bool, optional, defaults to True): Whether or not use a fully seedable random sampler (accelerate.data_loader.SeedableRandomSampler). Ensures training results are fully reproducable using a different sampling technique. While seed-to-seed results may differ, on average the differences are neglible when using multiple different seeds to compare. Should also be ran with ~utils.set_seed for the best results.
- use_configured_state (bool, optional, defaults to False): Whether or not to use a pre-configured AcceleratorState or PartialState defined before calling TrainingArguments. If True, an Accelerator or PartialState must be initialized. Note that by doing so, this could lead to issues with hyperparameter tuning.
label_smoothing_factor (float, 可选, 默认为 0.0) — 使用的标签平滑因子。零表示没有标签平滑，否则底层的一热编码标签将从0和1分别更改为label_smoothing_factor/num_labels和1 - label_smoothing_factor + label_smoothing_factor/num_labels.
debug (str or list of DebugOption, optional, defaults to "") — Enable one or more debug features. This is an experimental feature.
可能的选项有：
- "underflow_overflow": detects overflow in model’s input/outputs and reports the last frames that led to the event
- "tpu_metrics_debug": print debug metrics on TPU
选项应该用空格分隔。
optim (str 或 training_args.OptimizerNames, 可选, 默认为 "adamw_torch") — 要使用的优化器，例如 “adamw_hf”, “adamw_torch”, “adamw_torch_fused”, “adamw_apex_fused”, “adamw_anyprecision”, “adafactor”。有关优化器的完整列表，请参见 training_args.py 中的 OptimizerNames.
optim_args (str, optional) — 可选参数，这些参数提供给优化器，如 AnyPrecisionAdamW、AdEMAMix 和 GaLore。
group_by_length (bool, 可选, 默认为 False) — 是否在训练数据集中将长度大致相同的样本分组（以最小化填充并提高效率）。仅在应用动态填充时有用。
length_column_name (str, 可选, 默认为 "length") — 用于预计算长度的列名。如果该列存在，按长度分组时将使用这些值，而不是在训练启动时计算它们。除非 group_by_length 为 True 且数据集是 Dataset 的实例，否则忽略此参数。
report_to (str 或 List[str], 可选, 默认为 "all") — 用于报告结果和日志的集成列表。支持的平台有 "azure_ml", "clearml", "codecarbon", "comet_ml", "dagshub", "dvclive", "flyte", "mlflow", "neptune", "tensorboard", 和 "wandb"。使用 "all" 报告给所有已安装的集成，使用 "none" 表示不使用任何集成。
ddp_find_unused_parameters (bool, 可选) — 在使用分布式训练时，传递给DistributedDataParallel的标志find_unused_parameters的值。如果使用了梯度检查点，则默认为False，否则为True。
ddp_bucket_cap_mb (int, 可选) — 在使用分布式训练时，传递给DistributedDataParallel的bucket_cap_mb标志的值。
ddp_broadcast_buffers (bool, 可选) — 在使用分布式训练时，传递给DistributedDataParallel的标志broadcast_buffers的值。如果使用了梯度检查点，则默认为False，否则为True。
dataloader_pin_memory (bool, 可选, 默认为 True) — 是否希望在数据加载器中固定内存。默认为 True.
dataloader_persistent_workers (bool, 可选, 默认为 False) — 如果为True，数据加载器在数据集被消耗一次后不会关闭工作进程。这允许保持工作进程的Dataset实例存活。可能会加速训练，但会增加RAM使用量。默认为False.
dataloader_prefetch_factor (int, optional) — 每个工作线程预先加载的批次数。 2 表示所有工作线程将总共预取 2 * num_workers 个批次。
skip_memory_metrics (bool, 可选, 默认为 True) — 是否跳过将内存分析器报告添加到指标中。默认情况下会跳过此操作，因为它会减慢训练和评估速度。
push_to_hub (bool, optional, defaults to False) — Whether or not to push the model to the Hub every time the model is saved. If this is activated, output_dir will begin a git directory synced with the repo (determined by hub_model_id) and the content will be pushed each time a save is triggered (depending on your save_strategy). Calling save_model() will also trigger a push.

如果 output_dir 存在，它需要是 Trainer 将要推送到的存储库的本地克隆。
resume_from_checkpoint (str, optional) — 模型的有效检查点文件夹的路径。这个参数不是由Trainer直接使用的，而是供你的训练/评估脚本使用的。更多详情请参见example scripts.
hub_model_id (str, optional) — The name of the repository to keep in sync with the local output_dir. It can be a simple model ID in which case the model will be pushed in your namespace. Otherwise it should be the whole repository name, for instance "user_name/model", which allows you to push to an organization you are a member of with "organization_name/model". Will default to user_name/output_dir_name with output_dir_name being the name of output_dir.
将默认为output_dir的名称。
hub_strategy (str or HubStrategy, optional, defaults to "every_save") — Defines the scope of what is pushed to the Hub and when. Possible values are:
- "end": push the model, its configuration, the processing class e.g. tokenizer (if passed along to the Trainer) and a draft of a model card when the save_model() method is called.
- "every_save": push the model, its configuration, the processing class e.g. tokenizer (if passed along to the Trainer) and a draft of a model card each time there is a model save. The pushes are asynchronous to not block training, and in case the save are very frequent, a new push is only attempted if the previous one is finished. A last push is made with the final model at the end of training.
- "checkpoint": like "every_save" but the latest checkpoint is also pushed in a subfolder named last-checkpoint, allowing you to resume training easily with trainer.train(resume_from_checkpoint="last-checkpoint").
- "all_checkpoints": like "checkpoint" but all checkpoints are pushed like they appear in the output folder (so you will get one checkpoint folder per folder in your final repository)
hub_token (str, optional) — 用于将模型推送到 Hub 的令牌。默认为使用 huggingface-cli login 在缓存文件夹中获取的令牌。
hub_private_repo (bool, 可选) — 是否将仓库设为私有。如果为 None（默认值），仓库将为公开，除非组织的默认设置为私有。如果仓库已存在，则忽略此值。
hub_always_push (bool, 可选, 默认为 False) — 除非此值为 True，否则当上一次推送未完成时，Trainer 将跳过推送检查点。
gradient_checkpointing (bool, optional, defaults to False) — 如果为True，则使用梯度检查点来节省内存，但会减慢反向传播的速度。
gradient_checkpointing_kwargs (dict, 可选, 默认为 None) — 传递给 gradient_checkpointing_enable 方法的关键字参数.
include_inputs_for_metrics (bool, 可选, 默认为 False) — 此参数已弃用。请使用 include_for_metrics 代替，例如 include_for_metrics = ["inputs"].
include_for_metrics (List[str], 可选, 默认为 []) — 如果需要在 compute_metrics 函数中包含额外的数据以进行指标计算。可以添加到 include_for_metrics 列表中的选项：
- "inputs": 传递给模型的输入数据，用于计算依赖于输入的指标。
- "loss": 在评估期间计算的损失值，用于计算依赖于损失的指标。
eval_do_concat_batches (bool, 可选, 默认为 True) — 是否递归地跨批次连接输入/损失/标签/预测。如果为 False，则将它们存储为列表，每个批次保持独立。
auto_find_batch_size (bool, 可选, 默认为 False) — 是否通过指数衰减自动找到一个适合内存的批量大小，避免CUDA内存不足错误。需要安装accelerate (pip install accelerate)
full_determinism (bool, 可选, 默认为 False) — 如果为 True，则调用 enable_full_determinism() 而不是 set_seed() 以确保在分布式训练中获得可重复的结果。重要提示：这将对性能产生负面影响，因此仅用于调试。
torchdynamo (str, 可选) — 如果设置，则为TorchDynamo的后端编译器。可能的选项有 "eager", "aot_eager", "inductor", "nvfuser", "aot_nvfuser", "aot_cudagraphs", "ofi", "fx2trt", "onnxrt" 和 "ipex".
ray_scope (str, 可选, 默认为 "last") — 在使用Ray进行超参数搜索时使用的范围。默认情况下，将使用"last"。Ray将然后使用所有试验的最后一个检查点，比较这些检查点，并选择最佳的一个。然而，其他选项也是可用的。有关更多选项，请参阅Ray文档.
ddp_timeout (int, 可选, 默认值为 1800) — torch.distributed.init_process_group 调用的超时时间，用于在分布式运行中执行慢操作时避免 GPU 套接字超时。请参阅 [PyTorch 文档] (https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) 以获取更多信息.
use_mps_device (bool, 可选, 默认为 False) — 此参数已弃用。mps 设备将在可用时使用，类似于 cuda 设备。
torch_compile (bool, optional, defaults to False) — Whether or not to compile the model using PyTorch 2.0 torch.compile.
这将使用torch.compile API的最佳默认值。您可以使用参数torch_compile_backend和torch_compile_mode自定义默认值，但我们不保证其中任何一个都能正常工作，因为支持正在逐步在PyTorch中推出。

此标志和整个编译API是实验性的，可能会在未来的版本中发生变化。
torch_compile_backend (str, optional) — The backend to use in torch.compile. If set to any value, torch_compile will be set to True.
请参考PyTorch文档以获取可能的值，并注意它们可能会在PyTorch版本之间发生变化。

此标志是实验性的，可能会在未来的版本中更改。
torch_compile_mode (str, optional) — The mode to use in torch.compile. If set to any value, torch_compile will be set to True.
请参考PyTorch文档以获取可能的值，并注意它们可能会在PyTorch版本之间发生变化。

此标志是实验性的，可能会在未来的版本中更改。
split_batches (bool, optional) — Whether or not the accelerator should split the batches yielded by the dataloaders across the devices during distributed training. If
设置为True，实际使用的批量大小在任何类型的分布式进程中将相同，但它必须是一个

四舍五入到你使用的进程数（如GPU）的倍数。
include_tokens_per_second (bool, optional) — Whether or not to compute the number of tokens per second per device for training speed metrics.
这将预先遍历整个训练数据加载器一次，

并且会减慢整个过程。
include_num_input_tokens_seen (bool, optional) — Whether or not to track the number of input tokens seen throughout training.
在分布式训练中可能会较慢，因为必须调用聚集操作。
neftune_noise_alpha (Optional[float]) — 如果不是 None，这将激活 NEFTune 噪声嵌入。这可以显著提高模型在指令微调中的性能。查看原始论文和原始代码。支持 transformers 的 PreTrainedModel 以及 peft 的 PeftModel。原始论文中使用的值范围是 [5.0, 15.0]。
optim_target_modules (Union[str, List[str]], optional) — 要优化的目标模块，即您希望训练的模块名称，目前这仅用于GaLore算法 https://arxiv.org/abs/2403.03507 参见：https://github.com/jiaweizzhao/GaLore 了解更多详情。您需要确保传递一个有效的GaloRe 优化器，例如：“galore_adamw”、“galore_adamw_8bit”、“galore_adafactor”之一，并确保目标模块仅为nn.Linear模块
batch_eval_metrics (Optional[bool], 默认为 False) — 如果设置为 True，评估将在每批结束时调用 compute_metrics 以累积统计信息，而不是将所有评估 logits 保存在内存中。当设置为 True 时，您必须传递一个 compute_metrics 函数，该函数接受一个布尔参数 compute_result，当传递 True 时，将从您在评估集上累积的批级别统计信息中触发最终的全局汇总统计信息。
eval_on_start (bool, optional, defaults to False) — 是否在训练前执行评估步骤（健全性检查）以确保验证步骤正常工作。
eval_use_gather_object (bool, 可选, 默认为 False) — 是否在所有设备的嵌套列表/元组/字典中递归地收集对象。只有在用户不仅仅返回张量时才应启用此选项，并且PyTorch强烈不推荐这样做。
use_liger_kernel (bool, 可选, 默认为 False) — 是否启用 Liger 内核用于 LLM 模型训练。它可以有效提高多 GPU 训练吞吐量约 20%，并减少内存使用约 60%，与 flash attention、PyTorch FSDP 和 Microsoft DeepSpeed 开箱即用。目前，它支持 llama、mistral、mixtral 和 gemma 模型。
sortish_sampler (bool, optional, defaults to False) — Whether to use a sortish sampler or not. Only possible if the underlying datasets are Seq2SeqDataset for now but will become generally available in the near future.
它根据长度对输入进行排序，以最小化填充大小，并为训练集添加一些随机性。
predict_with_generate (bool, 可选, 默认为 False) — 是否使用生成来计算生成指标（ROUGE, BLEU）。
generation_max_length (int, optional) — 当predict_with_generate=True时，用于每次评估循环的max_length。将默认为模型配置的max_length值。
generation_num_beams (int, optional) — 当 predict_with_generate=True 时，在每个评估循环中使用的 num_beams。将默认为模型配置中的 num_beams 值。
generation_config (str or Path or GenerationConfig, optional) — Allows to load a GenerationConfig from the from_pretrained method. This can be either:
- a string, the model id of a pretrained model configuration hosted inside a model repo on huggingface.co.
- a path to a directory containing a configuration file saved using the save_pretrained() method, e.g., ./my_model_directory/.
- a GenerationConfig object.

TrainingArguments 是我们在示例脚本中使用的参数子集，这些参数与训练循环本身相关。

使用 HfArgumentParser 我们可以将这个类转换为 argparse 参数，这些参数可以在命令行中指定。

to_dict

( )

序列化此实例时，将Enum替换为其值，并将GenerationConfig替换为字典（以支持JSON序列化）。它通过移除其值来混淆令牌值。

< > Update on GitHub

←Tokenizer DeepSpeed→