Skip to content

HFTrainer

pipeline pipeline

使用 Trainer 框架训练一个新的 Hugging Face Transformer 模型。

示例

以下展示了一个使用此管道的简单示例。

import pandas as pd

from datasets import load_dataset

from txtai.pipeline import HFTrainer

trainer = HFTrainer()

# Pandas DataFrame
df = pd.read_csv("training.csv")
model, tokenizer = trainer("bert-base-uncased", df)

# Hugging Face 数据集
ds = load_dataset("glue", "sst2")
model, tokenizer = trainer("bert-base-uncased", ds["train"], columns=("sentence", "label"))

# 字典列表
dt = [{"text": "sentence 1", "label": 0}, {"text": "sentence 2", "label": 1}]]
model, tokenizer = trainer("bert-base-uncased", dt)

# 支持额外的 TrainingArguments
model, tokenizer = trainer("bert-base-uncased", dt, 
                            learning_rate=3e-5, num_train_epochs=5)

所有 TrainingArguments 都可以作为训练器调用的函数参数。

请参阅以下链接以获取更多详细示例。

Notebook 描述
训练一个文本标签器 构建文本序列分类模型 Open In Colab
无标签训练 使用零样本分类器训练新模型 Open In Colab
训练一个问答模型 构建和微调问答模型 Open In Colab
从头开始训练一个语言模型 构建新的语言模型 Open In Colab

训练任务

HFTrainer 管道构建和/或微调以下训练任务的模型。

任务 描述
language-generation 用于文本生成的因果语言模型(例如 GPT)
language-modeling 用于一般任务的掩码语言模型(例如 BERT)
question-answering 提取式问答模型,通常使用 SQuAD 数据集
sequence-sequence 序列到序列模型(例如 T5)
text-classification 使用一组标签对文本进行分类
token-detection 使用替换令牌检测的 ELECTRA 风格预训练

PEFT

通过 Hugging Face 的 PEFT 库 支持参数高效微调(PEFT)。量化通过 bitsandbytes 提供。请参阅以下示例。

from txtai.pipeline import HFTrainer

trainer = HFTrainer()
trainer(..., quantize=True, lora=True)

当这些参数设置为 True 时,它们使用默认配置。这也可以自定义。

quantize = {
    "load_in_4bit": True,
    "bnb_4bit_use_double_quant": True,
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_compute_dtype": "bfloat16"
}

lora = {
    "r": 16,
    "lora_alpha": 8,
    "target_modules": "all-linear",
    "lora_dropout": 0.05,
    "bias": "none"
}

trainer(..., quantize=quantize, lora=lora)

这些参数还接受 transformers.BitsAndBytesConfigpeft.LoraConfig 实例。

请参阅以下 PEFT 文档链接以获取更多信息。

方法

管道的 Python 文档。

__call__(base, train, validation=None, columns=None, maxlength=None, stride=128, task='text-classification', prefix=None, metrics=None, tokenizers=None, checkpoint=None, quantize=None, lora=None, **args)

Builds a new model using arguments.

Parameters:

Name Type Description Default
base

path to base model, accepts Hugging Face model hub id, local path or (model, tokenizer) tuple

required
train

training data

required
validation

validation data

None
columns

tuple of columns to use for text/label, defaults to (text, None, label)

None
maxlength

maximum sequence length, defaults to tokenizer.model_max_length

None
stride

chunk size for splitting data for QA tasks

128
task

optional model task or category, determines the model type, defaults to "text-classification"

'text-classification'
prefix

optional source prefix

None
metrics

optional function that computes and returns a dict of evaluation metrics

None
tokenizers

optional number of concurrent tokenizers, defaults to None

None
checkpoint

optional resume from checkpoint flag or path to checkpoint directory, defaults to None

None
quantize

quantization configuration to pass to base model

None
lora

lora configuration to pass to PEFT model

None
args

training arguments

{}

Returns:

Type Description

(model, tokenizer)

Source code in txtai/pipeline/train/hftrainer.py
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
def __call__(
    self,
    base,
    train,
    validation=None,
    columns=None,
    maxlength=None,
    stride=128,
    task="text-classification",
    prefix=None,
    metrics=None,
    tokenizers=None,
    checkpoint=None,
    quantize=None,
    lora=None,
    **args
):
    """
    Builds a new model using arguments.

    Args:
        base: path to base model, accepts Hugging Face model hub id, local path or (model, tokenizer) tuple
        train: training data
        validation: validation data
        columns: tuple of columns to use for text/label, defaults to (text, None, label)
        maxlength: maximum sequence length, defaults to tokenizer.model_max_length
        stride: chunk size for splitting data for QA tasks
        task: optional model task or category, determines the model type, defaults to "text-classification"
        prefix: optional source prefix
        metrics: optional function that computes and returns a dict of evaluation metrics
        tokenizers: optional number of concurrent tokenizers, defaults to None
        checkpoint: optional resume from checkpoint flag or path to checkpoint directory, defaults to None
        quantize: quantization configuration to pass to base model
        lora: lora configuration to pass to PEFT model
        args: training arguments

    Returns:
        (model, tokenizer)
    """

    # Quantization / LoRA support
    if (quantize or lora) and not PEFT:
        raise ImportError('PEFT is not available - install "pipeline" extra to enable')

    # Parse TrainingArguments
    args = self.parse(args)

    # Set seed for model reproducibility
    set_seed(args.seed)

    # Load model configuration, tokenizer and max sequence length
    config, tokenizer, maxlength = self.load(base, maxlength)

    # Default tokenizer pad token if it's not set
    tokenizer.pad_token = tokenizer.pad_token if tokenizer.pad_token is not None else tokenizer.eos_token

    # Prepare parameters
    process, collator, labels = self.prepare(task, train, tokenizer, columns, maxlength, stride, prefix, args)

    # Tokenize training and validation data
    train, validation = process(train, validation, os.cpu_count() if tokenizers and isinstance(tokenizers, bool) else tokenizers)

    # Create model to train
    model = self.model(task, base, config, labels, tokenizer, quantize)

    # Default config pad token if it's not set
    model.config.pad_token_id = model.config.pad_token_id if model.config.pad_token_id is not None else model.config.eos_token_id

    # Load as PEFT model, if necessary
    model = self.peft(task, lora, model)

    # Add model to collator
    if collator:
        collator.model = model

    # Build trainer
    trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        data_collator=collator,
        args=args,
        train_dataset=train,
        eval_dataset=validation if validation else None,
        compute_metrics=metrics,
    )

    # Run training
    trainer.train(resume_from_checkpoint=checkpoint)

    # Run evaluation
    if validation:
        trainer.evaluate()

    # Save model outputs
    if args.should_save:
        trainer.save_model()
        trainer.save_state()

    # Put model in eval mode to disable weight updates and return (model, tokenizer)
    return (model.eval(), tokenizer)