► Keras 3 API 文档 / Keras自然语言处理（KerasNLP） / 预训练模型 / 珍玛 / Gemma预处理器层

Gemma预处理器层

`GemmaPreprocessor` class

keras_nlp.models.GemmaPreprocessor(
    tokenizer, sequence_length=8192, add_start_token=True, add_end_token=True, **kwargs
)

Gemma preprocessing layer which tokenizes and packs inputs.

This preprocessing layer will do 2 things:

Tokenize the inputs using the tokenizer.
Construct a dictionary with keys "token_ids", "padding_mask", that can be passed directly to a keras_nlp.models.GemmaBackbone.

This layer can be used directly with tf.data.Dataset.map to preprocess string data in the (x, y, sample_weight) format used by keras.Model.fit.

The call method of this layer accepts three arguments, x, y, and sample_weight. x can be a python string or tensor representing a single segment, a list of python strings representing a batch of single segments, or a list of tensors representing multiple segments to be packed together. y and sample_weight are both optional, can have any format, and will be passed through unaltered.

GemmaPreprocessor expects the input to have only one segment, as Gemma is mainly used for generation tasks. For tasks having multi-segment inputs please combine inputs into a single string input before passing to the preprocessor layer.

Arguments

tokenizer: A keras_nlp.models.GemmaTokenizer instance.
sequence_length: The length of the packed inputs.
add_start_token: If True, the preprocessor will prepend the tokenizer start token to each input sequence.
add_end_token: If True, the preprocessor will append the tokenizer end token to each input sequence.

Call arguments

x: A string, tf.Tensor or list of python strings.
y: Any label data. Will be passed through unaltered.
sample_weight: Any label weight data. Will be passed through unaltered.
sequence_length: Pass to override the configured sequence_length of the layer.

Examples

Directly calling the layer on data.

preprocessor = keras_nlp.models.GemmaPreprocessor.from_preset(
    "gemma_2b_en"
)

# Tokenize and pack a single sentence.
preprocessor("The quick brown fox jumped.")

# Tokenize a batch of sentences.
preprocessor(["The quick brown fox jumped.", "Call me Ishmael."])

# Custom vocabulary.
bytes_io = io.BytesIO()
ds = tf.data.Dataset.from_tensor_slices(["The quick brown fox jumped."])
sentencepiece.SentencePieceTrainer.train(
    sentence_iterator=ds.as_numpy_iterator(),
    model_writer=bytes_io,
    vocab_size=8,
    model_type="WORD",
    pad_id=0,
    bos_id=1,
    eos_id=2,
    unk_id=3,
    pad_piece="<pad>",
    bos_piece="<bos>",
    eos_piece="<eos>",
    unk_piece="<unk>",
)
tokenizer = keras_nlp.models.GemmaTokenizer(
    proto=bytes_io.getvalue(),
)
preprocessor = keras_nlp.models.GemmaPreprocessor(tokenizer=tokenizer)
preprocessor("The quick brown fox jumped.")

Apply preprocessing to a tf.data.Dataset.

preprocessor = keras_nlp.models.GemmaPreprocessor.from_preset(
    "gemma_2b_en"
)

text = tf.constant(["The quick brown fox jumped.", "Call me Ishmael."])
label = tf.constant([1, 1])

# Map labeled single sentences.
ds = tf.data.Dataset.from_tensor_slices((text, label))
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)

# Map unlabeled single sentences.
ds = tf.data.Dataset.from_tensor_slices(text)
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)

[source]

`from_preset` method

GemmaPreprocessor.from_preset(preset, **kwargs)

Instantiate a keras_nlp.models.Preprocessor from a model preset.

A preset is a directory of configs, weights and other file assets used to save and load a pre-trained model. The preset can be passed as a one of:

a built in preset identifier like 'bert_base_en'
a Kaggle Models handle like 'kaggle://user/bert/keras/bert_base_en'
a Hugging Face handle like 'hf://user/bert_base_en'
a path to a local preset directory like './bert_base_en'

For any Preprocessor subclass, you can run cls.presets.keys() to list all built-in presets available on the class.

As there are usually multiple preprocessing classes for a given model, this method should be called on a specific subclass like keras_nlp.models.BertPreprocessor.from_preset().

Arguments

preset: string. A built in preset identifier, a Kaggle Models handle, a Hugging Face handle, or a path to a local directory.

Examples

# Load a preprocessor for Gemma generation.
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset(
    "gemma_2b_en",
)

# Load a preprocessor for Bert classification.
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
    "bert_base_en",
)

Preset name	Parameters	Description
gemma_2b_en	2.51B	2 billion parameter, 18-layer, base Gemma model.
gemma_instruct_2b_en	2.51B	2 billion parameter, 18-layer, instruction tuned Gemma model.
gemma_1.1_instruct_2b_en	2.51B	2 billion parameter, 18-layer, instruction tuned Gemma model. The 1.1 update improves model quality.
code_gemma_1.1_2b_en	2.51B	2 billion parameter, 18-layer, CodeGemma model. This model has been trained on a fill-in-the-middle (FIM) task for code completion. The 1.1 update improves model quality.
code_gemma_2b_en	2.51B	2 billion parameter, 18-layer, CodeGemma model. This model has been trained on a fill-in-the-middle (FIM) task for code completion.
gemma_7b_en	8.54B	7 billion parameter, 28-layer, base Gemma model.
gemma_instruct_7b_en	8.54B	7 billion parameter, 28-layer, instruction tuned Gemma model.
gemma_1.1_instruct_7b_en	8.54B	7 billion parameter, 28-layer, instruction tuned Gemma model. The 1.1 update improves model quality.
code_gemma_7b_en	8.54B	7 billion parameter, 28-layer, CodeGemma model. This model has been trained on a fill-in-the-middle (FIM) task for code completion.
code_gemma_instruct_7b_en	8.54B	7 billion parameter, 28-layer, instruction tuned CodeGemma model. This model has been trained for chat use cases related to code.
code_gemma_1.1_instruct_7b_en	8.54B	7 billion parameter, 28-layer, instruction tuned CodeGemma model. This model has been trained for chat use cases related to code. The 1.1 update improves model quality.
gemma2_2b_en	2.61B	2 billion parameter, 26-layer, base Gemma model.
gemma2_instruct_2b_en	2.61B	2 billion parameter, 26-layer, instruction tuned Gemma model.
gemma2_9b_en	9.24B	9 billion parameter, 42-layer, base Gemma model.
gemma2_instruct_9b_en	9.24B	9 billion parameter, 42-layer, instruction tuned Gemma model.
gemma2_27b_en	27.23B	27 billion parameter, 42-layer, base Gemma model.
gemma2_instruct_27b_en	27.23B	27 billion parameter, 42-layer, instruction tuned Gemma model.
shieldgemma_2b_en	2.61B	2 billion parameter, 26-layer, ShieldGemma model.
shieldgemma_9b_en	9.24B	9 billion parameter, 42-layer, ShieldGemma model.
shieldgemma_27b_en	27.23B	27 billion parameter, 42-layer, ShieldGemma model.
pali_gemma_3b_mix_224	2.92B	image size 224, mix fine tuned, text sequence length is 256
pali_gemma_3b_mix_448	2.92B	image size 448, mix fine tuned, text sequence length is 512
pali_gemma_3b_224	2.92B	image size 224, pre trained, text sequence length is 128
pali_gemma_3b_448	2.92B	image size 448, pre trained, text sequence length is 512
pali_gemma_3b_896	2.93B	image size 896, pre trained, text sequence length is 512

`tokenizer` property

keras_nlp.models.GemmaPreprocessor.tokenizer

The tokenizer used to tokenize strings.

Gemma预处理器层

GemmaPreprocessor class

from_preset method

tokenizer property

Gemma预处理器层

GemmaPreprocessor class

from_preset method

tokenizer property

`GemmaPreprocessor` class

`from_preset` method

`tokenizer` property