GemmaPreprocessor
classkeras_nlp.models.GemmaPreprocessor(
tokenizer, sequence_length=8192, add_start_token=True, add_end_token=True, **kwargs
)
Gemma preprocessing layer which tokenizes and packs inputs.
This preprocessing layer will do 2 things:
tokenizer
."token_ids"
, "padding_mask"
, that can
be passed directly to a keras_nlp.models.GemmaBackbone
.This layer can be used directly with tf.data.Dataset.map
to preprocess
string data in the (x, y, sample_weight)
format used by
keras.Model.fit
.
The call method of this layer accepts three arguments, x
, y
, and
sample_weight
. x
can be a python string or tensor representing a single
segment, a list of python strings representing a batch of single segments,
or a list of tensors representing multiple segments to be packed together.
y
and sample_weight
are both optional, can have any format, and will be
passed through unaltered.
GemmaPreprocessor
expects the input to have only one segment, as Gemma is
mainly used for generation tasks. For tasks having multi-segment inputs
please combine inputs into a single string input before passing to the
preprocessor layer.
Arguments
keras_nlp.models.GemmaTokenizer
instance.True
, the preprocessor will prepend the tokenizer
start token to each input sequence.True
, the preprocessor will append the tokenizer
end token to each input sequence.Call arguments
tf.Tensor
or list of python strings.sequence_length
of
the layer.Examples
Directly calling the layer on data.
preprocessor = keras_nlp.models.GemmaPreprocessor.from_preset(
"gemma_2b_en"
)
# Tokenize and pack a single sentence.
preprocessor("The quick brown fox jumped.")
# Tokenize a batch of sentences.
preprocessor(["The quick brown fox jumped.", "Call me Ishmael."])
# Custom vocabulary.
bytes_io = io.BytesIO()
ds = tf.data.Dataset.from_tensor_slices(["The quick brown fox jumped."])
sentencepiece.SentencePieceTrainer.train(
sentence_iterator=ds.as_numpy_iterator(),
model_writer=bytes_io,
vocab_size=8,
model_type="WORD",
pad_id=0,
bos_id=1,
eos_id=2,
unk_id=3,
pad_piece="<pad>",
bos_piece="<bos>",
eos_piece="<eos>",
unk_piece="<unk>",
)
tokenizer = keras_nlp.models.GemmaTokenizer(
proto=bytes_io.getvalue(),
)
preprocessor = keras_nlp.models.GemmaPreprocessor(tokenizer=tokenizer)
preprocessor("The quick brown fox jumped.")
Apply preprocessing to a tf.data.Dataset
.
preprocessor = keras_nlp.models.GemmaPreprocessor.from_preset(
"gemma_2b_en"
)
text = tf.constant(["The quick brown fox jumped.", "Call me Ishmael."])
label = tf.constant([1, 1])
# Map labeled single sentences.
ds = tf.data.Dataset.from_tensor_slices((text, label))
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)
# Map unlabeled single sentences.
ds = tf.data.Dataset.from_tensor_slices(text)
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)
from_preset
methodGemmaPreprocessor.from_preset(preset, **kwargs)
Instantiate a keras_nlp.models.Preprocessor
from a model preset.
A preset is a directory of configs, weights and other file assets used
to save and load a pre-trained model. The preset
can be passed as a
one of:
'bert_base_en'
'kaggle://user/bert/keras/bert_base_en'
'hf://user/bert_base_en'
'./bert_base_en'
For any Preprocessor
subclass, you can run cls.presets.keys()
to
list all built-in presets available on the class.
As there are usually multiple preprocessing classes for a given model,
this method should be called on a specific subclass like
keras_nlp.models.BertPreprocessor.from_preset()
.
Arguments
Examples
# Load a preprocessor for Gemma generation.
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset(
"gemma_2b_en",
)
# Load a preprocessor for Bert classification.
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
"bert_base_en",
)
Preset name | Parameters | Description |
---|---|---|
gemma_2b_en | 2.51B | 2 billion parameter, 18-layer, base Gemma model. |
gemma_instruct_2b_en | 2.51B | 2 billion parameter, 18-layer, instruction tuned Gemma model. |
gemma_1.1_instruct_2b_en | 2.51B | 2 billion parameter, 18-layer, instruction tuned Gemma model. The 1.1 update improves model quality. |
code_gemma_1.1_2b_en | 2.51B | 2 billion parameter, 18-layer, CodeGemma model. This model has been trained on a fill-in-the-middle (FIM) task for code completion. The 1.1 update improves model quality. |
code_gemma_2b_en | 2.51B | 2 billion parameter, 18-layer, CodeGemma model. This model has been trained on a fill-in-the-middle (FIM) task for code completion. |
gemma_7b_en | 8.54B | 7 billion parameter, 28-layer, base Gemma model. |
gemma_instruct_7b_en | 8.54B | 7 billion parameter, 28-layer, instruction tuned Gemma model. |
gemma_1.1_instruct_7b_en | 8.54B | 7 billion parameter, 28-layer, instruction tuned Gemma model. The 1.1 update improves model quality. |
code_gemma_7b_en | 8.54B | 7 billion parameter, 28-layer, CodeGemma model. This model has been trained on a fill-in-the-middle (FIM) task for code completion. |
code_gemma_instruct_7b_en | 8.54B | 7 billion parameter, 28-layer, instruction tuned CodeGemma model. This model has been trained for chat use cases related to code. |
code_gemma_1.1_instruct_7b_en | 8.54B | 7 billion parameter, 28-layer, instruction tuned CodeGemma model. This model has been trained for chat use cases related to code. The 1.1 update improves model quality. |
gemma2_2b_en | 2.61B | 2 billion parameter, 26-layer, base Gemma model. |
gemma2_instruct_2b_en | 2.61B | 2 billion parameter, 26-layer, instruction tuned Gemma model. |
gemma2_9b_en | 9.24B | 9 billion parameter, 42-layer, base Gemma model. |
gemma2_instruct_9b_en | 9.24B | 9 billion parameter, 42-layer, instruction tuned Gemma model. |
gemma2_27b_en | 27.23B | 27 billion parameter, 42-layer, base Gemma model. |
gemma2_instruct_27b_en | 27.23B | 27 billion parameter, 42-layer, instruction tuned Gemma model. |
shieldgemma_2b_en | 2.61B | 2 billion parameter, 26-layer, ShieldGemma model. |
shieldgemma_9b_en | 9.24B | 9 billion parameter, 42-layer, ShieldGemma model. |
shieldgemma_27b_en | 27.23B | 27 billion parameter, 42-layer, ShieldGemma model. |
pali_gemma_3b_mix_224 | 2.92B | image size 224, mix fine tuned, text sequence length is 256 |
pali_gemma_3b_mix_448 | 2.92B | image size 448, mix fine tuned, text sequence length is 512 |
pali_gemma_3b_224 | 2.92B | image size 224, pre trained, text sequence length is 128 |
pali_gemma_3b_448 | 2.92B | image size 448, pre trained, text sequence length is 512 |
pali_gemma_3b_896 | 2.93B | image size 896, pre trained, text sequence length is 512 |
tokenizer
propertykeras_nlp.models.GemmaPreprocessor.tokenizer
The tokenizer used to tokenize strings.