PaliGemmaCausalLMPreprocessor
classkeras_nlp.models.PaliGemmaCausalLMPreprocessor(
tokenizer, sequence_length=512, add_start_token=True, add_end_token=True, **kwargs
)
Gemma Causal LM preprocessor.
This preprocessing layer is meant for use with
keras_nlp.models.GemmaCausalLM
. By default, it will take in batches of
strings, and return outputs in a (x, y, sample_weight)
format, where the
y
label is the next token id in the x
sequence.
For use with generation, the layer also exposes two methods
generate_preprocess()
and generate_postprocess()
. When this preprocessor
is attached to a keras_nlp.models.GemmaCausalLM
instance, these methods
will be called implicitly in generate()
. They can also be called
standalone (e.g. to precompute preprocessing inputs for generation in a
separate process).
Arguments
keras_nlp.models.GemmaTokenizer
instance.True
, the preprocessor will prepend the tokenizer
start token to each input sequence.True
, the preprocessor will append the tokenizer
end token to each input sequence.Call arguments
tf.Tensor
or list of python strings.None
as the layer generates labels.None
as the layer
generates label weights.sequence_length
of
the layer.Examples
# Load the preprocessor from a preset.
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset(
"gemma_2b_en"
)
# Tokenize and pack a single sentence.
preprocessor("The quick brown fox jumped.")
# Tokenize a batch of sentences.
preprocessor(["The quick brown fox jumped.", "Call me Ishmael."])
# Apply tokenization to a [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).
features = tf.constant(["The quick brown fox.", "Call me Ishmael."])
ds = tf.data.Dataset.from_tensor_slices(features)
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)
# Prepare tokens for generation (no end token).
preprocessor.generate_preprocess(["The quick brown fox jumped."])
# Map generation outputs back to strings.
preprocessor.generate_postprocess({
'token_ids': np.array([[2, 714, 4320, 8426, 25341, 32292, 235265, 0]]),
'padding_mask': np.array([[ 1, 1, 1, 1, 1, 1, 1, 0]]),
})
from_preset
methodPaliGemmaCausalLMPreprocessor.from_preset(preset, **kwargs)
Instantiate a keras_nlp.models.Preprocessor
from a model preset.
A preset is a directory of configs, weights and other file assets used
to save and load a pre-trained model. The preset
can be passed as a
one of:
'bert_base_en'
'kaggle://user/bert/keras/bert_base_en'
'hf://user/bert_base_en'
'./bert_base_en'
For any Preprocessor
subclass, you can run cls.presets.keys()
to
list all built-in presets available on the class.
As there are usually multiple preprocessing classes for a given model,
this method should be called on a specific subclass like
keras_nlp.models.BertPreprocessor.from_preset()
.
Arguments
Examples
# Load a preprocessor for Gemma generation.
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset(
"gemma_2b_en",
)
# Load a preprocessor for Bert classification.
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
"bert_base_en",
)
Preset name | Parameters | Description |
---|---|---|
pali_gemma_3b_mix_224 | 2.92B | image size 224, mix fine tuned, text sequence length is 256 |
pali_gemma_3b_mix_448 | 2.92B | image size 448, mix fine tuned, text sequence length is 512 |
pali_gemma_3b_224 | 2.92B | image size 224, pre trained, text sequence length is 128 |
pali_gemma_3b_448 | 2.92B | image size 448, pre trained, text sequence length is 512 |
pali_gemma_3b_896 | 2.93B | image size 896, pre trained, text sequence length is 512 |
tokenizer
propertykeras_nlp.models.PaliGemmaCausalLMPreprocessor.tokenizer
The tokenizer used to tokenize strings.