Llama3Preprocessor classkeras_nlp.models.Llama3Preprocessor(
tokenizer, sequence_length=1024, add_start_token=True, add_end_token=False, **kwargs
)
A Llama preprocessing layer which tokenizes and packs inputs.
This preprocessing layer will do three things:
tokenizer.keras_nlp.layers.StartEndPacker.
with the appropriate tokens."token_ids", and "padding_mask"
that can be passed directly to keras_nlp.models.LlamaBackbone.This layer can be used directly with tf.data.Dataset.map to preprocess
string data in the (x, y, sample_weight) format used by
keras.Model.fit.
Arguments
keras_nlp.models.LlamaTokenizer instance.True, the preprocessor will prepend the tokenizer
start token to each input sequence. Default is True.True, the preprocessor will append the tokenizer
end token to each input sequence. Default is False.Call arguments
sequence_length of
the layer.Examples
Directly calling the from_preset().
preprocessor = keras_nlp.models.LlamaPreprocessor.from_preset(
"llama_base_en"
)
# Tokenize and pack a single sentence.
preprocessor("The quick brown fox jumped.")
# Tokenize and a batch of single sentences.
preprocessor(["The quick brown fox jumped.", "Call me Ishmael."])
# Preprocess a batch of sentence pairs.
# When handling multiple sequences, always convert to tensors first!
first = tf.constant(["The quick brown fox jumped.", "Call me Ishmael."])
second = tf.constant(["The fox tripped.", "Oh look, a whale."])
preprocessor((first, second))
Mapping with tf.data.Dataset.
preprocessor = keras_nlp.models.LlamaPreprocessor.from_preset(
"llama_base_en"
)
first = tf.constant(["The quick brown fox jumped.", "Call me Ishmael."])
second = tf.constant(["The fox tripped.", "Oh look, a whale."])
label = tf.constant([1, 1])
# Map labeled single sentences.
ds = tf.data.Dataset.from_tensor_slices((first, label))
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)
# Map unlabeled single sentences.
ds = tf.data.Dataset.from_tensor_slices(first)
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)
# Map labeled sentence pairs.
ds = tf.data.Dataset.from_tensor_slices(((first, second), label))
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)
# Map unlabeled sentence pairs.
ds = tf.data.Dataset.from_tensor_slices((first, second))
# Watch out for tf.data's default unpacking of tuples here!
# Best to invoke the `preprocessor` directly in this case.
ds = ds.map(
lambda first, second: preprocessor(x=(first, second)),
num_parallel_calls=tf.data.AUTOTUNE,
)
from_preset methodLlama3Preprocessor.from_preset(preset, **kwargs)
Instantiate a keras_nlp.models.Preprocessor from a model preset.
A preset is a directory of configs, weights and other file assets used
to save and load a pre-trained model. The preset can be passed as a
one of:
'bert_base_en''kaggle://user/bert/keras/bert_base_en''hf://user/bert_base_en''./bert_base_en'For any Preprocessor subclass, you can run cls.presets.keys() to
list all built-in presets available on the class.
As there are usually multiple preprocessing classes for a given model,
this method should be called on a specific subclass like
keras_nlp.models.BertPreprocessor.from_preset().
Arguments
Examples
# Load a preprocessor for Gemma generation.
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset(
"gemma_2b_en",
)
# Load a preprocessor for Bert classification.
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
"bert_base_en",
)
| Preset name | Parameters | Description |
|---|---|---|
| llama3_8b_en | 8.03B | LLaMA 3 8B Base model |
| llama3_instruct_8b_en | 8.03B | LLaMA 3 8B Instruct model |
tokenizer propertykeras_nlp.models.Llama3Preprocessor.tokenizer
The tokenizer used to tokenize strings.