Phi3CausalLMPreprocessor
classkeras_nlp.models.Phi3CausalLMPreprocessor(
tokenizer, sequence_length=4096, add_start_token=True, add_end_token=False, **kwargs
)
Phi3 Causal LM preprocessor.
This preprocessing layer is meant for use with
keras_nlp.models.Phi3CausalLM
. By default, it will take in batches of
strings, and return outputs in a (x, y, sample_weight)
format, where the
y
label is the next token id in the x
sequence.
For use with generation, the layer also exposes two methods
generate_preprocess()
and generate_postprocess()
. When this preprocessor
is attached to a keras_nlp.models.Phi3CausalLM
instance, these methods
will be called implicitly in generate()
. They can also be called
standalone (e.g. to precompute preprocessing inputs for generation in a
separate process).
Arguments
keras_nlp.models.Phi3Tokenizer
instance.True
, the preprocessor will prepend the tokenizer
start token to each input sequence. Default is True
.True
, the preprocessor will append the tokenizer
end token to each input sequence. Default is False
.Call arguments
tf.Tensor
or list of python strings.None
as the layer generates labels.None
as the layer
generates label weights.sequence_length
of
the layer.Examples
# Load the preprocessor from a preset.
preprocessor = keras_nlp.models.Phi3CausalLMPreprocessor.from_preset(
"phi3_mini_4k_instruct_en"
)
# Tokenize and pack a single sentence.
sentence = tf.constant("League of legends")
preprocessor(sentence)
# Same output.
preprocessor("League of legends")
# Tokenize a batch of sentences.
sentences = tf.constant(["Taco tuesday", "Fish taco please!"])
preprocessor(sentences)
# Same output.
preprocessor(["Taco tuesday", "Fish taco please!"])
# Map a dataset to preprocess a single sentence.
features = tf.constant(
[
"Avatar 2 is amazing!",
"Well, I am not sure.",
]
)
labels = tf.constant([1, 0])
ds = tf.data.Dataset.from_tensor_slices((features, labels))
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)
# Map a dataset to preprocess unlabled sentences.
ds = tf.data.Dataset.from_tensor_slices(features)
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)
from_preset
methodPhi3CausalLMPreprocessor.from_preset(preset, **kwargs)
Instantiate a keras_nlp.models.Preprocessor
from a model preset.
A preset is a directory of configs, weights and other file assets used
to save and load a pre-trained model. The preset
can be passed as a
one of:
'bert_base_en'
'kaggle://user/bert/keras/bert_base_en'
'hf://user/bert_base_en'
'./bert_base_en'
For any Preprocessor
subclass, you can run cls.presets.keys()
to
list all built-in presets available on the class.
As there are usually multiple preprocessing classes for a given model,
this method should be called on a specific subclass like
keras_nlp.models.BertPreprocessor.from_preset()
.
Arguments
Examples
# Load a preprocessor for Gemma generation.
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset(
"gemma_2b_en",
)
# Load a preprocessor for Bert classification.
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
"bert_base_en",
)
Preset name | Parameters | Description |
---|---|---|
phi3_mini_4k_instruct_en | 3.82B | 3.8 billion parameters, 32 layers, 4k context length, Phi-3 model. The model was trained using the Phi-3 datasets. This dataset includes both synthetic data and filtered publicly available website data, with an emphasis on high-quality and reasoning-dense properties. |
phi3_mini_128k_instruct_en | 3.82B | 3.8 billion parameters, 32 layers, 128k context length, Phi-3 model. The model was trained using the Phi-3 datasets. This dataset includes both synthetic data and filtered publicly available website data, with an emphasis on high-quality and reasoning-dense properties. |
tokenizer
propertykeras_nlp.models.Phi3CausalLMPreprocessor.tokenizer
The tokenizer used to tokenize strings.