Preprocessor
classkeras_nlp.models.Preprocessor()
Base class for preprocessing layers.
A Preprocessor
layer wraps a keras_nlp.tokenizer.Tokenizer
to provide a
complete preprocessing setup for a given task. For example a masked language
modeling preprocessor will take in raw input strings, and output
(x, y, sample_weight)
tuples. Where x
contains token id sequences with
some
This class can be subclassed similar to any keras.layers.Layer
, by
defining build()
, call()
and get_config()
methods. All subclasses
should set the tokenizer
property on construction.
from_preset
methodPreprocessor.from_preset(preset, **kwargs)
Instantiate a keras_nlp.models.Preprocessor
from a model preset.
A preset is a directory of configs, weights and other file assets used
to save and load a pre-trained model. The preset
can be passed as a
one of:
'bert_base_en'
'kaggle://user/bert/keras/bert_base_en'
'hf://user/bert_base_en'
'./bert_base_en'
For any Preprocessor
subclass, you can run cls.presets.keys()
to
list all built-in presets available on the class.
As there are usually multiple preprocessing classes for a given model,
this method should be called on a specific subclass like
keras_nlp.models.BertPreprocessor.from_preset()
.
Arguments
Examples
# Load a preprocessor for Gemma generation.
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset(
"gemma_2b_en",
)
# Load a preprocessor for Bert classification.
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
"bert_base_en",
)
save_to_preset
methodPreprocessor.save_to_preset(preset_dir)
Save preprocessor to a preset directory.
Arguments
tokenizer
propertykeras_nlp.models.Preprocessor.tokenizer
The tokenizer used to tokenize strings.