Tokenizer
classkeras_nlp.models.Tokenizer()
A base class for tokenizer layers.
Tokenizers in the KerasNLP library should all subclass this layer.
The class provides two core methods tokenize()
and detokenize()
for
going from plain text to sequences and back. A tokenizer is a subclass of
keras.layers.Layer
and can be combined into a keras.Model
.
Subclassers should always implement the tokenize()
method, which will also
be the default when calling the layer directly on inputs.
Subclassers can optionally implement the detokenize()
method if the
tokenization is reversible. Otherwise, this can be skipped.
Subclassers should implement get_vocabulary()
, vocabulary_size()
,
token_to_id()
and id_to_token()
if applicable. For some simple
"vocab free" tokenizers, such as a whitespace splitter show below, these
methods do not apply and can be skipped.
Example
class WhitespaceSplitterTokenizer(keras_nlp.tokenizers.Tokenizer):
def tokenize(self, inputs):
return tf.strings.split(inputs)
def detokenize(self, inputs):
return tf.strings.reduce_join(inputs, separator=" ", axis=-1)
tokenizer = WhitespaceSplitterTokenizer()
# Tokenize some inputs.
tokenizer.tokenize("This is a test")
# Shorthard for `tokenize()`.
tokenizer("This is a test")
# Detokenize some outputs.
tokenizer.detokenize(["This", "is", "a", "test"])
from_preset
methodTokenizer.from_preset(preset, **kwargs)
Instantiate a keras_nlp.models.Tokenizer
from a model preset.
A preset is a directory of configs, weights and other file assets used
to save and load a pre-trained model. The preset
can be passed as a
one of:
'bert_base_en'
'kaggle://user/bert/keras/bert_base_en'
'hf://user/bert_base_en'
'./bert_base_en'
For any Tokenizer
subclass, you can run cls.presets.keys()
to list
all built-in presets available on the class.
This constructor can be called in one of two ways. Either from the base
class like keras_nlp.models.Tokenizer.from_preset()
, or from
a model class like keras_nlp.models.GemmaTokenizer.from_preset()
.
If calling from the base class, the subclass of the returning object
will be inferred from the config in the preset directory.
Arguments
True
, the weights will be loaded into the
model architecture. If False
, the weights will be randomly
initialized.Examples
# Load a preset tokenizer.
tokenizer = keras_nlp.tokenizerTokenizer.from_preset("bert_base_en")
# Tokenize some input.
tokenizer("The quick brown fox tripped.")
# Detokenize some input.
tokenizer.detokenize([5, 6, 7, 8, 9])
save_to_preset
methodTokenizer.save_to_preset(preset_dir)
Save tokenizer to a preset directory.
Arguments