BloomPreprocessor
classkeras_nlp.models.BloomPreprocessor(
tokenizer, sequence_length=2048, add_start_token=True, add_end_token=True, **kwargs
)
BLOOM preprocessing layer which tokenizes and packs inputs.
This preprocessing layer will do 2 things:
tokenizer
."token_ids"
, "padding_mask"
, that can
be passed directly to a keras_nlp.models.BloomBackbone
.This layer can be used directly with tf.data.Dataset.map
to preprocess
string data in the (x, y, sample_weight)
format used by
keras.Model.fit
.
The call method of this layer accepts three arguments, x
, y
, and
sample_weight
. x
can be a python string or tensor representing a single
segment, a list of python strings representing a batch of single segments,
or a list of tensors representing multiple segments to be packed together.
y
and sample_weight
are both optional, can have any format, and will be
passed through unaltered.
Arguments
keras_nlp.models.BloomTokenizer
instance.True
, the preprocessor will prepend the tokenizer
start token to each input sequence.True
, the preprocessor will append the tokenizer
end token to each input sequence.Call arguments
tf.Tensor
or list of python strings.sequence_length
of
the layer.Examples
Directly calling the layer on data.
preprocessor = keras_nlp.models.BloomPreprocessor.from_preset(
"bloom_560m_multi"
)
# Tokenize and pack a single sentence.
preprocessor("The quick brown fox jumped.")
# Tokenize a batch of single sentences.
preprocessor(["The quick brown fox jumped.", "Call me Ishmael."])
# Custom vocabulary.
features = ["a quick fox.", "a fox quick."]
vocab = {"<pad>": 0, "<s>":1, "</s>":2, "a": 3, "Ġquick": 4, "Ġfox": 5}
merges = ["Ġ q", "u i", "c k", "ui ck", "Ġq uick"]
merges += ["Ġ f", "o x", "Ġf ox"]
tokenizer = keras_nlp.models.BloomTokenizer(
vocabulary=vocab,
merges=merges,
)
preprocessor = keras_nlp.models.BloomPreprocessor(tokenizer=tokenizer)
preprocessor("The quick brown fox jumped.")
Mapping with tf.data.Dataset
.
preprocessor = keras_nlp.models.BloomPreprocessor.from_preset(
"bloom_560m_multi"
)
text = tf.constant(["The quick brown fox jumped.", "Call me Ishmael."])
label = tf.constant([1, 1])
# Map labeled single sentences.
ds = tf.data.Dataset.from_tensor_slices((text, label))
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)
# Map unlabeled single sentences.
ds = tf.data.Dataset.from_tensor_slices(text)
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)
from_preset
methodBloomPreprocessor.from_preset(preset, **kwargs)
Instantiate a keras_nlp.models.Preprocessor
from a model preset.
A preset is a directory of configs, weights and other file assets used
to save and load a pre-trained model. The preset
can be passed as a
one of:
'bert_base_en'
'kaggle://user/bert/keras/bert_base_en'
'hf://user/bert_base_en'
'./bert_base_en'
For any Preprocessor
subclass, you can run cls.presets.keys()
to
list all built-in presets available on the class.
As there are usually multiple preprocessing classes for a given model,
this method should be called on a specific subclass like
keras_nlp.models.BertPreprocessor.from_preset()
.
Arguments
Examples
# Load a preprocessor for Gemma generation.
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset(
"gemma_2b_en",
)
# Load a preprocessor for Bert classification.
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
"bert_base_en",
)
Preset name | Parameters | Description |
---|---|---|
bloom_560m_multi | 559.21M | 24-layer Bloom model with hidden dimension of 1024. trained on 45 natural languages and 12 programming languages. |
bloom_1.1b_multi | 1.07B | 24-layer Bloom model with hidden dimension of 1536. trained on 45 natural languages and 12 programming languages. |
bloom_1.7b_multi | 1.72B | 24-layer Bloom model with hidden dimension of 2048. trained on 45 natural languages and 12 programming languages. |
bloom_3b_multi | 3.00B | 30-layer Bloom model with hidden dimension of 2560. trained on 45 natural languages and 12 programming languages. |
bloomz_560m_multi | 559.21M | 24-layer Bloom model with hidden dimension of 1024. finetuned on crosslingual task mixture (xP3) dataset. |
bloomz_1.1b_multi | 1.07B | 24-layer Bloom model with hidden dimension of 1536. finetuned on crosslingual task mixture (xP3) dataset. |
bloomz_1.7b_multi | 1.72B | 24-layer Bloom model with hidden dimension of 2048. finetuned on crosslingual task mixture (xP3) dataset. |
bloomz_3b_multi | 3.00B | 30-layer Bloom model with hidden dimension of 2560. finetuned on crosslingual task mixture (xP3) dataset. |
tokenizer
propertykeras_nlp.models.BloomPreprocessor.tokenizer
The tokenizer used to tokenize strings.