BloomPreprocessor classkeras_nlp.models.BloomPreprocessor(
tokenizer, sequence_length=2048, add_start_token=True, add_end_token=True, **kwargs
)
BLOOM preprocessing layer which tokenizes and packs inputs.
This preprocessing layer will do 2 things:
tokenizer."token_ids", "padding_mask", that can
be passed directly to a keras_nlp.models.BloomBackbone.This layer can be used directly with tf.data.Dataset.map to preprocess
string data in the (x, y, sample_weight) format used by
keras.Model.fit.
The call method of this layer accepts three arguments, x, y, and
sample_weight. x can be a python string or tensor representing a single
segment, a list of python strings representing a batch of single segments,
or a list of tensors representing multiple segments to be packed together.
y and sample_weight are both optional, can have any format, and will be
passed through unaltered.
Arguments
keras_nlp.models.BloomTokenizer instance.True, the preprocessor will prepend the tokenizer
start token to each input sequence.True, the preprocessor will append the tokenizer
end token to each input sequence.Call arguments
tf.Tensor or list of python strings.sequence_length of
the layer.Examples
Directly calling the layer on data.
preprocessor = keras_nlp.models.BloomPreprocessor.from_preset(
"bloom_560m_multi"
)
# Tokenize and pack a single sentence.
preprocessor("The quick brown fox jumped.")
# Tokenize a batch of single sentences.
preprocessor(["The quick brown fox jumped.", "Call me Ishmael."])
# Custom vocabulary.
features = ["a quick fox.", "a fox quick."]
vocab = {"<pad>": 0, "<s>":1, "</s>":2, "a": 3, "Ġquick": 4, "Ġfox": 5}
merges = ["Ġ q", "u i", "c k", "ui ck", "Ġq uick"]
merges += ["Ġ f", "o x", "Ġf ox"]
tokenizer = keras_nlp.models.BloomTokenizer(
vocabulary=vocab,
merges=merges,
)
preprocessor = keras_nlp.models.BloomPreprocessor(tokenizer=tokenizer)
preprocessor("The quick brown fox jumped.")
Mapping with tf.data.Dataset.
preprocessor = keras_nlp.models.BloomPreprocessor.from_preset(
"bloom_560m_multi"
)
text = tf.constant(["The quick brown fox jumped.", "Call me Ishmael."])
label = tf.constant([1, 1])
# Map labeled single sentences.
ds = tf.data.Dataset.from_tensor_slices((text, label))
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)
# Map unlabeled single sentences.
ds = tf.data.Dataset.from_tensor_slices(text)
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)
from_preset methodBloomPreprocessor.from_preset(preset, **kwargs)
Instantiate a keras_nlp.models.Preprocessor from a model preset.
A preset is a directory of configs, weights and other file assets used
to save and load a pre-trained model. The preset can be passed as a
one of:
'bert_base_en''kaggle://user/bert/keras/bert_base_en''hf://user/bert_base_en''./bert_base_en'For any Preprocessor subclass, you can run cls.presets.keys() to
list all built-in presets available on the class.
As there are usually multiple preprocessing classes for a given model,
this method should be called on a specific subclass like
keras_nlp.models.BertPreprocessor.from_preset().
Arguments
Examples
# Load a preprocessor for Gemma generation.
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset(
"gemma_2b_en",
)
# Load a preprocessor for Bert classification.
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
"bert_base_en",
)
| Preset name | Parameters | Description |
|---|---|---|
| bloom_560m_multi | 559.21M | 24-layer Bloom model with hidden dimension of 1024. trained on 45 natural languages and 12 programming languages. |
| bloom_1.1b_multi | 1.07B | 24-layer Bloom model with hidden dimension of 1536. trained on 45 natural languages and 12 programming languages. |
| bloom_1.7b_multi | 1.72B | 24-layer Bloom model with hidden dimension of 2048. trained on 45 natural languages and 12 programming languages. |
| bloom_3b_multi | 3.00B | 30-layer Bloom model with hidden dimension of 2560. trained on 45 natural languages and 12 programming languages. |
| bloomz_560m_multi | 559.21M | 24-layer Bloom model with hidden dimension of 1024. finetuned on crosslingual task mixture (xP3) dataset. |
| bloomz_1.1b_multi | 1.07B | 24-layer Bloom model with hidden dimension of 1536. finetuned on crosslingual task mixture (xP3) dataset. |
| bloomz_1.7b_multi | 1.72B | 24-layer Bloom model with hidden dimension of 2048. finetuned on crosslingual task mixture (xP3) dataset. |
| bloomz_3b_multi | 3.00B | 30-layer Bloom model with hidden dimension of 2560. finetuned on crosslingual task mixture (xP3) dataset. |
tokenizer propertykeras_nlp.models.BloomPreprocessor.tokenizer
The tokenizer used to tokenize strings.