BartSeq2SeqLMPreprocessor classkeras_nlp.models.BartSeq2SeqLMPreprocessor(
tokenizer, encoder_sequence_length=1024, decoder_sequence_length=1024, **kwargs
)
BART Seq2Seq LM preprocessor.
This layer is used as preprocessor for seq2seq tasks using the BART model.
This class subclasses keras_nlp.models.BartPreprocessor and keeps most of
its functionality. It has two changes from the superclass:
y (label) and sample_weights fields by shifting the
decoder input sequence one step towards the left. Both these fields are
inferred internally, and any passed values will be ignored.Arguments
keras_nlp.models.BartTokenizer instance.Call arguments
encoder_text and decoder_text as its keys.
Each value in the dictionary should be a tensor of single string
sequences. Inputs may be batched or unbatched. Raw python inputs
will be converted to tensors.None as the layer generates labels by
shifting the decoder input sequence one step to the left.None as the layer
generates label weights by shifting the padding mask one step to the
left.Examples
Directly calling the layer on data
preprocessor = keras_nlp.models.BartPreprocessor.from_preset("bart_base_en")
# Preprocess unbatched inputs.
inputs = {
"encoder_text": "The fox was sleeping.",
"decoder_text": "The fox was awake."
}
preprocessor(inputs)
# Preprocess batched inputs.
inputs = {
"encoder_text": ["The fox was sleeping.", "The lion was quiet."],
"decoder_text": ["The fox was awake.", "The lion was roaring."]
}
preprocessor(inputs)
# Custom vocabulary.
vocab = {
"<s>": 0,
"<pad>": 1,
"</s>": 2,
"Ġafter": 5,
"noon": 6,
"Ġsun": 7,
}
merges = ["Ġ a", "Ġ s", "Ġ n", "e r", "n o", "o n", "Ġs u", "Ġa f", "no on"]
merges += ["Ġsu n", "Ġaf t", "Ġaft er"]
tokenizer = keras_nlp.models.BartTokenizer(
vocabulary=vocab,
merges=merges,
)
preprocessor = keras_nlp.models.BartPreprocessor(
tokenizer=tokenizer,
encoder_sequence_length=20,
decoder_sequence_length=10,
)
inputs = {
"encoder_text": "The fox was sleeping.",
"decoder_text": "The fox was awake."
}
preprocessor(inputs)
Mapping with tf.data.Dataset.
preprocessor = keras_nlp.models.BartPreprocessor.from_preset("bart_base_en")
# Map single sentences.
features = {
"encoder_text": tf.constant(
["The fox was sleeping.", "The lion was quiet."]
),
"decoder_text": tf.constant(
["The fox was awake.", "The lion was roaring."]
)
}
ds = tf.data.Dataset.from_tensor_slices(features)
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)
from_preset methodBartSeq2SeqLMPreprocessor.from_preset(preset, **kwargs)
Instantiate a keras_nlp.models.Preprocessor from a model preset.
A preset is a directory of configs, weights and other file assets used
to save and load a pre-trained model. The preset can be passed as a
one of:
'bert_base_en''kaggle://user/bert/keras/bert_base_en''hf://user/bert_base_en''./bert_base_en'For any Preprocessor subclass, you can run cls.presets.keys() to
list all built-in presets available on the class.
As there are usually multiple preprocessing classes for a given model,
this method should be called on a specific subclass like
keras_nlp.models.BertPreprocessor.from_preset().
Arguments
Examples
# Load a preprocessor for Gemma generation.
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset(
"gemma_2b_en",
)
# Load a preprocessor for Bert classification.
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
"bert_base_en",
)
| Preset name | Parameters | Description |
|---|---|---|
| bart_base_en | 139.42M | 6-layer BART model where case is maintained. Trained on BookCorpus, English Wikipedia and CommonCrawl. |
| bart_large_en | 406.29M | 12-layer BART model where case is maintained. Trained on BookCorpus, English Wikipedia and CommonCrawl. |
| bart_large_en_cnn | 406.29M | The bart_large_en backbone model fine-tuned on the CNN+DM summarization dataset. |
generate_preprocess methodBartSeq2SeqLMPreprocessor.generate_preprocess(
x, encoder_sequence_length=None, decoder_sequence_length=None, sequence_length=None
)
Convert encoder and decoder input strings to integer token inputs for generation.
Similar to calling the layer for training, this method takes in a dict
containing "encoder_text" and "decoder_text", with strings or tensor
strings for values, tokenizes and packs the input, and computes a
padding mask masking all inputs not filled in with a padded value.
Unlike calling the layer for training, this method does not compute labels and will never append a tokenizer.end_token_id to the end of the decoder sequence (as generation is expected to continue at the end of the inputted decoder prompt).
generate_postprocess methodBartSeq2SeqLMPreprocessor.generate_postprocess(x)
Convert integer token output to strings for generation.
This method reverses generate_preprocess(), by first removing all
padding and start/end tokens, and then converting the integer sequence
back to a string.
tokenizer propertykeras_nlp.models.BartSeq2SeqLMPreprocessor.tokenizer
The tokenizer used to tokenize strings.