compute_sentence_piece_proto
functionkeras_nlp.tokenizers.compute_sentence_piece_proto(
data, vocabulary_size, model_type="unigram", proto_output_file=None, lowercase=False
)
A utility to train a SentencePiece vocabulary.
Trains a SentencePiece vocabulary from an input dataset or a list of filenames.
If data
is a list of filenames, the file format is required to be plain
text files, and the text will be read in line by line during training.
Arguments
tf.data.Dataset
, or a list of filenames."unigram"
, "bpe"
, "word"
or "char"
. Defaults to "unigram"
.None
, the model_file will be io.BytesIO
object.
Defaults to None
.False
.Returns
A bytes
object with a serialized SentencePiece proto or
None
if proto_output_file if provided.
Examples
Basic Usage (from Dataset).
>>> inputs = tf.data.Dataset.from_tensor_slices(["Drifting Along"])
>>> proto = keras_nlp.tokenizers.compute_sentence_piece_proto(inputs, vocabulary_size=15)
>>> tokenizer = keras_nlp.tokenizers.SentencePieceTokenizer(proto=proto)
>>> outputs = inputs.map(tokenizer)
>>> for output in outputs:
... print(output)
tf.Tensor([ 4 8 12 5 9 14 5 6 13 4 7 10 11 6 13],
shape=(15,), dtype=int32)
Basic Usage (with files).
with open("test.txt", "w+") as f: f.write("Drifting Along\n")
inputs = ["test.txt"]
proto = keras_nlp.tokenizers.compute_sentence_piece_proto(
inputs, vocabulary_size=15, proto_output_file="model.spm")
tokenizer = keras_nlp.tokenizers.SentencePieceTokenizer(proto="model.spm")
ds = tf.data.Dataset.from_tensor_slices(["the quick brown fox."])
ds = ds.map(tokenizer)
Usage with lowercase
>>> inputs = tf.data.Dataset.from_tensor_slices(["Drifting Along"])
>>> proto = keras_nlp.tokenizers.compute_sentence_piece_proto(
... inputs, vocabulary_size=15, lowercase=True)
>>> tokenizer = keras_nlp.tokenizers.SentencePieceTokenizer(proto=proto)
>>> outputs = inputs.map(tokenizer)
>>> for output in outputs:
... print(output)
tf.Tensor([ 4 8 12 5 9 14 5 6 13 4 7 10 11 6 13],
shape=(15,), dtype=int32)