compute_word_piece_vocabulary
functionkeras_nlp.tokenizers.compute_word_piece_vocabulary(
data,
vocabulary_size,
vocabulary_output_file=None,
lowercase=False,
strip_accents=False,
split=True,
split_on_cjk=True,
suffix_indicator="##",
reserved_tokens=["[PAD]", "[CLS]", "[SEP]", "[UNK]", "[MASK]"],
)
A utility to train a WordPiece vocabulary.
Trains a WordPiece vocabulary from an input dataset or a list of filenames.
For custom data loading and pretokenization (split=False
), the input
data
should be a tf.data.Dataset
. If data
is a list of filenames,
the file format is required to be plain text files, and the text would be
read in line by line during training.
Arguments
tf.data.Dataset
, or a list of filenames.None
.True
, the input text will be
lowercased before tokenization. Defaults to False
.True
, all accent marks will
be removed from text before tokenization. Defaults to False
.True
, input will be split on
whitespace and punctuation marks, and all punctuation marks will be
kept as tokens. If False
, input should be split ("pre-tokenized")
before calling the tokenizer, and passed as a dense or ragged tensor
of whole words. split
is required to be True
when data
is a
list of filenames. Defaults to True
.True
, input will be split
on CJK characters, i.e., Chinese, Japanese, Korean and Vietnamese
characters (https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)).
Note that this is applicable only when split
is True
.
Defaults to True
."##ing"
. Defaults to "##"
.Returns
Returns a list of vocabulary terms.
Examples
Basic Usage (from Dataset).
>>> inputs = tf.data.Dataset.from_tensor_slices(["bat sat pat mat rat"])
>>> vocab = compute_word_piece_vocabulary(inputs, 13)
>>> vocab
['[PAD]', '[CLS]', '[SEP]', '[UNK]', '[MASK]', 'a', 'b', 'm', 'p', 'r', 's', 't', '##at']
>>> tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(vocabulary=vocab, oov_token="[UNK]")
>>> outputs = inputs.map(tokenizer.tokenize)
>>> for x in outputs:
... print(x)
tf.Tensor([ 6 12 10 12 8 12 7 12 9 12], shape=(10,), dtype=int32)
Basic Usage (from filenames).
with open("test.txt", "w+") as f:
f.write("bat sat pat mat rat\n")
inputs = ["test.txt"]
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(inputs, 13)
Custom Split Usage (from Dataset).
>>> def normalize_and_split(x):
... "Strip punctuation and split on whitespace."
... x = tf.strings.regex_replace(x, r"\p{P}", "")
... return tf.strings.split(x)
>>> inputs = tf.data.Dataset.from_tensor_slices(["bat sat: pat mat rat.\n"])
>>> split_inputs = inputs.map(normalize_and_split)
>>> vocab = compute_word_piece_vocabulary(
... split_inputs, 13, split=False,
... )
>>> vocab
['[PAD]', '[CLS]', '[SEP]', '[UNK]', '[MASK]', 'a', 'b', 'm', 'p', 'r', 's', 't', '##at']
>>> tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(vocabulary=vocab)
>>> inputs.map(tokenizer.tokenize)
Custom Split Usage (from filenames).
def normalize_and_split(x):
"Strip punctuation and split on whitespace."
x = tf.strings.regex_replace(x, r"\p{P}", "")
return tf.strings.split(x)
with open("test.txt", "w+") as f:
f.write("bat sat: pat mat rat.\n")
inputs = tf.data.TextLineDataset(["test.txt"])
split_inputs = inputs.map(normalize_and_split)
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
split_inputs, 13, split=False
)
tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(vocabulary=vocab)
inputs.map(tokenizer.tokenize)