ByteTokenizer
classkeras_nlp.tokenizers.ByteTokenizer(
lowercase=True,
sequence_length=None,
normalization_form=None,
errors="replace",
replacement_char=65533,
dtype="int32",
**kwargs
)
Raw byte tokenizer.
This tokenizer is a vocabulary-free tokenizer which will tokenize text as as raw bytes from [0, 256).
Tokenizer outputs can either be padded and truncated with a
sequence_length
argument, or left un-truncated. The exact output will
depend on the rank of the input tensors.
If input is a batch of strings:
By default, the layer will output a tf.RaggedTensor
where the last
dimension of the output is ragged. If sequence_length
is set, the layer
will output a dense tf.Tensor
where all inputs have been padded or
truncated to sequence_length
.
If input is a scalar string:
There are two cases here. If sequence_length
is set, the output will be
a dense tf.Tensor
of shape [sequence_length]
. Otherwise, the output will
be a dense tf.Tensor
of shape [None]
.
The output dtype can be controlled via the
dtype
argument, which should be an integer type
("int16", "int32", etc.).
Arguments
detokenize()
behavior when an invalid tokenizer is encountered.
The value of 'strict'
will cause the operation to produce a
InvalidArgument
error on any invalid input formatting. A value of
'replace'
will cause the tokenizer to replace any invalid
formatting in the input with the replacement_char
codepoint.
A value of 'ignore'
will cause the tokenizer to skip any invalid
formatting in the input and produce no corresponding output
character.errors
is set to "replace" (same behaviour as
https://www.tensorflow.org/api_docs/python/tf/strings/unicode_transcode).
(U+FFFD) is 65533
. Defaults to 65533
.Examples
Basic usage.
>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer()
>>> outputs = tokenizer("hello")
>>> np.array(outputs)
array([104, 101, 108, 108, 111], dtype=int32)
Ragged outputs.
>>> inputs = ["hello", "hi"]
>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer()
>>> seq1, seq2 = tokenizer(inputs)
>>> np.array(seq1)
array([104, 101, 108, 108, 111], dtype=int32)
>>> np.array(seq2)
array([104, 105], dtype=int32)
Dense outputs.
>>> inputs = ["hello", "hi"]
>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer(sequence_length=8)
>>> seq1, seq2 = tokenizer(inputs)
>>> np.array(seq1)
array([104, 101, 108, 108, 111, 0, 0, 0], dtype=int32)
>>> np.array(seq2)
array([104, 105, 0, 0, 0, 0, 0, 0], dtype=int32)
Tokenize, then batch for ragged outputs.
>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer()
>>> ds = tf.data.Dataset.from_tensor_slices(["hello", "fun"])
>>> ds = ds.map(tokenizer)
>>> ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(2))
>>> ds.take(1).get_single_element()
<tf.RaggedTensor [[104, 101, 108, 108, 111], [102, 117, 110]]>
Batch, then tokenize for ragged outputs.
>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer()
>>> ds = tf.data.Dataset.from_tensor_slices(["hello", "fun"])
>>> ds = ds.batch(2).map(tokenizer)
>>> ds.take(1).get_single_element()
<tf.RaggedTensor [[104, 101, 108, 108, 111], [102, 117, 110]]>
Tokenize, then batch for dense outputs (sequence_length
provided).
>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer(sequence_length=5)
>>> ds = tf.data.Dataset.from_tensor_slices(["hello", "fun"])
>>> ds = ds.map(tokenizer)
>>> ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(2))
>>> ds.take(1).get_single_element()
<tf.Tensor: shape=(2, 5), dtype=int32, numpy=
array([[104, 101, 108, 108, 111],
[102, 117, 110, 0, 0]], dtype=int32)>
Batch, then tokenize for dense outputs. (sequence_length
provided).
>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer(sequence_length=5)
>>> ds = tf.data.Dataset.from_tensor_slices(["hello", "fun"])
>>> ds = ds.batch(2).map(tokenizer)
>>> ds.take(1).get_single_element()
<tf.Tensor: shape=(2, 5), dtype=int32, numpy=
array([[104, 101, 108, 108, 111],
[102, 117, 110, 0, 0]], dtype=int32)>
Detokenization.
>>> inputs = [104, 101, 108, 108, 111]
>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer()
>>> outputs = tokenizer.detokenize(inputs)
>>> np.array(outputs).astype("U")
array('hello', dtype='<U5')
Detokenization with invalid bytes.
>>> # The 255 below is invalid utf-8.
>>> inputs = [104, 101, 255, 108, 108, 111]
>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer(
... errors="replace", replacement_char=88)
>>> outputs = tokenizer.detokenize(inputs)
>>> np.array(outputs).astype("U")
array('heXllo', dtype='<U6')
tokenize
methodByteTokenizer.tokenize(inputs)
Transform input tensors of strings into output tokens.
Arguments
detokenize
methodByteTokenizer.detokenize(inputs)
Transform tokens back into strings.
Arguments
get_vocabulary
methodByteTokenizer.get_vocabulary()
Get the tokenizer vocabulary as a list of strings terms.
vocabulary_size
methodByteTokenizer.vocabulary_size()
Get the integer size of the tokenizer vocabulary.
token_to_id
methodByteTokenizer.token_to_id(token)
Convert a string token to an integer id.
id_to_token
methodByteTokenizer.id_to_token(id)
Convert an integer id to a string token.