UnicodeCodepointTokenizer classkeras_nlp.tokenizers.UnicodeCodepointTokenizer(
sequence_length=None,
lowercase=True,
normalization_form=None,
errors="replace",
replacement_char=65533,
input_encoding="UTF-8",
output_encoding="UTF-8",
vocabulary_size=None,
dtype="int32",
**kwargs
)
A unicode character tokenizer layer.
This tokenizer is a vocabulary free tokenizer which tokenizes text as unicode character codepoints.
Tokenizer outputs can either be padded and truncated with a
sequence_length argument, or left un-truncated. The exact output will
depend on the rank of the input tensors.
If input is a batch of strings (rank > 0):
By default, the layer will output a tf.RaggedTensor where the last
dimension of the output is ragged. If sequence_length is set, the layer
will output a dense tf.Tensor where all inputs have been padded or
truncated to sequence_length.
If input is a scalar string (rank == 0):
By default, the layer will output a dense tf.Tensor with static shape
[None]. If sequence_length is set, the output will be
a dense tf.Tensor of shape [sequence_length].
The output dtype can be controlled via the dtype argument, which should be
an integer type ("int16", "int32", etc.).
Arguments
True, the input text will be first lowered before
tokenization.detokenize() behavior when an invalid codepoint is encountered.
The value of 'strict' will cause the tokenizer to produce a
InvalidArgument error on any invalid input formatting. A value of
'replace' will cause the tokenizer to replace any invalid
formatting in the input with the replacement_char codepoint.
A value of 'ignore' will cause the tokenizer to skip any invalid
formatting in the input and produce no corresponding output
character.65533. Defaults to 65533."UTF-8"."UTF-8".vocabulary_size,
by clamping all codepoints to the range [0, vocabulary_size).
Effectively this will make the vocabulary_size - 1 id the
the OOV value.Examples
Basic Usage.
>>> inputs = "Unicode Tokenizer"
>>> tokenizer = keras_nlp.tokenizers.UnicodeCodepointTokenizer()
>>> outputs = tokenizer(inputs)
>>> np.array(outputs)
array([117, 110, 105, 99, 111, 100, 101, 32, 116, 111, 107, 101, 110,
105, 122, 101, 114], dtype=int32)
Ragged outputs.
>>> inputs = ["पुस्तक", "کتاب"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCodepointTokenizer()
>>> seq1, seq2 = tokenizer(inputs)
>>> np.array(seq1)
array([2346, 2369, 2360, 2381, 2340, 2325], dtype=int32)
>>> np.array(seq2)
array([1705, 1578, 1575, 1576], dtype=int32)
Dense outputs.
>>> inputs = ["पुस्तक", "کتاب"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=8)
>>> seq1, seq2 = tokenizer(inputs)
>>> np.array(seq1)
array([2346, 2369, 2360, 2381, 2340, 2325, 0, 0], dtype=int32)
>>> np.array(seq2)
array([1705, 1578, 1575, 1576, 0, 0, 0, 0], dtype=int32)
Tokenize, then batch for ragged outputs.
>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCodepointTokenizer()
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.map(tokenizer)
>>> ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(3))
>>> ds.take(1).get_single_element()
<tf.RaggedTensor [[98, 111, 111, 107],
[2346, 2369, 2360, 2381, 2340, 2325],
[1705, 1578, 1575, 1576]]>
Batch, then tokenize for ragged outputs.
>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCodepointTokenizer()
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.batch(3).map(tokenizer)
>>> ds.take(1).get_single_element()
<tf.RaggedTensor [[98, 111, 111, 107],
[2346, 2369, 2360, 2381, 2340, 2325],
[1705, 1578, 1575, 1576]]>
Tokenize, then batch for dense outputs (sequence_length provided).
>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=5)
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.map(tokenizer)
>>> ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(3))
>>> ds.take(1).get_single_element()
<tf.Tensor: shape=(3, 5), dtype=int32, numpy=
array([[ 98, 111, 111, 107, 0],
[2346, 2369, 2360, 2381, 2340],
[1705, 1578, 1575, 1576, 0]], dtype=int32)>
Batch, then tokenize for dense outputs (sequence_length provided).
>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=5)
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.batch(3).map(tokenizer)
>>> ds.take(1).get_single_element()
<tf.Tensor: shape=(3, 5), dtype=int32, numpy=
array([[ 98, 111, 111, 107, 0],
[2346, 2369, 2360, 2381, 2340],
[1705, 1578, 1575, 1576, 0]], dtype=int32)>
Tokenization with truncation.
>>> inputs = ["I Like to Travel a Lot", "मैं किताबें पढ़ना पसंद करता हूं"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=5)
>>> outputs = tokenizer(inputs)
>>> np.array(outputs)
array([[ 105, 32, 108, 105, 107],
[2350, 2376, 2306, 32, 2325]], dtype=int32)
Tokenization with vocabulary_size.
>>> latin_ext_cutoff = 592
>>> tokenizer = keras_nlp.tokenizers.UnicodeCodepointTokenizer(
... vocabulary_size=latin_ext_cutoff)
>>> outputs = tokenizer("¿Cómo estás?")
>>> np.array(outputs)
array([191, 99, 243, 109, 111, 32, 101, 115, 116, 225, 115, 63],
dtype=int32)
>>> outputs = tokenizer("आप कैसे हैं")
>>> np.array(outputs)
array([591, 591, 32, 591, 591, 591, 591, 32, 591, 591, 591],
dtype=int32)
Detokenization.
>>> inputs = tf.constant([110, 105, 110, 106, 97], dtype="int32")
>>> tokenizer = keras_nlp.tokenizers.UnicodeCodepointTokenizer()
>>> outputs = tokenizer.detokenize(inputs)
>>> np.array(outputs).astype("U")
array('ninja', dtype='<U5')
Detokenization with padding.
>>> tokenizer = keras_nlp.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=7)
>>> dataset = tf.data.Dataset.from_tensor_slices(["a b c", "b c", "a"])
>>> dataset = dataset.map(tokenizer)
>>> dataset.take(1).get_single_element()
<tf.Tensor: shape=(7,), dtype=int32,
numpy=array([97, 32, 98, 32, 99, 0, 0], dtype=int32)>
>>> detokunbatched = dataset.map(tokenizer.detokenize)
>>> detokunbatched.take(1).get_single_element()
<tf.Tensor: shape=(), dtype=string, numpy=b'a b c'>
Detokenization with invalid bytes.
>>> inputs = tf.constant([110, 105, 10000000, 110, 106, 97])
>>> tokenizer = keras_nlp.tokenizers.UnicodeCodepointTokenizer(
... errors="replace", replacement_char=88)
>>> outputs = tokenizer.detokenize(inputs)
>>> np.array(outputs).astype("U")
array('niXnja', dtype='<U6')
tokenize methodUnicodeCodepointTokenizer.tokenize(inputs)
Transform input tensors of strings into output tokens.
Arguments
detokenize methodUnicodeCodepointTokenizer.detokenize(inputs)
Transform tokens back into strings.
Arguments
get_vocabulary methodUnicodeCodepointTokenizer.get_vocabulary()
Get the tokenizer vocabulary as a list of strings terms.
vocabulary_size methodUnicodeCodepointTokenizer.vocabulary_size()
Get the size of the tokenizer vocabulary. None implies no vocabulary size was provided
token_to_id methodUnicodeCodepointTokenizer.token_to_id(token)
Convert a string token to an integer id.
id_to_token methodUnicodeCodepointTokenizer.id_to_token(id)
Convert an integer id to a string token.