BytePairTokenizer
classkeras_nlp.tokenizers.BytePairTokenizer(
vocabulary=None,
merges=None,
sequence_length=None,
add_prefix_space=False,
unsplittable_tokens=None,
dtype="int32",
**kwargs
)
Bype-pair encoding tokenizer layer.
This BPE tokenizer provides the same functionality as the official GPT-2
tokenizer. Given the same vocabulary
which maps tokens to ids, and merges
which describes BPE merge rules, it should provide the same output
as OpenAI implementation (https://github.com/openai/gpt-2/blob/master/src/encoder.py).
Different from OpenAI, this implementation is graph-compatible, so you can
use it within a tf.data
pipeline.
If input is a batch of strings (rank > 0):
By default, the layer will output a tf.RaggedTensor
where the last
dimension of the output is ragged. If sequence_length
is set, the layer
will output a dense tf.Tensor
where all inputs have been padded or
truncated to sequence_length
.
If input is a scalar string (rank == 0):
By default, the layer will output a dense tf.Tensor
with static shape
[None]
. If sequence_length
is set, the output will be
a dense tf.Tensor
of shape [sequence_length]
.
Arguments
sequence_length
. Defaults to None
.False
.vocabulary
. Defaults to None
.Examples
Tokenize
>>> vocab = {"butter": 1, "fly": 2}
>>> merge = ["b u", "t t", "e r", "bu tt", "butt er", "f l", "fl y"]
>>> tokenizer = keras_nlp.tokenizers.BytePairTokenizer(vocab, merge)
>>> outputs = tokenizer("butterfly")
>>> np.array(outputs)
array([1, 2], dtype=int32)
>>> seq1, seq2 = tokenizer(["butterfly", "butter"])
>>> np.array(seq1)
array([1, 2], dtype=int32)
>>> np.array(seq2)
array([1], dtype=int32)
>>> tokenizer = keras_nlp.tokenizers.BytePairTokenizer(
... vocab, merge, sequence_length=2)
>>> seq1, seq2 = tokenizer(["butterfly", "butter"])
>>> np.array(seq1)
array([1, 2], dtype=int32)
>>> np.array(seq2)
array([1, 0], dtype=int32)
Detokenize
>>> vocab = {"butter": 1, "fly": 2}
>>> merge = ["b u", "t t", "e r", "bu tt", "butt er", "f l", "fl y"]
>>> tokenizer = keras_nlp.tokenizers.BytePairTokenizer(vocab, merge)
>>> tokenizer.detokenize([[1, 2]])
<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'butterfly'],
dtype=object)>
tokenize
methodBytePairTokenizer.tokenize(inputs)
Transform input tensors of strings into output tokens.
Arguments
detokenize
methodBytePairTokenizer.detokenize(inputs)
Transform tokens back into strings.
Arguments
get_vocabulary
methodBytePairTokenizer.get_vocabulary()
Get the tokenizer vocabulary as a list of strings tokens.
vocabulary_size
methodBytePairTokenizer.vocabulary_size()
Get the integer size of the tokenizer vocabulary.
token_to_id
methodBytePairTokenizer.token_to_id(token)
Convert a string token to an integer id.
id_to_token
methodBytePairTokenizer.id_to_token(id)
Convert an integer id to a string token.