Tokenizers in Language Fashions – MachineLearningMastery.com

Tokenization is a vital preprocessing step in pure language processing (NLP) that converts uncooked textual content into tokens that may be processed by language fashions. Trendy language fashions use refined tokenization algorithms to deal with the complexity of human language. On this article, we are going to discover widespread tokenization algorithms utilized in trendy LLMs, their implementation, and the way to use them.
Let’s get began!

Tokenizers in Language Fashions
Photograph by Belle Co. Some rights reserved.
Overview
This put up is split into 5 elements; they’re:
- Naive Tokenization
- Stemming and Lemmatization
- Byte-Pair Encoding (BPE)
- WordPiece
- SentencePiece and Unigram
Naive Tokenization
The only type of tokenization splits textual content into tokens based mostly on whitespace. This can be a widespread tokenization technique utilized in many NLP duties.
textual content = “Hey, world! This can be a take a look at.” tokens = textual content.cut up() print(f“Tokens: {tokens}”) |
The output is:
Tokens: [‘Hello,’, ‘world!’, ‘This’, ‘is’, ‘a’, ‘test.’] |
Whereas easy and quick, this method has a number of limitations. Recall {that a} mannequin dealing with textual content must know its vocabulary (the set of all attainable tokens). Utilizing this naive tokenization, the vocabulary consists of all phrases within the supplied textual content. When coaching a mannequin, you create the vocabulary out of your coaching knowledge. Nonetheless, when utilizing the skilled mannequin in your venture, you might encounter phrases not within the vocabulary. In such circumstances, your mannequin can’t deal with them or should substitute them with a particular “unknown” token.
One other drawback with naive tokenization is its poor dealing with of punctuation and particular characters. For instance, “world!” turns into one token, whereas in one other sentence, “world” is perhaps a separate token. This creates two totally different tokens within the vocabulary for basically the identical phrase. Related points come up with capitalization and hyphenation.
Why tokenize phrases by area? In English, area is how we separate phrases, and phrases are the fundamental models of language. You wouldn’t need to tokenize enter by bytes, as you’d get meaningless alphabets that make it troublesome for the mannequin to grasp the textual content’s that means. Equally, tokenizing by sentences isn’t ultimate as a result of there are a number of orders of magnitude extra sentences than phrases. Coaching a mannequin to grasp textual content on the sentence degree would require proportionally extra coaching knowledge.
Nonetheless, are phrases the optimum degree for tokenization? Ideally, you need to break down textual content into the smallest significant models. In German, space-based tokenization isn’t ultimate on account of quite a few compound phrases. Even in English, prefixes and suffixes that aren’t standalone phrases carry that means when mixed with different phrases. For instance, “sad” needs to be understood as “un-” + “blissful”.
Subsequently, you want a greater tokenization technique.
Stemming and Lemmatization
By implementing extra refined tokenization algorithms, you possibly can create a greater vocabulary. For instance, this common expression tokenizes textual content into phrases, punctuation, and numbers:
import re
textual content = “Hey, world! This can be a take a look at.” tokens = re.findall(r‘w+|[^ws]’, textual content) print(f“Tokens: {tokens}”) |
To additional scale back vocabulary dimension, you possibly can convert all the pieces to lowercase:
import re
textual content = “Hey, world! This can be a take a look at.” tokens = re.findall(r‘w+|[^ws]’, textual content.decrease()) print(f“Tokens: {tokens}”) |
and the output is:
Tokens: [‘hello’, ‘,’, ‘world’, ‘!’, ‘this’, ‘is’, ‘a’, ‘test’, ‘.’] |
Nonetheless, this nonetheless doesn’t tackle the issue of phrase variations.
Stemming and lemmatization are two strategies for decreasing phrases to their root kind. Stemming is a extra aggressive approach that removes prefixes and suffixes based mostly on guidelines. Lemmatization is gentler, decreasing phrases to their base kind utilizing a dictionary. Each are language-specific, however stemming might produce invalid phrases.
In English, the Porter stemming algorithm is often used. You possibly can implement it utilizing the nltk library:
import nltk from nltk.stem import PorterStemmer from nltk.tokenize import phrase_tokenize
# obtain the required assets if have not accomplished so nltk.obtain(‘punkt_tab’)
textual content = “These fashions might turn out to be unstable rapidly if not initialized.” stemmer = PorterStemmer() phrases = word_tokenize(textual content) stemmed_words = [stemmer.stem(word) for word in words] print(stemmed_words) |
and the output is:
[‘these’, ‘model’, ‘may’, ‘becom’, ‘unstabl’, ‘quickli’, ‘if’, ‘not’, ‘initi’, ‘.’] |
You possibly can see that “unstabl” shouldn’t be a legitimate phrase, but it surely’s what the Porter stemming algorithm produces.
Lemmatization is gentler and virtually all the time produces legitimate phrases. Right here’s the way to use the nltk library for lemmatization:
import nltk from nltk.stem import WordNetLemmatizer from nltk.tokenize import phrase_tokenize
# obtain the required assets if have not accomplished so nltk.obtain(‘wordnet’)
textual content = “These fashions might turn out to be unstable rapidly if not initialized.” lemmatizer = WordNetLemmatizer() phrases = word_tokenize(textual content) lemmatized_words = [lemmatizer.lemmatize(word) for word in words] print(lemmatized_words) |
and the output is:
[‘These’, ‘model’, ‘may’, ‘become’, ‘unstable’, ‘quickly’, ‘if’, ‘not’, ‘initialized’, ‘.’] |
In each circumstances, you first tokenize the phrases after which remodel them with a stemmer or lemmatizer. This normalization step produces a extra constant vocabulary. Nonetheless, the basic tokenization points, corresponding to recognizing subwords, stay unsolved.
Byte-Pair Encoding (BPE)
Byte-Pair Encoding (BPE) is without doubt one of the most generally used tokenization algorithms in trendy language fashions. Initially created as a textual content compression algorithm, it was launched for machine translation and later adopted by GPT fashions. BPE works by iteratively merging probably the most frequent adjoining pairs of characters or tokens within the coaching knowledge.
The algorithm begins with a vocabulary of particular person characters and iteratively merges probably the most frequent adjoining pairs into new tokens. This course of continues till reaching the specified vocabulary dimension. For English textual content, you can begin with simply the alphabet and a few punctuation, making the preliminary character set very small. Then, widespread letter combos are launched to the vocabulary iteratively. The ensuing vocabulary incorporates each particular person characters and customary subword models.
BPE is skilled on particular knowledge, so the precise tokenization will depend on the coaching knowledge. Subsequently, it’s essential save and cargo the BPE tokenizer mannequin to be used in your venture.
BPE doesn’t specify the way to outline a phrase. For instance, hyphenated phrases like “pre-trained” will be handled as one phrase or two phrases. That is decided by the “pre-tokenizer,” which in its easiest kind splits phrases by areas.
Many transformer fashions use BPE, together with GPT, BART, and RoBERTa. You should utilize their skilled BPE tokenizers. Right here’s the way to use the BPE tokenizer from the Hugging Face Transformers library:
from transformers import GPT2Tokenizer
# Load the GPT-2 tokenizer (which makes use of BPE) tokenizer = GPT2Tokenizer.from_pretrained(“gpt2”)
# Tokenize a textual content textual content = “Pre-trained fashions can be found.” tokens = tokenizer.encode(textual content) print(f“Token IDs: {tokens}”) print(f“Tokens: {tokenizer.convert_ids_to_tokens(tokens)}”) print(f“Decoded: {tokenizer.decode(tokens)}”) |
and its output is:
Token IDs: [6719, 12, 35311, 4981, 389, 1695, 13] Tokens: [‘Pre’, ‘-‘, ‘trained’, ‘Ġmodels’, ‘Ġare’, ‘Ġavailable’, ‘.’] Decoded: Pre-trained fashions can be found. |
You possibly can see that the tokenizer makes use of “Ġ” to characterize areas between phrases. This can be a particular token utilized by BPE to characterize phrase boundaries. Discover that phrases are neither stemmed nor lemmatized: “fashions” stays as is, not remodeled to “mannequin”.
An alternative choice to Hugging Face’s tokenizer is OpenAI’s tiktoken library. Right here’s an instance:
import tiktoken
encoding = tiktoken.get_encoding(“cl100k_base”) textual content = “Pre-trained fashions can be found.” tokens = encoding.encode(textual content) print(f“Token IDs: {tokens}”) print(f“Tokens: {[encoding.decode_single_token_bytes(t) for t in tokens]}”) print(f“Decoded: {encoding.decode(tokens)}”) |
To coach your personal BPE tokenizer, the Hugging Face Tokenizers library is the simplest choice. Right here’s an instance:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
from datasets import load_dataset from tokenizers import Tokenizer from tokenizers.fashions import BPE from tokenizers.pre_tokenizers import Whitespace from tokenizers.trainers import BpeTrainer
ds = load_dataset(“Salesforce/wikitext”, “wikitext-103-raw-v1”) print(ds)
tokenizer = Tokenizer(BPE(unk_token=“[UNK]”)) tokenizer.pre_tokenizer = Whitespace() coach = BpeTrainer(special_tokens=[“[UNK]”, “[CLS]”, “[SEP]”, “[PAD]”, “[MASK]”]) print(tokenizer)
tokenizer.train_from_iterator(ds[“train”][“text”], coach) print(tokenizer) tokenizer.save(“my-tokenizer.json”)
# reload the skilled tokenizer tokenizer = Tokenizer.from_file(“my-tokenizer.json”) |
Operating this, you will note:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
DatasetDict({ take a look at: Dataset({ options: [‘text’], num_rows: 4358 }) prepare: Dataset({ options: [‘text’], num_rows: 1801350 }) validation: Dataset({ options: [‘text’], num_rows: 3760 }) }) Tokenizer(model=”1.0″, truncation=None, padding=None, added_tokens=[], normalizer=None, pre_tokenizer=Whitespace(), post_processor=None, decoder=None, mannequin=BPE(dropout=None, unk_token=”[UNK]”, continuing_subword_prefix=None, end_of_word_suffix=None, fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab={}, merges=[])) [00:00:04] Pre-processing sequences ███████████████████████████ 0 / 0 [00:00:00] Tokenize phrases ███████████████████████████ 608587 / 608587 [00:00:00] Depend pairs ███████████████████████████ 608587 / 608587 [00:00:02] Compute merges ███████████████████████████ 25018 / 25018 Tokenizer(model=”1.0″, truncation=None, padding=None, added_tokens=[ {“id”:0, “content”:”[UNK]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}, {“id”:1, “content material”:”[CLS]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}, {“id”:2, “content material”:”[SEP]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}, {“id”:3, “content material”:”[PAD]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}, {“id”:4, “content material”:”[MASK]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}], normalizer=None, pre_tokenizer=Whitespace(), post_processor=None, decoder=None, mannequin=BPE(dropout=None, unk_token=”[UNK]”, continuing_subword_prefix=None, end_of_word_suffix=None, fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab={“[UNK]”:0, “[CLS]”:1, “[SEP]”:2, “[PAD]”:3, “[MASK]”:4, …}, merges=[(“t”, “h”), (“i”, “n”), (“e”, “r”), (“a”, “n”), (“th”, “e”), …])) |
The BpeTrainer
object has extra arguments for controlling the coaching course of. On this instance, you loaded a dataset utilizing Hugging Face’s datasets
library and skilled the tokenizer on the textual content knowledge. Every dataset is totally different. This one has “take a look at”, “prepare”, and “validation” splits. Every cut up has one characteristic named “textual content” containing strings. We skilled the tokenizer utilizing ds["train"]["text"]
and let the coach discover merges till reaching the specified vocabulary dimension.
You possibly can see that the tokenizer’s state earlier than and after coaching differs. Tokens realized from the coaching knowledge are added and related to token IDs.
A key benefit of the BPE tokenizer is its potential to deal with unknown phrases by breaking them down into identified subword models.
WordPiece
WordPiece is a well-liked tokenization algorithm proposed by Google in 2016, utilized by BERT and its variants. It’s additionally a subword tokenization algorithm. Let’s see the way it tokenizes a sentence:
from transformers import BertTokenizer
# Load the WordPiece tokenizer from BERT tokenizer = BertTokenizer.from_pretrained(“bert-base-uncased”)
# Tokenize a textual content textual content = “These fashions are normally initialized with Gaussian random values.” tokens = tokenizer.encode(textual content) print(f“Token IDs: {tokens}”) print(f“Tokens: {tokenizer.convert_ids_to_tokens(tokens)}”) print(f“Decoded: {tokenizer.decode(tokens)}”) |
The output of this code is:
Token IDs: [101, 2122, 4275, 2024, 2788, 3988, 3550, 2007, 11721, 17854, 2937, 6721, 5300, 1012, 102] Tokens: [‘[CLS]’, ‘these’, ‘fashions’, ‘are’, ‘normally’, ‘preliminary’, ‘##ized’, ‘with’, ‘ga’, ‘##uss’, ‘##ian’, ‘random’, ‘values’, ‘.’, ‘[SEP]’] Decoded: [CLS] these fashions are normally initialized with gaussian random values. [SEP] |
From this output, you possibly can see that the tokenizer splits “initialized” into “preliminary” and “##ized”. The “##” prefix signifies that this can be a subword of the earlier phrase. If a phrase isn’t prefixed with “##”, it’s assumed to have an area earlier than it.
This end result consists of some BERT-specific design selections. On this BERT mannequin, all textual content is transformed to lowercase, which the tokenizer handles implicitly. BERT additionally assumes textual content sequences begin with a [CLS]
token and finish with a [SEP]
token. These particular tokens are added robotically by the tokenizer. None of those are required by the WordPiece algorithm, so that you may not see them in different fashions.
WordPiece is just like BPE. Each begin with the set of all characters and merge some into new vocabulary tokens. BPE merges probably the most frequent token pairs, whereas WordPiece makes use of a rating method that maximizes chance. The important thing distinction is that BPE might create subword tokens from widespread phrases, whereas WordPiece usually retains widespread phrases as single tokens.
Coaching a WordPiece tokenizer utilizing the Hugging Face tokenizers library is just like BPE. You should utilize the WordPieceTrainer
to coach the tokenizer. Right here’s an instance:
from datasets import load_dataset from tokenizers import Tokenizer from tokenizers.fashions import WordPiece from tokenizers.pre_tokenizers import Whitespace from tokenizers.trainers import WordPieceTrainer
ds = load_dataset(“Salesforce/wikitext”, “wikitext-103-raw-v1”)
tokenizer = Tokenizer(WordPiece(unk_token=“[UNK]”)) tokenizer.pre_tokenizer = Whitespace() coach = WordPieceTrainer(special_tokens=[“[UNK]”, “[CLS]”, “[SEP]”, “[PAD]”, “[MASK]”])
tokenizer.train_from_iterator(ds[“train”][“text”], coach) tokenizer.save(“my-tokenizer.json”) |
SentencePiece and Unigram
BPE and WordPiece are constructed from the underside up. They begin with the set of all characters and merge some into new vocabulary tokens. You can even construct a tokenizer from the highest down, beginning with all phrases from the coaching knowledge and pruning the vocabulary to the specified dimension.
Unigram is such an algorithm. Coaching a Unigram tokenizer entails eradicating vocabulary objects in every step based mostly on a log-likelihood rating. Not like BPE and WordPiece, the skilled Unigram tokenizer isn’t rule-based however statistical. It saves the chance of every token, which is used to find out the tokenization of latest textual content.
Whereas it’s theoretically attainable to have a standalone Unigram tokenizer, it’s mostly seen as a part of SentencePiece.
SentencePiece is a language-neutral tokenization algorithm that doesn’t require pre-tokenization of enter textual content. It’s notably helpful for multilingual situations as a result of, for instance, English makes use of areas to separate phrases, however Chinese language doesn’t. SentencePiece handles such variations by treating enter textual content as a stream of Unicode characters. It then makes use of both BPE or Unigram to create the tokenization.
Right here’s the way to use the SentencePiece tokenizer from the Hugging Face Transformers library:
from transformers import T5Tokenizer
# Load the T5 tokenizer (which makes use of SentencePiece+Unigram) tokenizer = T5Tokenizer.from_pretrained(“t5-small”)
# Tokenize a textual content textual content = “SentencePiece is a subword tokenizer utilized in fashions corresponding to XLNet and T5.” tokens = tokenizer.encode(textual content) print(f“Token IDs: {tokens}”) print(f“Tokens: {tokenizer.convert_ids_to_tokens(tokens)}”) print(f“Decoded: {tokenizer.decode(tokens)}”) |
and the output is:
Token IDs: [4892, 17, 1433, 345, 23, 15, 565, 19, 3, 9, 769, 6051, 14145, 8585, 261, 16, 2250, 224, 38, 3, 4, 434, 9688, 11, 332, 9125, 1] Tokens: [‘▁Sen’, ‘t’, ‘ence’, ‘P’, ‘i’, ‘e’, ‘ce’, ‘▁is’, ‘▁’, ‘a’, ‘▁sub’, ‘word’, ‘▁token’, ‘izer’, ‘▁used’, ‘▁in’, ‘▁models’, ‘▁such’, ‘▁as’, ‘▁’, ‘X’, ‘L’, ‘Net’, ‘▁and’, ‘▁T’, ‘5.’, ”] Decoded: SentencePiece is a subword tokenizer utilized in fashions corresponding to XLNet and T5. |
Just like WordPiece, a particular prefix (underscore character, “_”) is added to tell apart subwords from phrases.
Coaching a SentencePiece tokenizer can also be comparable utilizing the Hugging Face Tokenizers library. Right here’s an instance:
from datasets import load_dataset from tokenizers import SentencePieceUnigramTokenizer
ds = load_dataset(“Salesforce/wikitext”, “wikitext-103-raw-v1”) tokenizer = SentencePieceUnigramTokenizer()
tokenizer.train_from_iterator(ds[“train”][“text”]) tokenizer.save(“my-tokenizer.json”) |
You can even use Google’s sentencepiece
library for a similar objective.
Additional Readings
Under are some additional readings on the subject:
Abstract
On this article, you explored various kinds of tokenization algorithms utilized in trendy language fashions. You realized that:
- BPE is extensively utilized in GPT fashions and works by merging frequent adjoining pairs
- WordPiece is utilized in BERT fashions and maximizes chance of coaching knowledge
- SentencePiece is extra versatile and may deal with totally different languages with out pre-tokenization
- Trendy tokenizers embody necessary options like particular tokens, truncation, and padding
Understanding these tokenization algorithms is essential for working with trendy language fashions and preprocessing textual content knowledge successfully.