Coaching a Tokenizer for BERT Fashions
BERT is an early transformer-based mannequin for NLP duties that’s small and quick sufficient to coach on a house laptop. Like all deep studying fashions, it requires a tokenizer to transform textual content into integer tokens. This text exhibits find out how to practice a WordPiece tokenizer following BERT’s unique design.
Let’s get began.
Coaching a Tokenizer for BERT Fashions
Picture by JOHN TOWNER. Some rights reserved.
Overview
This text is split into two components; they’re:
- Choosing a Dataset
- Coaching a Tokenizer
Choosing a Dataset
To maintain issues easy, we’ll use English textual content solely. WikiText is a well-liked preprocessed dataset for experiments, obtainable via the Hugging Face datasets library:
|
import random from datasets import load_dataset
# path and identify of every dataset path, identify = “wikitext-2”, “wikitext-2-raw-v1” dataset = load_dataset(path, identify, break up=“practice”) print(f“measurement: {len(dataset)}”) # Print a couple of samples for idx in random.pattern(vary(len(dataset)), 5): textual content = dataset[idx][“text”].strip() print(f“{idx}: {textual content}”) |
On first run, the dataset downloads to ~/.cache/huggingface/datasets and is cached for future use. WikiText-2 that used above is a smaller dataset appropriate for fast experiments, whereas WikiText-103 is bigger and extra consultant of real-world textual content for a greater mannequin.
The output of this code might appear to be this:
|
measurement: 36718 23905: Dudgeon Creek 4242: In 1825 the Congress of Mexico established the Port of Galveston and in 1830 … 7181: Crew : 5 24596: On March 19 , 2007 , Sports activities Illustrated posted on its web site an article in its … 12920: The newest constructing included within the checklist is within the Quantock Hills . The … |
The dataset incorporates strings of various lengths with areas round punctuation marks. Whilst you may break up on whitespace, this wouldn’t seize sub-word elements. That’s what the WordPiece tokenization algorithm is nice at.
Coaching a Tokenizer
A number of tokenization algorithms assist sub-word elements. BERT makes use of WordPiece, whereas trendy LLMs usually use Byte-Pair Encoding (BPE). We’ll practice a WordPiece tokenizer following BERT’s unique design.
The tokenizers library implements a number of tokenization algorithms that may be configured to your wants. It saves you the trouble of implementing the tokenization algorithm from scratch. It is best to set up it with pip command:
Let’s practice a tokenizer:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
import tokenizers from datasets import load_dataset
path, identify = “wikitext”, “wikitext-103-raw-v1” vocab_size = 30522 dataset = load_dataset(path, identify, break up=“practice”)
# Accumulate texts, skip title strains beginning with “=” texts = [] for line in dataset[“text”]: line = line.strip() if line and not line.startswith(“=”): texts.append(line)
# Configure WordPiece tokenizer with NFKC normalization and particular tokens tokenizer = tokenizers.Tokenizer(tokenizers.fashions.WordPiece()) tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace() tokenizer.decoder = tokenizers.decoders.WordPiece() tokenizer.normalizer = tokenizers.normalizers.NFKC() tokenizer.coach = tokenizers.trainers.WordPieceTrainer( vocab_size=vocab_size, special_tokens=[“[PAD]”, “[CLS]”, “[SEP]”, “[MASK]”, “[UNK]”] ) # Practice the tokenizer and put it aside tokenizer.train_from_iterator(texts, coach=tokenizer.coach) tokenizer.enable_padding(pad_id=tokenizer.token_to_id(“[PAD]”), pad_token=“[PAD]”) tokenizer_path = f“{dataset_name}_wordpiece.json” tokenizer.save(tokenizer_path, fairly=True)
# Take a look at the tokenizer tokenizer = tokenizers.Tokenizer.from_file(tokenizer_path) print(tokenizer.encode(“Hi there, world!”).tokens) print(tokenizer.decode(tokenizer.encode(“Hi there, world!”).ids)) |
Operating this code might print the next output:
|
wikitext-103-raw-v1/train-00000-of-00002(…): 100%|█████| 157M/157M [00:46<00:00, 3.40MB/s] wikitext-103-raw-v1/train-00001-of-00002(…): 100%|█████| 157M/157M [00:04<00:00, 37.0MB/s] Producing take a look at break up: 100%|███████████████| 4358/4358 [00:00<00:00, 174470.75 examples/s] Producing practice break up: 100%|████████| 1801350/1801350 [00:09<00:00, 199210.10 examples/s] Producing validation break up: 100%|█████████| 3760/3760 [00:00<00:00, 201086.14 examples/s] measurement: 1801350 [00:00:04] Pre-processing sequences ████████████████████████████ 0 / 0 [00:00:00] Tokenize phrases ████████████████████████████ 606445 / 606445 [00:00:00] Depend pairs ████████████████████████████ 606445 / 606445 [00:00:04] Compute merges ████████████████████████████ 22020 / 22020 [‘Hell’, ‘##o’, ‘,’, ‘world’, ‘!’] Hi there, world! |
This code makes use of the WikiText-103 dataset. The primary run downloads 157MB of information containing 1.8 million strains. The coaching takes a couple of seconds. The instance exhibits how "Hi there, world!" turns into 5 tokens, with “Hi there” break up into “Hell” and “##o” (the “##” prefix signifies a sub-word element).
The tokenizer created within the code above has the next properties:
- Vocabulary measurement: 30,522 tokens (matching the unique BERT mannequin)
- Particular tokens:
[PAD],[CLS],[SEP],[MASK], and[UNK]are added to the vocabulary despite the fact that they aren’t within the dataset. - Pre-tokenizer: Whitespace splitting (because the dataset has areas round punctuation)
- Normalizer: NFKC normalization for unicode textual content. Observe which you can additionally configure the tokenizer to transform all the pieces into lowercase, because the frequent BERT-uncased mannequin does.
- Algorithm: WordPiece is used. Therefore the decoder needs to be set accordingly in order that the “##” prefix for sub-word elements is acknowledged.
- Padding: Enabled with
[PAD]token for batch processing. This isn’t demonstrated within the code above, however it is going to be helpful if you find yourself coaching a BERT mannequin.
The tokenizer saves to a pretty big JSON file containing the complete vocabulary, permitting you to reload the tokenizer later with out retraining.
To transform a string into a listing of tokens, you employ the syntax tokenizer.encode(textual content).tokens, through which every token is only a string. To be used in a mannequin, you need to use tokenizer.encode(textual content).ids as an alternative, through which the end result might be a listing of integers. The decode methodology can be utilized to transform a listing of integers again to a string. That is demonstrated within the code above.
Under are some sources that you could be discover helpful:
This text demonstrated find out how to practice a WordPiece tokenizer for BERT utilizing the WikiText dataset. You realized to configure the tokenizer with applicable normalization and particular tokens, and find out how to encode textual content to tokens and decode again to strings. That is simply a place to begin for tokenizer coaching. Contemplate leveraging present libraries and instruments to optimize tokenizer coaching pace so it doesn’t change into a bottleneck in your coaching course of.