Chapter 7: Tokenization - Your First Step Into Coding LLMs
π Reading Time: 60 minutes
π» Coding Time: 90 minutes
Welcome to Chapter 7! THIS IS WHERE WE START CODING! π
What weβve learned so far:
- Chapter 1-6: Theory, architecture, roadmap
Today:
- Build a complete tokenizer from scratch
- Python implementation
- Encoder + Decoder
- Handle special tokens
- Real code you can run!
Grab your laptop and letβs code! π»
π Table of Contents
- What is Tokenization?
- Why Tokenization Matters
- The 3 Steps of Data Preparation
- Step 1: Breaking Text Into Tokens
- Step 2: Converting Tokens to Token IDs
- Building the Vocabulary
- Implementing Encoder and Decoder
- Special Context Tokens
- Complete Tokenizer Class
- Testing Our Tokenizer
- Why GPT Uses Byte Pair Encoding
- Chapter Summary
What is Tokenization?
π€ The Simplest Definition
Tokenization = Breaking text into smaller units (tokens)
At its basic form:
Sentence: "The cat sat on the mat"
Tokens: ["The", "cat", "sat", "on", "the", "mat"]
But itβs MORE than just splitting by spaces!
π€ Why Canβt We Just Use Words?
Good question! Hereβs why simple word splitting doesnβt work:
Example 1: Punctuation
Sentence: "Hello, world!"
Wrong: ["Hello,", "world!"] β
Correct: ["Hello", ",", "world", "!"] β
Example 2: Contractions
Sentence: "I don't know"
Wrong: ["I", "don't", "know"] β
Correct: ["I", "don", "'", "t", "know"] β
Example 3: Special Characters
Sentence: "It's 2024--amazing!"
Wrong: ["It's", "2024--amazing!"] β
Correct: ["It", "'", "s", "2024", "--", "amazing", "!"] β
Tokenization handles ALL these cases properly!
π‘ Key Insight
Tokenization is the FIRST step in data preprocessing for LLMs.
ββββββββββββββββββββββββββββββββββββββββββββ
β LLM DATA PREPARATION PIPELINE β
ββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. TOKENIZATION (Today!) β
β Text β Tokens β
β β
β 2. TOKEN IDs (Today!) β
β Tokens β Numbers β
β β
β 3. VECTOR EMBEDDINGS (Next chapter) β
β Numbers β Vectors β
β β
β 4. TRAINING β
β Vectors β Trained LLM β
β β
ββββββββββββββββββββββββββββββββββββββββββββ
Today: Steps 1 & 2!
Why Tokenization Matters
π― The Core Problem
Neural networks work with NUMBERS, not TEXT!
What we have:
"The cat sat on the mat"
What LLMs need:
[101, 202, 303, 404, 101, 505]
Tokenization bridges this gap!
π Visual Flow
βββββββββββββββββββββββββββββββββββββββββββββββ
β β
β INPUT TEXT β
β "The cat sat on the mat" β
β β
β β TOKENIZATION β
β β
β TOKENS β
β ["The", "cat", "sat", "on", "the", "mat"]β
β β
β β TOKEN IDs β
β β
β TOKEN IDs β
β [101, 202, 303, 404, 101, 505] β
β β
β β EMBEDDINGS (Next chapter) β
β β
β VECTORS β
β [[0.2, 0.8, ...], [0.5, 0.3, ...], ...] β
β β
β β TRAINING β
β β
β TRAINED LLM β
β β
βββββββββββββββββββββββββββββββββββββββββββββββ
π° Real-World Impact
Good tokenization:
- β Reduces vocabulary size
- β Handles unknown words
- β Saves memory
- β Faster training
- β Better performance
Poor tokenization:
- β Huge vocabulary
- β Canβt handle new words
- β Wastes memory
- β Slower training
- β Worse performance
Tokenization is CRITICAL!
The 3 Steps of Data Preparation
π Complete Overview
Remember from Chapter 6βs roadmap:
STAGE 1: Building Blocks
βββ Data Preparation β WE ARE HERE!
β βββ Step 1: Tokenization
β βββ Step 2: Token IDs
β βββ Step 3: Vector Embeddings
βββ Attention Mechanisms
βββ LLM Architecture
π― Todayβs Focus
Step 1: Split text into tokens
Input: "This is an example"
Output: ["This", "is", "an", "example"]
Step 2: Convert tokens to token IDs
Input: ["This", "is", "an", "example"]
Output: [45, 12, 89, 234]
Step 3: (Next chapter!)
Input: [45, 12, 89, 234]
Output: [[0.2, 0.8, ...], [0.5, 0.3, ...], ...]
Step 1: Breaking Text Into Tokens
π οΈ Setup: Loading Our Dataset
Weβll use a real book for practice!
Book: βThe Verdictβ by Edith Wharton (1908)
Why? Free to download, perfect size for learning
Size: ~20,000 characters
π» Python Code: Loading the Book
# Load the text file
with open("the_verdict.txt", "r") as file:
raw_text = file.read()
# Check what we loaded
print(f"Total characters: {len(raw_text)}")
print(f"First 100 characters:\n{raw_text[:100]}")
Output:
Total characters: 20479
First 100 characters:
I had always thought Jack Gisburn rather a cheap genius--though
a good fellow enough--so it was no...
Success! Weβve loaded the text! β
π§ Simple Tokenization Attempt
Letβs try the simplest approach: split by spaces
# Split by spaces
text = "Hello, world! This is an example."
tokens = text.split(" ")
print(tokens)
Output:
['Hello,', 'world!', 'This', 'is', 'an', 'example.']
Problem: Punctuation is stuck to words! β
π― Better Approach: Regular Expressions
Use Pythonβs re library for smart splitting!
import re
text = "Hello, world! This is an example."
# Split on whitespace
result = re.split(r'(\s)', text)
print(result)
Output:
['Hello,', ' ', 'world!', ' ', 'This', ' ', 'is', ' ', 'an', ' ', 'example.']
Better! But still has issues: Whitespaces included, punctuation stuck
π Advanced Tokenization
Split on whitespace AND punctuation!
import re
text = "Hello, world! This is--an example."
# Split on whitespace AND punctuation
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
print(result)
Output:
['Hello', ',', ' ', 'world', '!', ' ', 'This', ' ', 'is', '--',
'an', ' ', 'example', '.', '']
Much better! Punctuation is now separate! β
π§Ή Remove Whitespaces
We donβt need spaces as separate tokens:
import re
text = "Hello, world! This is--an example."
# Split on whitespace and punctuation
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
# Remove whitespaces
tokens = [item for item in result if item.strip()]
print(tokens)
Output:
['Hello', ',', 'world', '!', 'This', 'is', '--', 'an', 'example', '.']
Perfect! Clean tokens without whitespaces! β
π Complete Tokenization Function
import re
def simple_tokenize(text):
"""
Tokenize text by splitting on:
- Whitespace
- Punctuation: , . : ; ? _ ! " ' ( ) --
Returns list of tokens (no whitespaces)
"""
# Split on whitespace and punctuation
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
# Remove whitespace tokens
tokens = [item for item in result if item.strip()]
return tokens
# Test
text = "Hello, world! This is--an example."
tokens = simple_tokenize(text)
print(tokens)
Output:
['Hello', ',', 'world', '!', 'This', 'is', '--', 'an', 'example', '.']
π― Apply to Full Book
# Tokenize the entire book!
preprocessed = simple_tokenize(raw_text)
print(f"Total tokens: {len(preprocessed)}")
print(f"First 30 tokens: {preprocessed[:30]}")
Output:
Total tokens: 4690
First 30 tokens: ['I', 'had', 'always', 'thought', 'Jack', 'Gisburn',
'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow',
'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me',
'to', 'hear', 'that', ',', 'in']
Amazing! 20,479 characters β 4,690 tokens! π
π‘ Key Takeaway: Step 1
β
We can break text into tokens
β
Handle punctuation properly
β
Remove unnecessary whitespaces
β
Tokenized entire book (4,690 tokens)
Next: Convert tokens to numbers!
Step 2: Converting Tokens to Token IDs
π€ Why Token IDs?
Problem: Computers canβt understand words like βcatβ or βdogβ
Solution: Convert each token to a unique number (Token ID)!
Token: "cat" β Token ID: 42
Token: "dog" β Token ID: 87
Token: "fox" β Token ID: 103
Now computers can process them! β
π Building a Vocabulary
Vocabulary = Mapping from tokens to token IDs
Process:
- Get unique tokens (no duplicates)
- Sort alphabetically (for consistency)
- Assign sequential IDs (0, 1, 2, 3β¦)
π― Example: Small Dataset
Training text:
"The quick brown fox jumps over the lazy dog"
Step 1: Tokenize
["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
Step 2: Get unique tokens (alphabetically)
["brown", "dog", "fox", "jumps", "lazy", "over", "quick", "the"]
Note: βtheβ and βTheβ are different! (Case-sensitive)
For simplicity, weβll keep them separate.
Step 3: Assign Token IDs
| Token | Token ID |
|---|---|
| brown | 0 |
| dog | 1 |
| fox | 2 |
| jumps | 3 |
| lazy | 4 |
| over | 5 |
| quick | 6 |
| the | 7 |
This is our VOCABULARY! π
π» Python Code: Build Vocabulary
# Get unique tokens and sort alphabetically
all_words = sorted(set(preprocessed))
# Create vocabulary: token β token_id
vocab = {token: idx for idx, token in enumerate(all_words)}
# Check vocabulary size
print(f"Vocabulary size: {len(vocab)}")
Output:
Vocabulary size: 1130
We have 1,130 unique tokens in βThe Verdictβ!
π Inspect the Vocabulary
# Show first 50 entries
for i, (token, token_id) in enumerate(vocab.items()):
print(f"{token}: {token_id}")
if i >= 49:
break
Output:
!: 0
": 1
': 2
(: 3
): 4
,: 5
--: 6
.: 7
:: 8
;: 9
?: 10
A: 11
Ah: 12
Among: 13
And: 14
Are: 15
Art: 16
As: 17
At: 18
...
Notice:
- Punctuation gets IDs 0-10
- Words are alphabetically ordered
- Each unique token has unique ID
π― Simplified Code
# One-line vocabulary creation!
all_words = sorted(set(preprocessed))
vocab = {token: idx for idx, token in enumerate(all_words)}
print(f"Vocabulary size: {len(vocab)}")
Thatβs it! Two lines to build a vocabulary! β¨
π‘ Key Takeaway: Step 2
β
Built vocabulary (1,130 unique tokens)
β
Each token has unique ID
β
Alphabetically sorted for consistency
β
Ready to encode text!
Building the Vocabulary
π What is a Vocabulary?
Vocabulary = A dictionary mapping tokens to token IDs
Think of it like a real dictionary:
Dictionary: "apple" β "a fruit that grows on trees"
Vocabulary: "apple" β 42 (token ID)
π§ Why Alphabetical Order?
Three reasons:
- Consistency: Same text always gives same IDs
- Reproducibility: Anyone can rebuild same vocabulary
- Debugging: Easy to find tokens
Example without sorting:
# Without sorting - RANDOM ORDER! β
tokens = ["cat", "dog", "fox"]
vocab = {token: idx for idx, token in enumerate(tokens)}
# Result depends on set order (unpredictable!)
Example with sorting:
# With sorting - CONSISTENT! β
tokens = sorted(["cat", "dog", "fox"])
vocab = {token: idx for idx, token in enumerate(tokens)}
# Result: {"cat": 0, "dog": 1, "fox": 2} (always same!)
π Vocabulary Statistics
print(f"Total tokens in text: {len(preprocessed)}")
print(f"Unique tokens (vocab size): {len(vocab)}")
print(f"Compression ratio: {len(preprocessed) / len(vocab):.2f}x")
Output:
Total tokens in text: 4690
Unique tokens (vocab size): 1130
Compression ratio: 4.15x
Interpretation:
- 4,690 total tokens
- Only 1,130 are unique
- Average token appears 4.15 times
π― Vocabulary is Your LLMβs Dictionary
Small vocabulary:
- β Faster processing
- β Less memory
- β May miss words
Large vocabulary:
- β Handles more words
- β Slower processing
- β More memory
GPT-3 vocabulary: ~50,000 tokens
Our toy example: 1,130 tokens
Implementing Encoder and Decoder
π The Two-Way Process
ENCODER: Text β Token IDs
DECODER: Token IDs β Text
βββββββββββββββββββββββββββββββββββββββββββ
β β
β TEXT: "Hello, world!" β
β β β
β ENCODER β
β β β
β TOKEN IDs: [523, 5, 892, 10] β
β β β
β DECODER β
β β β
β TEXT: "Hello, world!" β
β β
βββββββββββββββββββββββββββββββββββββββββββ
Why do we need both?
- Encoder: Prepare data for training
- Decoder: Convert LLM output back to readable text
π― Encoder: Text β Token IDs
def encode(text, vocab):
"""
Convert text to token IDs
Args:
text: String to encode
vocab: Dictionary {token: token_id}
Returns:
List of token IDs
"""
# Step 1: Tokenize
tokens = simple_tokenize(text)
# Step 2: Convert to IDs
ids = [vocab[token] for token in tokens]
return ids
# Test
text = "It's the last he painted, you know."
ids = encode(text, vocab)
print(f"Text: {text}")
print(f"Token IDs: {ids}")
Output:
Text: It's the last he painted, you know.
Token IDs: [1131, 1135, 1131, 1132, 1133, 1134, 5, 1136, 1137, 7]
It works! β
π Decoder: Token IDs β Text
def decode(ids, vocab):
"""
Convert token IDs back to text
Args:
ids: List of token IDs
vocab: Dictionary {token: token_id}
Returns:
Decoded text string
"""
# Step 1: Create reverse vocabulary (id β token)
inv_vocab = {idx: token for token, idx in vocab.items()}
# Step 2: Convert IDs to tokens
tokens = [inv_vocab[idx] for idx in ids]
# Step 3: Join tokens
text = " ".join(tokens)
# Step 4: Fix spacing before punctuation
text = re.sub(r'\s+([,.:;?!"])', r'\1', text)
return text
# Test
decoded_text = decode(ids, vocab)
print(f"Decoded: {decoded_text}")
Output:
Decoded: It's the last he painted, you know.
Perfect! We got back the original text! π
ποΈ Complete Tokenizer Class
Letβs package everything into a reusable class:
class SimpleTokenizerV1:
"""
Simple tokenizer with encode and decode methods
"""
def __init__(self, vocab):
"""
Initialize with vocabulary
Args:
vocab: Dictionary {token: token_id}
"""
self.str_to_int = vocab
self.int_to_str = {idx: token for token, idx in vocab.items()}
def encode(self, text):
"""Convert text to token IDs"""
# Tokenize
tokens = simple_tokenize(text)
# Convert to IDs
ids = [self.str_to_int[token] for token in tokens]
return ids
def decode(self, ids):
"""Convert token IDs to text"""
# Convert IDs to tokens
tokens = [self.int_to_str[idx] for idx in ids]
# Join tokens
text = " ".join(tokens)
# Fix punctuation spacing
text = re.sub(r'\s+([,.:;?!"])', r'\1', text)
return text
β Test the Tokenizer Class
# Create tokenizer instance
tokenizer = SimpleTokenizerV1(vocab)
# Test encode
text = "It's the last he painted, you know."
ids = tokenizer.encode(text)
print(f"Original: {text}")
print(f"Encoded: {ids}")
# Test decode
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")
Output:
Original: It's the last he painted, you know.
Encoded: [1131, 1135, 1131, 1132, 1133, 1134, 5, 1136, 1137, 7]
Decoded: It's the last he painted, you know.
Tokenizer works perfectly! β
π‘ Key Takeaway
β
Built encoder (text β IDs)
β
Built decoder (IDs β text)
β
Packaged into reusable class
β
Tested and verified!
Special Context Tokens
β οΈ The Unknown Word Problem
Test this:
text = "Hello, do you like tea?"
ids = tokenizer.encode(text)
Output:
KeyError: 'Hello'
ERROR! Why? π€
βHelloβ is not in our vocabulary! (Book written in 1908, doesnβt use βHelloβ)
π¨ The Problem
Our tokenizer crashes on unknown words!
Training text: "The Verdict" (1908 book)
Vocabulary: Only words from that book
User input: "Hello, do you like pizza?"
β
CRASH! β
This is UNACCEPTABLE for real LLMs!
π‘ The Solution: Special Tokens
Add special tokens to vocabulary:
<|unk|>- Unknown word token<|endoftext|>- End of text token
π§ Special Token #1: Unknown Token
Purpose: Handle words not in vocabulary
How it works:
Input: "Hello, world!"
Vocabulary: {world, !, ,} (no "Hello")
Without <|unk|>: ERROR β
With <|unk|>: [<|unk|>, ,, world, !] β
Unknown words β <|unk|> token
π§ Special Token #2: End of Text
Purpose: Separate different texts during training
Example:
Text Source 1: Harry Potter Chapter 1
Text Source 2: Wikipedia article on AI
Text Source 3: News article
Without <|endoftext|>:
"Harry cast a spell on AI researchers who reported today..."
(Confusing! Mixes sources!)
With <|endoftext|>:
"Harry cast a spell <|endoftext|> AI researchers are <|endoftext|> Today the news..."
(Clear separation!)
π Visual: Multiple Text Sources
ββββββββββββββββββββββββββββββββββββββββββββ
β TEXT SOURCE 1 (Book) β
β "Once upon a time, there was a wizard..."β
ββββββββββββββββββββββββββββββββββββββββββββ
β
<|endoftext|>
β
ββββββββββββββββββββββββββββββββββββββββββββ
β TEXT SOURCE 2 (Wikipedia) β
β "Machine learning is a subset of AI..." β
ββββββββββββββββββββββββββββββββββββββββββββ
β
<|endoftext|>
β
ββββββββββββββββββββββββββββββββββββββββββββ
β TEXT SOURCE 3 (News) β
β "Tech companies announced today..." β
ββββββββββββββββββββββββββββββββββββββββββββ
<|endoftext|> keeps sources separate!
π» Add Special Tokens to Vocabulary
# Start with original vocabulary
all_tokens = sorted(set(preprocessed))
# Add special tokens
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
# Rebuild vocabulary with special tokens
vocab = {token: idx for idx, token in enumerate(all_tokens)}
print(f"New vocabulary size: {len(vocab)}")
print(f"Last 5 entries:")
for token, idx in list(vocab.items())[-5:]:
print(f" {token}: {idx}")
Output:
New vocabulary size: 1132
Last 5 entries:
your: 1128
yourself: 1129
<|endoftext|>: 1130
<|unk|>: 1131
Special tokens added! β
π§ Updated Tokenizer with Special Tokens
class SimpleTokenizerV2:
"""
Improved tokenizer that handles unknown words
"""
def __init__(self, vocab):
self.str_to_int = vocab
self.int_to_str = {idx: token for token, idx in vocab.items()}
def encode(self, text):
"""Encode text, replacing unknown words with <|unk|>"""
tokens = simple_tokenize(text)
# Replace unknown tokens
ids = [
self.str_to_int.get(token, self.str_to_int["<|unk|>"])
for token in tokens
]
return ids
def decode(self, ids):
"""Decode token IDs to text"""
tokens = [self.int_to_str[idx] for idx in ids]
text = " ".join(tokens)
text = re.sub(r'\s+([,.:;?!"])', r'\1', text)
return text
Key change: self.str_to_int.get(token, self.str_to_int["<|unk|>"])
This returns <|unk|> ID if token not found!
β Test with Unknown Words
# Create new tokenizer
tokenizer = SimpleTokenizerV2(vocab)
# Test with unknown words
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
# Combine with <|endoftext|>
full_text = text1 + " <|endoftext|> " + text2
# Encode
ids = tokenizer.encode(full_text)
print(f"Text: {full_text}")
print(f"IDs: {ids}")
# Decode
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")
Output:
Text: Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.
IDs: [1131, 5, 1095, 1131, 1131, 1131, 10, 1130, 1089, ...]
Decoded: <|unk|>, do you <|unk|> <|unk|>? <|endoftext|> In the sunlit terraces of the <|unk|>.
Notice:
- βHelloβ β
<|unk|>(token ID 1131) - βteaβ β
<|unk|>(not in vocabulary) - βpalaceβ β
<|unk|>(not in vocabulary) <|endoftext|>β Preserved (token ID 1130)
No error! Tokenizer handles unknowns gracefully! β
π― Other Special Tokens in Real LLMs
We covered:
- β
<|unk|>- Unknown token - β
<|endoftext|>- End of text
Others used in research:
| Token | Purpose | Example |
|---|---|---|
| `< | BOS | >` |
| `< | EOS | >` |
| `< | PAD | >` |
π‘ Important Note About GPT
GPT models (GPT-2, GPT-3, GPT-4):
β
Use: <|endoftext|>
β Donβt use: <|unk|>, <|BOS|>, <|EOS|>, <|PAD|>
Why no <|unk|>?
GPT uses Byte Pair Encoding (BPE) which breaks unknown words into subwords!
Example:
Unknown word: "unbelievable"
BPE: ["un", "believ", "able"]
(All subwords are in vocabulary!)
Next chapter: Weβll learn BPE in detail!
Complete Tokenizer Class
ποΈ Final Production-Ready Tokenizer
import re
class SimpleTokenizerV2:
"""
Complete tokenizer with:
- Encoding and decoding
- Unknown word handling
- Special token support
"""
def __init__(self, vocab):
"""
Initialize tokenizer with vocabulary
Args:
vocab: Dictionary mapping tokens to IDs
"""
self.str_to_int = vocab
self.int_to_str = {idx: token for token, idx in vocab.items()}
def encode(self, text):
"""
Encode text to token IDs
Args:
text: Input string
Returns:
List of token IDs
"""
# Tokenize text
tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)
tokens = [item for item in tokens if item.strip()]
# Convert to IDs (use <|unk|> for unknown tokens)
ids = [
self.str_to_int.get(token, self.str_to_int["<|unk|>"])
for token in tokens
]
return ids
def decode(self, ids):
"""
Decode token IDs to text
Args:
ids: List of token IDs
Returns:
Decoded text string
"""
# Convert IDs to tokens
tokens = [self.int_to_str[idx] for idx in ids]
# Join tokens
text = " ".join(tokens)
# Fix spacing before punctuation
text = re.sub(r'\s+([,.:;?!"])', r'\1', text)
return text
π¦ How to Use
# 1. Build vocabulary from training data
with open("the_verdict.txt", "r") as f:
raw_text = f.read()
# Tokenize
tokens = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
tokens = [item for item in tokens if item.strip()]
# Create vocabulary
all_tokens = sorted(set(tokens))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token: idx for idx, token in enumerate(all_tokens)}
# 2. Create tokenizer
tokenizer = SimpleTokenizerV2(vocab)
# 3. Use it!
text = "Hello, world! <|endoftext|> This is a test."
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)
print(f"Original: {text}")
print(f"Encoded: {ids}")
print(f"Decoded: {decoded}")
Testing Our Tokenizer
π§ͺ Test Case 1: Simple Sentence
text = "It's the last he painted, you know."
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)
print(f"Original: {text}")
print(f"Encoded: {ids}")
print(f"Decoded: {decoded}")
print(f"Match: {text == decoded}")
Output:
Original: It's the last he painted, you know.
Encoded: [1131, 1135, 1131, 1132, 1133, 1134, 5, 1136, 1137, 7]
Decoded: It's the last he painted, you know.
Match: True
β Perfect!
π§ͺ Test Case 2: Unknown Words
text = "Hello, do you like pizza?"
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)
print(f"Original: {text}")
print(f"Encoded: {ids}")
print(f"Decoded: {decoded}")
Output:
Original: Hello, do you like pizza?
Encoded: [1131, 5, 1095, 1131, 1131, 1131, 10]
Decoded: <|unk|>, do you <|unk|> <|unk|>?
β Handles unknowns gracefully!
π§ͺ Test Case 3: Multiple Texts
text1 = "Hello, world!"
text2 = "This is a test."
combined = text1 + " <|endoftext|> " + text2
ids = tokenizer.encode(combined)
decoded = tokenizer.decode(ids)
print(f"Original: {combined}")
print(f"Decoded: {decoded}")
Output:
Original: Hello, world! <|endoftext|> This is a test.
Decoded: <|unk|>, <|unk|>! <|endoftext|> This is a test.
β
Preserves <|endoftext|> separator!
π§ͺ Test Case 4: Edge Cases
# Empty string
print(tokenizer.encode("")) # []
# Only punctuation
print(tokenizer.encode("!?,.")) # [0, 10, 5, 7]
# Only special tokens
print(tokenizer.encode("<|endoftext|>")) # [1130]
β All edge cases handled!
Why GPT Uses Byte Pair Encoding
π€ Limitations of Our Tokenizer
Our tokenizer:
Word = 1 token
"unbelievable" β 1 token
"supercalifragilisticexpialidocious" β 1 token
Problems:
-
Huge vocabulary needed
- Every word = separate token
- English has 170,000+ words
- New words added daily!
-
Unknown words
- Canβt handle βblockchainβ, βselfieβ, βemojiβ
- Replaces with
<|unk|>(loses information!)
-
Memory waste
- Rare words take space
- βantidisestablishmentarianismβ only appears once!
π‘ The BPE Solution
Byte Pair Encoding (BPE):
Break words into subword units!
Example:
Word: "unbelievable"
Our tokenizer: ["unbelievable"] (1 token)
BPE tokenizer: ["un", "believ", "able"] (3 tokens)
Benefits:
-
Smaller vocabulary
- Reuse subwords across words
- βunβ appears in: unbelievable, unhappy, undo, etc.
-
No unknown words!
- Any word can be broken into subwords
- Worst case: break into characters
-
Efficient
- Common words = 1 token
- Rare words = multiple tokens
π Comparison
| Tokenizer | βrunningβ | βunbelievableβ | Vocab Size |
|---|---|---|---|
| Our (word-level) | ["running"] |
["unbelievable"] |
~170,000 |
| BPE (subword-level) | ["run", "ning"] |
["un", "believ", "able"] |
~50,000 |
BPE vocabulary is 3x smaller!
π― GPT Tokenization
GPT-2/GPT-3/GPT-4 all use BPE!
Benefits:
- β Vocabulary: ~50,000 tokens
- β
No
<|unk|>needed - β Handles all languages
- β Efficient encoding
Example:
# GPT-3 tokenizer
text = "unbelievable"
tokens = gpt3_tokenize(text)
# Result: ["un", "believ", "able"]
π Next Chapter Preview
Chapter 8: Byte Pair Encoding (BPE)
Weβll learn:
- How BPE works
- Building BPE tokenizer from scratch
- Using GPTβs tokenizer
- Comparing tokenization methods
Get ready for more coding! π
Chapter Summary
π What We Accomplished Today
This was a HUGE chapter! Letβs recap:
1. Understanding Tokenization
Tokenization = Breaking text into tokens
Why? Neural networks need numbers, not text!
Process:
Text β Tokens β Token IDs β Embeddings β LLM
2. Built Tokenizer from Scratch
Step 1: Text β Tokens
import re
tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)
tokens = [item for item in tokens if item.strip()]
Step 2: Tokens β Token IDs
all_tokens = sorted(set(tokens))
vocab = {token: idx for idx, token in enumerate(all_tokens)}
3. Implemented Encoder & Decoder
Encoder:
def encode(text):
tokens = tokenize(text)
ids = [vocab[token] for token in tokens]
return ids
Decoder:
def decode(ids):
tokens = [inv_vocab[idx] for idx in ids]
text = " ".join(tokens)
return text
4. Added Special Tokens
Two key tokens:
<|unk|>- Unknown words<|endoftext|>- Text separator
Why they matter:
- Handle unknown words gracefully
- Separate different text sources
- Used in real GPT models!
5. Complete Tokenizer Class
class SimpleTokenizerV2:
def __init__(self, vocab):
# Initialize
def encode(self, text):
# Text β Token IDs
def decode(self, ids):
# Token IDs β Text
Production-ready! β
6. Learned About BPE
Word-level tokenization (our approach):
- 1 word = 1 token
- Large vocabulary
- Unknown word problem
Subword tokenization (BPE - GPTβs approach):
- 1 word = multiple subword tokens
- Smaller vocabulary
- No unknown words!
Next chapter: Build BPE tokenizer!
π Key Statistics
Our dataset:
- Book: βThe Verdictβ (1908)
- Characters: 20,479
- Tokens: 4,690
- Vocabulary: 1,132 (including special tokens)
Real LLMs:
- GPT-3 vocabulary: ~50,000 tokens
- Trained on: 300 billion tokens
- Uses: Byte Pair Encoding (BPE)
π‘ Key Takeaways
- Tokenization is essential for LLM data preprocessing
- Vocabulary maps tokens to IDs
- Encoder converts text to IDs
- Decoder converts IDs back to text
- Special tokens handle edge cases
- BPE is better than word-level tokenization
- GPT uses BPE, not our simple method
π― What We Learned (Checklist)
- [x] What is tokenization
- [x] Why tokenization matters
- [x] Breaking text into tokens (regex)
- [x] Building vocabulary
- [x] Converting tokens to IDs
- [x] Implementing encoder
- [x] Implementing decoder
- [x] Handling unknown words
- [x] Special tokens (
<|unk|>,<|endoftext|>) - [x] Complete tokenizer class
- [x] Testing and edge cases
- [x] Why GPT uses BPE
π Next Chapter: Chapter 8
Topic: Byte Pair Encoding (BPE) Tokenization
What weβll learn:
- How BPE algorithm works
- Building BPE tokenizer from scratch
- Using OpenAIβs tiktoken library
- Comparing tokenization methods
- Vocabulary size optimization
- Handling multiple languages
Hands-on coding continues! π»
π Practice Exercise
Try this before next chapter:
- Download a different book (Project Gutenberg)
- Build vocabulary from that book
- Create tokenizer instance
- Encode some sentences
- Decode them back
- Count vocabulary size
Share your results in comments! π¬
π Take Action Now!
- π» Run the Code - Download notebook and execute every cell
- π§ͺ Experiment - Try different texts, books, datasets
- π Practice - Build tokenizer for your own data
- β Ask Questions - Comment if anything is unclear
- π Bookmark - Reference material for future
- βοΈ Get Ready - Next chapter: Byte Pair Encoding!
Quick Reference
Key Functions:
# Tokenization
import re
tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)
tokens = [item for item in tokens if item.strip()]
# Vocabulary
vocab = {token: idx for idx, token in enumerate(sorted(set(tokens)))}
# Encoding
ids = [vocab[token] for token in tokens]
# Decoding
tokens = [inv_vocab[idx] for idx in ids]
text = " ".join(tokens)
Special Tokens:
| Token | Purpose | Usage |
|---|---|---|
| `< | unk | >` |
| `< | endoftext | >` |
| `< | BOS | >` |
| `< | EOS | >` |
| `< | PAD | >` |
Tokenizer Class Template:
class Tokenizer:
def __init__(self, vocab):
self.str_to_int = vocab
self.int_to_str = {v: k for k, v in vocab.items()}
def encode(self, text):
# Tokenize and convert to IDs
pass
def decode(self, ids):
# Convert IDs to text
pass
Thank You!
Youβve completed Chapter 7 - Tokenization! π
You now know:
- β How tokenization works
- β How to build encoder/decoder
- β How to handle special tokens
- β Why BPE is superior
- β Real implementation in Python
Next chapter: Byte Pair Encoding (BPE)
This is just the beginning! π
π£ Your Feedback Matters!
Drop a comment:
- Which part was most interesting?
- What was challenging?
- What would you like more detail on?
- Share your tokenizer experiments!
We respond to every comment! π¬
π― Coming Up
Chapter 8: Byte Pair Encoding (BPE)
Chapter 9: Vector Embeddings
Chapter 10: Positional Encoding
Chapter 11: Self-Attention Mechanism
Stay tuned! The coding adventure continues! π»π₯
See you in Chapter 8 where we build BPE tokenizer! π
Questions? Stuck somewhere? Drop a comment below! Weβre here to help. πͺ