Chapter 8: Byte Pair Encoding - The Secret Behind GPT Tokenization
π Reading Time: 70 minutes
π» Coding Time: 60 minutes
Welcome to Chapter 8! Today we learn THE tokenization method used in GPT! π
What we learned last chapter:
- Simple word-level tokenization
- Building encoder/decoder from scratch
- Special tokens (
<|unk|>,<|endoftext|>)
Today:
- Why word-level tokenization isnβt enough
- What is Byte Pair Encoding (BPE)?
- How BPE works step-by-step
- Building BPE tokenizer
- Using tiktoken (OpenAIβs tokenizer)
- Why GPT uses 50K tokens instead of 170K+ words
This is how GPT really works! π‘
π Table of Contents
- The 3 Types of Tokenization
- Word-Level Tokenization Problems
- Character-Level Tokenization Problems
- Subword Tokenization: Best of Both Worlds
- What is Byte Pair Encoding?
- BPE Algorithm Step-by-Step
- BPE for LLMs: Practical Example
- Building BPE Tokenizer
- Using tiktoken Library
- Why GPT Uses BPE
- Chapter Summary
The 3 Types of Tokenization
π Overview
There are 3 main approaches to tokenization:
1. Word-Level Tokenization
βββ Each word = 1 token
2. Character-Level Tokenization
βββ Each character = 1 token
3. Subword-Level Tokenization β BPE is here!
βββ Words broken into subwords
Todayβs focus: Why #3 (Subword) is the best!
π― Quick Comparison
| Method | Example: βplayingβ | Vocab Size | Problem |
|---|---|---|---|
| Word-level | ["playing"] |
~170,000 | Unknown words |
| Character-level | ["p","l","a","y","i","n","g"] |
~256 | Loses meaning |
| Subword (BPE) | ["play", "ing"] |
~50,000 | None! β |
Subword tokenization is the goldilocks solution!
Word-Level Tokenization Problems
π How It Works
Sentence:
"My hobby is playing cricket"
Tokens (word-level):
["My", "hobby", "is", "playing", "cricket"]
Each word = 1 token β
β Problem #1: Out-of-Vocabulary (OOV) Words
Training data:
"My hobby is playing cricket"
Vocabulary: {My, hobby, is, playing, cricket}
User input:
"My hobby is playing football"
Error! β βfootballβ not in vocabulary!
Real-World Impact
Training: Read 1 million books from 1900-1950
Vocabulary: {radio, telegram, typewriter, ...}
User in 2024: "I love my smartphone"
Result: ERROR! β
New words constantly appear!
- βblockchainβ (2008)
- βselfieβ (2013)
- βemojiβ (2014)
- βChatGPTβ (2022)
Word-level tokenization canβt handle them!
β Problem #2: Root Words Not Captured
Example:
Words: "boy" and "boys"
Word-level:
- "boy" β Token ID: 523
- "boys" β Token ID: 1892
Problem: No relationship captured!
These words are VERY similar (same root!), but treated completely differently!
More Examples
token β tokenize β tokenization β tokenizer
Word-level treats all as separate!
But they all share "token" as root!
modern β modernize β modernization
All share "modern" as root!
But word-level doesn't capture this!
Meaning is lost! π’
β Problem #3: Huge Vocabulary Size
English language:
- ~170,000 words in common use
- ~470,000 words in total
Problem:
170,000 words = 170,000 tokens
Memory needed: HUGE! πΎ
Training time: SLOW! π
Model size: MASSIVE! π¦
Impractical for modern LLMs!
π Summary: Word-Level Problems
β Problem 1: Out-of-vocabulary words
β Problem 2: Root meanings not captured
β Problem 3: Massive vocabulary (170K+)
Example:
"boy" and "boys" β Treated completely different
"smartphone" β Error (if trained before 2000)
Character-Level Tokenization Problems
π How It Works
Sentence:
"My hobby is playing cricket"
Tokens (character-level):
["M","y"," ","h","o","b","b","y"," ","i","s"," ","p","l","a","y","i","n","g"," ","c","r","i","c","k","e","t"]
Each character = 1 token
β Advantage #1: Tiny Vocabulary
English has only:
- 26 lowercase letters (a-z)
- 26 uppercase letters (A-Z)
- 10 digits (0-9)
- ~40 punctuation marks
- Total: ~256 characters!
Vocabulary size: 256 tokens! π
Compare:
- Word-level: 170,000 tokens
- Character-level: 256 tokens
67x smaller! π
β Advantage #2: No OOV Problem!
Any word can be broken into characters:
Unknown word: "blockchain"
Character tokens: ["b","l","o","c","k","c","h","a","i","n"]
β
All characters are in vocabulary!
New word: "ChatGPT"
Character tokens: ["C","h","a","t","G","P","T"]
β
Works perfectly!
No word is βunknownβ at character level!
β Problem #1: Meaning is Lost
Example:
Word: "dinosaur"
Word-level: ["dinosaur"] (1 token)
β
Preserves meaning
Character-level: ["d","i","n","o","s","a","u","r"] (8 tokens)
β Meaning is lost!
Words have meaning. Characters donβt!
Impact on Learning
Sentence: "The boy loves dinosaurs"
Word-level understanding:
- "boy" β human child
- "loves" β emotion
- "dinosaurs" β prehistoric creatures
Character-level understanding:
- "b","o","y" β ???
- "d","i","n","o","s","a","u","r","s" β ???
Model must learn to group characters back into words!
MUCH harder to learn! π°
β Problem #2: Sequence is TOO Long
Example:
Paragraph: 100 words
Word-level: 100 tokens β
Character-level: 500-600 tokens β
Character sequences are 5-6x longer!
Impact
Longer sequences means:
β More computation needed
β Slower training
β More memory required
β Harder for model to learn long-range dependencies
Example:
Book: "Harry Potter"
- 77,000 words
- 385,000+ characters!
Training on 385K tokens vs 77K tokens?
5x more computation! πΈ
β Problem #3: Similar Words Not Grouped
Example:
"boy" β ["b","o","y"]
"boys" β ["b","o","y","s"]
Better! At least "b","o","y" is shared!
But:
"modernization" β ["m","o","d","e","r","n","i","z","a","t","i","o","n"]
"tokenization" β ["t","o","k","e","n","i","z","a","t","i","o","n"]
Common suffix "-ization" is NOT captured as a unit!
Model must learn this pattern from scratch!
Root words and patterns still not effectively captured!
π Summary: Character-Level Problems
β
Advantage 1: Tiny vocabulary (256)
β
Advantage 2: No OOV problem
β Problem 1: Word meanings lost
β Problem 2: Sequences 5-6x longer
β Problem 3: Patterns not captured
Example:
"dinosaur" β 8 separate characters (meaning lost)
"Harry Potter book" β 385K+ tokens (too long!)
Subword Tokenization: Best of Both Worlds
π‘ The Solution: Subword Tokenization
Subword tokenization combines the best of word-level AND character-level tokenization!
Key idea: Break words into meaningful subword units
π― The Two Rules
Rule #1: Keep frequent words intact
Word appears often β Keep as single token
Example: "the", "is", "play"
These are common, so keep them as-is!
Rule #2: Split rare words into subwords
Word is rare β Break into meaningful subwords
Example: "unbelievable"
Rare word β Split into ["un", "believ", "able"]
Best of both worlds! β¨
π Visual Comparison
Word: βunbelievableβ
ββββββββββββββββββββββββββββββββββββββββββ
β WORD-LEVEL β
ββββββββββββββββββββββββββββββββββββββββββ€
β ["unbelievable"] β
β 1 token β
β β
Compact β
β β Huge vocabulary β
ββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββ
β CHARACTER-LEVEL β
ββββββββββββββββββββββββββββββββββββββββββ€
β ["u","n","b","e","l","i","e","v", β
β "a","b","l","e"] β
β 12 tokens β
β β
Small vocabulary β
β β Too long, meaning lost β
ββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββ
β SUBWORD-LEVEL (BPE) β BEST! β
ββββββββββββββββββββββββββββββββββββββββββ€
β ["un", "believ", "able"] β
β 3 tokens β
β β
Reasonable length β
β β
Manageable vocabulary β
β β
Captures root meanings! β
ββββββββββββββββββββββββββββββββββββββββββ
Subword wins! π
π Advantages of Subword Tokenization
Advantage #1: Captures Root Words
"token" β ["token"]
"tokens" β ["token", "s"]
"tokenize" β ["token", "ize"]
"tokenization" β ["token", "ization"]
All share "token" root! β
Model learns: These words are related!
Advantage #2: Captures Common Patterns
"modernization" β ["modern", "ization"]
"tokenization" β ["token", "ization"]
Both share "-ization" suffix!
Model learns: These are both noun forms!
"unhappy" β ["un", "happy"]
"undo" β ["un", "do"]
"unbelievable" β ["un", "believ", "able"]
All share "un-" prefix!
Model learns: "un-" means negation!
Advantage #3: Handles Unknown Words!
New word: "blockchain"
Word-level: ERROR! β
Subword-level: ["block", "chain"] β
(Both subwords likely in vocabulary!)
New word: "selfie"
Word-level: ERROR! β
Subword-level: ["self", "ie"] β
(Can break into known subwords!)
Even unknown words work! π
Advantage #4: Reasonable Vocabulary Size
Word-level: 170,000+ tokens β
Character-level: 256 tokens (but long sequences) β
Subword-level: 30,000-50,000 tokens β
Sweet spot! π―
GPT-2/GPT-3 use 50,257 tokens!
π Complete Comparison Table
| Aspect | Word | Character | Subword (BPE) |
|---|---|---|---|
| Vocabulary size | 170K+ | 256 | 30K-50K |
| OOV words | β Fails | β Handles | β Handles |
| Root meanings | β Lost | β Lost | β Captured |
| Sequence length | β Short | β Long | β Medium |
| Memory | β High | β Low | β Medium |
| Training speed | β Slow | β Slow | β Fast |
Subword (BPE) wins on almost every metric! π
π‘ Key Takeaway
Subword tokenization (BPE) is:
β
Not too big (like word-level)
β
Not too small (like character-level)
β
Just right! (Goldilocks solution)
That's why GPT uses it!
What is Byte Pair Encoding?
π History
BPE was invented in 1994!
Original purpose: Data compression (not for LLMs!)
Paper: βA New Algorithm for Data Compressionβ (1994)
How it was used:
Find most common byte pair β Replace with new byte
Repeat until data is compressed
π BPE for Data Compression
Original algorithm:
Step 1: Find most common pair of bytes
Step 2: Replace with a new byte
Step 3: Repeat until stopping criteria
Example:
Original: "aaabdaaabac"
Step 1: "aa" appears 4 times (most common)
Replace "aa" with "Z"
Result: "Zabdaabac"
Step 2: "Zab" appears twice... wait, "ab" appears 2 times
Replace "ab" with "Y"
Result: "ZYdZYac"
Compressed! β
π‘ BPE Adapted for LLMs
Same algorithm, different purpose!
For LLMs:
Step 1: Find most common character pair
Step 2: Merge into single token
Step 3: Repeat until vocabulary size reached
Result: Subword vocabulary!
π― BPE for LLMs: Key Idea
Instead of compressing data, we create a vocabulary!
Start: Individual characters
End: Characters + subwords + words
Example vocabulary after BPE:
["a", "b", "c", ..., "the", "ing", "tion", "token", "ization", ...]
Mix of characters and subwords! β
BPE Algorithm Step-by-Step
π Example Dataset
Letβs build a BPE tokenizer from scratch!
Training data (word frequencies):
"old" appears 7 times
"older" appears 3 times
"finest" appears 9 times
"lowest" appears 4 times
Total: 23 words
π― Preprocessing: Add End Tokens
Add </w> to mark word endings:
"old" β "old</w>" (7 times)
"older" β "older</w>" (3 times)
"finest" β "finest</w>" (9 times)
"lowest" β "lowest</w>" (4 times)
Why </w>?
- Marks word boundaries
- Distinguishes βestβ in βfinestβ from βestβ in βestimateβ
- βestβ = word ends with βestβ
π’ Step 0: Initial Character Vocabulary
Break words into characters:
"old</w>" β ['o', 'l', 'd', '</w>']
"older</w>" β ['o', 'l', 'd', 'e', 'r', '</w>']
"finest</w>" β ['f', 'i', 'n', 'e', 's', 't', '</w>']
"lowest</w>" β ['l', 'o', 'w', 'e', 's', 't', '</w>']
Count character frequencies:
| Character | Frequency |
|---|---|
| o | 14 (7+3+4) |
| l | 14 (7+3+4) |
| d | 10 (7+3) |
| e | 16 (3+9+4) |
| r | 3 |
| f | 9 |
| i | 9 |
| n | 9 |
| s | 13 (9+4) |
| t | 13 (9+4) |
| w | 4 |
</w> |
23 (all words) |
Initial vocabulary: 12 tokens (all characters)
π Iteration 1: Find Most Common Pair
Look for most common character pair:
Scan all words:
- "es" in "finest": 9 times
- "es" in "lowest": 4 times
- Total: "es" appears 13 times! β Most common
Merge βesβ into single token:
Before: ['f', 'i', 'n', 'e', 's', 't', '</w>']
After: ['f', 'i', 'n', 'es', 't', '</w>']
Before: ['l', 'o', 'w', 'e', 's', 't', '</w>']
After: ['l', 'o', 'w', 'es', 't', '</w>']
Update vocabulary:
| Token | Frequency |
|---|---|
| β¦ | β¦ |
| e | 3 (16-13) |
| s | 0 (13-13) |
| es | 13 β New! |
| β¦ | β¦ |
Vocabulary size: 13 tokens
π Iteration 2: Find Next Common Pair
Now βesβ is a single token!
Look for next most common pair:
"est" pair:
- "es" + "t" in "finest": 9 times
- "es" + "t" in "lowest": 4 times
- Total: 13 times! β Most common
Merge βesβ and βtβ β βestβ:
Before: ['f', 'i', 'n', 'es', 't', '</w>']
After: ['f', 'i', 'n', 'est', '</w>']
Before: ['l', 'o', 'w', 'es', 't', '</w>']
After: ['l', 'o', 'w', 'est', '</w>']
Update vocabulary:
| Token | Frequency |
|---|---|
| es | 0 (used in βestβ) |
| est | 13 β New! |
Vocabulary size: 14 tokens
π We discovered βestβ as a root word!
π Iteration 3: Find Next Common Pair
Look for next common pair:
"est</w>" pair:
- "est" + "</w>" in "finest": 9 times
- "est" + "</w>" in "lowest": 4 times
- Total: 13 times!
Merge βestβ and ββ β βestβ:
Before: ['f', 'i', 'n', 'est', '</w>']
After: ['f', 'i', 'n', 'est</w>']
Before: ['l', 'o', 'w', 'est', '</w>']
After: ['l', 'o', 'w', 'est</w>']
Why merge with </w>?
- Distinguishes βestβ at word END from βestβ in middle
- βestimateβ has βestβ but NOT βestβ
- βfinestβ has βestβ (word ends with βestβ)
π― Important distinction learned!
π Iteration 4: Find Next Common Pair
Look for next common pair:
"ol" pair:
- "o" + "l" in "old</w>": 7 times
- "o" + "l" in "older</w>": 3 times
- Total: 10 times!
Merge βoβ and βlβ β βolβ:
Before: ['o', 'l', 'd', '</w>']
After: ['ol', 'd', '</w>']
Before: ['o', 'l', 'd', 'e', 'r', '</w>']
After: ['ol', 'd', 'e', 'r', '</w>']
New token: βolβ (appears 10 times)
π Iteration 5: Find Next Common Pair
Look for next common pair:
"old" pair:
- "ol" + "d" in "old</w>": 7 times
- "ol" + "d" in "older</w>": 3 times
- Total: 10 times!
Merge βolβ and βdβ β βoldβ:
Before: ['ol', 'd', '</w>']
After: ['old', '</w>']
Before: ['ol', 'd', 'e', 'r', '</w>']
After: ['old', 'e', 'r', '</w>']
π We discovered βoldβ as a root word!
β Final Vocabulary
After all iterations:
| Token | Type | Frequency |
|---|---|---|
| d | character | 0 |
| e | character | 3 |
| f | character | 9 |
| i | character | 9 |
| l | character | 4 |
| n | character | 9 |
| o | character | 4 |
| r | character | 3 |
| w | character | 4 |
</w> |
special | 0 |
| old | subword | 10 |
| est | subword | 13 |
Final vocabulary size: 11 tokens
What we learned:
- β βoldβ is a root word (in βoldβ and βolderβ)
- β βestβ is a common ending (in βfinestβ and βlowestβ)
- β Characters still available for unknown words!
π Tokenization Results
Using final vocabulary:
"old" β ["old", "</w>"]
"older" β ["old", "e", "r", "</w>"]
"finest" β ["f", "i", "n", "est</w>"]
"lowest" β ["l", "o", "w", "est</w>"]
Notice:
- β βoldβ is preserved as root
- β βestβ shows word ending pattern
- β Other characters broken down
Subword tokenization in action! π
π― Key Insights
1. Root words captured:
"old" appears in "old" and "older"
β BPE creates "old" token!
2. Common patterns captured:
"est" ending in "finest" and "lowest"
β BPE creates "est</w>" token!
3. Can handle unknown words:
Unknown word: "oldest"
β ["old", "es", "t", "</w>"]
(All subwords likely in vocabulary!)
BPE for LLMs: Practical Example
π― Stopping Criteria
When to stop merging?
Option 1: Target vocabulary size
Goal: 50,000 tokens
Keep merging until vocab reaches 50K
Option 2: Number of iterations
Goal: 10,000 iterations
Stop after 10K merge operations
Option 3: Minimum frequency threshold
Goal: Pair must appear at least 5 times
Stop when no pair appears β₯5 times
GPT-2/GPT-3: Uses target vocabulary size (~50K tokens)
π BPE for Real LLMs
Training data: Billions of words!
Process:
1. Start with characters (256 tokens)
2. Find most common pair
3. Merge into new token
4. Repeat 49,744 more times
5. Final vocabulary: 50,000 tokens!
Result:
- Characters: a, b, c, β¦
- Common subwords: ing, tion, est, β¦
- Common words: the, is, are, β¦
- All in one vocabulary! β¨
π‘ Advantages of BPE
1. Handles unknown words:
Word not in training: "cryptocurrency"
BPE breaks down: ["crypto", "currency"]
Or even: ["crypt", "o", "currency"]
Works! β
2. Smaller vocabulary than word-level:
Word-level: 170K+ tokens
BPE: 50K tokens
3.4x smaller! π
3. Captures patterns:
"tokenize" β ["token", "ize"]
"modernize" β ["modern", "ize"]
Pattern "-ize" learned! β
4. Reasonable sequence length:
Paragraph: 100 words
BPE: ~130 tokens (1.3x original)
Much better than character-level (5-6x)!
Building BPE Tokenizer
π» Using tiktoken Library
OpenAIβs official tokenizer!
Install:
pip install tiktoken
Import:
import tiktoken
# Check version
print(tiktoken.__version__)
# Output: 0.6.0
π§ Initialize GPT-2 Tokenizer
# Load GPT-2 BPE tokenizer
tokenizer = tiktoken.get_encoding("gpt2")
# This tokenizer has:
# - 50,257 tokens
# - Trained on massive datasets
# - Used in GPT-2, GPT-3, ChatGPT
β Test: Simple Sentence
text = "Hello, do you like tea?"
# Encode
ids = tokenizer.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {ids}")
# Decode
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")
Output:
Text: Hello, do you like tea?
Token IDs: [15496, 11, 466, 345, 588, 8887, 30]
Decoded: Hello, do you like tea?
Perfect! β
π§ͺ Test: Unknown Words
# Word that didn't exist in training
text = "some unknown place"
ids = tokenizer.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {ids}")
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")
Output:
Text: some unknown place
Token IDs: [ome, 6439, 1295]
Decoded: some unknown place
No error! BPE handles it! β
How? Breaks βunknownβ into subwords that exist in vocabulary!
π― Test: End of Text Token
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces"
# Combine with <|endoftext|>
combined = text1 + " <|endoftext|> " + text2
ids = tokenizer.encode(combined)
print(f"Token IDs: {ids}")
# Find <|endoftext|> token ID
endoftext_id = tokenizer.encode("<|endoftext|>")[0]
print(f"<|endoftext|> token ID: {endoftext_id}")
Output:
Token IDs: [15496, 11, 466, 345, 588, 8887, 30, 50256, 554, 262, ...]
<|endoftext|> token ID: 50256
Special insights:
<|endoftext|>token ID: 50256- This is the LAST token in GPT-2 vocabulary
- Vocabulary size: 50,257 (0 to 50256)
π¬ Test: Completely Random Text
# Nonsense text
text = "akwirwiΠ΅Ρ" # Random characters
ids = tokenizer.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {ids}")
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")
Output:
Text: akwirwiΠ΅Ρ
Token IDs: [461, 4246, 343, 72, 10047]
Decoded: akwirwiΠ΅Ρ
Still works! No error! β
Why? BPE breaks it into characters/subwords that exist!
π How BPE Handles Unknown Words
Example: Word not in vocabulary
Word: "blockchain"
BPE process:
1. Check if "blockchain" is a token β No
2. Try breaking: "block" + "chain"
- "block" is a token? Likely yes! β
- "chain" is a token? Likely yes! β
3. Tokenize as: ["block", "chain"]
Success! β
Worst case scenario:
Word: "xyzabc" (complete nonsense)
BPE process:
1. Check "xyzabc" β No
2. Check "xyz" β No
3. Check "xy" β No
4. Break to characters: ["x", "y", "z", "a", "b", "c"]
5. All characters are in vocabulary! β
Success! β
BPE ALWAYS succeeds! Because vocabulary includes individual characters!
Using tiktoken Library
π Complete Code Example
import tiktoken
# Initialize GPT-2 tokenizer
tokenizer = tiktoken.get_encoding("gpt2")
# Test text
text = "Hello, world! <|endoftext|> This is amazing."
# Encode
token_ids = tokenizer.encode(text)
print(f"Original text: {text}")
print(f"Token IDs: {token_ids}")
print(f"Number of tokens: {len(token_ids)}")
# Decode
decoded_text = tokenizer.decode(token_ids)
print(f"Decoded text: {decoded_text}")
# Verify
print(f"Match: {text == decoded_text}")
Output:
Original text: Hello, world! <|endoftext|> This is amazing.
Token IDs: [15496, 11, 995, 0, 50256, 770, 318, 4998, 13]
Number of tokens: 9
Decoded text: Hello, world! <|endoftext|> This is amazing.
Match: True
π Exploring Individual Tokens
# Encode individual words
words = ["Hello", "world", "amazing", "tokenization"]
for word in words:
ids = tokenizer.encode(word)
print(f"{word:15} β {ids}")
Output:
Hello β [15496]
world β [6894]
amazing β [38094]
tokenization β [30001, 1634]
Notice:
- Common words: 1 token
- Complex words: multiple tokens
- βtokenizationβ β [βtokenβ, βizationβ]
π― Token to Text Mapping
# Decode single token IDs
token_ids = [15496, 11, 995, 0]
for token_id in token_ids:
text = tokenizer.decode([token_id])
print(f"Token ID {token_id:5} β '{text}'")
Output:
Token ID 15496 β 'Hello'
Token ID 11 β ','
Token ID 995 β ' world'
Token ID 0 β '!'
Interesting: Space is often included with word!
π Vocabulary Statistics
# Get vocabulary size
vocab_size = tokenizer.n_vocab
print(f"Vocabulary size: {vocab_size}")
# Check special tokens
endoftext = "<|endoftext|>"
endoftext_id = tokenizer.encode(endoftext)[0]
print(f"{endoftext} token ID: {endoftext_id}")
# Verify it's the last token
print(f"Is last token: {endoftext_id == vocab_size - 1}")
Output:
Vocabulary size: 50257
<|endoftext|> token ID: 50256
Is last token: True
π§ͺ Compare with Our Simple Tokenizer
Chapter 7 tokenizer:
# Our simple tokenizer from Chapter 7
text = "Hello, world!"
# Would fail on unknown words! β
BPE tokenizer:
# GPT-2 BPE tokenizer
text = "Hello, world! Even with unknown words!"
ids = tokenizer.encode(text)
# Always works! β
BPE is MUCH better! π
Why GPT Uses BPE
π― The Numbers
English language:
- Total words: ~170,000+ in common use
GPT-2/GPT-3 vocabulary:
- BPE tokens: 50,257
Reduction: 3.4x smaller! π
π° Cost Savings
With word-level (170K tokens):
Memory: HUGE πΎ
Training: SLOW π
Inference: EXPENSIVE πΈ
With BPE (50K tokens):
Memory: 3.4x less β
Training: 3.4x faster β
Inference: 3.4x cheaper β
Billions saved! π°
β Advantages Summary
1. No unknown words:
Any word can be broken into subwords/characters
100% coverage! β
2. Captures patterns:
"tokenize" β ["token", "ize"]
"modernize" β ["modern", "ize"]
Model learns "-ize" pattern!
3. Reasonable vocabulary:
Not too big (word-level: 170K)
Not too small (character-level: 256)
Just right! (50K)
4. Handles all languages:
English: β
Spanish: β
Chinese: β
Code: β
Emoji: β
π Real-World Impact
GPT-3 statistics:
- Trained on 300 billion tokens (BPE tokens!)
- Vocabulary: 50,257 tokens
- Cost: $4.6 million
If word-level was used:
- Vocabulary: 170,000+ tokens
- Cost: ~$15+ million (estimated)
- Training: 3-4x longer
BPE saved OpenAI millions! π°
π BPE vs Others: Final Comparison
| Metric | Word | Character | BPE |
|---|---|---|---|
| Vocab size | 170K+ | 256 | 50K |
| Unknown words | β Fails | β Works | β Works |
| Root words | β Lost | β Lost | β Captured |
| Patterns | β No | β No | β Yes |
| Seq length | β Short | β 6x | β 1.3x |
| Memory | β High | β Low | β Medium |
| Speed | β Slow | β Slow | β Fast |
| Used in GPT? | β No | β No | β YES! |
BPE is the clear winner! π
Chapter Summary
π What We Learned Today
This was a DEEP chapter! Letβs recap:
1. Three Tokenization Approaches
Word-level:
Pros: Short sequences
Cons: 170K+ vocab, unknown words, no root capture
Character-level:
Pros: Small vocab (256), no unknown words
Cons: Long sequences, meaning lost, patterns not captured
Subword-level (BPE):
Pros: Medium vocab (50K), no unknown words, captures patterns, reasonable length
Cons: None! β
2. BPE Algorithm
Core idea:
1. Start with characters
2. Find most common pair
3. Merge into single token
4. Repeat until target vocabulary size
Example:
"finest" + "lowest" β Discovers "est</w>" as common ending
"old" + "older" β Discovers "old" as root word
3. BPE for LLMs
Rules:
Rule 1: Keep frequent words intact
Rule 2: Split rare words into subwords
Result:
Vocabulary: Mix of characters + subwords + common words
Size: ~50,000 tokens (GPT-2/GPT-3)
Coverage: 100% (no unknown words!)
4. Using tiktoken
import tiktoken
# Load GPT-2 tokenizer
tokenizer = tiktoken.get_encoding("gpt2")
# Encode
ids = tokenizer.encode("Hello, world!")
# Decode
text = tokenizer.decode(ids)
Simple and powerful! β
5. Why GPT Uses BPE
β
Handles unknown words
β
Captures root words and patterns
β
Reasonable vocabulary size (50K vs 170K+)
β
3.4x memory/speed improvement
β
Billions saved in training costs
BPE is the secret behind GPTβs success!
π Key Statistics
GPT-2/GPT-3:
- Vocabulary: 50,257 tokens
- Last token:
<|endoftext|>(ID: 50256) - Method: Byte Pair Encoding (BPE)
Comparison:
- Word-level: 170,000+ tokens
- Character-level: 256 tokens
- BPE: 50,257 tokens
BPE is the goldilocks solution! π―
π‘ Key Takeaways
- BPE combines best of word and character tokenization
- BPE handles any word (even unknown ones!)
- BPE captures root words (βtokenβ in βtokenizeβ, βtokenizationβ)
- BPE captures patterns (β-izeβ, β-tionβ, βun-β)
- GPT uses 50K BPE tokens instead of 170K+ words
- tiktoken is OpenAIβs official BPE library
- BPE saved millions in training costs
π― What We Learned (Checklist)
- [x] Word-level tokenization problems
- [x] Character-level tokenization problems
- [x] Subword tokenization advantages
- [x] BPE algorithm step-by-step
- [x] Building BPE vocabulary
- [x] Using tiktoken library
- [x] Encoding and decoding with BPE
- [x] Handling unknown words
- [x] Why GPT uses BPE
- [x] Cost savings from BPE
π Next Chapter: Chapter 9
Topic: Data Sampling, Batch Sizes, and Context Windows
What weβll learn:
- How to feed tokens to LLMs
- What is a context window?
- Batch size considerations
- Creating training batches
- Sliding window approach
- Preparing data for training
From tokenization to training! π
π Practice Exercise
Try this before next chapter:
- Install tiktoken
- Tokenize your favorite book paragraph
- Count tokens
- Try unknown/nonsense words
- Compare with character count
- Calculate compression ratio
Share your findings in comments! π¬
π Take Action Now!
- π» Install tiktoken -
pip install tiktoken - π§ͺ Experiment - Tokenize different texts
- π Practice - Try unknown words
- β Ask Questions - Comment if unclear
- π Bookmark - Reference material
- βοΈ Get Ready - Next: Data sampling!
Quick Reference
BPE Algorithm:
# Pseudocode
vocabulary = list_of_characters
while len(vocabulary) < target_size:
pair = find_most_common_pair()
new_token = merge(pair)
vocabulary.add(new_token)
update_frequencies()
tiktoken Usage:
import tiktoken
# Initialize
tokenizer = tiktoken.get_encoding("gpt2")
# Encode
ids = tokenizer.encode("Your text here")
# Decode
text = tokenizer.decode(ids)
# Vocabulary size
size = tokenizer.n_vocab # 50257
Key Comparisons:
| Aspect | Word | Char | BPE |
|---|---|---|---|
| Vocab | 170K+ | 256 | 50K |
| Unknown | β | β | β |
| Root | β | β | β |
| Speed | β | β | β |
Thank You!
Youβve completed Chapter 8 - Byte Pair Encoding! π
You now know:
- β Why BPE is superior
- β How BPE algorithm works
- β How to use tiktoken
- β Why GPT uses 50K tokens
- β How BPE handles unknown words
Next chapter: Data sampling and context windows
Youβre mastering LLMs! π
π£ Your Feedback Matters!
Drop a comment:
- Did you understand BPE algorithm?
- Which part was most interesting?
- Any questions about tokenization?
- Share your tiktoken experiments!
We respond to every comment! π¬
π― Coming Up
Chapter 9: Data Sampling & Context Windows
Chapter 10: Vector Embeddings
Chapter 11: Positional Encoding
Chapter 12: Self-Attention Mechanism
The journey continues! π»π₯
See you in Chapter 9 where we learn data sampling! π
Questions? Experiments to share? Drop them below! Weβre here to help. πͺ