Chapter 8: Byte Pair Encoding (BPE) - How GPT Tokenizes Text

Chapter 8: Byte Pair Encoding - The Secret Behind GPT Tokenization

πŸ“– Reading Time: 70 minutes
πŸ’» Coding Time: 60 minutes

Welcome to Chapter 8! Today we learn THE tokenization method used in GPT! πŸš€

What we learned last chapter:

  • Simple word-level tokenization
  • Building encoder/decoder from scratch
  • Special tokens (<|unk|>, <|endoftext|>)

Today:

  • Why word-level tokenization isn’t enough
  • What is Byte Pair Encoding (BPE)?
  • How BPE works step-by-step
  • Building BPE tokenizer
  • Using tiktoken (OpenAI’s tokenizer)
  • Why GPT uses 50K tokens instead of 170K+ words

This is how GPT really works! πŸ’‘


πŸ“‘ Table of Contents


The 3 Types of Tokenization

πŸ“Š Overview

There are 3 main approaches to tokenization:

1. Word-Level Tokenization
   └── Each word = 1 token

2. Character-Level Tokenization
   └── Each character = 1 token

3. Subword-Level Tokenization ← BPE is here!
   └── Words broken into subwords

Today’s focus: Why #3 (Subword) is the best!


🎯 Quick Comparison

Method Example: β€œplaying” Vocab Size Problem
Word-level ["playing"] ~170,000 Unknown words
Character-level ["p","l","a","y","i","n","g"] ~256 Loses meaning
Subword (BPE) ["play", "ing"] ~50,000 None! βœ…

Subword tokenization is the goldilocks solution!


Word-Level Tokenization Problems

πŸ“ How It Works

Sentence:

"My hobby is playing cricket"

Tokens (word-level):

["My", "hobby", "is", "playing", "cricket"]

Each word = 1 token βœ…


❌ Problem #1: Out-of-Vocabulary (OOV) Words

Training data:

"My hobby is playing cricket"
Vocabulary: {My, hobby, is, playing, cricket}

User input:

"My hobby is playing football"

Error! ❌ β€œfootball” not in vocabulary!


Real-World Impact

Training: Read 1 million books from 1900-1950
Vocabulary: {radio, telegram, typewriter, ...}

User in 2024: "I love my smartphone"
Result: ERROR! ❌

New words constantly appear!

  • β€œblockchain” (2008)
  • β€œselfie” (2013)
  • β€œemoji” (2014)
  • β€œChatGPT” (2022)

Word-level tokenization can’t handle them!


❌ Problem #2: Root Words Not Captured

Example:

Words: "boy" and "boys"

Word-level:
- "boy"  β†’ Token ID: 523
- "boys" β†’ Token ID: 1892

Problem: No relationship captured!

These words are VERY similar (same root!), but treated completely differently!


More Examples

token β†’ tokenize β†’ tokenization β†’ tokenizer

Word-level treats all as separate!
But they all share "token" as root!
modern β†’ modernize β†’ modernization

All share "modern" as root!
But word-level doesn't capture this!

Meaning is lost! 😒


❌ Problem #3: Huge Vocabulary Size

English language:

  • ~170,000 words in common use
  • ~470,000 words in total

Problem:

170,000 words = 170,000 tokens

Memory needed: HUGE! πŸ’Ύ
Training time: SLOW! 🐌
Model size: MASSIVE! πŸ“¦

Impractical for modern LLMs!


πŸ“Š Summary: Word-Level Problems

❌ Problem 1: Out-of-vocabulary words
❌ Problem 2: Root meanings not captured
❌ Problem 3: Massive vocabulary (170K+)

Example:
"boy" and "boys" β†’ Treated completely different
"smartphone" β†’ Error (if trained before 2000)

Character-Level Tokenization Problems

πŸ“ How It Works

Sentence:

"My hobby is playing cricket"

Tokens (character-level):

["M","y"," ","h","o","b","b","y"," ","i","s"," ","p","l","a","y","i","n","g"," ","c","r","i","c","k","e","t"]

Each character = 1 token


βœ… Advantage #1: Tiny Vocabulary

English has only:

  • 26 lowercase letters (a-z)
  • 26 uppercase letters (A-Z)
  • 10 digits (0-9)
  • ~40 punctuation marks
  • Total: ~256 characters!

Vocabulary size: 256 tokens! πŸŽ‰

Compare:

  • Word-level: 170,000 tokens
  • Character-level: 256 tokens

67x smaller! 😍


βœ… Advantage #2: No OOV Problem!

Any word can be broken into characters:

Unknown word: "blockchain"
Character tokens: ["b","l","o","c","k","c","h","a","i","n"]
βœ… All characters are in vocabulary!
New word: "ChatGPT"
Character tokens: ["C","h","a","t","G","P","T"]
βœ… Works perfectly!

No word is β€œunknown” at character level!


❌ Problem #1: Meaning is Lost

Example:

Word: "dinosaur"

Word-level: ["dinosaur"] (1 token)
βœ… Preserves meaning

Character-level: ["d","i","n","o","s","a","u","r"] (8 tokens)
❌ Meaning is lost!

Words have meaning. Characters don’t!


Impact on Learning

Sentence: "The boy loves dinosaurs"

Word-level understanding:
- "boy" β†’ human child
- "loves" β†’ emotion
- "dinosaurs" β†’ prehistoric creatures

Character-level understanding:
- "b","o","y" β†’ ???
- "d","i","n","o","s","a","u","r","s" β†’ ???

Model must learn to group characters back into words!
MUCH harder to learn! 😰

❌ Problem #2: Sequence is TOO Long

Example:

Paragraph: 100 words

Word-level: 100 tokens βœ…
Character-level: 500-600 tokens ❌

Character sequences are 5-6x longer!


Impact

Longer sequences means:

❌ More computation needed
❌ Slower training
❌ More memory required
❌ Harder for model to learn long-range dependencies

Example:

Book: "Harry Potter"
- 77,000 words
- 385,000+ characters!

Training on 385K tokens vs 77K tokens?
5x more computation! πŸ’Έ

❌ Problem #3: Similar Words Not Grouped

Example:

"boy" β†’ ["b","o","y"]
"boys" β†’ ["b","o","y","s"]

Better! At least "b","o","y" is shared!

But:

"modernization" β†’ ["m","o","d","e","r","n","i","z","a","t","i","o","n"]
"tokenization" β†’ ["t","o","k","e","n","i","z","a","t","i","o","n"]

Common suffix "-ization" is NOT captured as a unit!
Model must learn this pattern from scratch!

Root words and patterns still not effectively captured!


πŸ“Š Summary: Character-Level Problems

βœ… Advantage 1: Tiny vocabulary (256)
βœ… Advantage 2: No OOV problem

❌ Problem 1: Word meanings lost
❌ Problem 2: Sequences 5-6x longer
❌ Problem 3: Patterns not captured

Example:
"dinosaur" β†’ 8 separate characters (meaning lost)
"Harry Potter book" β†’ 385K+ tokens (too long!)

Subword Tokenization: Best of Both Worlds

πŸ’‘ The Solution: Subword Tokenization

Subword tokenization combines the best of word-level AND character-level tokenization!

Key idea: Break words into meaningful subword units


🎯 The Two Rules

Rule #1: Keep frequent words intact

Word appears often β†’ Keep as single token

Example: "the", "is", "play"
These are common, so keep them as-is!

Rule #2: Split rare words into subwords

Word is rare β†’ Break into meaningful subwords

Example: "unbelievable"
Rare word β†’ Split into ["un", "believ", "able"]

Best of both worlds! ✨


πŸ“Š Visual Comparison

Word: β€œunbelievable”

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  WORD-LEVEL                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  ["unbelievable"]                      β”‚
β”‚  1 token                               β”‚
β”‚  βœ… Compact                            β”‚
β”‚  ❌ Huge vocabulary                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  CHARACTER-LEVEL                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  ["u","n","b","e","l","i","e","v",    β”‚
β”‚   "a","b","l","e"]                     β”‚
β”‚  12 tokens                             β”‚
β”‚  βœ… Small vocabulary                   β”‚
β”‚  ❌ Too long, meaning lost             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SUBWORD-LEVEL (BPE) ← BEST!          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  ["un", "believ", "able"]              β”‚
β”‚  3 tokens                              β”‚
β”‚  βœ… Reasonable length                  β”‚
β”‚  βœ… Manageable vocabulary              β”‚
β”‚  βœ… Captures root meanings!            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Subword wins! πŸ†


🌟 Advantages of Subword Tokenization

Advantage #1: Captures Root Words

"token" β†’ ["token"]
"tokens" β†’ ["token", "s"]
"tokenize" β†’ ["token", "ize"]
"tokenization" β†’ ["token", "ization"]

All share "token" root! βœ…
Model learns: These words are related!

Advantage #2: Captures Common Patterns

"modernization" β†’ ["modern", "ization"]
"tokenization" β†’ ["token", "ization"]

Both share "-ization" suffix!
Model learns: These are both noun forms!
"unhappy" β†’ ["un", "happy"]
"undo" β†’ ["un", "do"]
"unbelievable" β†’ ["un", "believ", "able"]

All share "un-" prefix!
Model learns: "un-" means negation!

Advantage #3: Handles Unknown Words!

New word: "blockchain"

Word-level: ERROR! ❌

Subword-level: ["block", "chain"] βœ…
(Both subwords likely in vocabulary!)
New word: "selfie"

Word-level: ERROR! ❌

Subword-level: ["self", "ie"] βœ…
(Can break into known subwords!)

Even unknown words work! πŸŽ‰


Advantage #4: Reasonable Vocabulary Size

Word-level: 170,000+ tokens ❌
Character-level: 256 tokens (but long sequences) ❌
Subword-level: 30,000-50,000 tokens βœ…

Sweet spot! 🎯

GPT-2/GPT-3 use 50,257 tokens!


πŸ“Š Complete Comparison Table

Aspect Word Character Subword (BPE)
Vocabulary size 170K+ 256 30K-50K
OOV words ❌ Fails βœ… Handles βœ… Handles
Root meanings ❌ Lost ❌ Lost βœ… Captured
Sequence length βœ… Short ❌ Long βœ… Medium
Memory ❌ High βœ… Low βœ… Medium
Training speed ❌ Slow ❌ Slow βœ… Fast

Subword (BPE) wins on almost every metric! πŸ†


πŸ’‘ Key Takeaway

Subword tokenization (BPE) is:
βœ… Not too big (like word-level)
βœ… Not too small (like character-level)
βœ… Just right! (Goldilocks solution)

That's why GPT uses it!

What is Byte Pair Encoding?

πŸ“š History

BPE was invented in 1994!

Original purpose: Data compression (not for LLMs!)

Paper: β€œA New Algorithm for Data Compression” (1994)

How it was used:

Find most common byte pair β†’ Replace with new byte
Repeat until data is compressed

πŸ”„ BPE for Data Compression

Original algorithm:

Step 1: Find most common pair of bytes
Step 2: Replace with a new byte
Step 3: Repeat until stopping criteria

Example:

Original: "aaabdaaabac"

Step 1: "aa" appears 4 times (most common)
       Replace "aa" with "Z"
       Result: "Zabdaabac"

Step 2: "Zab" appears twice... wait, "ab" appears 2 times
       Replace "ab" with "Y"
       Result: "ZYdZYac"

Compressed! βœ…

πŸ’‘ BPE Adapted for LLMs

Same algorithm, different purpose!

For LLMs:

Step 1: Find most common character pair
Step 2: Merge into single token
Step 3: Repeat until vocabulary size reached

Result: Subword vocabulary!

🎯 BPE for LLMs: Key Idea

Instead of compressing data, we create a vocabulary!

Start: Individual characters
End: Characters + subwords + words

Example vocabulary after BPE:
["a", "b", "c", ..., "the", "ing", "tion", "token", "ization", ...]

Mix of characters and subwords! βœ…

BPE Algorithm Step-by-Step

πŸ“ Example Dataset

Let’s build a BPE tokenizer from scratch!

Training data (word frequencies):

"old" appears 7 times
"older" appears 3 times
"finest" appears 9 times
"lowest" appears 4 times

Total: 23 words

🎯 Preprocessing: Add End Tokens

Add </w> to mark word endings:

"old" β†’ "old</w>"       (7 times)
"older" β†’ "older</w>"   (3 times)  
"finest" β†’ "finest</w>" (9 times)
"lowest" β†’ "lowest</w>" (4 times)

Why </w>?

  • Marks word boundaries
  • Distinguishes β€œest” in β€œfinest” from β€œest” in β€œestimate”
  • β€œest” = word ends with β€œest”

πŸ”’ Step 0: Initial Character Vocabulary

Break words into characters:

"old</w>" β†’ ['o', 'l', 'd', '</w>']
"older</w>" β†’ ['o', 'l', 'd', 'e', 'r', '</w>']
"finest</w>" β†’ ['f', 'i', 'n', 'e', 's', 't', '</w>']
"lowest</w>" β†’ ['l', 'o', 'w', 'e', 's', 't', '</w>']

Count character frequencies:

Character Frequency
o 14 (7+3+4)
l 14 (7+3+4)
d 10 (7+3)
e 16 (3+9+4)
r 3
f 9
i 9
n 9
s 13 (9+4)
t 13 (9+4)
w 4
</w> 23 (all words)

Initial vocabulary: 12 tokens (all characters)


πŸ”„ Iteration 1: Find Most Common Pair

Look for most common character pair:

Scan all words:
- "es" in "finest": 9 times
- "es" in "lowest": 4 times
- Total: "es" appears 13 times! ← Most common

Merge β€œes” into single token:

Before: ['f', 'i', 'n', 'e', 's', 't', '</w>']
After:  ['f', 'i', 'n', 'es', 't', '</w>']

Before: ['l', 'o', 'w', 'e', 's', 't', '</w>']  
After:  ['l', 'o', 'w', 'es', 't', '</w>']

Update vocabulary:

Token Frequency
… …
e 3 (16-13)
s 0 (13-13)
es 13 ← New!
… …

Vocabulary size: 13 tokens


πŸ”„ Iteration 2: Find Next Common Pair

Now β€œes” is a single token!

Look for next most common pair:

"est" pair:
- "es" + "t" in "finest": 9 times
- "es" + "t" in "lowest": 4 times
- Total: 13 times! ← Most common

Merge β€œes” and β€œt” β†’ β€œest”:

Before: ['f', 'i', 'n', 'es', 't', '</w>']
After:  ['f', 'i', 'n', 'est', '</w>']

Before: ['l', 'o', 'w', 'es', 't', '</w>']
After:  ['l', 'o', 'w', 'est', '</w>']

Update vocabulary:

Token Frequency
es 0 (used in β€œest”)
est 13 ← New!

Vocabulary size: 14 tokens

πŸŽ‰ We discovered β€œest” as a root word!


πŸ”„ Iteration 3: Find Next Common Pair

Look for next common pair:

"est</w>" pair:
- "est" + "</w>" in "finest": 9 times
- "est" + "</w>" in "lowest": 4 times
- Total: 13 times!

Merge β€œest” and β€œβ€ β†’ β€œest”:

Before: ['f', 'i', 'n', 'est', '</w>']
After:  ['f', 'i', 'n', 'est</w>']

Before: ['l', 'o', 'w', 'est', '</w>']
After:  ['l', 'o', 'w', 'est</w>']

Why merge with </w>?

  • Distinguishes β€œest” at word END from β€œest” in middle
  • β€œestimate” has β€œest” but NOT β€œest”
  • β€œfinest” has β€œest” (word ends with β€œest”)

🎯 Important distinction learned!


πŸ”„ Iteration 4: Find Next Common Pair

Look for next common pair:

"ol" pair:
- "o" + "l" in "old</w>": 7 times
- "o" + "l" in "older</w>": 3 times
- Total: 10 times!

Merge β€œo” and β€œl” β†’ β€œol”:

Before: ['o', 'l', 'd', '</w>']
After:  ['ol', 'd', '</w>']

Before: ['o', 'l', 'd', 'e', 'r', '</w>']
After:  ['ol', 'd', 'e', 'r', '</w>']

New token: β€œol” (appears 10 times)


πŸ”„ Iteration 5: Find Next Common Pair

Look for next common pair:

"old" pair:
- "ol" + "d" in "old</w>": 7 times
- "ol" + "d" in "older</w>": 3 times
- Total: 10 times!

Merge β€œol” and β€œd” β†’ β€œold”:

Before: ['ol', 'd', '</w>']
After:  ['old', '</w>']

Before: ['ol', 'd', 'e', 'r', '</w>']
After:  ['old', 'e', 'r', '</w>']

πŸŽ‰ We discovered β€œold” as a root word!


βœ… Final Vocabulary

After all iterations:

Token Type Frequency
d character 0
e character 3
f character 9
i character 9
l character 4
n character 9
o character 4
r character 3
w character 4
</w> special 0
old subword 10
est subword 13

Final vocabulary size: 11 tokens

What we learned:

  • βœ… β€œold” is a root word (in β€œold” and β€œolder”)
  • βœ… β€œest” is a common ending (in β€œfinest” and β€œlowest”)
  • βœ… Characters still available for unknown words!

πŸ“Š Tokenization Results

Using final vocabulary:

"old" β†’ ["old", "</w>"]
"older" β†’ ["old", "e", "r", "</w>"]
"finest" β†’ ["f", "i", "n", "est</w>"]
"lowest" β†’ ["l", "o", "w", "est</w>"]

Notice:

  • βœ… β€œold” is preserved as root
  • βœ… β€œest” shows word ending pattern
  • βœ… Other characters broken down

Subword tokenization in action! πŸŽ‰


🎯 Key Insights

1. Root words captured:

"old" appears in "old" and "older"
β†’ BPE creates "old" token!

2. Common patterns captured:

"est" ending in "finest" and "lowest"
β†’ BPE creates "est</w>" token!

3. Can handle unknown words:

Unknown word: "oldest"
β†’ ["old", "es", "t", "</w>"]
(All subwords likely in vocabulary!)

BPE for LLMs: Practical Example

🎯 Stopping Criteria

When to stop merging?

Option 1: Target vocabulary size

Goal: 50,000 tokens
Keep merging until vocab reaches 50K

Option 2: Number of iterations

Goal: 10,000 iterations
Stop after 10K merge operations

Option 3: Minimum frequency threshold

Goal: Pair must appear at least 5 times
Stop when no pair appears β‰₯5 times

GPT-2/GPT-3: Uses target vocabulary size (~50K tokens)


πŸ“Š BPE for Real LLMs

Training data: Billions of words!

Process:

1. Start with characters (256 tokens)
2. Find most common pair
3. Merge into new token
4. Repeat 49,744 more times
5. Final vocabulary: 50,000 tokens!

Result:

  • Characters: a, b, c, …
  • Common subwords: ing, tion, est, …
  • Common words: the, is, are, …
  • All in one vocabulary! ✨

πŸ’‘ Advantages of BPE

1. Handles unknown words:

Word not in training: "cryptocurrency"
BPE breaks down: ["crypto", "currency"]
Or even: ["crypt", "o", "currency"]
Works! βœ…

2. Smaller vocabulary than word-level:

Word-level: 170K+ tokens
BPE: 50K tokens
3.4x smaller! πŸŽ‰

3. Captures patterns:

"tokenize" β†’ ["token", "ize"]
"modernize" β†’ ["modern", "ize"]
Pattern "-ize" learned! βœ…

4. Reasonable sequence length:

Paragraph: 100 words
BPE: ~130 tokens (1.3x original)
Much better than character-level (5-6x)!

Building BPE Tokenizer

πŸ’» Using tiktoken Library

OpenAI’s official tokenizer!

Install:

pip install tiktoken

Import:

import tiktoken

# Check version
print(tiktoken.__version__)
# Output: 0.6.0

πŸ”§ Initialize GPT-2 Tokenizer

# Load GPT-2 BPE tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# This tokenizer has:
# - 50,257 tokens
# - Trained on massive datasets
# - Used in GPT-2, GPT-3, ChatGPT

βœ… Test: Simple Sentence

text = "Hello, do you like tea?"

# Encode
ids = tokenizer.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {ids}")

# Decode
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")

Output:

Text: Hello, do you like tea?
Token IDs: [15496, 11, 466, 345, 588, 8887, 30]
Decoded: Hello, do you like tea?

Perfect! βœ…


πŸ§ͺ Test: Unknown Words

# Word that didn't exist in training
text = "some unknown place"

ids = tokenizer.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {ids}")

decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")

Output:

Text: some unknown place
Token IDs: [ome, 6439, 1295]
Decoded: some unknown place

No error! BPE handles it! βœ…

How? Breaks β€œunknown” into subwords that exist in vocabulary!


🎯 Test: End of Text Token

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces"

# Combine with <|endoftext|>
combined = text1 + " <|endoftext|> " + text2

ids = tokenizer.encode(combined)
print(f"Token IDs: {ids}")

# Find <|endoftext|> token ID
endoftext_id = tokenizer.encode("<|endoftext|>")[0]
print(f"<|endoftext|> token ID: {endoftext_id}")

Output:

Token IDs: [15496, 11, 466, 345, 588, 8887, 30, 50256, 554, 262, ...]
<|endoftext|> token ID: 50256

Special insights:

  • <|endoftext|> token ID: 50256
  • This is the LAST token in GPT-2 vocabulary
  • Vocabulary size: 50,257 (0 to 50256)

πŸ”¬ Test: Completely Random Text

# Nonsense text
text = "akwirwiΠ΅Ρ€"  # Random characters

ids = tokenizer.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {ids}")

decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")

Output:

Text: akwirwiΠ΅Ρ€
Token IDs: [461, 4246, 343, 72, 10047]
Decoded: akwirwiΠ΅Ρ€

Still works! No error! βœ…

Why? BPE breaks it into characters/subwords that exist!


πŸ“Š How BPE Handles Unknown Words

Example: Word not in vocabulary

Word: "blockchain"

BPE process:
1. Check if "blockchain" is a token β†’ No
2. Try breaking: "block" + "chain"
   - "block" is a token? Likely yes! βœ…
   - "chain" is a token? Likely yes! βœ…
3. Tokenize as: ["block", "chain"]

Success! βœ…

Worst case scenario:

Word: "xyzabc" (complete nonsense)

BPE process:
1. Check "xyzabc" β†’ No
2. Check "xyz" β†’ No
3. Check "xy" β†’ No  
4. Break to characters: ["x", "y", "z", "a", "b", "c"]
5. All characters are in vocabulary! βœ…

Success! βœ…

BPE ALWAYS succeeds! Because vocabulary includes individual characters!


Using tiktoken Library

πŸ“š Complete Code Example

import tiktoken

# Initialize GPT-2 tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Test text
text = "Hello, world! <|endoftext|> This is amazing."

# Encode
token_ids = tokenizer.encode(text)
print(f"Original text: {text}")
print(f"Token IDs: {token_ids}")
print(f"Number of tokens: {len(token_ids)}")

# Decode
decoded_text = tokenizer.decode(token_ids)
print(f"Decoded text: {decoded_text}")

# Verify
print(f"Match: {text == decoded_text}")

Output:

Original text: Hello, world! <|endoftext|> This is amazing.
Token IDs: [15496, 11, 995, 0, 50256, 770, 318, 4998, 13]
Number of tokens: 9
Decoded text: Hello, world! <|endoftext|> This is amazing.
Match: True

πŸ” Exploring Individual Tokens

# Encode individual words
words = ["Hello", "world", "amazing", "tokenization"]

for word in words:
    ids = tokenizer.encode(word)
    print(f"{word:15} β†’ {ids}")

Output:

Hello           β†’ [15496]
world           β†’ [6894]
amazing         β†’ [38094]
tokenization    β†’ [30001, 1634]

Notice:

  • Common words: 1 token
  • Complex words: multiple tokens
  • β€œtokenization” β†’ [β€œtoken”, β€œization”]

🎯 Token to Text Mapping

# Decode single token IDs
token_ids = [15496, 11, 995, 0]

for token_id in token_ids:
    text = tokenizer.decode([token_id])
    print(f"Token ID {token_id:5} β†’ '{text}'")

Output:

Token ID 15496 β†’ 'Hello'
Token ID    11 β†’ ','
Token ID   995 β†’ ' world'
Token ID     0 β†’ '!'

Interesting: Space is often included with word!


πŸ“Š Vocabulary Statistics

# Get vocabulary size
vocab_size = tokenizer.n_vocab
print(f"Vocabulary size: {vocab_size}")

# Check special tokens
endoftext = "<|endoftext|>"
endoftext_id = tokenizer.encode(endoftext)[0]
print(f"{endoftext} token ID: {endoftext_id}")

# Verify it's the last token
print(f"Is last token: {endoftext_id == vocab_size - 1}")

Output:

Vocabulary size: 50257
<|endoftext|> token ID: 50256
Is last token: True

πŸ§ͺ Compare with Our Simple Tokenizer

Chapter 7 tokenizer:

# Our simple tokenizer from Chapter 7
text = "Hello, world!"
# Would fail on unknown words! ❌

BPE tokenizer:

# GPT-2 BPE tokenizer
text = "Hello, world! Even with unknown words!"
ids = tokenizer.encode(text)
# Always works! βœ…

BPE is MUCH better! πŸŽ‰


Why GPT Uses BPE

🎯 The Numbers

English language:

  • Total words: ~170,000+ in common use

GPT-2/GPT-3 vocabulary:

  • BPE tokens: 50,257

Reduction: 3.4x smaller! πŸŽ‰


πŸ’° Cost Savings

With word-level (170K tokens):

Memory: HUGE πŸ’Ύ
Training: SLOW 🐌
Inference: EXPENSIVE πŸ’Έ

With BPE (50K tokens):

Memory: 3.4x less βœ…
Training: 3.4x faster βœ…
Inference: 3.4x cheaper βœ…

Billions saved! πŸ’°


βœ… Advantages Summary

1. No unknown words:

Any word can be broken into subwords/characters
100% coverage! βœ…

2. Captures patterns:

"tokenize" β†’ ["token", "ize"]
"modernize" β†’ ["modern", "ize"]
Model learns "-ize" pattern!

3. Reasonable vocabulary:

Not too big (word-level: 170K)
Not too small (character-level: 256)
Just right! (50K)

4. Handles all languages:

English: βœ…
Spanish: βœ…
Chinese: βœ…
Code: βœ…
Emoji: βœ…

🌍 Real-World Impact

GPT-3 statistics:

  • Trained on 300 billion tokens (BPE tokens!)
  • Vocabulary: 50,257 tokens
  • Cost: $4.6 million

If word-level was used:

  • Vocabulary: 170,000+ tokens
  • Cost: ~$15+ million (estimated)
  • Training: 3-4x longer

BPE saved OpenAI millions! πŸ’°


πŸ“Š BPE vs Others: Final Comparison

Metric Word Character BPE
Vocab size 170K+ 256 50K
Unknown words ❌ Fails βœ… Works βœ… Works
Root words ❌ Lost ❌ Lost βœ… Captured
Patterns ❌ No ❌ No βœ… Yes
Seq length βœ… Short ❌ 6x βœ… 1.3x
Memory ❌ High βœ… Low βœ… Medium
Speed ❌ Slow ❌ Slow βœ… Fast
Used in GPT? ❌ No ❌ No βœ… YES!

BPE is the clear winner! πŸ†


Chapter Summary

πŸŽ‰ What We Learned Today

This was a DEEP chapter! Let’s recap:


1. Three Tokenization Approaches

Word-level:

Pros: Short sequences
Cons: 170K+ vocab, unknown words, no root capture

Character-level:

Pros: Small vocab (256), no unknown words
Cons: Long sequences, meaning lost, patterns not captured

Subword-level (BPE):

Pros: Medium vocab (50K), no unknown words, captures patterns, reasonable length
Cons: None! βœ…

2. BPE Algorithm

Core idea:

1. Start with characters
2. Find most common pair
3. Merge into single token
4. Repeat until target vocabulary size

Example:

"finest" + "lowest" β†’ Discovers "est</w>" as common ending
"old" + "older" β†’ Discovers "old" as root word

3. BPE for LLMs

Rules:

Rule 1: Keep frequent words intact
Rule 2: Split rare words into subwords

Result:

Vocabulary: Mix of characters + subwords + common words
Size: ~50,000 tokens (GPT-2/GPT-3)
Coverage: 100% (no unknown words!)

4. Using tiktoken

import tiktoken

# Load GPT-2 tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Encode
ids = tokenizer.encode("Hello, world!")

# Decode
text = tokenizer.decode(ids)

Simple and powerful! βœ…


5. Why GPT Uses BPE

βœ… Handles unknown words
βœ… Captures root words and patterns
βœ… Reasonable vocabulary size (50K vs 170K+)
βœ… 3.4x memory/speed improvement
βœ… Billions saved in training costs

BPE is the secret behind GPT’s success!


πŸ“Š Key Statistics

GPT-2/GPT-3:

  • Vocabulary: 50,257 tokens
  • Last token: <|endoftext|> (ID: 50256)
  • Method: Byte Pair Encoding (BPE)

Comparison:

  • Word-level: 170,000+ tokens
  • Character-level: 256 tokens
  • BPE: 50,257 tokens

BPE is the goldilocks solution! 🎯


πŸ’‘ Key Takeaways

  1. BPE combines best of word and character tokenization
  2. BPE handles any word (even unknown ones!)
  3. BPE captures root words (β€œtoken” in β€œtokenize”, β€œtokenization”)
  4. BPE captures patterns (β€œ-ize”, β€œ-tion”, β€œun-”)
  5. GPT uses 50K BPE tokens instead of 170K+ words
  6. tiktoken is OpenAI’s official BPE library
  7. BPE saved millions in training costs

🎯 What We Learned (Checklist)

  • [x] Word-level tokenization problems
  • [x] Character-level tokenization problems
  • [x] Subword tokenization advantages
  • [x] BPE algorithm step-by-step
  • [x] Building BPE vocabulary
  • [x] Using tiktoken library
  • [x] Encoding and decoding with BPE
  • [x] Handling unknown words
  • [x] Why GPT uses BPE
  • [x] Cost savings from BPE

πŸ”œ Next Chapter: Chapter 9

Topic: Data Sampling, Batch Sizes, and Context Windows

What we’ll learn:

  • How to feed tokens to LLMs
  • What is a context window?
  • Batch size considerations
  • Creating training batches
  • Sliding window approach
  • Preparing data for training

From tokenization to training! πŸš€


πŸ“ Practice Exercise

Try this before next chapter:

  1. Install tiktoken
  2. Tokenize your favorite book paragraph
  3. Count tokens
  4. Try unknown/nonsense words
  5. Compare with character count
  6. Calculate compression ratio

Share your findings in comments! πŸ’¬


πŸš€ Take Action Now!

  1. πŸ’» Install tiktoken - pip install tiktoken
  2. πŸ§ͺ Experiment - Tokenize different texts
  3. πŸ“ Practice - Try unknown words
  4. ❓ Ask Questions - Comment if unclear
  5. πŸ”– Bookmark - Reference material
  6. ⏭️ Get Ready - Next: Data sampling!

Quick Reference

BPE Algorithm:

# Pseudocode
vocabulary = list_of_characters
while len(vocabulary) < target_size:
    pair = find_most_common_pair()
    new_token = merge(pair)
    vocabulary.add(new_token)
    update_frequencies()

tiktoken Usage:

import tiktoken

# Initialize
tokenizer = tiktoken.get_encoding("gpt2")

# Encode
ids = tokenizer.encode("Your text here")

# Decode
text = tokenizer.decode(ids)

# Vocabulary size
size = tokenizer.n_vocab  # 50257

Key Comparisons:

Aspect Word Char BPE
Vocab 170K+ 256 50K
Unknown ❌ βœ… βœ…
Root ❌ ❌ βœ…
Speed ❌ ❌ βœ…

Thank You!

You’ve completed Chapter 8 - Byte Pair Encoding! πŸŽ‰

You now know:

  • βœ… Why BPE is superior
  • βœ… How BPE algorithm works
  • βœ… How to use tiktoken
  • βœ… Why GPT uses 50K tokens
  • βœ… How BPE handles unknown words

Next chapter: Data sampling and context windows

You’re mastering LLMs! πŸš€


πŸ“£ Your Feedback Matters!

Drop a comment:

  • Did you understand BPE algorithm?
  • Which part was most interesting?
  • Any questions about tokenization?
  • Share your tiktoken experiments!

We respond to every comment! πŸ’¬


🎯 Coming Up

Chapter 9: Data Sampling & Context Windows
Chapter 10: Vector Embeddings
Chapter 11: Positional Encoding
Chapter 12: Self-Attention Mechanism

The journey continues! πŸ’»πŸ”₯


See you in Chapter 9 where we learn data sampling! πŸš€


Questions? Experiments to share? Drop them below! We’re here to help. πŸ’ͺ