Chapter 7: Tokenization Explained - Building Your First Tokenizer From Scratch

October 22, 2025 by The GSM Work

LLM AI Tutorial Series Beginners Tokenization Python Coding NLP Data Preprocessing

Chapter 7: Tokenization - Your First Step Into Coding LLMs

📖 Reading Time: 60 minutes
💻 Coding Time: 90 minutes

Welcome to Chapter 7! THIS IS WHERE WE START CODING! 🚀

What we’ve learned so far:

Chapter 1-6: Theory, architecture, roadmap

Today:

Build a complete tokenizer from scratch
Python implementation
Encoder + Decoder
Handle special tokens
Real code you can run!

Grab your laptop and let’s code! 💻

What is Tokenization?

🔤 The Simplest Definition

Tokenization = Breaking text into smaller units (tokens)

At its basic form:

Sentence: "The cat sat on the mat"
Tokens:   ["The", "cat", "sat", "on", "the", "mat"]

But it’s MORE than just splitting by spaces!

🤔 Why Can’t We Just Use Words?

Good question! Here’s why simple word splitting doesn’t work:

Example 1: Punctuation

Sentence: "Hello, world!"
Wrong:    ["Hello,", "world!"]  ❌
Correct:  ["Hello", ",", "world", "!"]  ✅

Example 2: Contractions

Sentence: "I don't know"
Wrong:    ["I", "don't", "know"]  ❌
Correct:  ["I", "don", "'", "t", "know"]  ✅

Example 3: Special Characters

Sentence: "It's 2024--amazing!"
Wrong:    ["It's", "2024--amazing!"]  ❌
Correct:  ["It", "'", "s", "2024", "--", "amazing", "!"]  ✅

Tokenization handles ALL these cases properly!

💡 Key Insight

Tokenization is the FIRST step in data preprocessing for LLMs.

┌──────────────────────────────────────────┐
│      LLM DATA PREPARATION PIPELINE       │
├──────────────────────────────────────────┤
│                                          │
│  1. TOKENIZATION (Today!)                │
│     Text → Tokens                        │
│                                          │
│  2. TOKEN IDs (Today!)                   │
│     Tokens → Numbers                     │
│                                          │
│  3. VECTOR EMBEDDINGS (Next chapter)     │
│     Numbers → Vectors                    │
│                                          │
│  4. TRAINING                             │
│     Vectors → Trained LLM                │
│                                          │
└──────────────────────────────────────────┘

Today: Steps 1 & 2!

Why Tokenization Matters

🎯 The Core Problem

Neural networks work with NUMBERS, not TEXT!

What we have:

"The cat sat on the mat"

What LLMs need:

[101, 202, 303, 404, 101, 505]

Tokenization bridges this gap!

📊 Visual Flow

┌─────────────────────────────────────────────┐
│                                             │
│  INPUT TEXT                                 │
│  "The cat sat on the mat"                  │
│                                             │
│           ↓ TOKENIZATION                    │
│                                             │
│  TOKENS                                     │
│  ["The", "cat", "sat", "on", "the", "mat"]│
│                                             │
│           ↓ TOKEN IDs                       │
│                                             │
│  TOKEN IDs                                  │
│  [101, 202, 303, 404, 101, 505]           │
│                                             │
│           ↓ EMBEDDINGS (Next chapter)       │
│                                             │
│  VECTORS                                    │
│  [[0.2, 0.8, ...], [0.5, 0.3, ...], ...]  │
│                                             │
│           ↓ TRAINING                        │
│                                             │
│  TRAINED LLM                                │
│                                             │
└─────────────────────────────────────────────┘

💰 Real-World Impact

Good tokenization:

✅ Reduces vocabulary size
✅ Handles unknown words
✅ Saves memory
✅ Faster training
✅ Better performance

Poor tokenization:

❌ Huge vocabulary
❌ Can’t handle new words
❌ Wastes memory
❌ Slower training
❌ Worse performance

Tokenization is CRITICAL!

The 3 Steps of Data Preparation

📋 Complete Overview

Remember from Chapter 6’s roadmap:

STAGE 1: Building Blocks
├── Data Preparation ← WE ARE HERE!
│   ├── Step 1: Tokenization
│   ├── Step 2: Token IDs
│   └── Step 3: Vector Embeddings
├── Attention Mechanisms
└── LLM Architecture

🎯 Today’s Focus

Step 1: Split text into tokens

Input:  "This is an example"
Output: ["This", "is", "an", "example"]

Step 2: Convert tokens to token IDs

Input:  ["This", "is", "an", "example"]
Output: [45, 12, 89, 234]

Step 3: (Next chapter!)

Input:  [45, 12, 89, 234]
Output: [[0.2, 0.8, ...], [0.5, 0.3, ...], ...]

Step 1: Breaking Text Into Tokens

🛠️ Setup: Loading Our Dataset

We’ll use a real book for practice!

Book: “The Verdict” by Edith Wharton (1908)
Why? Free to download, perfect size for learning
Size: ~20,000 characters

💻 Python Code: Loading the Book

# Load the text file
with open("the_verdict.txt", "r") as file:
    raw_text = file.read()

# Check what we loaded
print(f"Total characters: {len(raw_text)}")
print(f"First 100 characters:\n{raw_text[:100]}")

Output:

Total characters: 20479
First 100 characters:
I had always thought Jack Gisburn rather a cheap genius--though
a good fellow enough--so it was no...

Success! We’ve loaded the text! ✅

🔧 Simple Tokenization Attempt

Let’s try the simplest approach: split by spaces

# Split by spaces
text = "Hello, world! This is an example."
tokens = text.split(" ")
print(tokens)

Output:

['Hello,', 'world!', 'This', 'is', 'an', 'example.']

Problem: Punctuation is stuck to words! ❌

🎯 Better Approach: Regular Expressions

Use Python’s re library for smart splitting!

import re

text = "Hello, world! This is an example."

# Split on whitespace
result = re.split(r'(\s)', text)
print(result)

Output:

['Hello,', ' ', 'world!', ' ', 'This', ' ', 'is', ' ', 'an', ' ', 'example.']

Better! But still has issues: Whitespaces included, punctuation stuck

🚀 Advanced Tokenization

Split on whitespace AND punctuation!

import re

text = "Hello, world! This is--an example."

# Split on whitespace AND punctuation
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
print(result)

Output:

['Hello', ',', ' ', 'world', '!', ' ', 'This', ' ', 'is', '--', 
 'an', ' ', 'example', '.', '']

Much better! Punctuation is now separate! ✅

🧹 Remove Whitespaces

We don’t need spaces as separate tokens:

import re

text = "Hello, world! This is--an example."

# Split on whitespace and punctuation
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)

# Remove whitespaces
tokens = [item for item in result if item.strip()]
print(tokens)

Output:

['Hello', ',', 'world', '!', 'This', 'is', '--', 'an', 'example', '.']

Perfect! Clean tokens without whitespaces! ✅

📝 Complete Tokenization Function

import re

def simple_tokenize(text):
    """
    Tokenize text by splitting on:
    - Whitespace
    - Punctuation: , . : ; ? _ ! " ' ( ) --
    
    Returns list of tokens (no whitespaces)
    """
    # Split on whitespace and punctuation
    result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
    
    # Remove whitespace tokens
    tokens = [item for item in result if item.strip()]
    
    return tokens

# Test
text = "Hello, world! This is--an example."
tokens = simple_tokenize(text)
print(tokens)

Output:

['Hello', ',', 'world', '!', 'This', 'is', '--', 'an', 'example', '.']

🎯 Apply to Full Book

# Tokenize the entire book!
preprocessed = simple_tokenize(raw_text)

print(f"Total tokens: {len(preprocessed)}")
print(f"First 30 tokens: {preprocessed[:30]}")

Output:

Total tokens: 4690
First 30 tokens: ['I', 'had', 'always', 'thought', 'Jack', 'Gisburn', 
'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 
'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 
'to', 'hear', 'that', ',', 'in']

Amazing! 20,479 characters → 4,690 tokens! 🎉

💡 Key Takeaway: Step 1

✅ We can break text into tokens
✅ Handle punctuation properly
✅ Remove unnecessary whitespaces
✅ Tokenized entire book (4,690 tokens)

Next: Convert tokens to numbers!

Step 2: Converting Tokens to Token IDs

🤔 Why Token IDs?

Problem: Computers can’t understand words like “cat” or “dog”

Solution: Convert each token to a unique number (Token ID)!

Token:    "cat"  →  Token ID: 42
Token:    "dog"  →  Token ID: 87
Token:    "fox"  →  Token ID: 103

Now computers can process them! ✅

📚 Building a Vocabulary

Vocabulary = Mapping from tokens to token IDs

Process:

Get unique tokens (no duplicates)
Sort alphabetically (for consistency)
Assign sequential IDs (0, 1, 2, 3…)

🎯 Example: Small Dataset

Training text:

"The quick brown fox jumps over the lazy dog"

Step 1: Tokenize

["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

Step 2: Get unique tokens (alphabetically)

["brown", "dog", "fox", "jumps", "lazy", "over", "quick", "the"]

Note: “the” and “The” are different! (Case-sensitive)
For simplicity, we’ll keep them separate.

Step 3: Assign Token IDs

Token	Token ID
brown	0
dog	1
fox	2
jumps	3
lazy	4
over	5
quick	6
the	7

This is our VOCABULARY! 📖

💻 Python Code: Build Vocabulary

# Get unique tokens and sort alphabetically
all_words = sorted(set(preprocessed))

# Create vocabulary: token → token_id
vocab = {token: idx for idx, token in enumerate(all_words)}

# Check vocabulary size
print(f"Vocabulary size: {len(vocab)}")

Output:

Vocabulary size: 1130

We have 1,130 unique tokens in “The Verdict”!

🔍 Inspect the Vocabulary

# Show first 50 entries
for i, (token, token_id) in enumerate(vocab.items()):
    print(f"{token}: {token_id}")
    if i >= 49:
        break

Output:

!: 0
": 1
': 2
(: 3
): 4
,: 5
--: 6
.: 7
:: 8
;: 9
?: 10
A: 11
Ah: 12
Among: 13
And: 14
Are: 15
Art: 16
As: 17
At: 18
...

Notice:

Punctuation gets IDs 0-10
Words are alphabetically ordered
Each unique token has unique ID

🎯 Simplified Code

# One-line vocabulary creation!
all_words = sorted(set(preprocessed))
vocab = {token: idx for idx, token in enumerate(all_words)}

print(f"Vocabulary size: {len(vocab)}")

That’s it! Two lines to build a vocabulary! ✨

💡 Key Takeaway: Step 2

✅ Built vocabulary (1,130 unique tokens)
✅ Each token has unique ID
✅ Alphabetically sorted for consistency
✅ Ready to encode text!

Building the Vocabulary

📖 What is a Vocabulary?

Vocabulary = A dictionary mapping tokens to token IDs

Think of it like a real dictionary:

Dictionary: "apple" → "a fruit that grows on trees"
Vocabulary: "apple" → 42 (token ID)

🔧 Why Alphabetical Order?

Three reasons:

Consistency: Same text always gives same IDs
Reproducibility: Anyone can rebuild same vocabulary
Debugging: Easy to find tokens

Example without sorting:

# Without sorting - RANDOM ORDER! ❌
tokens = ["cat", "dog", "fox"]
vocab = {token: idx for idx, token in enumerate(tokens)}
# Result depends on set order (unpredictable!)

Example with sorting:

# With sorting - CONSISTENT! ✅
tokens = sorted(["cat", "dog", "fox"])
vocab = {token: idx for idx, token in enumerate(tokens)}
# Result: {"cat": 0, "dog": 1, "fox": 2} (always same!)

📊 Vocabulary Statistics

print(f"Total tokens in text: {len(preprocessed)}")
print(f"Unique tokens (vocab size): {len(vocab)}")
print(f"Compression ratio: {len(preprocessed) / len(vocab):.2f}x")

Output:

Total tokens in text: 4690
Unique tokens (vocab size): 1130
Compression ratio: 4.15x

Interpretation:

4,690 total tokens
Only 1,130 are unique
Average token appears 4.15 times

🎯 Vocabulary is Your LLM’s Dictionary

Small vocabulary:

✅ Faster processing
✅ Less memory
❌ May miss words

Large vocabulary:

✅ Handles more words
❌ Slower processing
❌ More memory

GPT-3 vocabulary: ~50,000 tokens
Our toy example: 1,130 tokens

Implementing Encoder and Decoder

🔄 The Two-Way Process

ENCODER: Text → Token IDs
DECODER: Token IDs → Text

┌─────────────────────────────────────────┐
│                                         │
│  TEXT: "Hello, world!"                  │
│         ↓                               │
│         ENCODER                         │
│         ↓                               │
│  TOKEN IDs: [523, 5, 892, 10]          │
│         ↓                               │
│         DECODER                         │
│         ↓                               │
│  TEXT: "Hello, world!"                  │
│                                         │
└─────────────────────────────────────────┘

Why do we need both?

Encoder: Prepare data for training
Decoder: Convert LLM output back to readable text

🎯 Encoder: Text → Token IDs

def encode(text, vocab):
    """
    Convert text to token IDs
    
    Args:
        text: String to encode
        vocab: Dictionary {token: token_id}
    
    Returns:
        List of token IDs
    """
    # Step 1: Tokenize
    tokens = simple_tokenize(text)
    
    # Step 2: Convert to IDs
    ids = [vocab[token] for token in tokens]
    
    return ids

# Test
text = "It's the last he painted, you know."
ids = encode(text, vocab)
print(f"Text: {text}")
print(f"Token IDs: {ids}")

Output:

Text: It's the last he painted, you know.
Token IDs: [1131, 1135, 1131, 1132, 1133, 1134, 5, 1136, 1137, 7]

It works! ✅

🔄 Decoder: Token IDs → Text

def decode(ids, vocab):
    """
    Convert token IDs back to text
    
    Args:
        ids: List of token IDs
        vocab: Dictionary {token: token_id}
    
    Returns:
        Decoded text string
    """
    # Step 1: Create reverse vocabulary (id → token)
    inv_vocab = {idx: token for token, idx in vocab.items()}
    
    # Step 2: Convert IDs to tokens
    tokens = [inv_vocab[idx] for idx in ids]
    
    # Step 3: Join tokens
    text = " ".join(tokens)
    
    # Step 4: Fix spacing before punctuation
    text = re.sub(r'\s+([,.:;?!"])', r'\1', text)
    
    return text

# Test
decoded_text = decode(ids, vocab)
print(f"Decoded: {decoded_text}")

Output:

Decoded: It's the last he painted, you know.

Perfect! We got back the original text! 🎉

🏗️ Complete Tokenizer Class

Let’s package everything into a reusable class:

class SimpleTokenizerV1:
    """
    Simple tokenizer with encode and decode methods
    """
    
    def __init__(self, vocab):
        """
        Initialize with vocabulary
        
        Args:
            vocab: Dictionary {token: token_id}
        """
        self.str_to_int = vocab
        self.int_to_str = {idx: token for token, idx in vocab.items()}
    
    def encode(self, text):
        """Convert text to token IDs"""
        # Tokenize
        tokens = simple_tokenize(text)
        
        # Convert to IDs
        ids = [self.str_to_int[token] for token in tokens]
        
        return ids
    
    def decode(self, ids):
        """Convert token IDs to text"""
        # Convert IDs to tokens
        tokens = [self.int_to_str[idx] for idx in ids]
        
        # Join tokens
        text = " ".join(tokens)
        
        # Fix punctuation spacing
        text = re.sub(r'\s+([,.:;?!"])', r'\1', text)
        
        return text

✅ Test the Tokenizer Class

# Create tokenizer instance
tokenizer = SimpleTokenizerV1(vocab)

# Test encode
text = "It's the last he painted, you know."
ids = tokenizer.encode(text)
print(f"Original: {text}")
print(f"Encoded:  {ids}")

# Test decode
decoded = tokenizer.decode(ids)
print(f"Decoded:  {decoded}")

Output:

Original: It's the last he painted, you know.
Encoded:  [1131, 1135, 1131, 1132, 1133, 1134, 5, 1136, 1137, 7]
Decoded:  It's the last he painted, you know.

Tokenizer works perfectly! ✅

💡 Key Takeaway

✅ Built encoder (text → IDs)
✅ Built decoder (IDs → text)
✅ Packaged into reusable class
✅ Tested and verified!

Special Context Tokens

⚠️ The Unknown Word Problem

Test this:

text = "Hello, do you like tea?"
ids = tokenizer.encode(text)

Output:

KeyError: 'Hello'

ERROR! Why? 🤔

“Hello” is not in our vocabulary! (Book written in 1908, doesn’t use “Hello”)

🚨 The Problem

Our tokenizer crashes on unknown words!

Training text: "The Verdict" (1908 book)
Vocabulary: Only words from that book

User input: "Hello, do you like pizza?"
        ↓
   CRASH! ❌

This is UNACCEPTABLE for real LLMs!

💡 The Solution: Special Tokens

Add special tokens to vocabulary:

<|unk|> - Unknown word token
<|endoftext|> - End of text token

🔧 Special Token #1: Unknown Token

Purpose: Handle words not in vocabulary

How it works:

Input: "Hello, world!"
Vocabulary: {world, !, ,} (no "Hello")

Without <|unk|>: ERROR ❌
With <|unk|>:    [<|unk|>, ,, world, !] ✅

Unknown words → <|unk|> token

🔧 Special Token #2: End of Text

Purpose: Separate different texts during training

Example:

Text Source 1: Harry Potter Chapter 1
Text Source 2: Wikipedia article on AI
Text Source 3: News article

Without <|endoftext|>:

"Harry cast a spell on AI researchers who reported today..."
(Confusing! Mixes sources!)

With <|endoftext|>:

"Harry cast a spell <|endoftext|> AI researchers are <|endoftext|> Today the news..."
(Clear separation!)

📊 Visual: Multiple Text Sources

┌──────────────────────────────────────────┐
│  TEXT SOURCE 1 (Book)                    │
│  "Once upon a time, there was a wizard..."│
└──────────────────────────────────────────┘
              ↓
         <|endoftext|>
              ↓
┌──────────────────────────────────────────┐
│  TEXT SOURCE 2 (Wikipedia)               │
│  "Machine learning is a subset of AI..." │
└──────────────────────────────────────────┘
              ↓
         <|endoftext|>
              ↓
┌──────────────────────────────────────────┐
│  TEXT SOURCE 3 (News)                    │
│  "Tech companies announced today..."     │
└──────────────────────────────────────────┘

<|endoftext|> keeps sources separate!

💻 Add Special Tokens to Vocabulary

# Start with original vocabulary
all_tokens = sorted(set(preprocessed))

# Add special tokens
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

# Rebuild vocabulary with special tokens
vocab = {token: idx for idx, token in enumerate(all_tokens)}

print(f"New vocabulary size: {len(vocab)}")
print(f"Last 5 entries:")
for token, idx in list(vocab.items())[-5:]:
    print(f"  {token}: {idx}")

Output:

New vocabulary size: 1132
Last 5 entries:
  your: 1128
  yourself: 1129
  <|endoftext|>: 1130
  <|unk|>: 1131

Special tokens added! ✅

🔧 Updated Tokenizer with Special Tokens

class SimpleTokenizerV2:
    """
    Improved tokenizer that handles unknown words
    """
    
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {idx: token for token, idx in vocab.items()}
    
    def encode(self, text):
        """Encode text, replacing unknown words with <|unk|>"""
        tokens = simple_tokenize(text)
        
        # Replace unknown tokens
        ids = [
            self.str_to_int.get(token, self.str_to_int["<|unk|>"])
            for token in tokens
        ]
        
        return ids
    
    def decode(self, ids):
        """Decode token IDs to text"""
        tokens = [self.int_to_str[idx] for idx in ids]
        text = " ".join(tokens)
        text = re.sub(r'\s+([,.:;?!"])', r'\1', text)
        return text

Key change: self.str_to_int.get(token, self.str_to_int["<|unk|>"])

This returns <|unk|> ID if token not found!

✅ Test with Unknown Words

# Create new tokenizer
tokenizer = SimpleTokenizerV2(vocab)

# Test with unknown words
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

# Combine with <|endoftext|>
full_text = text1 + " <|endoftext|> " + text2

# Encode
ids = tokenizer.encode(full_text)
print(f"Text: {full_text}")
print(f"IDs: {ids}")

# Decode
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")

Output:

Text: Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.
IDs: [1131, 5, 1095, 1131, 1131, 1131, 10, 1130, 1089, ...]
Decoded: <|unk|>, do you <|unk|> <|unk|>? <|endoftext|> In the sunlit terraces of the <|unk|>.

Notice:

“Hello” → <|unk|> (token ID 1131)
“tea” → <|unk|> (not in vocabulary)
“palace” → <|unk|> (not in vocabulary)
<|endoftext|> → Preserved (token ID 1130)

No error! Tokenizer handles unknowns gracefully! ✅

🎯 Other Special Tokens in Real LLMs

We covered:

✅ <|unk|> - Unknown token
✅ <|endoftext|> - End of text

Others used in research:

Token	Purpose	Example
`<	BOS	>`
`<	EOS	>`
`<	PAD	>`

💡 Important Note About GPT

GPT models (GPT-2, GPT-3, GPT-4):

✅ Use: <|endoftext|>
❌ Don’t use: <|unk|>, <|BOS|>, <|EOS|>, <|PAD|>

Why no <|unk|>?

GPT uses Byte Pair Encoding (BPE) which breaks unknown words into subwords!

Example:

Unknown word: "unbelievable"
BPE: ["un", "believ", "able"]
(All subwords are in vocabulary!)

Next chapter: We’ll learn BPE in detail!

Complete Tokenizer Class

🏗️ Final Production-Ready Tokenizer

import re

class SimpleTokenizerV2:
    """
    Complete tokenizer with:
    - Encoding and decoding
    - Unknown word handling
    - Special token support
    """
    
    def __init__(self, vocab):
        """
        Initialize tokenizer with vocabulary
        
        Args:
            vocab: Dictionary mapping tokens to IDs
        """
        self.str_to_int = vocab
        self.int_to_str = {idx: token for token, idx in vocab.items()}
    
    def encode(self, text):
        """
        Encode text to token IDs
        
        Args:
            text: Input string
            
        Returns:
            List of token IDs
        """
        # Tokenize text
        tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        tokens = [item for item in tokens if item.strip()]
        
        # Convert to IDs (use <|unk|> for unknown tokens)
        ids = [
            self.str_to_int.get(token, self.str_to_int["<|unk|>"])
            for token in tokens
        ]
        
        return ids
    
    def decode(self, ids):
        """
        Decode token IDs to text
        
        Args:
            ids: List of token IDs
            
        Returns:
            Decoded text string
        """
        # Convert IDs to tokens
        tokens = [self.int_to_str[idx] for idx in ids]
        
        # Join tokens
        text = " ".join(tokens)
        
        # Fix spacing before punctuation
        text = re.sub(r'\s+([,.:;?!"])', r'\1', text)
        
        return text

📦 How to Use

# 1. Build vocabulary from training data
with open("the_verdict.txt", "r") as f:
    raw_text = f.read()

# Tokenize
tokens = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
tokens = [item for item in tokens if item.strip()]

# Create vocabulary
all_tokens = sorted(set(tokens))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token: idx for idx, token in enumerate(all_tokens)}

# 2. Create tokenizer
tokenizer = SimpleTokenizerV2(vocab)

# 3. Use it!
text = "Hello, world! <|endoftext|> This is a test."
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print(f"Original: {text}")
print(f"Encoded:  {ids}")
print(f"Decoded:  {decoded}")

Testing Our Tokenizer

🧪 Test Case 1: Simple Sentence

text = "It's the last he painted, you know."
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print(f"Original: {text}")
print(f"Encoded:  {ids}")
print(f"Decoded:  {decoded}")
print(f"Match: {text == decoded}")

Output:

Original: It's the last he painted, you know.
Encoded:  [1131, 1135, 1131, 1132, 1133, 1134, 5, 1136, 1137, 7]
Decoded:  It's the last he painted, you know.
Match: True

✅ Perfect!

🧪 Test Case 2: Unknown Words

text = "Hello, do you like pizza?"
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print(f"Original: {text}")
print(f"Encoded:  {ids}")
print(f"Decoded:  {decoded}")

Output:

Original: Hello, do you like pizza?
Encoded:  [1131, 5, 1095, 1131, 1131, 1131, 10]
Decoded:  <|unk|>, do you <|unk|> <|unk|>?

✅ Handles unknowns gracefully!

🧪 Test Case 3: Multiple Texts

text1 = "Hello, world!"
text2 = "This is a test."
combined = text1 + " <|endoftext|> " + text2

ids = tokenizer.encode(combined)
decoded = tokenizer.decode(ids)

print(f"Original: {combined}")
print(f"Decoded:  {decoded}")

Output:

Original: Hello, world! <|endoftext|> This is a test.
Decoded:  <|unk|>, <|unk|>! <|endoftext|> This is a test.

✅ Preserves <|endoftext|> separator!

🧪 Test Case 4: Edge Cases

# Empty string
print(tokenizer.encode(""))  # []

# Only punctuation
print(tokenizer.encode("!?,."))  # [0, 10, 5, 7]

# Only special tokens
print(tokenizer.encode("<|endoftext|>"))  # [1130]

✅ All edge cases handled!

Why GPT Uses Byte Pair Encoding

🤔 Limitations of Our Tokenizer

Our tokenizer:

Word = 1 token

"unbelievable" → 1 token
"supercalifragilisticexpialidocious" → 1 token

Problems:

Huge vocabulary needed
- Every word = separate token
- English has 170,000+ words
- New words added daily!
Unknown words
- Can’t handle “blockchain”, “selfie”, “emoji”
- Replaces with <|unk|> (loses information!)
Memory waste
- Rare words take space
- “antidisestablishmentarianism” only appears once!

💡 The BPE Solution

Byte Pair Encoding (BPE):

Break words into subword units!

Example:

Word: "unbelievable"

Our tokenizer: ["unbelievable"] (1 token)
BPE tokenizer:  ["un", "believ", "able"] (3 tokens)

Benefits:

Smaller vocabulary
- Reuse subwords across words
- “un” appears in: unbelievable, unhappy, undo, etc.
No unknown words!
- Any word can be broken into subwords
- Worst case: break into characters
Efficient
- Common words = 1 token
- Rare words = multiple tokens

📊 Comparison

Tokenizer	“running”	“unbelievable”	Vocab Size
Our (word-level)	`["running"]`	`["unbelievable"]`	~170,000
BPE (subword-level)	`["run", "ning"]`	`["un", "believ", "able"]`	~50,000

BPE vocabulary is 3x smaller!

🎯 GPT Tokenization

GPT-2/GPT-3/GPT-4 all use BPE!

Benefits:

✅ Vocabulary: ~50,000 tokens
✅ No <|unk|> needed
✅ Handles all languages
✅ Efficient encoding

Example:

# GPT-3 tokenizer
text = "unbelievable"
tokens = gpt3_tokenize(text)
# Result: ["un", "believ", "able"]

📚 Next Chapter Preview

Chapter 8: Byte Pair Encoding (BPE)

We’ll learn:

How BPE works
Building BPE tokenizer from scratch
Using GPT’s tokenizer
Comparing tokenization methods

Get ready for more coding! 🚀

Chapter Summary

🎉 What We Accomplished Today

This was a HUGE chapter! Let’s recap:

1. Understanding Tokenization

Tokenization = Breaking text into tokens

Why? Neural networks need numbers, not text!

Process:
Text → Tokens → Token IDs → Embeddings → LLM

2. Built Tokenizer from Scratch

Step 1: Text → Tokens

import re
tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)
tokens = [item for item in tokens if item.strip()]

Step 2: Tokens → Token IDs

all_tokens = sorted(set(tokens))
vocab = {token: idx for idx, token in enumerate(all_tokens)}

3. Implemented Encoder & Decoder

Encoder:

def encode(text):
    tokens = tokenize(text)
    ids = [vocab[token] for token in tokens]
    return ids

Decoder:

def decode(ids):
    tokens = [inv_vocab[idx] for idx in ids]
    text = " ".join(tokens)
    return text

4. Added Special Tokens

Two key tokens:

<|unk|> - Unknown words
<|endoftext|> - Text separator

Why they matter:

Handle unknown words gracefully
Separate different text sources
Used in real GPT models!

5. Complete Tokenizer Class

class SimpleTokenizerV2:
    def __init__(self, vocab):
        # Initialize
    
    def encode(self, text):
        # Text → Token IDs
    
    def decode(self, ids):
        # Token IDs → Text

Production-ready! ✅

6. Learned About BPE

Word-level tokenization (our approach):

1 word = 1 token
Large vocabulary
Unknown word problem

Subword tokenization (BPE - GPT’s approach):

1 word = multiple subword tokens
Smaller vocabulary
No unknown words!

Next chapter: Build BPE tokenizer!

📊 Key Statistics

Our dataset:

Book: “The Verdict” (1908)
Characters: 20,479
Tokens: 4,690
Vocabulary: 1,132 (including special tokens)

Real LLMs:

GPT-3 vocabulary: ~50,000 tokens
Trained on: 300 billion tokens
Uses: Byte Pair Encoding (BPE)

💡 Key Takeaways

Tokenization is essential for LLM data preprocessing
Vocabulary maps tokens to IDs
Encoder converts text to IDs
Decoder converts IDs back to text
Special tokens handle edge cases
BPE is better than word-level tokenization
GPT uses BPE, not our simple method

🎯 What We Learned (Checklist)

[x] What is tokenization
[x] Why tokenization matters
[x] Breaking text into tokens (regex)
[x] Building vocabulary
[x] Converting tokens to IDs
[x] Implementing encoder
[x] Implementing decoder
[x] Handling unknown words
[x] Special tokens (<|unk|>, <|endoftext|>)
[x] Complete tokenizer class
[x] Testing and edge cases
[x] Why GPT uses BPE

🔜 Next Chapter: Chapter 8

Topic: Byte Pair Encoding (BPE) Tokenization

What we’ll learn:

How BPE algorithm works
Building BPE tokenizer from scratch
Using OpenAI’s tiktoken library
Comparing tokenization methods
Vocabulary size optimization
Handling multiple languages

Hands-on coding continues! 💻

📝 Practice Exercise

Try this before next chapter:

Download a different book (Project Gutenberg)
Build vocabulary from that book
Create tokenizer instance
Encode some sentences
Decode them back
Count vocabulary size

Share your results in comments! 💬

🚀 Take Action Now!

💻 Run the Code - Download notebook and execute every cell
🧪 Experiment - Try different texts, books, datasets
📝 Practice - Build tokenizer for your own data
❓ Ask Questions - Comment if anything is unclear
🔖 Bookmark - Reference material for future
⏭️ Get Ready - Next chapter: Byte Pair Encoding!

Quick Reference

Key Functions:

# Tokenization
import re
tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)
tokens = [item for item in tokens if item.strip()]

# Vocabulary
vocab = {token: idx for idx, token in enumerate(sorted(set(tokens)))}

# Encoding
ids = [vocab[token] for token in tokens]

# Decoding
tokens = [inv_vocab[idx] for idx in ids]
text = " ".join(tokens)

Special Tokens:

Token	Purpose	Usage
`<	unk	>`
`<	endoftext	>`
`<	BOS	>`
`<	EOS	>`
`<	PAD	>`

Tokenizer Class Template:

class Tokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {v: k for k, v in vocab.items()}
    
    def encode(self, text):
        # Tokenize and convert to IDs
        pass
    
    def decode(self, ids):
        # Convert IDs to text
        pass

Thank You!

You’ve completed Chapter 7 - Tokenization! 🎉

You now know:

✅ How tokenization works
✅ How to build encoder/decoder
✅ How to handle special tokens
✅ Why BPE is superior
✅ Real implementation in Python

Next chapter: Byte Pair Encoding (BPE)

This is just the beginning! 🚀

📣 Your Feedback Matters!

Drop a comment:

Which part was most interesting?
What was challenging?
What would you like more detail on?
Share your tokenizer experiments!

We respond to every comment! 💬

🎯 Coming Up

Chapter 8: Byte Pair Encoding (BPE)
Chapter 9: Vector Embeddings
Chapter 10: Positional Encoding
Chapter 11: Self-Attention Mechanism

Stay tuned! The coding adventure continues! 💻🔥

See you in Chapter 8 where we build BPE tokenizer! 🚀

Questions? Stuck somewhere? Drop a comment below! We’re here to help. 💪