Chapter 7: Tokenization Explained - Building Your First Tokenizer From Scratch

Chapter 7: Tokenization - Your First Step Into Coding LLMs

πŸ“– Reading Time: 60 minutes
πŸ’» Coding Time: 90 minutes

Welcome to Chapter 7! THIS IS WHERE WE START CODING! πŸš€

What we’ve learned so far:

  • Chapter 1-6: Theory, architecture, roadmap

Today:

  • Build a complete tokenizer from scratch
  • Python implementation
  • Encoder + Decoder
  • Handle special tokens
  • Real code you can run!

Grab your laptop and let’s code! πŸ’»


πŸ“‘ Table of Contents


What is Tokenization?

πŸ”€ The Simplest Definition

Tokenization = Breaking text into smaller units (tokens)

At its basic form:

Sentence: "The cat sat on the mat"
Tokens:   ["The", "cat", "sat", "on", "the", "mat"]

But it’s MORE than just splitting by spaces!


πŸ€” Why Can’t We Just Use Words?

Good question! Here’s why simple word splitting doesn’t work:

Example 1: Punctuation

Sentence: "Hello, world!"
Wrong:    ["Hello,", "world!"]  ❌
Correct:  ["Hello", ",", "world", "!"]  βœ…

Example 2: Contractions

Sentence: "I don't know"
Wrong:    ["I", "don't", "know"]  ❌
Correct:  ["I", "don", "'", "t", "know"]  βœ…

Example 3: Special Characters

Sentence: "It's 2024--amazing!"
Wrong:    ["It's", "2024--amazing!"]  ❌
Correct:  ["It", "'", "s", "2024", "--", "amazing", "!"]  βœ…

Tokenization handles ALL these cases properly!


πŸ’‘ Key Insight

Tokenization is the FIRST step in data preprocessing for LLMs.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      LLM DATA PREPARATION PIPELINE       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                          β”‚
β”‚  1. TOKENIZATION (Today!)                β”‚
β”‚     Text β†’ Tokens                        β”‚
β”‚                                          β”‚
β”‚  2. TOKEN IDs (Today!)                   β”‚
β”‚     Tokens β†’ Numbers                     β”‚
β”‚                                          β”‚
β”‚  3. VECTOR EMBEDDINGS (Next chapter)     β”‚
β”‚     Numbers β†’ Vectors                    β”‚
β”‚                                          β”‚
β”‚  4. TRAINING                             β”‚
β”‚     Vectors β†’ Trained LLM                β”‚
β”‚                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Today: Steps 1 & 2!


Why Tokenization Matters

🎯 The Core Problem

Neural networks work with NUMBERS, not TEXT!

What we have:

"The cat sat on the mat"

What LLMs need:

[101, 202, 303, 404, 101, 505]

Tokenization bridges this gap!


πŸ“Š Visual Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                             β”‚
β”‚  INPUT TEXT                                 β”‚
β”‚  "The cat sat on the mat"                  β”‚
β”‚                                             β”‚
β”‚           ↓ TOKENIZATION                    β”‚
β”‚                                             β”‚
β”‚  TOKENS                                     β”‚
β”‚  ["The", "cat", "sat", "on", "the", "mat"]β”‚
β”‚                                             β”‚
β”‚           ↓ TOKEN IDs                       β”‚
β”‚                                             β”‚
β”‚  TOKEN IDs                                  β”‚
β”‚  [101, 202, 303, 404, 101, 505]           β”‚
β”‚                                             β”‚
β”‚           ↓ EMBEDDINGS (Next chapter)       β”‚
β”‚                                             β”‚
β”‚  VECTORS                                    β”‚
β”‚  [[0.2, 0.8, ...], [0.5, 0.3, ...], ...]  β”‚
β”‚                                             β”‚
β”‚           ↓ TRAINING                        β”‚
β”‚                                             β”‚
β”‚  TRAINED LLM                                β”‚
β”‚                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’° Real-World Impact

Good tokenization:

  • βœ… Reduces vocabulary size
  • βœ… Handles unknown words
  • βœ… Saves memory
  • βœ… Faster training
  • βœ… Better performance

Poor tokenization:

  • ❌ Huge vocabulary
  • ❌ Can’t handle new words
  • ❌ Wastes memory
  • ❌ Slower training
  • ❌ Worse performance

Tokenization is CRITICAL!


The 3 Steps of Data Preparation

πŸ“‹ Complete Overview

Remember from Chapter 6’s roadmap:

STAGE 1: Building Blocks
β”œβ”€β”€ Data Preparation ← WE ARE HERE!
β”‚   β”œβ”€β”€ Step 1: Tokenization
β”‚   β”œβ”€β”€ Step 2: Token IDs
β”‚   └── Step 3: Vector Embeddings
β”œβ”€β”€ Attention Mechanisms
└── LLM Architecture

🎯 Today’s Focus

Step 1: Split text into tokens

Input:  "This is an example"
Output: ["This", "is", "an", "example"]

Step 2: Convert tokens to token IDs

Input:  ["This", "is", "an", "example"]
Output: [45, 12, 89, 234]

Step 3: (Next chapter!)

Input:  [45, 12, 89, 234]
Output: [[0.2, 0.8, ...], [0.5, 0.3, ...], ...]

Step 1: Breaking Text Into Tokens

πŸ› οΈ Setup: Loading Our Dataset

We’ll use a real book for practice!

Book: β€œThe Verdict” by Edith Wharton (1908)
Why? Free to download, perfect size for learning
Size: ~20,000 characters


πŸ’» Python Code: Loading the Book

# Load the text file
with open("the_verdict.txt", "r") as file:
    raw_text = file.read()

# Check what we loaded
print(f"Total characters: {len(raw_text)}")
print(f"First 100 characters:\n{raw_text[:100]}")

Output:

Total characters: 20479
First 100 characters:
I had always thought Jack Gisburn rather a cheap genius--though
a good fellow enough--so it was no...

Success! We’ve loaded the text! βœ…


πŸ”§ Simple Tokenization Attempt

Let’s try the simplest approach: split by spaces

# Split by spaces
text = "Hello, world! This is an example."
tokens = text.split(" ")
print(tokens)

Output:

['Hello,', 'world!', 'This', 'is', 'an', 'example.']

Problem: Punctuation is stuck to words! ❌


🎯 Better Approach: Regular Expressions

Use Python’s re library for smart splitting!

import re

text = "Hello, world! This is an example."

# Split on whitespace
result = re.split(r'(\s)', text)
print(result)

Output:

['Hello,', ' ', 'world!', ' ', 'This', ' ', 'is', ' ', 'an', ' ', 'example.']

Better! But still has issues: Whitespaces included, punctuation stuck


πŸš€ Advanced Tokenization

Split on whitespace AND punctuation!

import re

text = "Hello, world! This is--an example."

# Split on whitespace AND punctuation
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
print(result)

Output:

['Hello', ',', ' ', 'world', '!', ' ', 'This', ' ', 'is', '--', 
 'an', ' ', 'example', '.', '']

Much better! Punctuation is now separate! βœ…


🧹 Remove Whitespaces

We don’t need spaces as separate tokens:

import re

text = "Hello, world! This is--an example."

# Split on whitespace and punctuation
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)

# Remove whitespaces
tokens = [item for item in result if item.strip()]
print(tokens)

Output:

['Hello', ',', 'world', '!', 'This', 'is', '--', 'an', 'example', '.']

Perfect! Clean tokens without whitespaces! βœ…


πŸ“ Complete Tokenization Function

import re

def simple_tokenize(text):
    """
    Tokenize text by splitting on:
    - Whitespace
    - Punctuation: , . : ; ? _ ! " ' ( ) --
    
    Returns list of tokens (no whitespaces)
    """
    # Split on whitespace and punctuation
    result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
    
    # Remove whitespace tokens
    tokens = [item for item in result if item.strip()]
    
    return tokens

# Test
text = "Hello, world! This is--an example."
tokens = simple_tokenize(text)
print(tokens)

Output:

['Hello', ',', 'world', '!', 'This', 'is', '--', 'an', 'example', '.']

🎯 Apply to Full Book

# Tokenize the entire book!
preprocessed = simple_tokenize(raw_text)

print(f"Total tokens: {len(preprocessed)}")
print(f"First 30 tokens: {preprocessed[:30]}")

Output:

Total tokens: 4690
First 30 tokens: ['I', 'had', 'always', 'thought', 'Jack', 'Gisburn', 
'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 
'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 
'to', 'hear', 'that', ',', 'in']

Amazing! 20,479 characters β†’ 4,690 tokens! πŸŽ‰


πŸ’‘ Key Takeaway: Step 1

βœ… We can break text into tokens
βœ… Handle punctuation properly
βœ… Remove unnecessary whitespaces
βœ… Tokenized entire book (4,690 tokens)

Next: Convert tokens to numbers!


Step 2: Converting Tokens to Token IDs

πŸ€” Why Token IDs?

Problem: Computers can’t understand words like β€œcat” or β€œdog”

Solution: Convert each token to a unique number (Token ID)!

Token:    "cat"  β†’  Token ID: 42
Token:    "dog"  β†’  Token ID: 87
Token:    "fox"  β†’  Token ID: 103

Now computers can process them! βœ…


πŸ“š Building a Vocabulary

Vocabulary = Mapping from tokens to token IDs

Process:

  1. Get unique tokens (no duplicates)
  2. Sort alphabetically (for consistency)
  3. Assign sequential IDs (0, 1, 2, 3…)

🎯 Example: Small Dataset

Training text:

"The quick brown fox jumps over the lazy dog"

Step 1: Tokenize

["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

Step 2: Get unique tokens (alphabetically)

["brown", "dog", "fox", "jumps", "lazy", "over", "quick", "the"]

Note: β€œthe” and β€œThe” are different! (Case-sensitive)
For simplicity, we’ll keep them separate.

Step 3: Assign Token IDs

Token Token ID
brown 0
dog 1
fox 2
jumps 3
lazy 4
over 5
quick 6
the 7

This is our VOCABULARY! πŸ“–


πŸ’» Python Code: Build Vocabulary

# Get unique tokens and sort alphabetically
all_words = sorted(set(preprocessed))

# Create vocabulary: token β†’ token_id
vocab = {token: idx for idx, token in enumerate(all_words)}

# Check vocabulary size
print(f"Vocabulary size: {len(vocab)}")

Output:

Vocabulary size: 1130

We have 1,130 unique tokens in β€œThe Verdict”!


πŸ” Inspect the Vocabulary

# Show first 50 entries
for i, (token, token_id) in enumerate(vocab.items()):
    print(f"{token}: {token_id}")
    if i >= 49:
        break

Output:

!: 0
": 1
': 2
(: 3
): 4
,: 5
--: 6
.: 7
:: 8
;: 9
?: 10
A: 11
Ah: 12
Among: 13
And: 14
Are: 15
Art: 16
As: 17
At: 18
...

Notice:

  • Punctuation gets IDs 0-10
  • Words are alphabetically ordered
  • Each unique token has unique ID

🎯 Simplified Code

# One-line vocabulary creation!
all_words = sorted(set(preprocessed))
vocab = {token: idx for idx, token in enumerate(all_words)}

print(f"Vocabulary size: {len(vocab)}")

That’s it! Two lines to build a vocabulary! ✨


πŸ’‘ Key Takeaway: Step 2

βœ… Built vocabulary (1,130 unique tokens)
βœ… Each token has unique ID
βœ… Alphabetically sorted for consistency
βœ… Ready to encode text!

Building the Vocabulary

πŸ“– What is a Vocabulary?

Vocabulary = A dictionary mapping tokens to token IDs

Think of it like a real dictionary:

Dictionary: "apple" β†’ "a fruit that grows on trees"
Vocabulary: "apple" β†’ 42 (token ID)

πŸ”§ Why Alphabetical Order?

Three reasons:

  1. Consistency: Same text always gives same IDs
  2. Reproducibility: Anyone can rebuild same vocabulary
  3. Debugging: Easy to find tokens

Example without sorting:

# Without sorting - RANDOM ORDER! ❌
tokens = ["cat", "dog", "fox"]
vocab = {token: idx for idx, token in enumerate(tokens)}
# Result depends on set order (unpredictable!)

Example with sorting:

# With sorting - CONSISTENT! βœ…
tokens = sorted(["cat", "dog", "fox"])
vocab = {token: idx for idx, token in enumerate(tokens)}
# Result: {"cat": 0, "dog": 1, "fox": 2} (always same!)

πŸ“Š Vocabulary Statistics

print(f"Total tokens in text: {len(preprocessed)}")
print(f"Unique tokens (vocab size): {len(vocab)}")
print(f"Compression ratio: {len(preprocessed) / len(vocab):.2f}x")

Output:

Total tokens in text: 4690
Unique tokens (vocab size): 1130
Compression ratio: 4.15x

Interpretation:

  • 4,690 total tokens
  • Only 1,130 are unique
  • Average token appears 4.15 times

🎯 Vocabulary is Your LLM’s Dictionary

Small vocabulary:

  • βœ… Faster processing
  • βœ… Less memory
  • ❌ May miss words

Large vocabulary:

  • βœ… Handles more words
  • ❌ Slower processing
  • ❌ More memory

GPT-3 vocabulary: ~50,000 tokens
Our toy example: 1,130 tokens


Implementing Encoder and Decoder

πŸ”„ The Two-Way Process

ENCODER: Text β†’ Token IDs
DECODER: Token IDs β†’ Text

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                         β”‚
β”‚  TEXT: "Hello, world!"                  β”‚
β”‚         ↓                               β”‚
β”‚         ENCODER                         β”‚
β”‚         ↓                               β”‚
β”‚  TOKEN IDs: [523, 5, 892, 10]          β”‚
β”‚         ↓                               β”‚
β”‚         DECODER                         β”‚
β”‚         ↓                               β”‚
β”‚  TEXT: "Hello, world!"                  β”‚
β”‚                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why do we need both?

  • Encoder: Prepare data for training
  • Decoder: Convert LLM output back to readable text

🎯 Encoder: Text β†’ Token IDs

def encode(text, vocab):
    """
    Convert text to token IDs
    
    Args:
        text: String to encode
        vocab: Dictionary {token: token_id}
    
    Returns:
        List of token IDs
    """
    # Step 1: Tokenize
    tokens = simple_tokenize(text)
    
    # Step 2: Convert to IDs
    ids = [vocab[token] for token in tokens]
    
    return ids

# Test
text = "It's the last he painted, you know."
ids = encode(text, vocab)
print(f"Text: {text}")
print(f"Token IDs: {ids}")

Output:

Text: It's the last he painted, you know.
Token IDs: [1131, 1135, 1131, 1132, 1133, 1134, 5, 1136, 1137, 7]

It works! βœ…


πŸ”„ Decoder: Token IDs β†’ Text

def decode(ids, vocab):
    """
    Convert token IDs back to text
    
    Args:
        ids: List of token IDs
        vocab: Dictionary {token: token_id}
    
    Returns:
        Decoded text string
    """
    # Step 1: Create reverse vocabulary (id β†’ token)
    inv_vocab = {idx: token for token, idx in vocab.items()}
    
    # Step 2: Convert IDs to tokens
    tokens = [inv_vocab[idx] for idx in ids]
    
    # Step 3: Join tokens
    text = " ".join(tokens)
    
    # Step 4: Fix spacing before punctuation
    text = re.sub(r'\s+([,.:;?!"])', r'\1', text)
    
    return text

# Test
decoded_text = decode(ids, vocab)
print(f"Decoded: {decoded_text}")

Output:

Decoded: It's the last he painted, you know.

Perfect! We got back the original text! πŸŽ‰


πŸ—οΈ Complete Tokenizer Class

Let’s package everything into a reusable class:

class SimpleTokenizerV1:
    """
    Simple tokenizer with encode and decode methods
    """
    
    def __init__(self, vocab):
        """
        Initialize with vocabulary
        
        Args:
            vocab: Dictionary {token: token_id}
        """
        self.str_to_int = vocab
        self.int_to_str = {idx: token for token, idx in vocab.items()}
    
    def encode(self, text):
        """Convert text to token IDs"""
        # Tokenize
        tokens = simple_tokenize(text)
        
        # Convert to IDs
        ids = [self.str_to_int[token] for token in tokens]
        
        return ids
    
    def decode(self, ids):
        """Convert token IDs to text"""
        # Convert IDs to tokens
        tokens = [self.int_to_str[idx] for idx in ids]
        
        # Join tokens
        text = " ".join(tokens)
        
        # Fix punctuation spacing
        text = re.sub(r'\s+([,.:;?!"])', r'\1', text)
        
        return text

βœ… Test the Tokenizer Class

# Create tokenizer instance
tokenizer = SimpleTokenizerV1(vocab)

# Test encode
text = "It's the last he painted, you know."
ids = tokenizer.encode(text)
print(f"Original: {text}")
print(f"Encoded:  {ids}")

# Test decode
decoded = tokenizer.decode(ids)
print(f"Decoded:  {decoded}")

Output:

Original: It's the last he painted, you know.
Encoded:  [1131, 1135, 1131, 1132, 1133, 1134, 5, 1136, 1137, 7]
Decoded:  It's the last he painted, you know.

Tokenizer works perfectly! βœ…


πŸ’‘ Key Takeaway

βœ… Built encoder (text β†’ IDs)
βœ… Built decoder (IDs β†’ text)
βœ… Packaged into reusable class
βœ… Tested and verified!

Special Context Tokens

⚠️ The Unknown Word Problem

Test this:

text = "Hello, do you like tea?"
ids = tokenizer.encode(text)

Output:

KeyError: 'Hello'

ERROR! Why? πŸ€”

β€œHello” is not in our vocabulary! (Book written in 1908, doesn’t use β€œHello”)


🚨 The Problem

Our tokenizer crashes on unknown words!

Training text: "The Verdict" (1908 book)
Vocabulary: Only words from that book

User input: "Hello, do you like pizza?"
        ↓
   CRASH! ❌

This is UNACCEPTABLE for real LLMs!


πŸ’‘ The Solution: Special Tokens

Add special tokens to vocabulary:

  1. <|unk|> - Unknown word token
  2. <|endoftext|> - End of text token

πŸ”§ Special Token #1: Unknown Token

Purpose: Handle words not in vocabulary

How it works:

Input: "Hello, world!"
Vocabulary: {world, !, ,} (no "Hello")

Without <|unk|>: ERROR ❌
With <|unk|>:    [<|unk|>, ,, world, !] βœ…

Unknown words β†’ <|unk|> token


πŸ”§ Special Token #2: End of Text

Purpose: Separate different texts during training

Example:

Text Source 1: Harry Potter Chapter 1
Text Source 2: Wikipedia article on AI
Text Source 3: News article

Without <|endoftext|>:

"Harry cast a spell on AI researchers who reported today..."
(Confusing! Mixes sources!)

With <|endoftext|>:

"Harry cast a spell <|endoftext|> AI researchers are <|endoftext|> Today the news..."
(Clear separation!)

πŸ“Š Visual: Multiple Text Sources

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  TEXT SOURCE 1 (Book)                    β”‚
β”‚  "Once upon a time, there was a wizard..."β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
         <|endoftext|>
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  TEXT SOURCE 2 (Wikipedia)               β”‚
β”‚  "Machine learning is a subset of AI..." β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
         <|endoftext|>
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  TEXT SOURCE 3 (News)                    β”‚
β”‚  "Tech companies announced today..."     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

<|endoftext|> keeps sources separate!


πŸ’» Add Special Tokens to Vocabulary

# Start with original vocabulary
all_tokens = sorted(set(preprocessed))

# Add special tokens
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

# Rebuild vocabulary with special tokens
vocab = {token: idx for idx, token in enumerate(all_tokens)}

print(f"New vocabulary size: {len(vocab)}")
print(f"Last 5 entries:")
for token, idx in list(vocab.items())[-5:]:
    print(f"  {token}: {idx}")

Output:

New vocabulary size: 1132
Last 5 entries:
  your: 1128
  yourself: 1129
  <|endoftext|>: 1130
  <|unk|>: 1131

Special tokens added! βœ…


πŸ”§ Updated Tokenizer with Special Tokens

class SimpleTokenizerV2:
    """
    Improved tokenizer that handles unknown words
    """
    
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {idx: token for token, idx in vocab.items()}
    
    def encode(self, text):
        """Encode text, replacing unknown words with <|unk|>"""
        tokens = simple_tokenize(text)
        
        # Replace unknown tokens
        ids = [
            self.str_to_int.get(token, self.str_to_int["<|unk|>"])
            for token in tokens
        ]
        
        return ids
    
    def decode(self, ids):
        """Decode token IDs to text"""
        tokens = [self.int_to_str[idx] for idx in ids]
        text = " ".join(tokens)
        text = re.sub(r'\s+([,.:;?!"])', r'\1', text)
        return text

Key change: self.str_to_int.get(token, self.str_to_int["<|unk|>"])

This returns <|unk|> ID if token not found!


βœ… Test with Unknown Words

# Create new tokenizer
tokenizer = SimpleTokenizerV2(vocab)

# Test with unknown words
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

# Combine with <|endoftext|>
full_text = text1 + " <|endoftext|> " + text2

# Encode
ids = tokenizer.encode(full_text)
print(f"Text: {full_text}")
print(f"IDs: {ids}")

# Decode
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")

Output:

Text: Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.
IDs: [1131, 5, 1095, 1131, 1131, 1131, 10, 1130, 1089, ...]
Decoded: <|unk|>, do you <|unk|> <|unk|>? <|endoftext|> In the sunlit terraces of the <|unk|>.

Notice:

  • β€œHello” β†’ <|unk|> (token ID 1131)
  • β€œtea” β†’ <|unk|> (not in vocabulary)
  • β€œpalace” β†’ <|unk|> (not in vocabulary)
  • <|endoftext|> β†’ Preserved (token ID 1130)

No error! Tokenizer handles unknowns gracefully! βœ…


🎯 Other Special Tokens in Real LLMs

We covered:

  • βœ… <|unk|> - Unknown token
  • βœ… <|endoftext|> - End of text

Others used in research:

Token Purpose Example
`< BOS >`
`< EOS >`
`< PAD >`

πŸ’‘ Important Note About GPT

GPT models (GPT-2, GPT-3, GPT-4):

βœ… Use: <|endoftext|>
❌ Don’t use: <|unk|>, <|BOS|>, <|EOS|>, <|PAD|>

Why no <|unk|>?

GPT uses Byte Pair Encoding (BPE) which breaks unknown words into subwords!

Example:

Unknown word: "unbelievable"
BPE: ["un", "believ", "able"]
(All subwords are in vocabulary!)

Next chapter: We’ll learn BPE in detail!


Complete Tokenizer Class

πŸ—οΈ Final Production-Ready Tokenizer

import re

class SimpleTokenizerV2:
    """
    Complete tokenizer with:
    - Encoding and decoding
    - Unknown word handling
    - Special token support
    """
    
    def __init__(self, vocab):
        """
        Initialize tokenizer with vocabulary
        
        Args:
            vocab: Dictionary mapping tokens to IDs
        """
        self.str_to_int = vocab
        self.int_to_str = {idx: token for token, idx in vocab.items()}
    
    def encode(self, text):
        """
        Encode text to token IDs
        
        Args:
            text: Input string
            
        Returns:
            List of token IDs
        """
        # Tokenize text
        tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        tokens = [item for item in tokens if item.strip()]
        
        # Convert to IDs (use <|unk|> for unknown tokens)
        ids = [
            self.str_to_int.get(token, self.str_to_int["<|unk|>"])
            for token in tokens
        ]
        
        return ids
    
    def decode(self, ids):
        """
        Decode token IDs to text
        
        Args:
            ids: List of token IDs
            
        Returns:
            Decoded text string
        """
        # Convert IDs to tokens
        tokens = [self.int_to_str[idx] for idx in ids]
        
        # Join tokens
        text = " ".join(tokens)
        
        # Fix spacing before punctuation
        text = re.sub(r'\s+([,.:;?!"])', r'\1', text)
        
        return text

πŸ“¦ How to Use

# 1. Build vocabulary from training data
with open("the_verdict.txt", "r") as f:
    raw_text = f.read()

# Tokenize
tokens = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
tokens = [item for item in tokens if item.strip()]

# Create vocabulary
all_tokens = sorted(set(tokens))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token: idx for idx, token in enumerate(all_tokens)}

# 2. Create tokenizer
tokenizer = SimpleTokenizerV2(vocab)

# 3. Use it!
text = "Hello, world! <|endoftext|> This is a test."
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print(f"Original: {text}")
print(f"Encoded:  {ids}")
print(f"Decoded:  {decoded}")

Testing Our Tokenizer

πŸ§ͺ Test Case 1: Simple Sentence

text = "It's the last he painted, you know."
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print(f"Original: {text}")
print(f"Encoded:  {ids}")
print(f"Decoded:  {decoded}")
print(f"Match: {text == decoded}")

Output:

Original: It's the last he painted, you know.
Encoded:  [1131, 1135, 1131, 1132, 1133, 1134, 5, 1136, 1137, 7]
Decoded:  It's the last he painted, you know.
Match: True

βœ… Perfect!


πŸ§ͺ Test Case 2: Unknown Words

text = "Hello, do you like pizza?"
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print(f"Original: {text}")
print(f"Encoded:  {ids}")
print(f"Decoded:  {decoded}")

Output:

Original: Hello, do you like pizza?
Encoded:  [1131, 5, 1095, 1131, 1131, 1131, 10]
Decoded:  <|unk|>, do you <|unk|> <|unk|>?

βœ… Handles unknowns gracefully!


πŸ§ͺ Test Case 3: Multiple Texts

text1 = "Hello, world!"
text2 = "This is a test."
combined = text1 + " <|endoftext|> " + text2

ids = tokenizer.encode(combined)
decoded = tokenizer.decode(ids)

print(f"Original: {combined}")
print(f"Decoded:  {decoded}")

Output:

Original: Hello, world! <|endoftext|> This is a test.
Decoded:  <|unk|>, <|unk|>! <|endoftext|> This is a test.

βœ… Preserves <|endoftext|> separator!


πŸ§ͺ Test Case 4: Edge Cases

# Empty string
print(tokenizer.encode(""))  # []

# Only punctuation
print(tokenizer.encode("!?,."))  # [0, 10, 5, 7]

# Only special tokens
print(tokenizer.encode("<|endoftext|>"))  # [1130]

βœ… All edge cases handled!


Why GPT Uses Byte Pair Encoding

πŸ€” Limitations of Our Tokenizer

Our tokenizer:

Word = 1 token

"unbelievable" β†’ 1 token
"supercalifragilisticexpialidocious" β†’ 1 token

Problems:

  1. Huge vocabulary needed

    • Every word = separate token
    • English has 170,000+ words
    • New words added daily!
  2. Unknown words

    • Can’t handle β€œblockchain”, β€œselfie”, β€œemoji”
    • Replaces with <|unk|> (loses information!)
  3. Memory waste

    • Rare words take space
    • β€œantidisestablishmentarianism” only appears once!

πŸ’‘ The BPE Solution

Byte Pair Encoding (BPE):

Break words into subword units!

Example:

Word: "unbelievable"

Our tokenizer: ["unbelievable"] (1 token)
BPE tokenizer:  ["un", "believ", "able"] (3 tokens)

Benefits:

  1. Smaller vocabulary

    • Reuse subwords across words
    • β€œun” appears in: unbelievable, unhappy, undo, etc.
  2. No unknown words!

    • Any word can be broken into subwords
    • Worst case: break into characters
  3. Efficient

    • Common words = 1 token
    • Rare words = multiple tokens

πŸ“Š Comparison

Tokenizer β€œrunning” β€œunbelievable” Vocab Size
Our (word-level) ["running"] ["unbelievable"] ~170,000
BPE (subword-level) ["run", "ning"] ["un", "believ", "able"] ~50,000

BPE vocabulary is 3x smaller!


🎯 GPT Tokenization

GPT-2/GPT-3/GPT-4 all use BPE!

Benefits:

  • βœ… Vocabulary: ~50,000 tokens
  • βœ… No <|unk|> needed
  • βœ… Handles all languages
  • βœ… Efficient encoding

Example:

# GPT-3 tokenizer
text = "unbelievable"
tokens = gpt3_tokenize(text)
# Result: ["un", "believ", "able"]

πŸ“š Next Chapter Preview

Chapter 8: Byte Pair Encoding (BPE)

We’ll learn:

  • How BPE works
  • Building BPE tokenizer from scratch
  • Using GPT’s tokenizer
  • Comparing tokenization methods

Get ready for more coding! πŸš€


Chapter Summary

πŸŽ‰ What We Accomplished Today

This was a HUGE chapter! Let’s recap:


1. Understanding Tokenization

Tokenization = Breaking text into tokens

Why? Neural networks need numbers, not text!

Process:
Text β†’ Tokens β†’ Token IDs β†’ Embeddings β†’ LLM

2. Built Tokenizer from Scratch

Step 1: Text β†’ Tokens

import re
tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)
tokens = [item for item in tokens if item.strip()]

Step 2: Tokens β†’ Token IDs

all_tokens = sorted(set(tokens))
vocab = {token: idx for idx, token in enumerate(all_tokens)}

3. Implemented Encoder & Decoder

Encoder:

def encode(text):
    tokens = tokenize(text)
    ids = [vocab[token] for token in tokens]
    return ids

Decoder:

def decode(ids):
    tokens = [inv_vocab[idx] for idx in ids]
    text = " ".join(tokens)
    return text

4. Added Special Tokens

Two key tokens:

  • <|unk|> - Unknown words
  • <|endoftext|> - Text separator

Why they matter:

  • Handle unknown words gracefully
  • Separate different text sources
  • Used in real GPT models!

5. Complete Tokenizer Class

class SimpleTokenizerV2:
    def __init__(self, vocab):
        # Initialize
    
    def encode(self, text):
        # Text β†’ Token IDs
    
    def decode(self, ids):
        # Token IDs β†’ Text

Production-ready! βœ…


6. Learned About BPE

Word-level tokenization (our approach):

  • 1 word = 1 token
  • Large vocabulary
  • Unknown word problem

Subword tokenization (BPE - GPT’s approach):

  • 1 word = multiple subword tokens
  • Smaller vocabulary
  • No unknown words!

Next chapter: Build BPE tokenizer!


πŸ“Š Key Statistics

Our dataset:

  • Book: β€œThe Verdict” (1908)
  • Characters: 20,479
  • Tokens: 4,690
  • Vocabulary: 1,132 (including special tokens)

Real LLMs:

  • GPT-3 vocabulary: ~50,000 tokens
  • Trained on: 300 billion tokens
  • Uses: Byte Pair Encoding (BPE)

πŸ’‘ Key Takeaways

  1. Tokenization is essential for LLM data preprocessing
  2. Vocabulary maps tokens to IDs
  3. Encoder converts text to IDs
  4. Decoder converts IDs back to text
  5. Special tokens handle edge cases
  6. BPE is better than word-level tokenization
  7. GPT uses BPE, not our simple method

🎯 What We Learned (Checklist)

  • [x] What is tokenization
  • [x] Why tokenization matters
  • [x] Breaking text into tokens (regex)
  • [x] Building vocabulary
  • [x] Converting tokens to IDs
  • [x] Implementing encoder
  • [x] Implementing decoder
  • [x] Handling unknown words
  • [x] Special tokens (<|unk|>, <|endoftext|>)
  • [x] Complete tokenizer class
  • [x] Testing and edge cases
  • [x] Why GPT uses BPE

πŸ”œ Next Chapter: Chapter 8

Topic: Byte Pair Encoding (BPE) Tokenization

What we’ll learn:

  • How BPE algorithm works
  • Building BPE tokenizer from scratch
  • Using OpenAI’s tiktoken library
  • Comparing tokenization methods
  • Vocabulary size optimization
  • Handling multiple languages

Hands-on coding continues! πŸ’»


πŸ“ Practice Exercise

Try this before next chapter:

  1. Download a different book (Project Gutenberg)
  2. Build vocabulary from that book
  3. Create tokenizer instance
  4. Encode some sentences
  5. Decode them back
  6. Count vocabulary size

Share your results in comments! πŸ’¬


πŸš€ Take Action Now!

  1. πŸ’» Run the Code - Download notebook and execute every cell
  2. πŸ§ͺ Experiment - Try different texts, books, datasets
  3. πŸ“ Practice - Build tokenizer for your own data
  4. ❓ Ask Questions - Comment if anything is unclear
  5. πŸ”– Bookmark - Reference material for future
  6. ⏭️ Get Ready - Next chapter: Byte Pair Encoding!

Quick Reference

Key Functions:

# Tokenization
import re
tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)
tokens = [item for item in tokens if item.strip()]

# Vocabulary
vocab = {token: idx for idx, token in enumerate(sorted(set(tokens)))}

# Encoding
ids = [vocab[token] for token in tokens]

# Decoding
tokens = [inv_vocab[idx] for idx in ids]
text = " ".join(tokens)

Special Tokens:

Token Purpose Usage
`< unk >`
`< endoftext >`
`< BOS >`
`< EOS >`
`< PAD >`

Tokenizer Class Template:

class Tokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {v: k for k, v in vocab.items()}
    
    def encode(self, text):
        # Tokenize and convert to IDs
        pass
    
    def decode(self, ids):
        # Convert IDs to text
        pass

Thank You!

You’ve completed Chapter 7 - Tokenization! πŸŽ‰

You now know:

  • βœ… How tokenization works
  • βœ… How to build encoder/decoder
  • βœ… How to handle special tokens
  • βœ… Why BPE is superior
  • βœ… Real implementation in Python

Next chapter: Byte Pair Encoding (BPE)

This is just the beginning! πŸš€


πŸ“£ Your Feedback Matters!

Drop a comment:

  • Which part was most interesting?
  • What was challenging?
  • What would you like more detail on?
  • Share your tokenizer experiments!

We respond to every comment! πŸ’¬


🎯 Coming Up

Chapter 8: Byte Pair Encoding (BPE)
Chapter 9: Vector Embeddings
Chapter 10: Positional Encoding
Chapter 11: Self-Attention Mechanism

Stay tuned! The coding adventure continues! πŸ’»πŸ”₯


See you in Chapter 8 where we build BPE tokenizer! πŸš€


Questions? Stuck somewhere? Drop a comment below! We’re here to help. πŸ’ͺ