Chapter 9: Data Sampling & Context Windows - Preparing Data for LLM Training

Chapter 9: Data Sampling - From Tokens to Training Data

πŸ“– Reading Time: 75 minutes
πŸ’» Coding Time: 90 minutes

Welcome to Chapter 9! Today we prepare data for actual LLM training! πŸš€

What we’ve learned so far:

  • Tokenization basics (Chapter 7)
  • Byte Pair Encoding (Chapter 8)
  • How to convert text β†’ tokens

Today:

  • Creating input-target pairs
  • What is a context window?
  • Sliding window approach
  • Batch processing
  • PyTorch DataLoader implementation
  • Preparing data for training!

This is the FINAL step before embeddings! πŸ’‘


πŸ“‘ Table of Contents


Why Do We Need Input-Target Pairs?

πŸ€” The Problem

In other ML tasks, input-output is clear:

Image Classification:

Input: Picture of a cat 🐱
Output: "cat"
βœ… Clear!

House Price Prediction:

Input: House area (2000 sq ft)
Output: Price ($300,000)
βœ… Clear!

But for LLMs? πŸ€”


πŸ’‘ LLM’s Task: Predict Next Word

Given sentence:

"LLMs learn to predict one word at a time"

What’s the input? What’s the output? πŸ€”

Answer: We need to CREATE input-output pairs from the sentence itself!


πŸ“Š Where We Are

STAGE 1: Building Blocks
β”œβ”€β”€ Data Preparation
β”‚   β”œβ”€β”€ βœ… Tokenization (Chapter 7)
β”‚   β”œβ”€β”€ βœ… BPE (Chapter 8)
β”‚   β”œβ”€β”€ βœ… Input-Target Pairs (Chapter 9) ← TODAY!
β”‚   └── ⏳ Vector Embeddings (Next!)
β”œβ”€β”€ Attention Mechanisms
└── LLM Architecture

Almost done with data preparation! πŸŽ‰


Understanding Next-Word Prediction

🎯 The Core Idea

LLMs learn by predicting the NEXT word in a sequence

Sentence:

"LLMs learn to predict one word at a time"

How to train on this? πŸ€”


πŸ“ Creating Multiple Training Examples

From ONE sentence, create MULTIPLE training pairs!

Iteration 1:

Input:  "LLMs"
Target: "learn"

Iteration 2:

Input:  "LLMs learn"
Target: "to"

Iteration 3:

Input:  "LLMs learn to"
Target: "predict"

Iteration 4:

Input:  "LLMs learn to predict"
Target: "one"

See the pattern? ✨


πŸ”„ Visual Representation

Sentence: "LLMs learn to predict one word at a time"

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  ITERATION 1                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input:  [LLMs]                             β”‚
β”‚  Target: [learn]                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  ITERATION 2                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input:  [LLMs] [learn]                     β”‚
β”‚  Target: [to]                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  ITERATION 3                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input:  [LLMs] [learn] [to]                β”‚
β”‚  Target: [predict]                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  ITERATION 4                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input:  [LLMs] [learn] [to] [predict]      β”‚
β”‚  Target: [one]                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

One sentence β†’ Multiple training examples! πŸŽ‰


🎯 Key Insights

1. Target is always NEXT word:

Input: "The cat"
Target: "sat" (next word!)

2. Input grows each iteration:

Iteration 1: [The]
Iteration 2: [The] [cat]
Iteration 3: [The] [cat] [sat]

3. Future words are MASKED:

When predicting "cat":
Can see: "The" βœ…
Cannot see: "sat", "on", "the", "mat" ❌

This is CAUSAL language modeling!


What is a Context Window?

πŸͺŸ Definition

Context Window = How many previous words the model can see to predict the next word

Think of it as the model’s β€œmemory”! 🧠


πŸ“ Example: Context Window = 4

Sentence: β€œLLMs learn to predict one word at a time”

Training examples with context window = 4:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Example 1                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input:  [LLMs]                             β”‚
β”‚  Target: [learn]                            β”‚
β”‚  Context used: 1 word                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Example 2                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input:  [LLMs] [learn]                     β”‚
β”‚  Target: [to]                               β”‚
β”‚  Context used: 2 words                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Example 3                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input:  [LLMs] [learn] [to]                β”‚
β”‚  Target: [predict]                          β”‚
β”‚  Context used: 3 words                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Example 4                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input:  [LLMs] [learn] [to] [predict]      β”‚
β”‚  Target: [one]                              β”‚
β”‚  Context used: 4 words (MAX!)               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Example 5                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input:  [learn] [to] [predict] [one]       β”‚
β”‚          (dropped "LLMs"!)                  β”‚
β”‚  Target: [word]                             β”‚
β”‚  Context used: 4 words (MAX!)               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Context window = MAXIMUM words model can see! 🎯


πŸ”’ Real-World Context Sizes

Model Context Window Notes
Our tutorial 4 tokens For learning!
GPT-2 1024 tokens ~750 words
GPT-3 2048 tokens ~1500 words
GPT-4 8192 tokens ~6000 words
GPT-4 Turbo 128K tokens ~96000 words!
Claude 2 100K tokens ~75000 words

Larger context = More memory! πŸ“ˆ


πŸ’‘ Why Context Window Matters

Small context (4 words):

Input: "The cat sat on"
Predict: "the"

Problem: Can't see full sentence! ❌

Large context (256 words):

Input: [entire paragraph about cats sitting]
Predict: "mat"

Better! Can see full context! βœ…

But: Larger context = More computation + memory! πŸ’°


🎯 Key Takeaway

Context Window = 4 means:
βœ… Model can see up to 4 previous words
βœ… Creates 4 prediction tasks per input-target pair
βœ… Balances memory and performance

GPT training uses context = 256 or more!

Auto-Regressive Training

πŸ”„ What is Auto-Regressive?

Auto-Regressive = Model’s output becomes its next input

β€œAuto” = Self
β€œRegressive” = Using previous values


πŸ“Š Visual Example

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Step 1                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input:  "The"                           β”‚
β”‚  Output: "cat"                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
        Add to input!
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Step 2                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input:  "The cat" ← (includes "cat"!)  β”‚
β”‚  Output: "sat"                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
        Add to input!
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Step 3                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Input:  "The cat sat" ← (includes both!)β”‚
β”‚  Output: "on"                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each output feeds back as input! πŸ”„


🎯 Why It’s Called β€œAuto-Regressive”

Compare two iterations:

Iteration 1:
Input:  [1]
Output: [2]

Iteration 2:
Input:  [1, 2] ← Output from Iteration 1!
Output: [3]

The model β€œregresses” on its own outputs!


πŸ“ Auto-Regressive = Self-Supervised

Key insight: No manual labeling needed!

Traditional supervised learning:
Data: Image of cat
Label: "cat" ← Someone labeled this! πŸ‘€

Auto-regressive (LLMs):
Data: "The cat sat on the mat"
Labels: Automatically from sentence structure! πŸ€–
  - Input: "The" β†’ Label: "cat"
  - Input: "The cat" β†’ Label: "sat"
  - Input: "The cat sat" β†’ Label: "on"

The sentence ITSELF provides the labels! ✨


πŸ’‘ Key Takeaway

Auto-Regressive Training:
βœ… Output becomes next input
βœ… Self-supervised (no manual labels!)
βœ… How LLMs learn from raw text
βœ… Enables unsupervised learning at scale

This is why LLMs can train on billions of words!

Creating Input-Target Pairs in Python

πŸ’» Simple Example

Goal: Create input-target pairs with context window = 4

# Sample text (already tokenized)
tokens = [1, 2, 3, 4, 5, 6, 7, 8]

# Context window
context_length = 4

# Create input (X) and target (Y)
X = tokens[:context_length]      # [1, 2, 3, 4]
Y = tokens[1:context_length+1]   # [2, 3, 4, 5]

print(f"Input:  {X}")
print(f"Target: {Y}")

Output:

Input:  [1, 2, 3, 4]
Target: [2, 3, 4, 5]

Target is input shifted by 1! βœ…


πŸ“Š Understanding the Pairs

Input: [1, 2, 3, 4]
Target: [2, 3, 4, 5]

This creates 4 prediction tasks:

Input Sequence Target Task
[1] 2 If input is 1, predict 2
[1, 2] 3 If input is 1,2, predict 3
[1, 2, 3] 4 If input is 1,2,3, predict 4
[1, 2, 3, 4] 5 If input is 1,2,3,4, predict 5

One input-target pair = 4 prediction tasks! 🎯


πŸ”„ Iterating Through Tasks

X = [1, 2, 3, 4]
Y = [2, 3, 4, 5]
context_length = 4

for i in range(1, context_length + 1):
    context = X[:i]
    target = Y[i-1]
    print(f"Input: {context} β†’ Target: {target}")

Output:

Input: [1] β†’ Target: 2
Input: [1, 2] β†’ Target: 3
Input: [1, 2, 3] β†’ Target: 4
Input: [1, 2, 3, 4] β†’ Target: 5

This is how LLMs train! ✨


πŸ“– Real Text Example

# Load and tokenize text
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

text = "and established himself in"
encoded = tokenizer.encode(text)
print(f"Encoded: {encoded}")

# Create input-target pairs
context_length = 4
X = encoded[:context_length]
Y = encoded[1:context_length+1]

# Decode to see words
for i in range(1, context_length + 1):
    context = tokenizer.decode(X[:i])
    target = tokenizer.decode([Y[i-1]])
    print(f"'{context}' β†’ '{target}'")

Output:

Encoded: [290, 4920, 2241, 287]
'and' β†’ 'established'
'and established' β†’ 'himself'
'and established himself' β†’ 'in'

Makes sense! Each word predicts the next! βœ…


The Sliding Window Approach

πŸͺŸ What is Sliding Window?

Sliding Window = Move window across text to create multiple input-target pairs

Think of it like a camera viewport moving across text! πŸ“Ή


πŸ“Š Visual Example

Text: β€œIn the heart of the city stood the old library”

Context Window = 4

Step 1: Window at position 0

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ [In] [the] [heart] [of] β”‚ ← Input
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓ Shift by 1
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ [the] [heart] [of] [the]β”‚ ← Target
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 2: Slide window (stride = 4)

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ [the] [city] [stood] [the]β”‚ ← Input
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            ↓ Shift by 1
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ [city] [stood] [the] [old]β”‚ ← Target
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 3: Slide again (stride = 4)

                                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                        β”‚ [old] [library] [...] [...]β”‚ ← Input
                                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Keep sliding until end of text! 🎯


🎚️ Stride Parameter

Stride = How much to slide the window

Stride = 1 (overlap):

Window 1: [In] [the] [heart] [of]
Window 2:      [the] [heart] [of] [the]  ← Shifted by 1
Window 3:            [heart] [of] [the] [city]

Maximum training examples!
But: Lots of overlap (may overfit) ⚠️

Stride = 4 (no overlap):

Window 1: [In] [the] [heart] [of]
Window 2:                           [the] [city] [stood] [the]
Window 3:                                                        [old] [library] [...]

No overlap!
But: Fewer training examples

πŸ“Š Visual Comparison

Stride = 1:

Text: "In the heart of the city stood the old library"

Input 1:  [In] [the] [heart] [of]
Target 1: [the] [heart] [of] [the]

Input 2:  [the] [heart] [of] [the]      ← 3/4 overlap!
Target 2: [heart] [of] [the] [city]

Stride = 4:

Text: "In the heart of the city stood the old library"

Input 1:  [In] [the] [heart] [of]
Target 1: [the] [heart] [of] [the]

Input 2:  [the] [city] [stood] [the]    ← No overlap!
Target 2: [city] [stood] [the] [old]

🎯 Choosing Stride

Common practice:

Stride = Context Length

Example:
Context = 4 β†’ Stride = 4

Why?
βœ… No overlap (prevents overfitting)
βœ… Uses all data (no words skipped)
βœ… Balanced approach!

GPT training typically uses stride = context_length ✨


Implementing PyTorch DataLoader

🎯 Why DataLoader?

Problems with manual approach:

❌ Have to manually iterate
❌ No batching support
❌ No parallel processing
❌ Inefficient for large datasets

DataLoader solution:

βœ… Automatic batching
βœ… Parallel data loading
βœ… Shuffling support
βœ… Efficient memory usage

πŸ“¦ Two Components

1. Dataset = Defines how to get one sample
2. DataLoader = Manages batches and workers

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Dataset                           β”‚
β”‚  - How to load data                β”‚
β”‚  - How to get one sample           β”‚
β”‚  - Returns (input, target) pair    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DataLoader                        β”‚
β”‚  - Batch multiple samples          β”‚
β”‚  - Shuffle data                    β”‚
β”‚  - Parallel loading (workers)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’» Step 1: Create Dataset Class

from torch.utils.data import Dataset

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.tokenizer = tokenizer
        self.input_ids = []
        self.target_ids = []
        
        # Tokenize entire text
        token_ids = tokenizer.encode(txt)
        
        # Create input-target pairs with sliding window
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1:i + max_length + 1]
            self.input_ids.append(input_chunk)
            self.target_ids.append(target_chunk)
    
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

πŸ” Understanding the Dataset

__init__: Setup (tokenize text, create pairs)
__len__: Return total number of samples
__getitem__: Return one sample (input, target)

Example:

dataset = GPTDatasetV1(
    txt="Hello world",
    tokenizer=tokenizer,
    max_length=4,
    stride=4
)

# Get first sample
input_ids, target_ids = dataset[0]
print(f"Input:  {input_ids}")
print(f"Target: {target_ids}")

πŸ’» Step 2: Create DataLoader

from torch.utils.data import DataLoader

def create_dataloader(txt, batch_size=4, max_length=256, 
                     stride=128, num_workers=0):
    # Initialize tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")
    
    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    
    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=False,
        drop_last=True,
        num_workers=num_workers
    )
    
    return dataloader

βœ… Test the DataLoader

# Load text
with open("the_verdict.txt", "r") as f:
    raw_text = f.read()

# Create dataloader
dataloader = create_dataloader(
    raw_text,
    batch_size=8,
    max_length=4,
    stride=4
)

# Get first batch
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print(f"Input shape:  {inputs.shape}")   # torch.Size([8, 4])
print(f"Target shape: {targets.shape}")  # torch.Size([8, 4])
print(f"\nFirst input:  {inputs[0]}")
print(f"First target: {targets[0]}")

Output:

Input shape:  torch.Size([8, 4])
Target shape: torch.Size([8, 4])

First input:  tensor([290, 4920, 2241, 287])
First target: tensor([4920, 2241, 287, 257])

Perfect! Batch of 8 input-target pairs! πŸŽ‰


Batch Size vs Stride vs Num Workers

🎚️ Three Key Parameters

Let’s understand each one:

  1. Batch Size = Samples processed before parameter update
  2. Stride = How much to slide window
  3. Num Workers = Parallel data loading processes

1️⃣ Batch Size

Definition: Number of samples processed together

Small batch (batch_size=1):

Process 1 sample β†’ Update parameters
Process 1 sample β†’ Update parameters
Process 1 sample β†’ Update parameters

βœ… Less memory
❌ Noisy updates (slow convergence)
❌ Slower (no parallelization)

Large batch (batch_size=32):

Process 32 samples β†’ Update parameters
Process 32 samples β†’ Update parameters

βœ… Faster (parallelization)
βœ… Stable updates
❌ More memory needed

Common values: 4, 8, 16, 32, 64


πŸ“Š Batch Size Visual

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Batch Size = 1                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Sample 1 β†’ Update                     β”‚
β”‚  Sample 2 β†’ Update                     β”‚
β”‚  Sample 3 β†’ Update                     β”‚
β”‚  ...                                   β”‚
β”‚  Many updates (noisy!)                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Batch Size = 8                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Samples 1-8 β†’ Update                  β”‚
β”‚  Samples 9-16 β†’ Update                 β”‚
β”‚  ...                                   β”‚
β”‚  Fewer, stable updates!                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2️⃣ Stride

Definition: How much to slide window between samples

Stride = 1 (max overlap):

Text: "In the heart of the city stood"

Sample 1: [In] [the] [heart] [of]
Sample 2:      [the] [heart] [of] [the]  ← 3/4 overlap!
Sample 3:            [heart] [of] [the] [city]

βœ… Maximum training data
❌ Lots of overlap (overfitting risk)

Stride = 4 (no overlap):

Text: "In the heart of the city stood"

Sample 1: [In] [the] [heart] [of]
Sample 2:                           [the] [city] [stood] [the]

βœ… No overlap (less overfitting)
❌ Less training data

Best practice: stride = max_length


3️⃣ Num Workers

Definition: Number of parallel processes for data loading

num_workers=0 (single process):

Main Process:
  Load data β†’ Train β†’ Load data β†’ Train β†’ ...
  
⏱️ Slower (sequential)

num_workers=4 (parallel):

Main Process: Train β†’ Train β†’ Train β†’ ...
Worker 1: Load batch 1 ─┐
Worker 2: Load batch 2  β”œβ”€β†’ Ready batches
Worker 3: Load batch 3  β”‚
Worker 4: Load batch 4 β”€β”˜

⚑ Faster (parallel loading!)

Common values: 2, 4, 8 (based on CPU cores)


πŸ“Š Complete Example

# Load data
with open("the_verdict.txt", "r") as f:
    text = f.read()

# Create dataloader with all parameters
dataloader = create_dataloader(
    text,
    batch_size=8,      # Process 8 samples at once
    max_length=256,    # Context window = 256 tokens
    stride=128,        # Slide by 128 (50% overlap)
    num_workers=4      # 4 parallel workers
)

# Iterate through batches
for batch_idx, (inputs, targets) in enumerate(dataloader):
    print(f"Batch {batch_idx}:")
    print(f"  Inputs shape:  {inputs.shape}")
    print(f"  Targets shape: {targets.shape}")
    
    if batch_idx >= 2:  # Show first 3 batches
        break

Output:

Batch 0:
  Inputs shape:  torch.Size([8, 256])
  Targets shape: torch.Size([8, 256])
Batch 1:
  Inputs shape:  torch.Size([8, 256])
  Targets shape: torch.Size([8, 256])
Batch 2:
  Inputs shape:  torch.Size([8, 256])
  Targets shape: torch.Size([8, 256])

🎯 Parameter Guidelines

Parameter Small Project Medium Large (GPT-like)
batch_size 4-8 16-32 64-512
max_length 128 256-512 1024-2048
stride = max_length = max_length = max_length
num_workers 2 4 8-16

Complete Data Pipeline

🎯 Full Implementation

import tiktoken
import torch
from torch.utils.data import Dataset, DataLoader

# 1. Dataset Class
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.tokenizer = tokenizer
        self.input_ids = []
        self.target_ids = []
        
        # Tokenize
        token_ids = tokenizer.encode(txt)
        
        # Create pairs with sliding window
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1:i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
    
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

# 2. DataLoader Function
def create_dataloader(txt, batch_size=4, max_length=256,
                     stride=128, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=False,
        drop_last=True,
        num_workers=num_workers
    )
    return dataloader

# 3. Usage
with open("the_verdict.txt", "r") as f:
    raw_text = f.read()

dataloader = create_dataloader(
    raw_text,
    batch_size=8,
    max_length=256,
    stride=128
)

# 4. Training Loop (preview!)
for batch_idx, (inputs, targets) in enumerate(dataloader):
    # inputs shape: [batch_size, max_length]
    # targets shape: [batch_size, max_length]
    
    # This will be fed to the model!
    # model(inputs) β†’ predictions
    # loss = compute_loss(predictions, targets)
    # loss.backward()
    # optimizer.step()
    
    print(f"Batch {batch_idx}: {inputs.shape}")

πŸ“Š Data Flow Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. RAW TEXT                                    β”‚
β”‚  "The cat sat on the mat..."                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    ↓
          [TOKENIZATION - BPE]
                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  2. TOKEN IDs                                   β”‚
β”‚  [101, 202, 303, 404, ...]                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    ↓
       [SLIDING WINDOW - Dataset]
                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  3. INPUT-TARGET PAIRS                          β”‚
β”‚  Input:  [101, 202, 303, 404]                  β”‚
β”‚  Target: [202, 303, 404, 505]                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    ↓
         [BATCHING - DataLoader]
                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  4. BATCHED TENSORS                             β”‚
β”‚  Input tensor:  [batch_size, max_length]        β”‚
β”‚  Target tensor: [batch_size, max_length]        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    ↓
          [EMBEDDINGS - Next Chapter!]
                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  5. READY FOR TRAINING!                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Chapter Summary

πŸŽ‰ What We Learned Today

This was a CRUCIAL chapter! Let’s recap:


1. Input-Target Pairs

Core concept:

For LLM training, we create pairs where:
- Input: Current sequence
- Target: Input shifted by 1

Example:
Input:  [1, 2, 3, 4]
Target: [2, 3, 4, 5]

This creates multiple prediction tasks!

2. Context Window

Definition:

Maximum number of tokens model can see

Small (4): Fast, less memory, less context
Large (256): Slow, more memory, more context

GPT-2: 1024 tokens
GPT-3: 2048 tokens
GPT-4: 8192+ tokens

3. Auto-Regressive Training

Key insight:

Output of step N β†’ Input of step N+1

Self-supervised learning:
βœ… No manual labels needed
βœ… Sentence structure provides labels
βœ… Scales to billions of words

4. Sliding Window

Approach:

Window = [In] [the] [heart] [of]
Slide β†’ [the] [heart] [of] [the]
Slide β†’ [heart] [of] [the] [city]

Parameters:
- Context length: Window size
- Stride: How much to slide

5. PyTorch DataLoader

Two components:

# Dataset: How to get one sample
class GPTDatasetV1(Dataset):
    def __getitem__(self, idx):
        return input, target

# DataLoader: Batch management
dataloader = DataLoader(
    dataset,
    batch_size=8,
    num_workers=4
)

6. Three Key Parameters

Batch Size:

How many samples before parameter update
Small: Fast updates, noisy
Large: Stable updates, more memory
Common: 8, 16, 32

Stride:

How much to slide window
stride = context_length (common)
Prevents overlap, uses all data

Num Workers:

Parallel data loading
More workers = faster loading
Common: 2, 4, 8

πŸ“Š Complete Pipeline

Text β†’ Tokenize β†’ Sliding Window β†’ Batching β†’ Ready for Training!

"The cat sat..."
      ↓
[101, 202, 303, ...]
      ↓
Input: [101, 202, 303, 404]
Target: [202, 303, 404, 505]
      ↓
Batch: [8, 256] tensors
      ↓
Next: Embeddings!

πŸ’‘ Key Takeaways

  1. LLMs train on next-word prediction (auto-regressive)
  2. One sentence β†’ multiple training examples (sliding window)
  3. Context window = model’s memory (GPT-2: 1024)
  4. Target = input shifted by 1 (self-supervised!)
  5. DataLoader enables efficient batching (parallel, fast)
  6. Stride = context_length (best practice)
  7. Ready for embeddings! (next chapter)

🎯 What We Learned (Checklist)

  • [x] Why input-target pairs are needed
  • [x] Next-word prediction concept
  • [x] Context window definition
  • [x] Auto-regressive training
  • [x] Creating pairs in Python
  • [x] Sliding window approach
  • [x] Stride parameter
  • [x] PyTorch Dataset class
  • [x] PyTorch DataLoader
  • [x] Batch size vs stride vs workers
  • [x] Complete data pipeline

πŸ”œ Next Chapter: Chapter 10

Topic: Vector Embeddings (Token Embeddings)

What we’ll learn:

  • Converting token IDs β†’ vectors
  • Why embeddings matter
  • Word2Vec concepts
  • Building embedding layer
  • Embedding dimensions
  • Pre-trained embeddings

From numbers to vectors! πŸš€


πŸ“ Practice Exercise

Try this before next chapter:

  1. Load a different text file
  2. Create Dataset with context=8, stride=4
  3. Create DataLoader with batch_size=4
  4. Print first 3 batches
  5. Verify input-target relationship
  6. Experiment with different strides

Share your results! πŸ’¬


πŸš€ Take Action Now!

  1. πŸ’» Run the Code - Implement Dataset and DataLoader
  2. πŸ§ͺ Experiment - Try different parameters
  3. πŸ“ Practice - Create your own dataset
  4. ❓ Ask Questions - Comment if unclear
  5. πŸ”– Bookmark - Critical reference material
  6. ⏭️ Get Ready - Next: Embeddings!

Quick Reference

Dataset Template:

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        # Create input-target pairs
        pass
    
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

DataLoader Usage:

dataloader = DataLoader(
    dataset,
    batch_size=8,      # Samples per batch
    shuffle=False,      # Don't shuffle
    drop_last=True,     # Drop incomplete batch
    num_workers=4       # Parallel workers
)

Key Parameters:

Parameter Meaning Typical Value
max_length Context window 256, 512, 1024
stride Window slide = max_length
batch_size Samples/batch 8, 16, 32
num_workers Parallel processes 2, 4, 8

Thank You!

You’ve completed Chapter 9 - Data Sampling! πŸŽ‰

You now know:

  • βœ… How to create input-target pairs
  • βœ… What context windows are
  • βœ… Auto-regressive training
  • βœ… Sliding window approach
  • βœ… PyTorch DataLoader implementation

Next chapter: Vector Embeddings

Almost done with data preparation! πŸš€


πŸ“£ Your Feedback Matters!

Drop a comment:

  • Did you understand DataLoader?
  • Which part was most challenging?
  • Any questions about batching?
  • Share your experiments!

We respond to every comment! πŸ’¬


🎯 Coming Up

Chapter 10: Token Embeddings
Chapter 11: Positional Encoding
Chapter 12: Self-Attention Mechanism
Chapter 13: Building GPT from Scratch

The journey continues! πŸ’»πŸ”₯


See you in Chapter 10 where we learn embeddings! πŸš€


Questions? Stuck on DataLoader? Drop them below! πŸ’ͺ