Chapter 9: Data Sampling & Context Windows - Preparing Data for LLM Training

October 24, 2025 by The GSM Work

LLM AI Tutorial Series Beginners Data Sampling Context Windows PyTorch DataLoader Training

Chapter 9: Data Sampling - From Tokens to Training Data

📖 Reading Time: 75 minutes
💻 Coding Time: 90 minutes

Welcome to Chapter 9! Today we prepare data for actual LLM training! 🚀

What we’ve learned so far:

Tokenization basics (Chapter 7)
Byte Pair Encoding (Chapter 8)
How to convert text → tokens

Today:

Creating input-target pairs
What is a context window?
Sliding window approach
Batch processing
PyTorch DataLoader implementation
Preparing data for training!

This is the FINAL step before embeddings! 💡

Why Do We Need Input-Target Pairs?

🤔 The Problem

In other ML tasks, input-output is clear:

Image Classification:

Input: Picture of a cat 🐱
Output: "cat"
✅ Clear!

House Price Prediction:

Input: House area (2000 sq ft)
Output: Price ($300,000)
✅ Clear!

But for LLMs? 🤔

💡 LLM’s Task: Predict Next Word

Given sentence:

"LLMs learn to predict one word at a time"

What’s the input? What’s the output? 🤔

Answer: We need to CREATE input-output pairs from the sentence itself!

📊 Where We Are

STAGE 1: Building Blocks
├── Data Preparation
│   ├── ✅ Tokenization (Chapter 7)
│   ├── ✅ BPE (Chapter 8)
│   ├── ✅ Input-Target Pairs (Chapter 9) ← TODAY!
│   └── ⏳ Vector Embeddings (Next!)
├── Attention Mechanisms
└── LLM Architecture

Almost done with data preparation! 🎉

Understanding Next-Word Prediction

🎯 The Core Idea

LLMs learn by predicting the NEXT word in a sequence

Sentence:

"LLMs learn to predict one word at a time"

How to train on this? 🤔

📝 Creating Multiple Training Examples

From ONE sentence, create MULTIPLE training pairs!

Iteration 1:

Input:  "LLMs"
Target: "learn"

Iteration 2:

Input:  "LLMs learn"
Target: "to"

Iteration 3:

Input:  "LLMs learn to"
Target: "predict"

Iteration 4:

Input:  "LLMs learn to predict"
Target: "one"

See the pattern? ✨

🔄 Visual Representation

Sentence: "LLMs learn to predict one word at a time"

┌─────────────────────────────────────────────┐
│  ITERATION 1                                │
├─────────────────────────────────────────────┤
│  Input:  [LLMs]                             │
│  Target: [learn]                            │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  ITERATION 2                                │
├─────────────────────────────────────────────┤
│  Input:  [LLMs] [learn]                     │
│  Target: [to]                               │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  ITERATION 3                                │
├─────────────────────────────────────────────┤
│  Input:  [LLMs] [learn] [to]                │
│  Target: [predict]                          │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  ITERATION 4                                │
├─────────────────────────────────────────────┤
│  Input:  [LLMs] [learn] [to] [predict]      │
│  Target: [one]                              │
└─────────────────────────────────────────────┘

One sentence → Multiple training examples! 🎉

🎯 Key Insights

1. Target is always NEXT word:

Input: "The cat"
Target: "sat" (next word!)

2. Input grows each iteration:

Iteration 1: [The]
Iteration 2: [The] [cat]
Iteration 3: [The] [cat] [sat]

3. Future words are MASKED:

When predicting "cat":
Can see: "The" ✅
Cannot see: "sat", "on", "the", "mat" ❌

This is CAUSAL language modeling!

What is a Context Window?

🪟 Definition

Context Window = How many previous words the model can see to predict the next word

Think of it as the model’s “memory”! 🧠

📏 Example: Context Window = 4

Sentence: “LLMs learn to predict one word at a time”

Training examples with context window = 4:

┌─────────────────────────────────────────────┐
│  Example 1                                  │
├─────────────────────────────────────────────┤
│  Input:  [LLMs]                             │
│  Target: [learn]                            │
│  Context used: 1 word                       │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Example 2                                  │
├─────────────────────────────────────────────┤
│  Input:  [LLMs] [learn]                     │
│  Target: [to]                               │
│  Context used: 2 words                      │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Example 3                                  │
├─────────────────────────────────────────────┤
│  Input:  [LLMs] [learn] [to]                │
│  Target: [predict]                          │
│  Context used: 3 words                      │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Example 4                                  │
├─────────────────────────────────────────────┤
│  Input:  [LLMs] [learn] [to] [predict]      │
│  Target: [one]                              │
│  Context used: 4 words (MAX!)               │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Example 5                                  │
├─────────────────────────────────────────────┤
│  Input:  [learn] [to] [predict] [one]       │
│          (dropped "LLMs"!)                  │
│  Target: [word]                             │
│  Context used: 4 words (MAX!)               │
└─────────────────────────────────────────────┘

Context window = MAXIMUM words model can see! 🎯

🔢 Real-World Context Sizes

Model	Context Window	Notes
Our tutorial	4 tokens	For learning!
GPT-2	1024 tokens	~750 words
GPT-3	2048 tokens	~1500 words
GPT-4	8192 tokens	~6000 words
GPT-4 Turbo	128K tokens	~96000 words!
Claude 2	100K tokens	~75000 words

Larger context = More memory! 📈

💡 Why Context Window Matters

Small context (4 words):

Input: "The cat sat on"
Predict: "the"

Problem: Can't see full sentence! ❌

Large context (256 words):

Input: [entire paragraph about cats sitting]
Predict: "mat"

Better! Can see full context! ✅

But: Larger context = More computation + memory! 💰

🎯 Key Takeaway

Context Window = 4 means:
✅ Model can see up to 4 previous words
✅ Creates 4 prediction tasks per input-target pair
✅ Balances memory and performance

GPT training uses context = 256 or more!

Auto-Regressive Training

🔄 What is Auto-Regressive?

Auto-Regressive = Model’s output becomes its next input

“Auto” = Self
“Regressive” = Using previous values

📊 Visual Example

┌──────────────────────────────────────────┐
│  Step 1                                  │
├──────────────────────────────────────────┤
│  Input:  "The"                           │
│  Output: "cat"                           │
└──────────────────────────────────────────┘
              ↓
        Add to input!
              ↓
┌──────────────────────────────────────────┐
│  Step 2                                  │
├──────────────────────────────────────────┤
│  Input:  "The cat" ← (includes "cat"!)  │
│  Output: "sat"                           │
└──────────────────────────────────────────┘
              ↓
        Add to input!
              ↓
┌──────────────────────────────────────────┐
│  Step 3                                  │
├──────────────────────────────────────────┤
│  Input:  "The cat sat" ← (includes both!)│
│  Output: "on"                            │
└──────────────────────────────────────────┘

Each output feeds back as input! 🔄

🎯 Why It’s Called “Auto-Regressive”

Compare two iterations:

Iteration 1:
Input:  [1]
Output: [2]

Iteration 2:
Input:  [1, 2] ← Output from Iteration 1!
Output: [3]

The model “regresses” on its own outputs!

📝 Auto-Regressive = Self-Supervised

Key insight: No manual labeling needed!

Traditional supervised learning:
Data: Image of cat
Label: "cat" ← Someone labeled this! 👤

Auto-regressive (LLMs):
Data: "The cat sat on the mat"
Labels: Automatically from sentence structure! 🤖
  - Input: "The" → Label: "cat"
  - Input: "The cat" → Label: "sat"
  - Input: "The cat sat" → Label: "on"

The sentence ITSELF provides the labels! ✨

💡 Key Takeaway

Auto-Regressive Training:
✅ Output becomes next input
✅ Self-supervised (no manual labels!)
✅ How LLMs learn from raw text
✅ Enables unsupervised learning at scale

This is why LLMs can train on billions of words!

Creating Input-Target Pairs in Python

💻 Simple Example

Goal: Create input-target pairs with context window = 4

# Sample text (already tokenized)
tokens = [1, 2, 3, 4, 5, 6, 7, 8]

# Context window
context_length = 4

# Create input (X) and target (Y)
X = tokens[:context_length]      # [1, 2, 3, 4]
Y = tokens[1:context_length+1]   # [2, 3, 4, 5]

print(f"Input:  {X}")
print(f"Target: {Y}")

Output:

Input:  [1, 2, 3, 4]
Target: [2, 3, 4, 5]

Target is input shifted by 1! ✅

📊 Understanding the Pairs

Input: [1, 2, 3, 4]
Target: [2, 3, 4, 5]

This creates 4 prediction tasks:

Input Sequence	Target	Task
`[1]`	`2`	If input is 1, predict 2
`[1, 2]`	`3`	If input is 1,2, predict 3
`[1, 2, 3]`	`4`	If input is 1,2,3, predict 4
`[1, 2, 3, 4]`	`5`	If input is 1,2,3,4, predict 5

One input-target pair = 4 prediction tasks! 🎯

🔄 Iterating Through Tasks

X = [1, 2, 3, 4]
Y = [2, 3, 4, 5]
context_length = 4

for i in range(1, context_length + 1):
    context = X[:i]
    target = Y[i-1]
    print(f"Input: {context} → Target: {target}")

Output:

Input: [1] → Target: 2
Input: [1, 2] → Target: 3
Input: [1, 2, 3] → Target: 4
Input: [1, 2, 3, 4] → Target: 5

This is how LLMs train! ✨

📖 Real Text Example

# Load and tokenize text
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

text = "and established himself in"
encoded = tokenizer.encode(text)
print(f"Encoded: {encoded}")

# Create input-target pairs
context_length = 4
X = encoded[:context_length]
Y = encoded[1:context_length+1]

# Decode to see words
for i in range(1, context_length + 1):
    context = tokenizer.decode(X[:i])
    target = tokenizer.decode([Y[i-1]])
    print(f"'{context}' → '{target}'")

Output:

Encoded: [290, 4920, 2241, 287]
'and' → 'established'
'and established' → 'himself'
'and established himself' → 'in'

Makes sense! Each word predicts the next! ✅

The Sliding Window Approach

🪟 What is Sliding Window?

Sliding Window = Move window across text to create multiple input-target pairs

Think of it like a camera viewport moving across text! 📹

📊 Visual Example

Text: “In the heart of the city stood the old library”

Context Window = 4

Step 1: Window at position 0

┌─────────────────────────┐
│ [In] [the] [heart] [of] │ ← Input
└─────────────────────────┘
         ↓ Shift by 1
┌─────────────────────────┐
│ [the] [heart] [of] [the]│ ← Target
└─────────────────────────┘

Step 2: Slide window (stride = 4)

                    ┌───────────────────────────┐
                    │ [the] [city] [stood] [the]│ ← Input
                    └───────────────────────────┘
                            ↓ Shift by 1
                    ┌───────────────────────────┐
                    │ [city] [stood] [the] [old]│ ← Target
                    └───────────────────────────┘

Step 3: Slide again (stride = 4)

                                        ┌────────────────────────────┐
                                        │ [old] [library] [...] [...]│ ← Input
                                        └────────────────────────────┘

Keep sliding until end of text! 🎯

🎚️ Stride Parameter

Stride = How much to slide the window

Stride = 1 (overlap):

Window 1: [In] [the] [heart] [of]
Window 2:      [the] [heart] [of] [the]  ← Shifted by 1
Window 3:            [heart] [of] [the] [city]

Maximum training examples!
But: Lots of overlap (may overfit) ⚠️

Stride = 4 (no overlap):

Window 1: [In] [the] [heart] [of]
Window 2:                           [the] [city] [stood] [the]
Window 3:                                                        [old] [library] [...]

No overlap!
But: Fewer training examples

📊 Visual Comparison

Stride = 1:

Text: "In the heart of the city stood the old library"

Input 1:  [In] [the] [heart] [of]
Target 1: [the] [heart] [of] [the]

Input 2:  [the] [heart] [of] [the]      ← 3/4 overlap!
Target 2: [heart] [of] [the] [city]

Stride = 4:

Text: "In the heart of the city stood the old library"

Input 1:  [In] [the] [heart] [of]
Target 1: [the] [heart] [of] [the]

Input 2:  [the] [city] [stood] [the]    ← No overlap!
Target 2: [city] [stood] [the] [old]

🎯 Choosing Stride

Common practice:

Stride = Context Length

Example:
Context = 4 → Stride = 4

Why?
✅ No overlap (prevents overfitting)
✅ Uses all data (no words skipped)
✅ Balanced approach!

GPT training typically uses stride = context_length ✨

Implementing PyTorch DataLoader

🎯 Why DataLoader?

Problems with manual approach:

❌ Have to manually iterate
❌ No batching support
❌ No parallel processing
❌ Inefficient for large datasets

DataLoader solution:

✅ Automatic batching
✅ Parallel data loading
✅ Shuffling support
✅ Efficient memory usage

📦 Two Components

1. Dataset = Defines how to get one sample
2. DataLoader = Manages batches and workers

┌────────────────────────────────────┐
│  Dataset                           │
│  - How to load data                │
│  - How to get one sample           │
│  - Returns (input, target) pair    │
└────────────────────────────────────┘
              ↓
┌────────────────────────────────────┐
│  DataLoader                        │
│  - Batch multiple samples          │
│  - Shuffle data                    │
│  - Parallel loading (workers)      │
└────────────────────────────────────┘

💻 Step 1: Create Dataset Class

from torch.utils.data import Dataset

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.tokenizer = tokenizer
        self.input_ids = []
        self.target_ids = []
        
        # Tokenize entire text
        token_ids = tokenizer.encode(txt)
        
        # Create input-target pairs with sliding window
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1:i + max_length + 1]
            self.input_ids.append(input_chunk)
            self.target_ids.append(target_chunk)
    
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

🔍 Understanding the Dataset

__init__: Setup (tokenize text, create pairs)
__len__: Return total number of samples
__getitem__: Return one sample (input, target)

Example:

dataset = GPTDatasetV1(
    txt="Hello world",
    tokenizer=tokenizer,
    max_length=4,
    stride=4
)

# Get first sample
input_ids, target_ids = dataset[0]
print(f"Input:  {input_ids}")
print(f"Target: {target_ids}")

💻 Step 2: Create DataLoader

from torch.utils.data import DataLoader

def create_dataloader(txt, batch_size=4, max_length=256, 
                     stride=128, num_workers=0):
    # Initialize tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")
    
    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    
    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=False,
        drop_last=True,
        num_workers=num_workers
    )
    
    return dataloader

✅ Test the DataLoader

# Load text
with open("the_verdict.txt", "r") as f:
    raw_text = f.read()

# Create dataloader
dataloader = create_dataloader(
    raw_text,
    batch_size=8,
    max_length=4,
    stride=4
)

# Get first batch
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print(f"Input shape:  {inputs.shape}")   # torch.Size([8, 4])
print(f"Target shape: {targets.shape}")  # torch.Size([8, 4])
print(f"\nFirst input:  {inputs[0]}")
print(f"First target: {targets[0]}")

Output:

Input shape:  torch.Size([8, 4])
Target shape: torch.Size([8, 4])

First input:  tensor([290, 4920, 2241, 287])
First target: tensor([4920, 2241, 287, 257])

Perfect! Batch of 8 input-target pairs! 🎉

Batch Size vs Stride vs Num Workers

🎚️ Three Key Parameters

Let’s understand each one:

Batch Size = Samples processed before parameter update
Stride = How much to slide window
Num Workers = Parallel data loading processes

1️⃣ Batch Size

Definition: Number of samples processed together

Small batch (batch_size=1):

Process 1 sample → Update parameters
Process 1 sample → Update parameters
Process 1 sample → Update parameters

✅ Less memory
❌ Noisy updates (slow convergence)
❌ Slower (no parallelization)

Large batch (batch_size=32):

Process 32 samples → Update parameters
Process 32 samples → Update parameters

✅ Faster (parallelization)
✅ Stable updates
❌ More memory needed

Common values: 4, 8, 16, 32, 64

📊 Batch Size Visual

┌────────────────────────────────────────┐
│  Batch Size = 1                        │
├────────────────────────────────────────┤
│  Sample 1 → Update                     │
│  Sample 2 → Update                     │
│  Sample 3 → Update                     │
│  ...                                   │
│  Many updates (noisy!)                 │
└────────────────────────────────────────┘

┌────────────────────────────────────────┐
│  Batch Size = 8                        │
├────────────────────────────────────────┤
│  Samples 1-8 → Update                  │
│  Samples 9-16 → Update                 │
│  ...                                   │
│  Fewer, stable updates!                │
└────────────────────────────────────────┘

2️⃣ Stride

Definition: How much to slide window between samples

Stride = 1 (max overlap):

Text: "In the heart of the city stood"

Sample 1: [In] [the] [heart] [of]
Sample 2:      [the] [heart] [of] [the]  ← 3/4 overlap!
Sample 3:            [heart] [of] [the] [city]

✅ Maximum training data
❌ Lots of overlap (overfitting risk)

Stride = 4 (no overlap):

Text: "In the heart of the city stood"

Sample 1: [In] [the] [heart] [of]
Sample 2:                           [the] [city] [stood] [the]

✅ No overlap (less overfitting)
❌ Less training data

Best practice: stride = max_length

3️⃣ Num Workers

Definition: Number of parallel processes for data loading

num_workers=0 (single process):

Main Process:
  Load data → Train → Load data → Train → ...
  
⏱️ Slower (sequential)

num_workers=4 (parallel):

Main Process: Train → Train → Train → ...
Worker 1: Load batch 1 ─┐
Worker 2: Load batch 2  ├─→ Ready batches
Worker 3: Load batch 3  │
Worker 4: Load batch 4 ─┘

⚡ Faster (parallel loading!)

Common values: 2, 4, 8 (based on CPU cores)

📊 Complete Example

# Load data
with open("the_verdict.txt", "r") as f:
    text = f.read()

# Create dataloader with all parameters
dataloader = create_dataloader(
    text,
    batch_size=8,      # Process 8 samples at once
    max_length=256,    # Context window = 256 tokens
    stride=128,        # Slide by 128 (50% overlap)
    num_workers=4      # 4 parallel workers
)

# Iterate through batches
for batch_idx, (inputs, targets) in enumerate(dataloader):
    print(f"Batch {batch_idx}:")
    print(f"  Inputs shape:  {inputs.shape}")
    print(f"  Targets shape: {targets.shape}")
    
    if batch_idx >= 2:  # Show first 3 batches
        break

Output:

Batch 0:
  Inputs shape:  torch.Size([8, 256])
  Targets shape: torch.Size([8, 256])
Batch 1:
  Inputs shape:  torch.Size([8, 256])
  Targets shape: torch.Size([8, 256])
Batch 2:
  Inputs shape:  torch.Size([8, 256])
  Targets shape: torch.Size([8, 256])

🎯 Parameter Guidelines

Parameter	Small Project	Medium	Large (GPT-like)
batch_size	4-8	16-32	64-512
max_length	128	256-512	1024-2048
stride	= max_length	= max_length	= max_length
num_workers	2	4	8-16

Complete Data Pipeline

🎯 Full Implementation

import tiktoken
import torch
from torch.utils.data import Dataset, DataLoader

# 1. Dataset Class
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.tokenizer = tokenizer
        self.input_ids = []
        self.target_ids = []
        
        # Tokenize
        token_ids = tokenizer.encode(txt)
        
        # Create pairs with sliding window
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1:i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
    
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

# 2. DataLoader Function
def create_dataloader(txt, batch_size=4, max_length=256,
                     stride=128, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=False,
        drop_last=True,
        num_workers=num_workers
    )
    return dataloader

# 3. Usage
with open("the_verdict.txt", "r") as f:
    raw_text = f.read()

dataloader = create_dataloader(
    raw_text,
    batch_size=8,
    max_length=256,
    stride=128
)

# 4. Training Loop (preview!)
for batch_idx, (inputs, targets) in enumerate(dataloader):
    # inputs shape: [batch_size, max_length]
    # targets shape: [batch_size, max_length]
    
    # This will be fed to the model!
    # model(inputs) → predictions
    # loss = compute_loss(predictions, targets)
    # loss.backward()
    # optimizer.step()
    
    print(f"Batch {batch_idx}: {inputs.shape}")

📊 Data Flow Diagram

┌─────────────────────────────────────────────────┐
│  1. RAW TEXT                                    │
│  "The cat sat on the mat..."                    │
└─────────────────────────────────────────────────┘
                    ↓
          [TOKENIZATION - BPE]
                    ↓
┌─────────────────────────────────────────────────┐
│  2. TOKEN IDs                                   │
│  [101, 202, 303, 404, ...]                     │
└─────────────────────────────────────────────────┘
                    ↓
       [SLIDING WINDOW - Dataset]
                    ↓
┌─────────────────────────────────────────────────┐
│  3. INPUT-TARGET PAIRS                          │
│  Input:  [101, 202, 303, 404]                  │
│  Target: [202, 303, 404, 505]                  │
└─────────────────────────────────────────────────┘
                    ↓
         [BATCHING - DataLoader]
                    ↓
┌─────────────────────────────────────────────────┐
│  4. BATCHED TENSORS                             │
│  Input tensor:  [batch_size, max_length]        │
│  Target tensor: [batch_size, max_length]        │
└─────────────────────────────────────────────────┘
                    ↓
          [EMBEDDINGS - Next Chapter!]
                    ↓
┌─────────────────────────────────────────────────┐
│  5. READY FOR TRAINING!                         │
└─────────────────────────────────────────────────┘

Chapter Summary

🎉 What We Learned Today

This was a CRUCIAL chapter! Let’s recap:

1. Input-Target Pairs

Core concept:

For LLM training, we create pairs where:
- Input: Current sequence
- Target: Input shifted by 1

Example:
Input:  [1, 2, 3, 4]
Target: [2, 3, 4, 5]

This creates multiple prediction tasks!

2. Context Window

Definition:

Maximum number of tokens model can see

Small (4): Fast, less memory, less context
Large (256): Slow, more memory, more context

GPT-2: 1024 tokens
GPT-3: 2048 tokens
GPT-4: 8192+ tokens

3. Auto-Regressive Training

Key insight:

Output of step N → Input of step N+1

Self-supervised learning:
✅ No manual labels needed
✅ Sentence structure provides labels
✅ Scales to billions of words

4. Sliding Window

Approach:

Window = [In] [the] [heart] [of]
Slide → [the] [heart] [of] [the]
Slide → [heart] [of] [the] [city]

Parameters:
- Context length: Window size
- Stride: How much to slide

5. PyTorch DataLoader

Two components:

# Dataset: How to get one sample
class GPTDatasetV1(Dataset):
    def __getitem__(self, idx):
        return input, target

# DataLoader: Batch management
dataloader = DataLoader(
    dataset,
    batch_size=8,
    num_workers=4
)

6. Three Key Parameters

Batch Size:

How many samples before parameter update
Small: Fast updates, noisy
Large: Stable updates, more memory
Common: 8, 16, 32

Stride:

How much to slide window
stride = context_length (common)
Prevents overlap, uses all data

Num Workers:

Parallel data loading
More workers = faster loading
Common: 2, 4, 8

📊 Complete Pipeline

Text → Tokenize → Sliding Window → Batching → Ready for Training!

"The cat sat..."
      ↓
[101, 202, 303, ...]
      ↓
Input: [101, 202, 303, 404]
Target: [202, 303, 404, 505]
      ↓
Batch: [8, 256] tensors
      ↓
Next: Embeddings!

💡 Key Takeaways

LLMs train on next-word prediction (auto-regressive)
One sentence → multiple training examples (sliding window)
Context window = model’s memory (GPT-2: 1024)
Target = input shifted by 1 (self-supervised!)
DataLoader enables efficient batching (parallel, fast)
Stride = context_length (best practice)
Ready for embeddings! (next chapter)

🎯 What We Learned (Checklist)

[x] Why input-target pairs are needed
[x] Next-word prediction concept
[x] Context window definition
[x] Auto-regressive training
[x] Creating pairs in Python
[x] Sliding window approach
[x] Stride parameter
[x] PyTorch Dataset class
[x] PyTorch DataLoader
[x] Batch size vs stride vs workers
[x] Complete data pipeline

🔜 Next Chapter: Chapter 10

Topic: Vector Embeddings (Token Embeddings)

What we’ll learn:

Converting token IDs → vectors
Why embeddings matter
Word2Vec concepts
Building embedding layer
Embedding dimensions
Pre-trained embeddings

From numbers to vectors! 🚀

📝 Practice Exercise

Try this before next chapter:

Load a different text file
Create Dataset with context=8, stride=4
Create DataLoader with batch_size=4
Print first 3 batches
Verify input-target relationship
Experiment with different strides

Share your results! 💬

🚀 Take Action Now!

💻 Run the Code - Implement Dataset and DataLoader
🧪 Experiment - Try different parameters
📝 Practice - Create your own dataset
❓ Ask Questions - Comment if unclear
🔖 Bookmark - Critical reference material
⏭️ Get Ready - Next: Embeddings!

Quick Reference

Dataset Template:

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        # Create input-target pairs
        pass
    
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

DataLoader Usage:

dataloader = DataLoader(
    dataset,
    batch_size=8,      # Samples per batch
    shuffle=False,      # Don't shuffle
    drop_last=True,     # Drop incomplete batch
    num_workers=4       # Parallel workers
)

Key Parameters:

Parameter	Meaning	Typical Value
max_length	Context window	256, 512, 1024
stride	Window slide	= max_length
batch_size	Samples/batch	8, 16, 32
num_workers	Parallel processes	2, 4, 8

Thank You!

You’ve completed Chapter 9 - Data Sampling! 🎉

You now know:

✅ How to create input-target pairs
✅ What context windows are
✅ Auto-regressive training
✅ Sliding window approach
✅ PyTorch DataLoader implementation

Next chapter: Vector Embeddings

Almost done with data preparation! 🚀

📣 Your Feedback Matters!

Drop a comment:

Did you understand DataLoader?
Which part was most challenging?
Any questions about batching?
Share your experiments!

We respond to every comment! 💬

🎯 Coming Up

Chapter 10: Token Embeddings
Chapter 11: Positional Encoding
Chapter 12: Self-Attention Mechanism
Chapter 13: Building GPT from Scratch

The journey continues! 💻🔥

See you in Chapter 10 where we learn embeddings! 🚀

Questions? Stuck on DataLoader? Drop them below! 💪

Chapter 9: Data Sampling - From Tokens to Training Data

📑 Table of Contents

Why Do We Need Input-Target Pairs?

🤔 The Problem

💡 LLM’s Task: Predict Next Word

📊 Where We Are

Understanding Next-Word Prediction

🎯 The Core Idea

📝 Creating Multiple Training Examples

🔄 Visual Representation

🎯 Key Insights

What is a Context Window?

🪟 Definition

📏 Example: Context Window = 4

🔢 Real-World Context Sizes

💡 Why Context Window Matters

🎯 Key Takeaway

Auto-Regressive Training

🔄 What is Auto-Regressive?

📊 Visual Example

🎯 Why It’s Called “Auto-Regressive”

📝 Auto-Regressive = Self-Supervised

💡 Key Takeaway

Creating Input-Target Pairs in Python

💻 Simple Example

📊 Understanding the Pairs

🔄 Iterating Through Tasks

📖 Real Text Example

The Sliding Window Approach

🪟 What is Sliding Window?

📊 Visual Example

🎚️ Stride Parameter

📊 Visual Comparison

🎯 Choosing Stride

Implementing PyTorch DataLoader

🎯 Why DataLoader?

📦 Two Components

💻 Step 1: Create Dataset Class

🔍 Understanding the Dataset

💻 Step 2: Create DataLoader

✅ Test the DataLoader

Batch Size vs Stride vs Num Workers

🎚️ Three Key Parameters

1️⃣ Batch Size

📊 Batch Size Visual

2️⃣ Stride

3️⃣ Num Workers

📊 Complete Example

🎯 Parameter Guidelines

Complete Data Pipeline

🎯 Full Implementation

📊 Data Flow Diagram

Chapter Summary

🎉 What We Learned Today

1. Input-Target Pairs

2. Context Window

3. Auto-Regressive Training

4. Sliding Window

5. PyTorch DataLoader

6. Three Key Parameters

📊 Complete Pipeline

💡 Key Takeaways

🎯 What We Learned (Checklist)

🔜 Next Chapter: Chapter 10

📝 Practice Exercise

🚀 Take Action Now!

Quick Reference

Dataset Template:

DataLoader Usage:

Key Parameters:

Thank You!

📣 Your Feedback Matters!

🎯 Coming Up

Share this post:

Related Posts

Chapter 10: Token Embeddings - Converting Words to Meaning-Rich Vectors

Chapter 8: Byte Pair Encoding (BPE) - How GPT Tokenizes Text

Chapter 7: Tokenization Explained - Building Your First Tokenizer From Scratch