Chapter 9: Data Sampling - From Tokens to Training Data
π Reading Time: 75 minutes
π» Coding Time: 90 minutes
Welcome to Chapter 9! Today we prepare data for actual LLM training! π
What weβve learned so far:
- Tokenization basics (Chapter 7)
- Byte Pair Encoding (Chapter 8)
- How to convert text β tokens
Today:
- Creating input-target pairs
- What is a context window?
- Sliding window approach
- Batch processing
- PyTorch DataLoader implementation
- Preparing data for training!
This is the FINAL step before embeddings! π‘
π Table of Contents
- Why Do We Need Input-Target Pairs?
- Understanding Next-Word Prediction
- What is a Context Window?
- Auto-Regressive Training
- Creating Input-Target Pairs in Python
- The Sliding Window Approach
- Implementing PyTorch DataLoader
- Batch Size vs Stride vs Num Workers
- Complete Data Pipeline
- Chapter Summary
Why Do We Need Input-Target Pairs?
π€ The Problem
In other ML tasks, input-output is clear:
Image Classification:
Input: Picture of a cat π±
Output: "cat"
β
Clear!
House Price Prediction:
Input: House area (2000 sq ft)
Output: Price ($300,000)
β
Clear!
But for LLMs? π€
π‘ LLMβs Task: Predict Next Word
Given sentence:
"LLMs learn to predict one word at a time"
Whatβs the input? Whatβs the output? π€
Answer: We need to CREATE input-output pairs from the sentence itself!
π Where We Are
STAGE 1: Building Blocks
βββ Data Preparation
β βββ β
Tokenization (Chapter 7)
β βββ β
BPE (Chapter 8)
β βββ β
Input-Target Pairs (Chapter 9) β TODAY!
β βββ β³ Vector Embeddings (Next!)
βββ Attention Mechanisms
βββ LLM Architecture
Almost done with data preparation! π
Understanding Next-Word Prediction
π― The Core Idea
LLMs learn by predicting the NEXT word in a sequence
Sentence:
"LLMs learn to predict one word at a time"
How to train on this? π€
π Creating Multiple Training Examples
From ONE sentence, create MULTIPLE training pairs!
Iteration 1:
Input: "LLMs"
Target: "learn"
Iteration 2:
Input: "LLMs learn"
Target: "to"
Iteration 3:
Input: "LLMs learn to"
Target: "predict"
Iteration 4:
Input: "LLMs learn to predict"
Target: "one"
See the pattern? β¨
π Visual Representation
Sentence: "LLMs learn to predict one word at a time"
βββββββββββββββββββββββββββββββββββββββββββββββ
β ITERATION 1 β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Input: [LLMs] β
β Target: [learn] β
βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
β ITERATION 2 β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Input: [LLMs] [learn] β
β Target: [to] β
βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
β ITERATION 3 β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Input: [LLMs] [learn] [to] β
β Target: [predict] β
βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
β ITERATION 4 β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Input: [LLMs] [learn] [to] [predict] β
β Target: [one] β
βββββββββββββββββββββββββββββββββββββββββββββββ
One sentence β Multiple training examples! π
π― Key Insights
1. Target is always NEXT word:
Input: "The cat"
Target: "sat" (next word!)
2. Input grows each iteration:
Iteration 1: [The]
Iteration 2: [The] [cat]
Iteration 3: [The] [cat] [sat]
3. Future words are MASKED:
When predicting "cat":
Can see: "The" β
Cannot see: "sat", "on", "the", "mat" β
This is CAUSAL language modeling!
What is a Context Window?
πͺ Definition
Context Window = How many previous words the model can see to predict the next word
Think of it as the modelβs βmemoryβ! π§
π Example: Context Window = 4
Sentence: βLLMs learn to predict one word at a timeβ
Training examples with context window = 4:
βββββββββββββββββββββββββββββββββββββββββββββββ
β Example 1 β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Input: [LLMs] β
β Target: [learn] β
β Context used: 1 word β
βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Example 2 β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Input: [LLMs] [learn] β
β Target: [to] β
β Context used: 2 words β
βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Example 3 β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Input: [LLMs] [learn] [to] β
β Target: [predict] β
β Context used: 3 words β
βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Example 4 β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Input: [LLMs] [learn] [to] [predict] β
β Target: [one] β
β Context used: 4 words (MAX!) β
βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Example 5 β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Input: [learn] [to] [predict] [one] β
β (dropped "LLMs"!) β
β Target: [word] β
β Context used: 4 words (MAX!) β
βββββββββββββββββββββββββββββββββββββββββββββββ
Context window = MAXIMUM words model can see! π―
π’ Real-World Context Sizes
| Model | Context Window | Notes |
|---|---|---|
| Our tutorial | 4 tokens | For learning! |
| GPT-2 | 1024 tokens | ~750 words |
| GPT-3 | 2048 tokens | ~1500 words |
| GPT-4 | 8192 tokens | ~6000 words |
| GPT-4 Turbo | 128K tokens | ~96000 words! |
| Claude 2 | 100K tokens | ~75000 words |
Larger context = More memory! π
π‘ Why Context Window Matters
Small context (4 words):
Input: "The cat sat on"
Predict: "the"
Problem: Can't see full sentence! β
Large context (256 words):
Input: [entire paragraph about cats sitting]
Predict: "mat"
Better! Can see full context! β
But: Larger context = More computation + memory! π°
π― Key Takeaway
Context Window = 4 means:
β
Model can see up to 4 previous words
β
Creates 4 prediction tasks per input-target pair
β
Balances memory and performance
GPT training uses context = 256 or more!
Auto-Regressive Training
π What is Auto-Regressive?
Auto-Regressive = Modelβs output becomes its next input
βAutoβ = Self
βRegressiveβ = Using previous values
π Visual Example
ββββββββββββββββββββββββββββββββββββββββββββ
β Step 1 β
ββββββββββββββββββββββββββββββββββββββββββββ€
β Input: "The" β
β Output: "cat" β
ββββββββββββββββββββββββββββββββββββββββββββ
β
Add to input!
β
ββββββββββββββββββββββββββββββββββββββββββββ
β Step 2 β
ββββββββββββββββββββββββββββββββββββββββββββ€
β Input: "The cat" β (includes "cat"!) β
β Output: "sat" β
ββββββββββββββββββββββββββββββββββββββββββββ
β
Add to input!
β
ββββββββββββββββββββββββββββββββββββββββββββ
β Step 3 β
ββββββββββββββββββββββββββββββββββββββββββββ€
β Input: "The cat sat" β (includes both!)β
β Output: "on" β
ββββββββββββββββββββββββββββββββββββββββββββ
Each output feeds back as input! π
π― Why Itβs Called βAuto-Regressiveβ
Compare two iterations:
Iteration 1:
Input: [1]
Output: [2]
Iteration 2:
Input: [1, 2] β Output from Iteration 1!
Output: [3]
The model βregressesβ on its own outputs!
π Auto-Regressive = Self-Supervised
Key insight: No manual labeling needed!
Traditional supervised learning:
Data: Image of cat
Label: "cat" β Someone labeled this! π€
Auto-regressive (LLMs):
Data: "The cat sat on the mat"
Labels: Automatically from sentence structure! π€
- Input: "The" β Label: "cat"
- Input: "The cat" β Label: "sat"
- Input: "The cat sat" β Label: "on"
The sentence ITSELF provides the labels! β¨
π‘ Key Takeaway
Auto-Regressive Training:
β
Output becomes next input
β
Self-supervised (no manual labels!)
β
How LLMs learn from raw text
β
Enables unsupervised learning at scale
This is why LLMs can train on billions of words!
Creating Input-Target Pairs in Python
π» Simple Example
Goal: Create input-target pairs with context window = 4
# Sample text (already tokenized)
tokens = [1, 2, 3, 4, 5, 6, 7, 8]
# Context window
context_length = 4
# Create input (X) and target (Y)
X = tokens[:context_length] # [1, 2, 3, 4]
Y = tokens[1:context_length+1] # [2, 3, 4, 5]
print(f"Input: {X}")
print(f"Target: {Y}")
Output:
Input: [1, 2, 3, 4]
Target: [2, 3, 4, 5]
Target is input shifted by 1! β
π Understanding the Pairs
Input: [1, 2, 3, 4]
Target: [2, 3, 4, 5]
This creates 4 prediction tasks:
| Input Sequence | Target | Task |
|---|---|---|
[1] |
2 |
If input is 1, predict 2 |
[1, 2] |
3 |
If input is 1,2, predict 3 |
[1, 2, 3] |
4 |
If input is 1,2,3, predict 4 |
[1, 2, 3, 4] |
5 |
If input is 1,2,3,4, predict 5 |
One input-target pair = 4 prediction tasks! π―
π Iterating Through Tasks
X = [1, 2, 3, 4]
Y = [2, 3, 4, 5]
context_length = 4
for i in range(1, context_length + 1):
context = X[:i]
target = Y[i-1]
print(f"Input: {context} β Target: {target}")
Output:
Input: [1] β Target: 2
Input: [1, 2] β Target: 3
Input: [1, 2, 3] β Target: 4
Input: [1, 2, 3, 4] β Target: 5
This is how LLMs train! β¨
π Real Text Example
# Load and tokenize text
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
text = "and established himself in"
encoded = tokenizer.encode(text)
print(f"Encoded: {encoded}")
# Create input-target pairs
context_length = 4
X = encoded[:context_length]
Y = encoded[1:context_length+1]
# Decode to see words
for i in range(1, context_length + 1):
context = tokenizer.decode(X[:i])
target = tokenizer.decode([Y[i-1]])
print(f"'{context}' β '{target}'")
Output:
Encoded: [290, 4920, 2241, 287]
'and' β 'established'
'and established' β 'himself'
'and established himself' β 'in'
Makes sense! Each word predicts the next! β
The Sliding Window Approach
πͺ What is Sliding Window?
Sliding Window = Move window across text to create multiple input-target pairs
Think of it like a camera viewport moving across text! πΉ
π Visual Example
Text: βIn the heart of the city stood the old libraryβ
Context Window = 4
Step 1: Window at position 0
βββββββββββββββββββββββββββ
β [In] [the] [heart] [of] β β Input
βββββββββββββββββββββββββββ
β Shift by 1
βββββββββββββββββββββββββββ
β [the] [heart] [of] [the]β β Target
βββββββββββββββββββββββββββ
Step 2: Slide window (stride = 4)
βββββββββββββββββββββββββββββ
β [the] [city] [stood] [the]β β Input
βββββββββββββββββββββββββββββ
β Shift by 1
βββββββββββββββββββββββββββββ
β [city] [stood] [the] [old]β β Target
βββββββββββββββββββββββββββββ
Step 3: Slide again (stride = 4)
ββββββββββββββββββββββββββββββ
β [old] [library] [...] [...]β β Input
ββββββββββββββββββββββββββββββ
Keep sliding until end of text! π―
ποΈ Stride Parameter
Stride = How much to slide the window
Stride = 1 (overlap):
Window 1: [In] [the] [heart] [of]
Window 2: [the] [heart] [of] [the] β Shifted by 1
Window 3: [heart] [of] [the] [city]
Maximum training examples!
But: Lots of overlap (may overfit) β οΈ
Stride = 4 (no overlap):
Window 1: [In] [the] [heart] [of]
Window 2: [the] [city] [stood] [the]
Window 3: [old] [library] [...]
No overlap!
But: Fewer training examples
π Visual Comparison
Stride = 1:
Text: "In the heart of the city stood the old library"
Input 1: [In] [the] [heart] [of]
Target 1: [the] [heart] [of] [the]
Input 2: [the] [heart] [of] [the] β 3/4 overlap!
Target 2: [heart] [of] [the] [city]
Stride = 4:
Text: "In the heart of the city stood the old library"
Input 1: [In] [the] [heart] [of]
Target 1: [the] [heart] [of] [the]
Input 2: [the] [city] [stood] [the] β No overlap!
Target 2: [city] [stood] [the] [old]
π― Choosing Stride
Common practice:
Stride = Context Length
Example:
Context = 4 β Stride = 4
Why?
β
No overlap (prevents overfitting)
β
Uses all data (no words skipped)
β
Balanced approach!
GPT training typically uses stride = context_length β¨
Implementing PyTorch DataLoader
π― Why DataLoader?
Problems with manual approach:
β Have to manually iterate
β No batching support
β No parallel processing
β Inefficient for large datasets
DataLoader solution:
β
Automatic batching
β
Parallel data loading
β
Shuffling support
β
Efficient memory usage
π¦ Two Components
1. Dataset = Defines how to get one sample
2. DataLoader = Manages batches and workers
ββββββββββββββββββββββββββββββββββββββ
β Dataset β
β - How to load data β
β - How to get one sample β
β - Returns (input, target) pair β
ββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββ
β DataLoader β
β - Batch multiple samples β
β - Shuffle data β
β - Parallel loading (workers) β
ββββββββββββββββββββββββββββββββββββββ
π» Step 1: Create Dataset Class
from torch.utils.data import Dataset
class GPTDatasetV1(Dataset):
def __init__(self, txt, tokenizer, max_length, stride):
self.tokenizer = tokenizer
self.input_ids = []
self.target_ids = []
# Tokenize entire text
token_ids = tokenizer.encode(txt)
# Create input-target pairs with sliding window
for i in range(0, len(token_ids) - max_length, stride):
input_chunk = token_ids[i:i + max_length]
target_chunk = token_ids[i + 1:i + max_length + 1]
self.input_ids.append(input_chunk)
self.target_ids.append(target_chunk)
def __len__(self):
return len(self.input_ids)
def __getitem__(self, idx):
return self.input_ids[idx], self.target_ids[idx]
π Understanding the Dataset
__init__: Setup (tokenize text, create pairs)
__len__: Return total number of samples
__getitem__: Return one sample (input, target)
Example:
dataset = GPTDatasetV1(
txt="Hello world",
tokenizer=tokenizer,
max_length=4,
stride=4
)
# Get first sample
input_ids, target_ids = dataset[0]
print(f"Input: {input_ids}")
print(f"Target: {target_ids}")
π» Step 2: Create DataLoader
from torch.utils.data import DataLoader
def create_dataloader(txt, batch_size=4, max_length=256,
stride=128, num_workers=0):
# Initialize tokenizer
tokenizer = tiktoken.get_encoding("gpt2")
# Create dataset
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
# Create dataloader
dataloader = DataLoader(
dataset,
batch_size=batch_size,
shuffle=False,
drop_last=True,
num_workers=num_workers
)
return dataloader
β Test the DataLoader
# Load text
with open("the_verdict.txt", "r") as f:
raw_text = f.read()
# Create dataloader
dataloader = create_dataloader(
raw_text,
batch_size=8,
max_length=4,
stride=4
)
# Get first batch
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print(f"Input shape: {inputs.shape}") # torch.Size([8, 4])
print(f"Target shape: {targets.shape}") # torch.Size([8, 4])
print(f"\nFirst input: {inputs[0]}")
print(f"First target: {targets[0]}")
Output:
Input shape: torch.Size([8, 4])
Target shape: torch.Size([8, 4])
First input: tensor([290, 4920, 2241, 287])
First target: tensor([4920, 2241, 287, 257])
Perfect! Batch of 8 input-target pairs! π
Batch Size vs Stride vs Num Workers
ποΈ Three Key Parameters
Letβs understand each one:
- Batch Size = Samples processed before parameter update
- Stride = How much to slide window
- Num Workers = Parallel data loading processes
1οΈβ£ Batch Size
Definition: Number of samples processed together
Small batch (batch_size=1):
Process 1 sample β Update parameters
Process 1 sample β Update parameters
Process 1 sample β Update parameters
β
Less memory
β Noisy updates (slow convergence)
β Slower (no parallelization)
Large batch (batch_size=32):
Process 32 samples β Update parameters
Process 32 samples β Update parameters
β
Faster (parallelization)
β
Stable updates
β More memory needed
Common values: 4, 8, 16, 32, 64
π Batch Size Visual
ββββββββββββββββββββββββββββββββββββββββββ
β Batch Size = 1 β
ββββββββββββββββββββββββββββββββββββββββββ€
β Sample 1 β Update β
β Sample 2 β Update β
β Sample 3 β Update β
β ... β
β Many updates (noisy!) β
ββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββ
β Batch Size = 8 β
ββββββββββββββββββββββββββββββββββββββββββ€
β Samples 1-8 β Update β
β Samples 9-16 β Update β
β ... β
β Fewer, stable updates! β
ββββββββββββββββββββββββββββββββββββββββββ
2οΈβ£ Stride
Definition: How much to slide window between samples
Stride = 1 (max overlap):
Text: "In the heart of the city stood"
Sample 1: [In] [the] [heart] [of]
Sample 2: [the] [heart] [of] [the] β 3/4 overlap!
Sample 3: [heart] [of] [the] [city]
β
Maximum training data
β Lots of overlap (overfitting risk)
Stride = 4 (no overlap):
Text: "In the heart of the city stood"
Sample 1: [In] [the] [heart] [of]
Sample 2: [the] [city] [stood] [the]
β
No overlap (less overfitting)
β Less training data
Best practice: stride = max_length
3οΈβ£ Num Workers
Definition: Number of parallel processes for data loading
num_workers=0 (single process):
Main Process:
Load data β Train β Load data β Train β ...
β±οΈ Slower (sequential)
num_workers=4 (parallel):
Main Process: Train β Train β Train β ...
Worker 1: Load batch 1 ββ
Worker 2: Load batch 2 βββ Ready batches
Worker 3: Load batch 3 β
Worker 4: Load batch 4 ββ
β‘ Faster (parallel loading!)
Common values: 2, 4, 8 (based on CPU cores)
π Complete Example
# Load data
with open("the_verdict.txt", "r") as f:
text = f.read()
# Create dataloader with all parameters
dataloader = create_dataloader(
text,
batch_size=8, # Process 8 samples at once
max_length=256, # Context window = 256 tokens
stride=128, # Slide by 128 (50% overlap)
num_workers=4 # 4 parallel workers
)
# Iterate through batches
for batch_idx, (inputs, targets) in enumerate(dataloader):
print(f"Batch {batch_idx}:")
print(f" Inputs shape: {inputs.shape}")
print(f" Targets shape: {targets.shape}")
if batch_idx >= 2: # Show first 3 batches
break
Output:
Batch 0:
Inputs shape: torch.Size([8, 256])
Targets shape: torch.Size([8, 256])
Batch 1:
Inputs shape: torch.Size([8, 256])
Targets shape: torch.Size([8, 256])
Batch 2:
Inputs shape: torch.Size([8, 256])
Targets shape: torch.Size([8, 256])
π― Parameter Guidelines
| Parameter | Small Project | Medium | Large (GPT-like) |
|---|---|---|---|
| batch_size | 4-8 | 16-32 | 64-512 |
| max_length | 128 | 256-512 | 1024-2048 |
| stride | = max_length | = max_length | = max_length |
| num_workers | 2 | 4 | 8-16 |
Complete Data Pipeline
π― Full Implementation
import tiktoken
import torch
from torch.utils.data import Dataset, DataLoader
# 1. Dataset Class
class GPTDatasetV1(Dataset):
def __init__(self, txt, tokenizer, max_length, stride):
self.tokenizer = tokenizer
self.input_ids = []
self.target_ids = []
# Tokenize
token_ids = tokenizer.encode(txt)
# Create pairs with sliding window
for i in range(0, len(token_ids) - max_length, stride):
input_chunk = token_ids[i:i + max_length]
target_chunk = token_ids[i + 1:i + max_length + 1]
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))
def __len__(self):
return len(self.input_ids)
def __getitem__(self, idx):
return self.input_ids[idx], self.target_ids[idx]
# 2. DataLoader Function
def create_dataloader(txt, batch_size=4, max_length=256,
stride=128, num_workers=0):
tokenizer = tiktoken.get_encoding("gpt2")
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
dataloader = DataLoader(
dataset,
batch_size=batch_size,
shuffle=False,
drop_last=True,
num_workers=num_workers
)
return dataloader
# 3. Usage
with open("the_verdict.txt", "r") as f:
raw_text = f.read()
dataloader = create_dataloader(
raw_text,
batch_size=8,
max_length=256,
stride=128
)
# 4. Training Loop (preview!)
for batch_idx, (inputs, targets) in enumerate(dataloader):
# inputs shape: [batch_size, max_length]
# targets shape: [batch_size, max_length]
# This will be fed to the model!
# model(inputs) β predictions
# loss = compute_loss(predictions, targets)
# loss.backward()
# optimizer.step()
print(f"Batch {batch_idx}: {inputs.shape}")
π Data Flow Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. RAW TEXT β
β "The cat sat on the mat..." β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
[TOKENIZATION - BPE]
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. TOKEN IDs β
β [101, 202, 303, 404, ...] β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
[SLIDING WINDOW - Dataset]
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. INPUT-TARGET PAIRS β
β Input: [101, 202, 303, 404] β
β Target: [202, 303, 404, 505] β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
[BATCHING - DataLoader]
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. BATCHED TENSORS β
β Input tensor: [batch_size, max_length] β
β Target tensor: [batch_size, max_length] β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
[EMBEDDINGS - Next Chapter!]
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5. READY FOR TRAINING! β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Chapter Summary
π What We Learned Today
This was a CRUCIAL chapter! Letβs recap:
1. Input-Target Pairs
Core concept:
For LLM training, we create pairs where:
- Input: Current sequence
- Target: Input shifted by 1
Example:
Input: [1, 2, 3, 4]
Target: [2, 3, 4, 5]
This creates multiple prediction tasks!
2. Context Window
Definition:
Maximum number of tokens model can see
Small (4): Fast, less memory, less context
Large (256): Slow, more memory, more context
GPT-2: 1024 tokens
GPT-3: 2048 tokens
GPT-4: 8192+ tokens
3. Auto-Regressive Training
Key insight:
Output of step N β Input of step N+1
Self-supervised learning:
β
No manual labels needed
β
Sentence structure provides labels
β
Scales to billions of words
4. Sliding Window
Approach:
Window = [In] [the] [heart] [of]
Slide β [the] [heart] [of] [the]
Slide β [heart] [of] [the] [city]
Parameters:
- Context length: Window size
- Stride: How much to slide
5. PyTorch DataLoader
Two components:
# Dataset: How to get one sample
class GPTDatasetV1(Dataset):
def __getitem__(self, idx):
return input, target
# DataLoader: Batch management
dataloader = DataLoader(
dataset,
batch_size=8,
num_workers=4
)
6. Three Key Parameters
Batch Size:
How many samples before parameter update
Small: Fast updates, noisy
Large: Stable updates, more memory
Common: 8, 16, 32
Stride:
How much to slide window
stride = context_length (common)
Prevents overlap, uses all data
Num Workers:
Parallel data loading
More workers = faster loading
Common: 2, 4, 8
π Complete Pipeline
Text β Tokenize β Sliding Window β Batching β Ready for Training!
"The cat sat..."
β
[101, 202, 303, ...]
β
Input: [101, 202, 303, 404]
Target: [202, 303, 404, 505]
β
Batch: [8, 256] tensors
β
Next: Embeddings!
π‘ Key Takeaways
- LLMs train on next-word prediction (auto-regressive)
- One sentence β multiple training examples (sliding window)
- Context window = modelβs memory (GPT-2: 1024)
- Target = input shifted by 1 (self-supervised!)
- DataLoader enables efficient batching (parallel, fast)
- Stride = context_length (best practice)
- Ready for embeddings! (next chapter)
π― What We Learned (Checklist)
- [x] Why input-target pairs are needed
- [x] Next-word prediction concept
- [x] Context window definition
- [x] Auto-regressive training
- [x] Creating pairs in Python
- [x] Sliding window approach
- [x] Stride parameter
- [x] PyTorch Dataset class
- [x] PyTorch DataLoader
- [x] Batch size vs stride vs workers
- [x] Complete data pipeline
π Next Chapter: Chapter 10
Topic: Vector Embeddings (Token Embeddings)
What weβll learn:
- Converting token IDs β vectors
- Why embeddings matter
- Word2Vec concepts
- Building embedding layer
- Embedding dimensions
- Pre-trained embeddings
From numbers to vectors! π
π Practice Exercise
Try this before next chapter:
- Load a different text file
- Create Dataset with context=8, stride=4
- Create DataLoader with batch_size=4
- Print first 3 batches
- Verify input-target relationship
- Experiment with different strides
Share your results! π¬
π Take Action Now!
- π» Run the Code - Implement Dataset and DataLoader
- π§ͺ Experiment - Try different parameters
- π Practice - Create your own dataset
- β Ask Questions - Comment if unclear
- π Bookmark - Critical reference material
- βοΈ Get Ready - Next: Embeddings!
Quick Reference
Dataset Template:
class GPTDatasetV1(Dataset):
def __init__(self, txt, tokenizer, max_length, stride):
# Create input-target pairs
pass
def __len__(self):
return len(self.input_ids)
def __getitem__(self, idx):
return self.input_ids[idx], self.target_ids[idx]
DataLoader Usage:
dataloader = DataLoader(
dataset,
batch_size=8, # Samples per batch
shuffle=False, # Don't shuffle
drop_last=True, # Drop incomplete batch
num_workers=4 # Parallel workers
)
Key Parameters:
| Parameter | Meaning | Typical Value |
|---|---|---|
| max_length | Context window | 256, 512, 1024 |
| stride | Window slide | = max_length |
| batch_size | Samples/batch | 8, 16, 32 |
| num_workers | Parallel processes | 2, 4, 8 |
Thank You!
Youβve completed Chapter 9 - Data Sampling! π
You now know:
- β How to create input-target pairs
- β What context windows are
- β Auto-regressive training
- β Sliding window approach
- β PyTorch DataLoader implementation
Next chapter: Vector Embeddings
Almost done with data preparation! π
π£ Your Feedback Matters!
Drop a comment:
- Did you understand DataLoader?
- Which part was most challenging?
- Any questions about batching?
- Share your experiments!
We respond to every comment! π¬
π― Coming Up
Chapter 10: Token Embeddings
Chapter 11: Positional Encoding
Chapter 12: Self-Attention Mechanism
Chapter 13: Building GPT from Scratch
The journey continues! π»π₯
See you in Chapter 10 where we learn embeddings! π
Questions? Stuck on DataLoader? Drop them below! πͺ