Chapter 5: GPT Architecture - From Transformers to ChatGPT Evolution

October 20, 2025 by The GSM Work

LLM AI GPT ChatGPT Transformers Tutorial Series Beginners Architecture Deep Learning

Chapter 5: Understanding GPT Architecture

📖 Reading Time: 45 minutes

Welcome to Chapter 5! This is where we dive DEEP into GPT - the architecture behind ChatGPT! 🚀

So far we’ve learned:

What LLMs are (Chapter 2)
Pre-training vs Fine-tuning (Chapter 3)
Transformer architecture basics (Chapter 4)

Today, we’ll learn:

Complete GPT evolution (2017-2024)
What makes GPT different from Transformers
Zero-shot vs Few-shot learning
Why it costs $4.6 million to train
Auto-regressive models
Emergent behavior (the magic!)

Let’s begin! 🎯

📑 Table of Contents

Quick Recap
The Evolution: From Transformers to GPT-4
Zero-Shot vs Few-Shot Learning
The Massive Dataset Behind GPT-3
GPT Architecture: Decoder-Only Design
Auto-Regressive Training
Why Training Costs $4.6 Million
Emergent Behavior: The Surprise
Open Source vs Closed Source
Chapter Summary

Quick Recap

Where we are:

Chapter 1: Series introduction
Chapter 2: What are LLMs?
Chapter 3: Pre-training + Fine-tuning
Chapter 4: Transformer architecture intro
Chapter 5 (Today): GPT architecture deep-dive

From Chapter 4, remember:

Transformers have encoder + decoder
Self-attention is the key innovation
BERT uses encoder only
GPT uses decoder only

Today: We’ll explore GPT in detail!

The Evolution: From Transformers to GPT-4

🗓️ The Complete Timeline (2017-2024)

Let’s trace the entire evolution:

2017 → Transformers (Google)
  ↓
2018 → GPT-1 (OpenAI)
  ↓
2019 → GPT-2 (OpenAI)
  ↓
2020 → GPT-3 (OpenAI) - The breakthrough!
  ↓
2022 → ChatGPT (GPT-3.5) - Goes viral!
  ↓
2023 → GPT-4 (OpenAI)
  ↓
2024 → GPT-4o (Current)

7 years that changed the world!

📜 2017: “Attention is All You Need” (Transformers)

Published by: Google Brain (8 researchers)

Key Innovation: Self-attention mechanism

Architecture: Encoder + Decoder

Purpose: Language translation (English → German/French)

Impact: Started the revolution

Citations: 100,000+ (unprecedented!)

1️⃣ 2018: GPT-1 (Generative Pre-trained Transformer)

Paper: “Improving Language Understanding by Generative Pre-Training”

Published by: OpenAI

Key Innovation:

Removed encoder (decoder-only!)
Unsupervised pre-training
No labels needed!

Parameters: 117 million

Training approach: Next word prediction

Key Insight:

“We can train on unlabeled data! The sentence itself provides the labels.”

Example:

Sentence: "The cat sat on the mat"
Training: Use "The cat sat on the" to predict "mat"
(No external labels needed!)

Impact: Researchers got excited, but general public unaware

2️⃣ 2019: GPT-2 - Scaling Up

Paper: “Language Models are Unsupervised Multitask Learners”

Key Innovation: Just scale it up!

Four Versions Released:

Model	Parameters	Size Description
GPT-2 Small	117 million	Baseline
GPT-2 Medium	345 million	3x larger
GPT-2 Large	762 million	6.5x larger
GPT-2 XL	1.5 billion	13x larger!

Breakthrough moment:

1 billion parameters was HUGE at that time! No one had seen language models this large.

Results: Significantly better at many tasks

Citations: 10,000+

Public awareness: Still mostly researchers

3️⃣ 2020: GPT-3 - The Game Changer

Paper: “Language Models are Few-Shot Learners”

THIS was the big one! 🎉

Parameters: 175 BILLION

Let’s put this in perspective:

GPT-1:  117 million
GPT-2:  1.5 billion (13x larger than GPT-1)
GPT-3:  175 billion (117x larger than GPT-2!)

Multiple versions:

Model	Parameters
GPT-3 Small	125 million
GPT-3 Medium	350 million
GPT-3 Large	760 million
GPT-3 XL	1.3 billion
GPT-3 2.7B	2.7 billion
GPT-3 6.7B	6.7 billion
GPT-3 13B	13 billion
GPT-3 175B	175 billion

Training cost: $4.6 million (we’ll explore why later!)

Key Claim:

“GPT-3 is a few-shot learner - give it a few examples and it can do amazing things!”

What people discovered:

Even though it was ONLY trained to predict the next word, it could:

✅ Translate languages
✅ Answer questions
✅ Write code
✅ Summarize text
✅ Detect sentiment
✅ Generate creative content

Nobody expected this! 🤯

4️⃣ 2022: ChatGPT (GPT-3.5) - Goes Viral

What changed: Made it accessible to everyone via chat interface

Impact:

1 million users in 5 days
100 million users in 2 months
Fastest-growing app in history!

Why it went viral:

Easy to use (just chat!)
Incredibly capable
Free to try

The world woke up to AI!

5️⃣ 2023: GPT-4 - Current State

Parameters: Estimated ~1 trillion (not officially confirmed)

Capabilities:

Multimodal (text + images)
Much more accurate
Better reasoning
Longer context (32K tokens)

Versions:

GPT-4
GPT-4 Turbo
GPT-4o (optimized, 2024)

We’re here now! ✨

📊 Visual Comparison

┌──────────────────────────────────────────────────┐
│          GPT EVOLUTION (2018-2024)              │
├──────────────────────────────────────────────────┤
│                                                  │
│  GPT-1    ●────                                 │
│  (117M)                                          │
│                                                  │
│  GPT-2          ●──────                         │
│  (1.5B)                                          │
│                                                  │
│  GPT-3                       ●─────────────────→│
│  (175B)                                          │
│                                                  │
│  GPT-4                                    (?) ●│
│  (~1T?)                                          │
│                                                  │
└──────────────────────────────────────────────────┘
       Parameters (Log Scale) →

💡 Key Takeaway

The pattern:

Better results = More parameters + More data + More compute

GPT-1 → GPT-2 → GPT-3 → GPT-4
(Each ~10-100x larger than previous!)

In just 7 years (2017-2024), we went from translation models to ChatGPT!

Zero-Shot vs Few-Shot Learning

🤔 What Do These Terms Mean?

Remember from Chapter 3: GPT-3’s paper was titled “Language Models are Few-Shot Learners”

But what does that mean?

🎯 Zero-Shot Learning

Definition:

The model can perform a task without any examples - just by understanding the instruction.

Example 1: Translation

Your prompt:

Translate English to French:
cheese

GPT-3 output:

fromage

Notice: No examples given! Just told it to translate, and it did it!

Example 2: Sentiment Analysis

Your prompt:

What's the sentiment of this review?
"This movie was terrible! Waste of money."

GPT output:

Negative

Again, no examples provided!

🎓 One-Shot Learning

Definition:

The model sees one example before performing the task.

Example: Translation with One Example

Your prompt:

Translate English to French.

Example:
sea otter → loutre de mer

Now translate:
cheese → ?

GPT output:

fromage

The model saw ONE example and learned the pattern!

📚 Few-Shot Learning

Definition:

The model sees a few examples (typically 2-10) before performing the task.

Example: Translation with Multiple Examples

Your prompt:

Translate English to French.

Examples:
sea otter → loutre de mer
peppermint → menthe poivrée  
plush gir affe → girafe en peluche

Now translate:
cheese → ?

GPT output:

fromage

With multiple examples, accuracy improves!

📊 Visual Comparison

┌─────────────────────────────────────────────────┐
│  ZERO-SHOT: No examples                         │
├─────────────────────────────────────────────────┤
│  Instruction: Translate to French: breakfast    │
│  Output: petit-déjeuner                          │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│  ONE-SHOT: 1 example                            │
├─────────────────────────────────────────────────┤
│  Example: sea otter → loutre de mer             │
│  Task: breakfast → ?                             │
│  Output: petit-déjeuner                          │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│  FEW-SHOT: Multiple examples                    │
├─────────────────────────────────────────────────┤
│  Example 1: sea otter → loutre de mer           │
│  Example 2: peppermint → menthe poivrée          │
│  Example 3: plush giraffe → girafe en peluche   │
│  Task: breakfast → ?                             │
│  Output: petit-déjeuner (high confidence!)       │
└─────────────────────────────────────────────────┘

💬 Real Example: Unscrambling Words

Zero-shot:

Unscramble this word: "brekafast"
Output: breakfast

Few-shot:

Examples:
"omre" → "more"
"sotry" → "story"

Now unscramble: "brekafast"
Output: breakfast (more confident!)

🤖 What About GPT-4?

Let’s ask GPT-4 itself!

Question to ChatGPT:

“Are you a zero-shot learner or a few-shot learner?”

GPT-4’s Answer:

“I am primarily a few-shot learner. This means I perform better when given a few examples. However, I also have zero-shot capabilities and can handle many tasks without examples. Providing examples helps me generate more accurate responses.”

📈 Performance Comparison

Task Accuracy:

Zero-shot:  ████████░░  80%
One-shot:   █████████░  90%
Few-shot:   ██████████  95%

(More examples = Better performance!)

🎯 Practical Tip

To get better results from ChatGPT:

Bad approach (zero-shot):

Write me a professional email.

Good approach (few-shot):

Write me a professional email.

Here's my style:
- Greeting: "Good morning,"
- Tone: Friendly but professional
- Closing: "Best regards,"

Example:
"Good morning John,
I hope this email finds you well..."

Now write an email to Sarah about the project delay.

Much better results! ✨

💡 Key Takeaway

GPT-3 and GPT-4 are few-shot learners:

Can do zero-shot (no examples)
Excel at few-shot (with examples)
More examples = More accuracy

When using ChatGPT, provide examples for better results!

The Massive Dataset Behind GPT-3

📚 What Data Was Used?

Training GPT-3 required an insane amount of data!

Total: 300 billion tokens (~300 billion words)

But where did this data come from?

🌐 Data Sources Breakdown

Source	Tokens (Words)	Percentage	What It Contains
Common Crawl	410 billion	60%	Entire internet!
WebText2	19 billion	22%	Reddit posts (high quality)
Books	Not specified	16%	Published books
Wikipedia	3 billion	3%	Wikipedia articles

Total used for training: ~300 billion tokens

1️⃣ Common Crawl - 410 Billion Tokens

What is it?

An open repository that crawls the entire internet!

Visit: commoncrawl.org

Contains:

News articles
Blog posts
Product reviews
Forum discussions
Scientific papers
Social media content
Basically EVERYTHING online!

Free and open since 2007!

Data stats:

250+ billion web pages
17+ years of data
Constantly updated

GPT-3 used 60% of its training data from here!

2️⃣ WebText2 - 19 Billion Tokens

What is it?

High-quality text from Reddit submissions!

Why Reddit?

Upvote system = quality filter
Diverse topics (tech, cooking, science, history, etc.)
Human-written content
Conversational language

Also includes:

Stack Overflow (programming Q&A)
Quality-filtered articles
Long-form discussions

Minimum 3 upvotes required (quality threshold)

3️⃣ Books - ~16% of Data

Why books?

Proper grammar and structure
Long-form storytelling
Rich vocabulary
Different writing styles
Fiction + Non-fiction + Technical

Types included:

Classic literature
Modern novels
Technical manuals
Science textbooks
Biography and history

4️⃣ Wikipedia - 3 Billion Tokens

Why Wikipedia?

Factual information
Well-structured
Covers millions of topics
Multiple languages
Regularly updated
Free and open

🔢 What is a “Token”?

Simple explanation:

A token is a unit of text that the model reads.

For now, think: 1 token ≈ 1 word

(We’ll learn tokenization in detail in later chapters!)

📊 Scale Visualization

300 billion words is:

Average book: 80,000 words
300 billion ÷ 80,000 = 3,750,000 books

If you read 1 book per day:
3,750,000 ÷ 365 = 10,273 YEARS!

GPT-3 “read” 10,000 years worth of books! 🤯

🎯 Why So Much Data?

More data = Better language understanding

Example:

Small dataset (1 million words):

"The bank is near the river"
Model: Confused (river bank? financial bank?)

Massive dataset (300 billion words):

Seen "bank" used in 1 million different contexts!
Understands: river bank vs financial bank vs blood bank

Diversity = Better generalization!

💡 Key Takeaway

GPT-3’s secret:

300 billion words of training data
From the entire internet (Common Crawl)
Quality content (Reddit, Books, Wikipedia)
This massive dataset enables human-like understanding

More data = Smarter AI!

GPT Architecture: Decoder-Only Design

🏗️ Key Difference from Transformers

Remember from Chapter 4:

Original Transformer (2017):
├── Encoder (understands input)
└── Decoder (generates output)

GPT (2018+):
└── Decoder ONLY (no encoder!)

Why remove the encoder?

Because GPT’s task is simpler: Just predict the next word!

🔄 Comparison Visual

Original Transformer:

Input Text
    ↓
[ENCODER] ← Converts to embeddings
    ↓
Vector Embeddings
    ↓
[DECODER] ← Generates output
    ↓
Output Text

GPT Architecture:

Input Text
    ↓
[DECODER ONLY] ← Does everything!
    ↓
Predicted Next Word

Simpler = Faster = Scalable!

📊 GPT-3 Architecture Details

Original Transformer (2017):

6 encoder layers
6 decoder layers
Total: 12 layers

GPT-3 (2020):

0 encoder layers
96 decoder layers (!!)
Total: 96 Transformer blocks

Each decoder layer has:

Multi-head self-attention
Feed-forward neural network
Layer normalization
Residual connections

96 layers × parameters per layer = 175 billion parameters!

🎯 Why Decoder-Only Works

GPT’s task: Predict next word (left-to-right)

Input: "The cat sat on the"
Task: Predict next word
Output: "mat"

No need for encoder because:

We’re not translating between languages
We’re not filling in blanks
We’re just continuing the sequence

Decoder alone is enough!

🔍 What Happens Inside?

Step-by-step:

1. Input: "The cat sat"
   ↓
2. Tokenization: [The][cat][sat]
   ↓
3. Token IDs: [101][202][303]
   ↓
4. [Decoder Layer 1]
   ↓
5. [Decoder Layer 2]
   ↓
   ... (96 layers total!)
   ↓
96. [Decoder Layer 96]
   ↓
97. Output Layer
   ↓
98. Prediction: "on" (next word!)

Each layer refines the understanding!

💡 Key Takeaway

GPT vs Transformer:

Transformer: Encoder + Decoder (translation)
GPT: Decoder only (text generation)
GPT-3: 96 decoder layers
Result: 175 billion parameters

Simpler architecture, but MUCH bigger scale!

Auto-Regressive Training

🔄 What Does “Auto-Regressive” Mean?

Auto = Self
Regressive = Using previous outputs

Simple definition:

The model uses its previous output as input for the next prediction.

📖 Visual Example

Sentence to generate: “This is an example”

Training process:

┌──────────────────────────────────────────┐
│  ITERATION 1                             │
├──────────────────────────────────────────┤
│  Input: "This"                           │
│    ↓                                     │
│  [Decoder]                               │
│    ↓                                     │
│  Output: "is"                            │
└──────────────────────────────────────────┘

┌──────────────────────────────────────────┐
│  ITERATION 2                             │
├──────────────────────────────────────────┤
│  Input: "This is" ← (previous output!)  │
│    ↓                                     │
│  [Decoder]                               │
│    ↓                                     │
│  Output: "an"                            │
└──────────────────────────────────────────┘

┌──────────────────────────────────────────┐
│  ITERATION 3                             │
├──────────────────────────────────────────┤
│  Input: "This is an" ← (includes "an")  │
│    ↓                                     │
│  [Decoder]                               │
│    ↓                                     │
│  Output: "example"                       │
└──────────────────────────────────────────┘

Final: "This is an example" ✅

See the pattern? Each output becomes part of the next input!

🔁 Step-by-Step Breakdown

Starting sentence: “Second law of robotics: A robot must”

Iteration 1:

Input:  "Second law of robotics:"
Output: "A"

Iteration 2:

Input:  "Second law of robotics: A"  ← ("A" was previous output!)
Output: "robot"

Iteration 3:

Input:  "Second law of robotics: A robot"  ← ("robot" added!)
Output: "must"

Iteration 4:

Input:  "Second law of robotics: A robot must"
Output: "obey"

This continues until the sentence is complete!

🎯 Why “Auto-Regressive”?

“Auto” (Self):

The model feeds itself
No external input needed after start
Self-contained generation

“Regressive” (Using previous):

Each prediction depends on ALL previous words
History influences the future
Context builds up

🤖 Real Example: ChatGPT

When you chat with ChatGPT, it works the same way!

You type:

“Once upon a time”

GPT generates:

Iteration 1: "Once upon a time" → "there"
Iteration 2: "Once upon a time there" → "was"
Iteration 3: "Once upon a time there was" → "a"
Iteration 4: "Once upon a time there was a" → "princess"
...

Each word builds on the previous ones!

📊 Visual Flow

"This" → [GPT] → "is"
        ↓
"This is" → [GPT] → "an"
        ↓
"This is an" → [GPT] → "example"
        ↓
Complete: "This is an example"

Output of step N becomes input of step N+1!

🔍 Unsupervised Learning Connection

Here’s the magic:

The sentence ITSELF provides the training labels!

Example sentence: “The lion lives in the jungle”

Training data:

Input (Training)	Output (Label)
“The”	“lion”
“The lion”	“lives”
“The lion lives”	“in”
“The lion lives in”	“the”
“The lion lives in the”	“jungle”

No external labeling needed! The sentence is broken into input/output pairs automatically.

This is why it’s called unsupervised learning!

💡 Key Takeaway

Auto-regressive means:

Predict next word
Add it to input
Predict again
Repeat!

This is how GPT generates text, one word at a time, building on its own outputs!

Why Training Costs $4.6 Million

💰 The Shocking Number

Training GPT-3 cost: $4.6 million

For just PRE-training! (Not including fine-tuning, infrastructure, salaries, etc.)

Why so expensive? Let’s break it down:

📊 Cost Breakdown

Resource	Details	Cost
Compute (GPUs)	10,000+ V100 GPUs for weeks	~$3 million
Electricity	Power for running GPUs 24/7	~$500,000
Cloud Infrastructure	AWS/Azure services	~$1 million
Data Storage	300B tokens of data	~$100,000
Networking	Data transfer, bandwidth	~$50,000
Monitoring & Tools	Training infrastructure	~$50,000
TOTAL		$4.6 million

(This doesn’t include salaries, research costs, experiments!)

🖥️ 1. GPU Costs - The Biggest Expense

What GPUs were used: NVIDIA V100

Specs:

Price: ~$10,000 per GPU
Memory: 16-32 GB
Performance: 125 teraflops

How many needed: ~10,000 GPUs

Why so many?

175 billion parameters
×
300 billion tokens
×
Multiple training epochs
=
Insane amount of computation!

Training time: 30+ days continuously

Cost calculation:

GPU rental on cloud: ~$3/hour per GPU
10,000 GPUs × $3/hour × 24 hours × 30 days
= $21.6 million if bought

(OpenAI likely got bulk discounts + used own infrastructure)
Actual cost: ~$3 million

⚡ 2. Electricity Costs

Power consumption:

1 GPU = ~300 watts
10,000 GPUs = 3,000,000 watts = 3 megawatts!

That's enough electricity to power:
- 2,000 homes continuously
- Or a small town!

24/7 for a month:

3 megawatts × 24 hours × 30 days = 2,160 megawatt-hours

At ~$0.10/kWh (commercial rates):
2,160,000 kWh × $0.10 = $216,000

Plus cooling, backup power, etc.
Total: ~$500,000

☁️ 3. Cloud Infrastructure

What’s needed:

Data storage (petabytes!)
High-speed networking
Load balancers
Backup systems
Monitoring tools

Estimated cost: ~$1 million

📈 4. The Computational Challenge

Why does it take so much compute?

Math breakdown:

Parameters: 175 billion
Training tokens: 300 billion
Operations per token: ~1000s of calculations

Total operations:
175B × 300B × 1000 = 5.25 × 10^22 operations!

That's 52,500,000,000,000,000,000,000 calculations!

Even with powerful GPUs, this takes WEEKS!

🔄 5. The Training Process

What happens during training:

For each of 300 billion tokens:
  1. Forward pass (prediction)
  2. Calculate error
  3. Backward pass (gradient calculation)
  4. Update 175 billion parameters
  5. Repeat!

This happens BILLIONS of times!

Example:

Sentence: "The cat sat on the mat"
Broken into 6 predictions:
  "The" → predict "cat"
  "The cat" → predict "sat"
  "The cat sat" → predict "on"
  ... etc.

Now multiply by 300 BILLION tokens!

📉 6. Cost Comparison

GPT-1 (2018):   117M parameters  →  ~$50,000
GPT-2 (2019):   1.5B parameters  →  ~$250,000
GPT-3 (2020):   175B parameters  →  $4.6 MILLION
GPT-4 (2023):   ~1T parameters?  →  $100 MILLION? (estimated)

Exponential growth in cost!

🏢 Who Can Afford This?

Only a few companies:

✅ OpenAI (Microsoft-backed)
✅ Google (tech giant)
✅ Meta (tech giant)
✅ Anthropic ($7B funding)
✅ Mistral AI ($400M funding)

Total globally: ~10-15 companies

Everyone else: Uses pre-trained models and fine-tunes them!

💡 Good News for You!

You DON’T need to pre-train!

What you CAN do:

Use OpenAI API
Fine-tune existing models
Build applications

Cost: $100 - $10,000 (460x cheaper!)

In this series:

We’ll learn HOW pre-training works
But practically, you’ll fine-tune existing models
Understanding the process helps you use AI better!

🎯 Key Takeaway

Why $4.6 million:

175 billion parameters to optimize
300 billion tokens to process
10,000 GPUs running for weeks
Massive electricity consumption
Enterprise cloud infrastructure

This is why only big companies do pre-training!

Emergent Behavior: The Surprise

🤯 The Unexpected Discovery

Here’s the crazy part:

GPT was ONLY trained to predict the next word.

That’s it! Just this one simple task:

Input: "The cat sat on the"
Task: Predict next word
Output: "mat"

But then researchers discovered it could do SO MUCH MORE!

✨ What is Emergent Behavior?

Definition:

The ability of a model to perform tasks it was never explicitly trained to do.

Formal definition from research:

“Emergent behavior is the ability to perform tasks that the model wasn’t explicitly trained to perform.”