Chapter 3: Stages of Building an LLM - Pre-training vs Fine-tuning Explained

October 18, 2025 by The GSM Work

LLM AI Machine Learning Deep Learning ChatGPT Tutorial Series Beginners Pre-training Fine-tuning

Chapter 3: The Two Stages of Building an LLM

📖 Reading Time: 35 minutes

Welcome to Chapter 3! In the previous chapters, we learned what LLMs are and why they’re revolutionary. Now it’s time to understand how they’re actually built.

The big reveal: Building an LLM is not one giant step - it’s a two-stage process:

Pre-training (Stage 1)
Fine-tuning (Stage 2)

By the end of this chapter, you’ll understand:

What is pre-training and why it’s called “pre”
How LLMs learn from billions of words
What is fine-tuning and when it’s needed
Real examples from top companies
The complete lifecycle from data to deployment

Let’s dive in! 🚀

📑 Table of Contents

Quick Recap: Where We Are
The Two-Stage Building Process
Stage 1: Pre-training Explained
Stage 2: Fine-tuning Explained
Pre-training vs Fine-tuning: Key Differences
Real-World Examples
The Complete LLM Lifecycle
Labeled vs Unlabeled Data
Types of Fine-tuning
Cost Analysis: The Money Behind LLMs
Chapter Summary

Quick Recap: Where We Are

In Chapter 1, we introduced the LLM series and our goals.

In Chapter 2, we learned:

What LLMs are (neural networks for text)
Why they’re called “Large” (billions of parameters)
The secret sauce (Transformer architecture)
Difference between AI, ML, DL, and LLM

In Chapter 3 (Today), we’ll learn:

How these massive models are actually built
The step-by-step process from raw data to ChatGPT

The Two-Stage Building Process

🎭 The Two Acts of Building an LLM

Think of building an LLM like training for the Olympics:

Stage 1: General Training (Pre-training)

Like an athlete doing years of general fitness training
Builds overall strength, endurance, speed
Not focused on one specific sport yet

Stage 2: Specialized Training (Fine-tuning)

Like an athlete specializing in javelin throw or swimming
Focuses on specific skills for a particular event
Refines what was learned in general training

For LLMs:

Stage 1: Pre-training
└── Train on EVERYTHING (entire internet)
    └── Result: General-purpose AI

Stage 2: Fine-tuning
└── Train on SPECIFIC data (your company's data)
    └── Result: Specialized AI for your needs

🤔 Why Two Stages? Why Not Just One?

Great question!

Analogy: Education System

General Education (Pre-training):

Kindergarten to 12th grade
Learn everything: Math, Science, Languages, History
Become a well-rounded person

Specialization (Fine-tuning):

College/University - Choose Engineering
Medical school - Become a doctor
Law school - Become a lawyer

Same logic for LLMs:

You need general knowledge first (pre-training), then specialize (fine-tuning) for your specific use case.

Stage 1: Pre-training Explained

📚 What is Pre-training?

Simple Definition:

Training the LLM on a massive and diverse dataset so it learns general language understanding.

What “massive” means:

GPT-3 was trained on 300 billion words!

Let’s put that in perspective:

Average book: 80,000 words
300 billion words = 3,750,000 books

If you read one book per day:
It would take 10,274 YEARS to read all that!

🌐 Where Does This Training Data Come From?

GPT-3’s Training Data Sources:

Source	Words	Percentage	What It Contains
Common Crawl	410 billion	60%	Entire internet content
WebText2	20 billion	22%	Reddit posts, quality articles
Books1 & Books2	67 billion	16%	Published books
Wikipedia	3 billion	3%	Wikipedia articles

Total: 500 billion words (300 billion used for training)

🔍 Let’s Explore These Sources

1. Common Crawl

What is it?

An open repository of web data
Crawls and stores content from billions of websites
Anyone can access it for free

Example content:

News articles
Blog posts
Product reviews
Social media discussions
Scientific papers
Forums and Q&A sites

Try it: Visit commoncrawl.org

2. WebText2

What is it?

High-quality text from Reddit
Only upvoted content (quality filter)
Includes Stack Overflow (programming Q&A)

Why Reddit?

Reddit’s upvote system = quality filter
Diverse topics (technology, cooking, science, history)
Human-written, conversational language

3. Books

Why include books?

Proper grammar and structure
Long-form storytelling
Diverse vocabulary
Different writing styles (fiction, non-fiction, technical)

Example books included:

Classic literature
Technical manuals
Science textbooks
Fiction novels

4. Wikipedia

Why Wikipedia?

Factual, well-structured information
Covers millions of topics
Multiple languages
Regularly updated

🎯 The Training Task: Next Word Prediction

Here’s the fascinating part:

LLMs learn by playing a simple game: “Guess the next word”

Example:

Given: "The lion is in the ___"
LLM predicts: "forest"

Given: "I went to the ___"
LLM predicts: "store" (or "park", "school", "mall")

Given: "The capital of France is ___"
LLM predicts: "Paris"

That’s it!

Just train on predicting the next word, billions of times, with billions of examples.

🤯 The Surprising Discovery

What researchers found:

When you train an LLM ONLY for “next word prediction” on massive data, something magical happens:

It learns to do MANY other tasks automatically!

Tasks LLMs can do (without specific training):

✅ Translation

Input: Translate "Hello" to Spanish
Output: Hola

✅ Summarization

Input: Summarize this 10-page article
Output: [Concise 3-sentence summary]

✅ Question Answering

Input: What is the capital of Japan?
Output: Tokyo

✅ Multiple Choice Questions

Input: What is 2+2? A) 3 B) 4 C) 5
Output: B) 4

✅ Sentiment Analysis

Input: "This movie was terrible!"
Output: Negative sentiment

✅ Code Generation

Input: Write Python code to reverse a string
Output: [Working Python code]

All of this WITHOUT being specifically trained for these tasks!

📊 The Pre-training Result: Foundation Model

After pre-training, you get:

Foundation Model (also called Base Model or Pre-trained Model)

Characteristics:

✅ General-purpose
✅ Can do many tasks
✅ Understands language deeply
✅ But… not specialized for anything specific

Example: GPT-4 is a foundation model

When you use ChatGPT without any customization, you’re using a foundation model (with some basic fine-tuning).

💡 Key Takeaway

Pre-training is like giving an LLM a complete education:

Read everything on the internet
Learn general language patterns
Become a jack-of-all-trades

But it’s not specialized yet. That’s where fine-tuning comes in!

Stage 2: Fine-tuning Explained

🎯 What is Fine-tuning?

Simple Definition:

Taking a pre-trained model and refining it on a specific, narrow dataset for a particular task or domain.

Analogy: Doctor Specialization

Medical School (Pre-training)
└── Learn general medicine
    └── Graduate: General doctor

Specialization (Fine-tuning)
└── Cardiologist (heart specialist)
└── Neurologist (brain specialist)
└── Pediatrician (child specialist)

Same for LLMs:

Pre-trained GPT-4 (Foundation Model)
└── Knows everything generally

Fine-tuned for Banking (JP Morgan's AI)
└── Specializes in financial analysis

Fine-tuned for Legal (Harvey AI)
└── Specializes in legal cases

Fine-tuned for Telecom (SK Telecom AI)
└── Specializes in customer support (Korean)

🤔 Why Not Just Use Pre-trained Models?

Great question! Let’s see with examples:

Scenario 1: You Run an Airline Company

You want: AI chatbot for customer support

Question to AI:

“What’s the price for Lufthansa flight leaving at 6 PM to Munich?”

If you use pre-trained GPT-4 (without fine-tuning):

Response: "I don't have access to real-time flight prices. 
          Please check the Lufthansa website or contact 
          their customer service at..."

❌ Not helpful! It’s generic.

If you fine-tune on YOUR airline’s data:

Response: "The Lufthansa flight LH456 departing at 6 PM 
          to Munich costs €235 (Economy) or €890 (Business Class). 
          Would you like me to check availability?"

✅ Perfect! Specific to your company.

Scenario 2: You’re an Educational Platform

You want: AI to generate high-quality exam questions

If you use pre-trained GPT-4:

Questions are okay but generic
May not match your curriculum
Quality varies

If you fine-tune on YOUR past exam papers:

Questions match your style exactly
Difficulty levels are consistent
Covers your specific syllabus

🏢 When Do You Need Fine-tuning?

You DON’T need fine-tuning if:

❌ You’re a student using ChatGPT for homework
❌ You’re using AI for general tasks (writing emails, summaries)
❌ You’re exploring AI capabilities
❌ Generic responses are good enough

You NEED fine-tuning if:

✅ You’re a company with proprietary data
✅ Your domain is highly specialized (legal, medical, finance)
✅ You need consistent, high-quality responses
✅ Your data is not publicly available
✅ You’re building a production application
✅ Generic AI responses are not good enough

📊 Pre-training vs Fine-tuning: Key Differences

Aspect	Pre-training	Fine-tuning
Data	300 billion+ words	10,000 - 10 million examples
Data Source	Entire internet	Your specific dataset
Data Type	Unlabeled (raw text)	Labeled (with answers/tags)
Goal	Learn general language	Specialize for specific task
Cost	$4.6 million (GPT-3)	$1,000 - $100,000
Time	Weeks to months	Hours to days
Result	Foundation model	Specialized model
Who Does It	OpenAI, Google, Meta	Companies, developers, you!
Examples	GPT-4, Claude, Gemini	Harvey (legal), Your chatbot

Real-World Examples

Let’s see how top companies use fine-tuning:

1️⃣ SK Telecom (South Korea)

Company: Major telecommunications provider

Problem:

Needed AI customer support chatbot
Must understand Korean telecom terminology
Generic GPT-4 doesn’t understand telecom jargon

Solution: Fine-tuned GPT-4 on:

Past customer service conversations (Korean)
Telecom-specific terminology
Company policies and procedures

Results:

✅ 35% improvement in conversation summarization
✅ 33% improvement in understanding customer intent
✅ Handles Korean telecom queries perfectly

Source: OpenAI case studies

2️⃣ Harvey AI (Legal Industry)

Company: AI assistant for lawyers and attorneys

Website: harvey.ai

Problem:

Lawyers need AI that understands legal case history
Generic GPT-4 lacks extensive legal knowledge
Legal terminology and precedents are crucial

Solution: Fine-tuned LLM on:

Millions of legal case documents
Court rulings and precedents
Legal contracts and agreements
Jurisdiction-specific laws

What Harvey can do:

✅ Research legal cases in seconds
✅ Draft legal documents
✅ Analyze contracts
✅ Provide case law references

Used by:

Top law firms globally
Corporate legal teams
Legal professionals

Why it’s better than ChatGPT:

Trained specifically on legal data
Understands legal jargon
Provides case law citations
Domain expertise in law

3️⃣ JP Morgan Chase (Banking)

Company: Major investment bank

Product: Internal AI-powered LLM suite

Announcement: “JP Morgan unveils AI-powered LLM suite - may replace research analysts”

Why build their own LLM?

Problem with using GPT-4 directly:

❌ Not trained on JP Morgan’s proprietary data
❌ Lacks internal banking insights
❌ Doesn’t understand company-specific terminology
❌ Can’t access confidential financial models

Solution: Fine-tuned LLM on:

Internal research reports
Financial analysis documents
Market data and trends
Company-specific methodologies

Use cases:

✅ Generate financial research reports
✅ Analyze market trends
✅ Summarize earnings calls
✅ Draft investment recommendations

Benefits:

Speeds up analyst work
Maintains confidentiality
Uses proprietary insights
Consistent with company standards

🎯 Pattern in All Examples

Notice the pattern:

Generic LLM (GPT-4)
    ↓
  + Company's specific data
    ↓
  Fine-tuning
    ↓
  Specialized LLM for that industry

Key Insight:

Every major company building serious AI applications does fine-tuning. They NEVER just use the foundation model as-is.

The Complete LLM Lifecycle

🔄 From Raw Data to Production

Let me show you the complete journey:

┌─────────────────────────────────────────────────────────┐
│               STAGE 1: PRE-TRAINING                     │
└─────────────────────────────────────────────────────────┘

Step 1: Collect Massive Data
├── Common Crawl (410B words)
├── WebText2 (20B words)
├── Books (67B words)
└── Wikipedia (3B words)
    ↓
    Total: 500 billion words

Step 2: Train the Model
├── Task: Predict next word
├── Hardware: 1000s of GPUs
├── Time: Several weeks
├── Cost: $4.6 million (GPT-3)
└── Parameters: 175 billion

Step 3: Result
└── Foundation Model (Pre-trained LLM)
    └── Can do many tasks
    └── General-purpose
    └── Not specialized

┌─────────────────────────────────────────────────────────┐
│               STAGE 2: FINE-TUNING                      │
└─────────────────────────────────────────────────────────┘

Step 4: Collect Specific Data
├── Your company data (10K-1M examples)
├── Domain-specific (legal, medical, finance)
├── Labeled data (with answers)
└── High-quality examples

Step 5: Fine-tune the Model
├── Start from pre-trained model
├── Train on your specific data
├── Time: Hours to days
└── Cost: $1,000 - $100,000

Step 6: Result
└── Specialized Model
    └── Perfect for your use case
    └── Domain expert
    └── Ready for production

┌─────────────────────────────────────────────────────────┐
│               STAGE 3: DEPLOYMENT                       │
└─────────────────────────────────────────────────────────┘

Step 7: Build Application
├── Chatbot
├── Document analyzer
├── Customer support
└── Code assistant

Step 8: Deploy to Users
└── Companies, employees, customers

📊 Visual Schematic

Here’s a simplified view:

RAW DATA                 FOUNDATIONAL MODEL      SPECIALIZED APPLICATIONS
(Unlabeled)              (Pre-trained)           (Fine-tuned)

┌─────────────┐         ┌──────────────┐        ┌─────────────────┐
│  Internet   │────┐    │              │───┬───→│ Personal        │
│  Text       │    │    │              │   │    │ Assistant       │
├─────────────┤    │    │              │   │    └─────────────────┘
│  Books      │    ├───→│  Foundation  │   │    
├─────────────┤    │    │  Model       │   │    ┌─────────────────┐
│  Research   │    │    │  (GPT-4)     │   ├───→│ Language        │
│  Papers     │    │    │              │   │    │ Translator      │
├─────────────┤    │    │              │   │    └─────────────────┘
│  Wikipedia  │────┘    │              │   │    
└─────────────┘         │              │   │    ┌─────────────────┐
                        └──────────────┘   ├───→│ Code            │
                                            │    │ Assistant       │
                                            │    └─────────────────┘
                                            │    
                                            │    ┌─────────────────┐
                                            └───→│ Classification  │
                                                 │ Bot             │
                                                 └─────────────────┘

Labeled vs Unlabeled Data

🤔 What’s the Difference?

This is super important to understand!

📝 Unlabeled Data

Definition: Raw text without any extra information or tags

Examples:

Example 1: News Article

"Scientists discover new planet orbiting distant star. 
The planet, named Kepler-452b, is located 1,400 light-years 
from Earth and may have conditions suitable for life..."

No labels needed! Just the text.

Example 2: Book

"It was the best of times, it was the worst of times, 
it was the age of wisdom, it was the age of foolishness..."

No labels needed! Just the story.

Used for: Pre-training

🏷️ Labeled Data

Definition: Text with associated labels, tags, or answers

Examples:

Example 1: Email Classification

Text: "Congratulations! You've won $1 million! Click here..."
Label: SPAM ❌

Text: "Meeting rescheduled to 3 PM tomorrow"
Label: NOT SPAM ✅

Example 2: Sentiment Analysis

Text: "This movie was amazing! Best film ever!"
Label: POSITIVE 😊

Text: "Waste of time and money. Terrible acting."
Label: NEGATIVE 😡

Example 3: Question-Answer Pairs

Question: "What is the capital of France?"
Answer: "Paris"

Question: "Who wrote Romeo and Juliet?"
Answer: "William Shakespeare"

Example 4: Legal Case Data

Case Description: "Contract dispute over intellectual property rights..."
Relevant Precedent: "Smith v. Jones (1995) - similar case ruling..."
Expected Outcome: "Likely favorable to defendant based on precedent"

Used for: Fine-tuning

📊 Comparison Table

Feature	Unlabeled Data	Labeled Data
Structure	Just text	Text + tags/answers
Example	“The cat sat on the mat”	Text: “Movie was great!” Label: Positive
Cost	Cheap (scrape internet)	Expensive (humans label it)
Availability	Abundant (billions of words)	Limited (thousands of examples)
Used in	Pre-training	Fine-tuning
Purpose	Learn general language	Learn specific task

💡 Why This Matters

Pre-training:

Needs MASSIVE amounts of data
Unlabeled data is easy to get (entire internet!)
Learns general patterns

Fine-tuning:

Needs SMALLER amounts of data
But data must be labeled (expensive!)
Learns specific tasks

This is why:

Pre-training costs millions (data + compute)
Fine-tuning costs thousands (mostly labeling)

Types of Fine-tuning

Not all fine-tuning is the same! There are two main types:

1️⃣ Instruction Fine-tuning

What it is: Teaching the LLM to follow specific instructions

Format: Instruction → Response pairs

Examples:

Example 1: Translation

Instruction: "Translate this English text to French: Hello, how are you?"
Response: "Bonjour, comment allez-vous?"

Example 2: Summarization

Instruction: "Summarize this article in 3 sentences: [article text]"
Response: "[3-sentence summary]"

Example 3: Customer Support

Instruction: "Customer says: 'My flight is cancelled, what should I do?'"
Response: "I apologize for the inconvenience. Here are your options:
           1. Rebook on the next available flight
           2. Request a full refund
           3. Hotel accommodation for tonight
           Which would you prefer?"

Use cases:

✅ Chatbots
✅ Virtual assistants
✅ Translation services
✅ Summarization tools
✅ Educational tutors

2️⃣ Classification Fine-tuning

What it is: Teaching the LLM to categorize or classify text

Format: Text → Label/Category

Examples:

Example 1: Spam Detection

Input: "Congratulations! You won a free iPhone! Click now!"
Output: SPAM

Input: "Meeting agenda for tomorrow attached"
Output: NOT SPAM

Example 2: Sentiment Analysis

Input: "This product is absolutely terrible. Don't buy it."
Output: NEGATIVE

Input: "Amazing quality! Highly recommend to everyone."
Output: POSITIVE

Example 3: Topic Classification

Input: "Scientists discover new cancer treatment breakthrough..."
Output: SCIENCE

Input: "Stock market hits record high as tech companies surge..."
Output: BUSINESS

Example 4: Intent Detection

Input: "What time does the store close?"
Output: HOURS_INQUIRY

Input: "I want to return this product"
Output: RETURN_REQUEST

Use cases:

✅ Email filtering (spam/not spam)
✅ Sentiment analysis (positive/negative/neutral)
✅ Content moderation (appropriate/inappropriate)
✅ Topic categorization (sports/politics/tech)
✅ Intent recognition (in chatbots)

📊 Comparison

Aspect	Instruction Fine-tuning	Classification Fine-tuning
Output	Free-form text (answers, translations)	Fixed categories/labels
Complexity	More complex	Simpler
Examples	Q&A, translation, summarization	Spam detection, sentiment
Flexibility	Very flexible responses	Predefined categories only
Training Data	Instruction-response pairs	Text-label pairs

Cost Analysis: The Money Behind LLMs

💰 Let’s Talk Numbers

Building LLMs is EXPENSIVE. Let’s break down the costs:

📊 Pre-training Costs

GPT-3 Pre-training:

Resource	Quantity	Cost
GPUs	~10,000 NVIDIA V100	$3 million
Electricity	Several megawatts	$500,000
Cloud Infrastructure	AWS/Azure	$1 million
Data Collection	500B words	$100,000
Engineers	50+ AI researchers	Priceless
Total	-	$4.6 million

Training Duration: 30+ days continuously

💡 Why So Expensive?

1. GPUs Are Expensive

Single NVIDIA A100 GPU: $10,000
For GPT-3: 10,000 GPUs needed
Cost: $100 million in hardware
(but rented from cloud, so cheaper)

2. Electricity Costs

10,000 GPUs running 24/7 for a month
= Electricity for a small town!
= $500,000+ in power bills

3. Expertise Needed

AI researchers: $300,000+/year salary
50 researchers × 1 year
= $15 million+ in salaries

📉 Fine-tuning Costs (Much Cheaper!)

Typical Fine-tuning Project:

Resource	Quantity	Cost
Compute	Few GPUs for days	$1,000 - $10,000
Data Labeling	10,000 examples	$5,000 - $50,000
API Costs	Using OpenAI API	$100 - $1,000
Total	-	$6,000 - $60,000

Training Duration: Few hours to few days

🎯 Cost Comparison

Pre-training:  $4,600,000  💸💸💸💸💸
Fine-tuning:   $   10,000  💸
                ──────────
Difference:    460x cheaper!

This is why:

Only big companies (OpenAI, Google, Meta) do pre-training
Everyone else uses their pre-trained models and fine-tunes
You can fine-tune GPT-4 for your own use case!

🏢 Who Can Afford Pre-training?

Companies that have done pre-training:

✅ OpenAI (GPT series) - Backed by Microsoft
✅ Google (Gemini, PaLM) - Tech giant
✅ Meta (Llama series) - Tech giant
✅ Anthropic (Claude) - $7B funding
✅ Mistral AI (Mistral) - $400M funding

Total companies globally: ~10-15

Everyone else: Uses fine-tuning on existing models

💡 Good News for You!

You DON’T need to pre-train!

You can:

Use OpenAI’s API
Fine-tune GPT-4 on your data
Build amazing applications
Total cost: $100 - $10,000

In this series:

We’ll learn both pre-training AND fine-tuning
But in practice, you’ll mostly fine-tune existing models
Understanding pre-training helps you understand how it all works!

Chapter Summary

🎓 What We Learned Today

Let’s recap the major concepts:

1. The Two Stages

Building an LLM = Pre-training + Fine-tuning

Stage 1 (Pre-training):
└── Train on 300 billion+ words
    └── Learn general language
        └── Result: Foundation Model

Stage 2 (Fine-tuning):
└── Train on 10,000-1M specific examples
    └── Specialize for a task
        └── Result: Your Custom AI

2. Pre-training

✅ Massive dataset (entire internet)
✅ Unlabeled data (raw text)
✅ Task: Predict next word
✅ Duration: Weeks to months
✅ Cost: Millions of dollars
✅ Result: Foundation model (GPT-4, Claude)
✅ Done by: OpenAI, Google, Meta

3. Fine-tuning

✅ Smaller dataset (your company data)
✅ Labeled data (with answers/tags)
✅ Task: Specific application
✅ Duration: Hours to days
✅ Cost: Thousands of dollars
✅ Result: Specialized model (Harvey AI, JP Morgan AI)
✅ Done by: Companies, developers, YOU!

4. Key Differences

Pre-training	Fine-tuning
300B+ words	10K-1M examples
Unlabeled	Labeled
General-purpose	Specific task
$4.6M	$10K
Weeks	Days
Foundation model	Custom model

5. Real-World Examples

SK Telecom: Fine-tuned for Korean telecom support
└── 35% better conversation quality

Harvey AI: Fine-tuned for legal case research
└── Trusted by top law firms

JP Morgan: Fine-tuned for financial analysis
└── May replace research analysts

6. Data Types

Unlabeled (Pre-training):

"The cat sat on the mat."
[Just text, no labels]

Labeled (Fine-tuning):

Text: "This movie was great!"
Label: Positive
[Text + Label]

7. Types of Fine-tuning

1. Instruction Fine-tuning:

Format: Instruction → Response
Use: Chatbots, translation, Q&A

2. Classification Fine-tuning:

Format: Text → Category
Use: Spam detection, sentiment analysis

🎯 The Big Picture

Remember this flow:

1. OpenAI/Google pre-trains → Creates GPT-4
2. You fine-tune GPT-4 → Your custom AI
3. You deploy → Real-world application
4. Profit! 💰

📚 Before Next Chapter

Make sure you understand:

[ ] What is pre-training?
[ ] What is fine-tuning?
[ ] Why are there two stages?
[ ] Difference between labeled and unlabeled data
[ ] When do you need fine-tuning?
[ ] At least 2 real-world examples
[ ] Cost difference (millions vs thousands)

If anything is unclear, read this chapter again!

🔜 What’s Next?

In Chapter 4, we’ll start diving into the technical details:

Introduction to Transformer architecture
Brief look at “Attention is All You Need” paper
Understanding the building blocks
Preparing for actual coding!

Get ready to go deeper! 🔍

🚀 Take Action Now!

What to do next:

💬 Comment Below - Which stage interested you more: pre-training or fine-tuning?
✅ Check Your Understanding - Can you explain both stages to a friend?
🔖 Bookmark - Save for reference
🔄 Think About Use Cases - What would YOU fine-tune an LLM for?
⏭️ Stay Tuned - Chapter 4 coming soon!

Quick Reference

Key Terms Learned:

Term	Meaning
Pre-training	Training on massive unlabeled data (Stage 1)
Fine-tuning	Refining on specific labeled data (Stage 2)
Foundation Model	Pre-trained LLM (base model)
Unlabeled Data	Raw text without tags
Labeled Data	Text with answers/categories
Instruction Fine-tuning	Teaching specific tasks (Q&A, translation)
Classification Fine-tuning	Teaching categorization (spam detection)

Important Numbers:

GPT-3 training data: 300 billion words
Pre-training cost: $4.6 million
Fine-tuning cost: $1,000 - $100,000
Pre-training duration: Weeks to months
Fine-tuning duration: Hours to days

Real Companies Using Fine-tuning:

✅ SK Telecom (Telecom support)
✅ Harvey AI (Legal research)
✅ JP Morgan Chase (Financial analysis)
✅ And thousands more!

Thank You!

You’ve completed Chapter 3! 🎉

You now understand the complete lifecycle of building an LLM - from raw internet data to production-ready specialized AI. This knowledge is crucial for everything that follows!

Remember:

Pre-training = General education (expensive, done by big companies)
Fine-tuning = Specialization (affordable, YOU can do this!)

In the next chapter, we’ll start exploring the “secret sauce” - the Transformer architecture that makes all of this possible!

See you in Chapter 4! 🚀

Questions? Drop them in the comments below! We respond to every single one.

Chapter 3: The Two Stages of Building an LLM

📑 Table of Contents

Quick Recap: Where We Are

The Two-Stage Building Process

🎭 The Two Acts of Building an LLM

🤔 Why Two Stages? Why Not Just One?

Stage 1: Pre-training Explained

📚 What is Pre-training?

🌐 Where Does This Training Data Come From?

🔍 Let’s Explore These Sources

1. Common Crawl

2. WebText2

3. Books

4. Wikipedia

🎯 The Training Task: Next Word Prediction

🤯 The Surprising Discovery

📊 The Pre-training Result: Foundation Model

💡 Key Takeaway

Stage 2: Fine-tuning Explained

🎯 What is Fine-tuning?

🤔 Why Not Just Use Pre-trained Models?

Scenario 1: You Run an Airline Company

Scenario 2: You’re an Educational Platform

🏢 When Do You Need Fine-tuning?

📊 Pre-training vs Fine-tuning: Key Differences

Real-World Examples

1️⃣ SK Telecom (South Korea)

2️⃣ Harvey AI (Legal Industry)

3️⃣ JP Morgan Chase (Banking)

🎯 Pattern in All Examples

The Complete LLM Lifecycle

🔄 From Raw Data to Production

📊 Visual Schematic

Labeled vs Unlabeled Data

🤔 What’s the Difference?

📝 Unlabeled Data

🏷️ Labeled Data

📊 Comparison Table

💡 Why This Matters

Types of Fine-tuning

1️⃣ Instruction Fine-tuning

2️⃣ Classification Fine-tuning

📊 Comparison

Cost Analysis: The Money Behind LLMs

💰 Let’s Talk Numbers

📊 Pre-training Costs

💡 Why So Expensive?

📉 Fine-tuning Costs (Much Cheaper!)

🎯 Cost Comparison

🏢 Who Can Afford Pre-training?

💡 Good News for You!

Chapter Summary

🎓 What We Learned Today

1. The Two Stages

2. Pre-training

3. Fine-tuning

4. Key Differences

5. Real-World Examples

6. Data Types

7. Types of Fine-tuning

🎯 The Big Picture

📚 Before Next Chapter

🔜 What’s Next?

🚀 Take Action Now!

Quick Reference

Key Terms Learned:

Important Numbers:

Real Companies Using Fine-tuning:

Thank You!

Share this post:

Related Posts

Chapter 10: Token Embeddings - Converting Words to Meaning-Rich Vectors

Chapter 9: Data Sampling & Context Windows - Preparing Data for LLM Training

Chapter 8: Byte Pair Encoding (BPE) - How GPT Tokenizes Text