Chapter 3: Stages of Building an LLM - Pre-training vs Fine-tuning Explained

Chapter 3: The Two Stages of Building an LLM

πŸ“– Reading Time: 35 minutes

Welcome to Chapter 3! In the previous chapters, we learned what LLMs are and why they’re revolutionary. Now it’s time to understand how they’re actually built.

The big reveal: Building an LLM is not one giant step - it’s a two-stage process:

  1. Pre-training (Stage 1)
  2. Fine-tuning (Stage 2)

By the end of this chapter, you’ll understand:

  • What is pre-training and why it’s called β€œpre”
  • How LLMs learn from billions of words
  • What is fine-tuning and when it’s needed
  • Real examples from top companies
  • The complete lifecycle from data to deployment

Let’s dive in! πŸš€


πŸ“‘ Table of Contents


Quick Recap: Where We Are

In Chapter 1, we introduced the LLM series and our goals.

In Chapter 2, we learned:

  • What LLMs are (neural networks for text)
  • Why they’re called β€œLarge” (billions of parameters)
  • The secret sauce (Transformer architecture)
  • Difference between AI, ML, DL, and LLM

In Chapter 3 (Today), we’ll learn:

  • How these massive models are actually built
  • The step-by-step process from raw data to ChatGPT

The Two-Stage Building Process

🎭 The Two Acts of Building an LLM

Think of building an LLM like training for the Olympics:

Stage 1: General Training (Pre-training)

  • Like an athlete doing years of general fitness training
  • Builds overall strength, endurance, speed
  • Not focused on one specific sport yet

Stage 2: Specialized Training (Fine-tuning)

  • Like an athlete specializing in javelin throw or swimming
  • Focuses on specific skills for a particular event
  • Refines what was learned in general training

For LLMs:

Stage 1: Pre-training
└── Train on EVERYTHING (entire internet)
    └── Result: General-purpose AI

Stage 2: Fine-tuning
└── Train on SPECIFIC data (your company's data)
    └── Result: Specialized AI for your needs

πŸ€” Why Two Stages? Why Not Just One?

Great question!

Analogy: Education System

General Education (Pre-training):

  • Kindergarten to 12th grade
  • Learn everything: Math, Science, Languages, History
  • Become a well-rounded person

Specialization (Fine-tuning):

  • College/University - Choose Engineering
  • Medical school - Become a doctor
  • Law school - Become a lawyer

Same logic for LLMs:

You need general knowledge first (pre-training), then specialize (fine-tuning) for your specific use case.


Stage 1: Pre-training Explained

πŸ“š What is Pre-training?

Simple Definition:

Training the LLM on a massive and diverse dataset so it learns general language understanding.

What β€œmassive” means:

GPT-3 was trained on 300 billion words!

Let’s put that in perspective:

Average book: 80,000 words
300 billion words = 3,750,000 books

If you read one book per day:
It would take 10,274 YEARS to read all that!

🌐 Where Does This Training Data Come From?

GPT-3’s Training Data Sources:

Source Words Percentage What It Contains
Common Crawl 410 billion 60% Entire internet content
WebText2 20 billion 22% Reddit posts, quality articles
Books1 & Books2 67 billion 16% Published books
Wikipedia 3 billion 3% Wikipedia articles

Total: 500 billion words (300 billion used for training)


πŸ” Let’s Explore These Sources

1. Common Crawl

What is it?

  • An open repository of web data
  • Crawls and stores content from billions of websites
  • Anyone can access it for free

Example content:

  • News articles
  • Blog posts
  • Product reviews
  • Social media discussions
  • Scientific papers
  • Forums and Q&A sites

Try it: Visit commoncrawl.org


2. WebText2

What is it?

  • High-quality text from Reddit
  • Only upvoted content (quality filter)
  • Includes Stack Overflow (programming Q&A)

Why Reddit?

  • Reddit’s upvote system = quality filter
  • Diverse topics (technology, cooking, science, history)
  • Human-written, conversational language

3. Books

Why include books?

  • Proper grammar and structure
  • Long-form storytelling
  • Diverse vocabulary
  • Different writing styles (fiction, non-fiction, technical)

Example books included:

  • Classic literature
  • Technical manuals
  • Science textbooks
  • Fiction novels

4. Wikipedia

Why Wikipedia?

  • Factual, well-structured information
  • Covers millions of topics
  • Multiple languages
  • Regularly updated

🎯 The Training Task: Next Word Prediction

Here’s the fascinating part:

LLMs learn by playing a simple game: β€œGuess the next word”

Example:

Given: "The lion is in the ___"
LLM predicts: "forest"

Given: "I went to the ___"
LLM predicts: "store" (or "park", "school", "mall")

Given: "The capital of France is ___"
LLM predicts: "Paris"

That’s it!

Just train on predicting the next word, billions of times, with billions of examples.


🀯 The Surprising Discovery

What researchers found:

When you train an LLM ONLY for β€œnext word prediction” on massive data, something magical happens:

It learns to do MANY other tasks automatically!

Tasks LLMs can do (without specific training):

βœ… Translation

Input: Translate "Hello" to Spanish
Output: Hola

βœ… Summarization

Input: Summarize this 10-page article
Output: [Concise 3-sentence summary]

βœ… Question Answering

Input: What is the capital of Japan?
Output: Tokyo

βœ… Multiple Choice Questions

Input: What is 2+2? A) 3 B) 4 C) 5
Output: B) 4

βœ… Sentiment Analysis

Input: "This movie was terrible!"
Output: Negative sentiment

βœ… Code Generation

Input: Write Python code to reverse a string
Output: [Working Python code]

All of this WITHOUT being specifically trained for these tasks!


πŸ“Š The Pre-training Result: Foundation Model

After pre-training, you get:

Foundation Model (also called Base Model or Pre-trained Model)

Characteristics:

  • βœ… General-purpose
  • βœ… Can do many tasks
  • βœ… Understands language deeply
  • βœ… But… not specialized for anything specific

Example: GPT-4 is a foundation model

When you use ChatGPT without any customization, you’re using a foundation model (with some basic fine-tuning).


πŸ’‘ Key Takeaway

Pre-training is like giving an LLM a complete education:

  • Read everything on the internet
  • Learn general language patterns
  • Become a jack-of-all-trades

But it’s not specialized yet. That’s where fine-tuning comes in!


Stage 2: Fine-tuning Explained

🎯 What is Fine-tuning?

Simple Definition:

Taking a pre-trained model and refining it on a specific, narrow dataset for a particular task or domain.

Analogy: Doctor Specialization

Medical School (Pre-training)
└── Learn general medicine
    └── Graduate: General doctor

Specialization (Fine-tuning)
└── Cardiologist (heart specialist)
└── Neurologist (brain specialist)
└── Pediatrician (child specialist)

Same for LLMs:

Pre-trained GPT-4 (Foundation Model)
└── Knows everything generally

Fine-tuned for Banking (JP Morgan's AI)
└── Specializes in financial analysis

Fine-tuned for Legal (Harvey AI)
└── Specializes in legal cases

Fine-tuned for Telecom (SK Telecom AI)
└── Specializes in customer support (Korean)

πŸ€” Why Not Just Use Pre-trained Models?

Great question! Let’s see with examples:


Scenario 1: You Run an Airline Company

You want: AI chatbot for customer support

Question to AI:

β€œWhat’s the price for Lufthansa flight leaving at 6 PM to Munich?”

If you use pre-trained GPT-4 (without fine-tuning):

Response: "I don't have access to real-time flight prices. 
          Please check the Lufthansa website or contact 
          their customer service at..."

❌ Not helpful! It’s generic.

If you fine-tune on YOUR airline’s data:

Response: "The Lufthansa flight LH456 departing at 6 PM 
          to Munich costs €235 (Economy) or €890 (Business Class). 
          Would you like me to check availability?"

βœ… Perfect! Specific to your company.


Scenario 2: You’re an Educational Platform

You want: AI to generate high-quality exam questions

If you use pre-trained GPT-4:

  • Questions are okay but generic
  • May not match your curriculum
  • Quality varies

If you fine-tune on YOUR past exam papers:

  • Questions match your style exactly
  • Difficulty levels are consistent
  • Covers your specific syllabus

🏒 When Do You Need Fine-tuning?

You DON’T need fine-tuning if:

  • ❌ You’re a student using ChatGPT for homework
  • ❌ You’re using AI for general tasks (writing emails, summaries)
  • ❌ You’re exploring AI capabilities
  • ❌ Generic responses are good enough

You NEED fine-tuning if:

  • βœ… You’re a company with proprietary data
  • βœ… Your domain is highly specialized (legal, medical, finance)
  • βœ… You need consistent, high-quality responses
  • βœ… Your data is not publicly available
  • βœ… You’re building a production application
  • βœ… Generic AI responses are not good enough

πŸ“Š Pre-training vs Fine-tuning: Key Differences

Aspect Pre-training Fine-tuning
Data 300 billion+ words 10,000 - 10 million examples
Data Source Entire internet Your specific dataset
Data Type Unlabeled (raw text) Labeled (with answers/tags)
Goal Learn general language Specialize for specific task
Cost $4.6 million (GPT-3) $1,000 - $100,000
Time Weeks to months Hours to days
Result Foundation model Specialized model
Who Does It OpenAI, Google, Meta Companies, developers, you!
Examples GPT-4, Claude, Gemini Harvey (legal), Your chatbot

Real-World Examples

Let’s see how top companies use fine-tuning:


1️⃣ SK Telecom (South Korea)

Company: Major telecommunications provider

Problem:

  • Needed AI customer support chatbot
  • Must understand Korean telecom terminology
  • Generic GPT-4 doesn’t understand telecom jargon

Solution: Fine-tuned GPT-4 on:

  • Past customer service conversations (Korean)
  • Telecom-specific terminology
  • Company policies and procedures

Results:

  • βœ… 35% improvement in conversation summarization
  • βœ… 33% improvement in understanding customer intent
  • βœ… Handles Korean telecom queries perfectly

Source: OpenAI case studies


Company: AI assistant for lawyers and attorneys

Website: harvey.ai

Problem:

  • Lawyers need AI that understands legal case history
  • Generic GPT-4 lacks extensive legal knowledge
  • Legal terminology and precedents are crucial

Solution: Fine-tuned LLM on:

  • Millions of legal case documents
  • Court rulings and precedents
  • Legal contracts and agreements
  • Jurisdiction-specific laws

What Harvey can do:

  • βœ… Research legal cases in seconds
  • βœ… Draft legal documents
  • βœ… Analyze contracts
  • βœ… Provide case law references

Used by:

  • Top law firms globally
  • Corporate legal teams
  • Legal professionals

Why it’s better than ChatGPT:

  • Trained specifically on legal data
  • Understands legal jargon
  • Provides case law citations
  • Domain expertise in law

3️⃣ JP Morgan Chase (Banking)

Company: Major investment bank

Product: Internal AI-powered LLM suite

Announcement: β€œJP Morgan unveils AI-powered LLM suite - may replace research analysts”

Why build their own LLM?

Problem with using GPT-4 directly:

  • ❌ Not trained on JP Morgan’s proprietary data
  • ❌ Lacks internal banking insights
  • ❌ Doesn’t understand company-specific terminology
  • ❌ Can’t access confidential financial models

Solution: Fine-tuned LLM on:

  • Internal research reports
  • Financial analysis documents
  • Market data and trends
  • Company-specific methodologies

Use cases:

  • βœ… Generate financial research reports
  • βœ… Analyze market trends
  • βœ… Summarize earnings calls
  • βœ… Draft investment recommendations

Benefits:

  • Speeds up analyst work
  • Maintains confidentiality
  • Uses proprietary insights
  • Consistent with company standards

🎯 Pattern in All Examples

Notice the pattern:

Generic LLM (GPT-4)
    ↓
  + Company's specific data
    ↓
  Fine-tuning
    ↓
  Specialized LLM for that industry

Key Insight:

Every major company building serious AI applications does fine-tuning. They NEVER just use the foundation model as-is.


The Complete LLM Lifecycle

πŸ”„ From Raw Data to Production

Let me show you the complete journey:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               STAGE 1: PRE-TRAINING                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 1: Collect Massive Data
β”œβ”€β”€ Common Crawl (410B words)
β”œβ”€β”€ WebText2 (20B words)
β”œβ”€β”€ Books (67B words)
└── Wikipedia (3B words)
    ↓
    Total: 500 billion words

Step 2: Train the Model
β”œβ”€β”€ Task: Predict next word
β”œβ”€β”€ Hardware: 1000s of GPUs
β”œβ”€β”€ Time: Several weeks
β”œβ”€β”€ Cost: $4.6 million (GPT-3)
└── Parameters: 175 billion

Step 3: Result
└── Foundation Model (Pre-trained LLM)
    └── Can do many tasks
    └── General-purpose
    └── Not specialized

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               STAGE 2: FINE-TUNING                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 4: Collect Specific Data
β”œβ”€β”€ Your company data (10K-1M examples)
β”œβ”€β”€ Domain-specific (legal, medical, finance)
β”œβ”€β”€ Labeled data (with answers)
└── High-quality examples

Step 5: Fine-tune the Model
β”œβ”€β”€ Start from pre-trained model
β”œβ”€β”€ Train on your specific data
β”œβ”€β”€ Time: Hours to days
└── Cost: $1,000 - $100,000

Step 6: Result
└── Specialized Model
    └── Perfect for your use case
    └── Domain expert
    └── Ready for production

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               STAGE 3: DEPLOYMENT                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 7: Build Application
β”œβ”€β”€ Chatbot
β”œβ”€β”€ Document analyzer
β”œβ”€β”€ Customer support
└── Code assistant

Step 8: Deploy to Users
└── Companies, employees, customers

πŸ“Š Visual Schematic

Here’s a simplified view:

RAW DATA                 FOUNDATIONAL MODEL      SPECIALIZED APPLICATIONS
(Unlabeled)              (Pre-trained)           (Fine-tuned)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Internet   │────┐    β”‚              │───┬───→│ Personal        β”‚
β”‚  Text       β”‚    β”‚    β”‚              β”‚   β”‚    β”‚ Assistant       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚    β”‚              β”‚   β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚  Books      β”‚    β”œβ”€β”€β”€β†’β”‚  Foundation  β”‚   β”‚    
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚    β”‚  Model       β”‚   β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Research   β”‚    β”‚    β”‚  (GPT-4)     β”‚   β”œβ”€β”€β”€β†’β”‚ Language        β”‚
β”‚  Papers     β”‚    β”‚    β”‚              β”‚   β”‚    β”‚ Translator      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚    β”‚              β”‚   β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚  Wikipedia  β”‚β”€β”€β”€β”€β”˜    β”‚              β”‚   β”‚    
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚              β”‚   β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”œβ”€β”€β”€β†’β”‚ Code            β”‚
                                            β”‚    β”‚ Assistant       β”‚
                                            β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                            β”‚    
                                            β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                            └───→│ Classification  β”‚
                                                 β”‚ Bot             β”‚
                                                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Labeled vs Unlabeled Data

πŸ€” What’s the Difference?

This is super important to understand!


πŸ“ Unlabeled Data

Definition: Raw text without any extra information or tags

Examples:

Example 1: News Article

"Scientists discover new planet orbiting distant star. 
The planet, named Kepler-452b, is located 1,400 light-years 
from Earth and may have conditions suitable for life..."

No labels needed! Just the text.

Example 2: Book

"It was the best of times, it was the worst of times, 
it was the age of wisdom, it was the age of foolishness..."

No labels needed! Just the story.

Used for: Pre-training


🏷️ Labeled Data

Definition: Text with associated labels, tags, or answers

Examples:

Example 1: Email Classification

Text: "Congratulations! You've won $1 million! Click here..."
Label: SPAM ❌

Text: "Meeting rescheduled to 3 PM tomorrow"
Label: NOT SPAM βœ…

Example 2: Sentiment Analysis

Text: "This movie was amazing! Best film ever!"
Label: POSITIVE 😊

Text: "Waste of time and money. Terrible acting."
Label: NEGATIVE 😑

Example 3: Question-Answer Pairs

Question: "What is the capital of France?"
Answer: "Paris"

Question: "Who wrote Romeo and Juliet?"
Answer: "William Shakespeare"

Example 4: Legal Case Data

Case Description: "Contract dispute over intellectual property rights..."
Relevant Precedent: "Smith v. Jones (1995) - similar case ruling..."
Expected Outcome: "Likely favorable to defendant based on precedent"

Used for: Fine-tuning


πŸ“Š Comparison Table

Feature Unlabeled Data Labeled Data
Structure Just text Text + tags/answers
Example β€œThe cat sat on the mat” Text: β€œMovie was great!”
Label: Positive
Cost Cheap (scrape internet) Expensive (humans label it)
Availability Abundant (billions of words) Limited (thousands of examples)
Used in Pre-training Fine-tuning
Purpose Learn general language Learn specific task

πŸ’‘ Why This Matters

Pre-training:

  • Needs MASSIVE amounts of data
  • Unlabeled data is easy to get (entire internet!)
  • Learns general patterns

Fine-tuning:

  • Needs SMALLER amounts of data
  • But data must be labeled (expensive!)
  • Learns specific tasks

This is why:

  • Pre-training costs millions (data + compute)
  • Fine-tuning costs thousands (mostly labeling)

Types of Fine-tuning

Not all fine-tuning is the same! There are two main types:


1️⃣ Instruction Fine-tuning

What it is: Teaching the LLM to follow specific instructions

Format: Instruction β†’ Response pairs

Examples:

Example 1: Translation

Instruction: "Translate this English text to French: Hello, how are you?"
Response: "Bonjour, comment allez-vous?"

Example 2: Summarization

Instruction: "Summarize this article in 3 sentences: [article text]"
Response: "[3-sentence summary]"

Example 3: Customer Support

Instruction: "Customer says: 'My flight is cancelled, what should I do?'"
Response: "I apologize for the inconvenience. Here are your options:
           1. Rebook on the next available flight
           2. Request a full refund
           3. Hotel accommodation for tonight
           Which would you prefer?"

Use cases:

  • βœ… Chatbots
  • βœ… Virtual assistants
  • βœ… Translation services
  • βœ… Summarization tools
  • βœ… Educational tutors

2️⃣ Classification Fine-tuning

What it is: Teaching the LLM to categorize or classify text

Format: Text β†’ Label/Category

Examples:

Example 1: Spam Detection

Input: "Congratulations! You won a free iPhone! Click now!"
Output: SPAM

Input: "Meeting agenda for tomorrow attached"
Output: NOT SPAM

Example 2: Sentiment Analysis

Input: "This product is absolutely terrible. Don't buy it."
Output: NEGATIVE

Input: "Amazing quality! Highly recommend to everyone."
Output: POSITIVE

Example 3: Topic Classification

Input: "Scientists discover new cancer treatment breakthrough..."
Output: SCIENCE

Input: "Stock market hits record high as tech companies surge..."
Output: BUSINESS

Example 4: Intent Detection

Input: "What time does the store close?"
Output: HOURS_INQUIRY

Input: "I want to return this product"
Output: RETURN_REQUEST

Use cases:

  • βœ… Email filtering (spam/not spam)
  • βœ… Sentiment analysis (positive/negative/neutral)
  • βœ… Content moderation (appropriate/inappropriate)
  • βœ… Topic categorization (sports/politics/tech)
  • βœ… Intent recognition (in chatbots)

πŸ“Š Comparison

Aspect Instruction Fine-tuning Classification Fine-tuning
Output Free-form text (answers, translations) Fixed categories/labels
Complexity More complex Simpler
Examples Q&A, translation, summarization Spam detection, sentiment
Flexibility Very flexible responses Predefined categories only
Training Data Instruction-response pairs Text-label pairs

Cost Analysis: The Money Behind LLMs

πŸ’° Let’s Talk Numbers

Building LLMs is EXPENSIVE. Let’s break down the costs:


πŸ“Š Pre-training Costs

GPT-3 Pre-training:

Resource Quantity Cost
GPUs ~10,000 NVIDIA V100 $3 million
Electricity Several megawatts $500,000
Cloud Infrastructure AWS/Azure $1 million
Data Collection 500B words $100,000
Engineers 50+ AI researchers Priceless
Total - $4.6 million

Training Duration: 30+ days continuously


πŸ’‘ Why So Expensive?

1. GPUs Are Expensive

Single NVIDIA A100 GPU: $10,000
For GPT-3: 10,000 GPUs needed
Cost: $100 million in hardware
(but rented from cloud, so cheaper)

2. Electricity Costs

10,000 GPUs running 24/7 for a month
= Electricity for a small town!
= $500,000+ in power bills

3. Expertise Needed

AI researchers: $300,000+/year salary
50 researchers Γ— 1 year
= $15 million+ in salaries

πŸ“‰ Fine-tuning Costs (Much Cheaper!)

Typical Fine-tuning Project:

Resource Quantity Cost
Compute Few GPUs for days $1,000 - $10,000
Data Labeling 10,000 examples $5,000 - $50,000
API Costs Using OpenAI API $100 - $1,000
Total - $6,000 - $60,000

Training Duration: Few hours to few days


🎯 Cost Comparison

Pre-training:  $4,600,000  πŸ’ΈπŸ’ΈπŸ’ΈπŸ’ΈπŸ’Έ
Fine-tuning:   $   10,000  πŸ’Έ
                ──────────
Difference:    460x cheaper!

This is why:

  • Only big companies (OpenAI, Google, Meta) do pre-training
  • Everyone else uses their pre-trained models and fine-tunes
  • You can fine-tune GPT-4 for your own use case!

🏒 Who Can Afford Pre-training?

Companies that have done pre-training:

βœ… OpenAI (GPT series) - Backed by Microsoft
βœ… Google (Gemini, PaLM) - Tech giant
βœ… Meta (Llama series) - Tech giant
βœ… Anthropic (Claude) - $7B funding
βœ… Mistral AI (Mistral) - $400M funding

Total companies globally: ~10-15

Everyone else: Uses fine-tuning on existing models


πŸ’‘ Good News for You!

You DON’T need to pre-train!

You can:

  1. Use OpenAI’s API
  2. Fine-tune GPT-4 on your data
  3. Build amazing applications
  4. Total cost: $100 - $10,000

In this series:

  • We’ll learn both pre-training AND fine-tuning
  • But in practice, you’ll mostly fine-tune existing models
  • Understanding pre-training helps you understand how it all works!

Chapter Summary

πŸŽ“ What We Learned Today

Let’s recap the major concepts:


1. The Two Stages

Building an LLM = Pre-training + Fine-tuning

Stage 1 (Pre-training):
└── Train on 300 billion+ words
    └── Learn general language
        └── Result: Foundation Model

Stage 2 (Fine-tuning):
└── Train on 10,000-1M specific examples
    └── Specialize for a task
        └── Result: Your Custom AI

2. Pre-training

βœ… Massive dataset (entire internet)
βœ… Unlabeled data (raw text)
βœ… Task: Predict next word
βœ… Duration: Weeks to months
βœ… Cost: Millions of dollars
βœ… Result: Foundation model (GPT-4, Claude)
βœ… Done by: OpenAI, Google, Meta

3. Fine-tuning

βœ… Smaller dataset (your company data)
βœ… Labeled data (with answers/tags)
βœ… Task: Specific application
βœ… Duration: Hours to days
βœ… Cost: Thousands of dollars
βœ… Result: Specialized model (Harvey AI, JP Morgan AI)
βœ… Done by: Companies, developers, YOU!

4. Key Differences

Pre-training Fine-tuning
300B+ words 10K-1M examples
Unlabeled Labeled
General-purpose Specific task
$4.6M $10K
Weeks Days
Foundation model Custom model

5. Real-World Examples

SK Telecom: Fine-tuned for Korean telecom support
└── 35% better conversation quality

Harvey AI: Fine-tuned for legal case research
└── Trusted by top law firms

JP Morgan: Fine-tuned for financial analysis
└── May replace research analysts

6. Data Types

Unlabeled (Pre-training):

"The cat sat on the mat."
[Just text, no labels]

Labeled (Fine-tuning):

Text: "This movie was great!"
Label: Positive
[Text + Label]

7. Types of Fine-tuning

1. Instruction Fine-tuning:

  • Format: Instruction β†’ Response
  • Use: Chatbots, translation, Q&A

2. Classification Fine-tuning:

  • Format: Text β†’ Category
  • Use: Spam detection, sentiment analysis

🎯 The Big Picture

Remember this flow:

1. OpenAI/Google pre-trains β†’ Creates GPT-4
2. You fine-tune GPT-4 β†’ Your custom AI
3. You deploy β†’ Real-world application
4. Profit! πŸ’°

πŸ“š Before Next Chapter

Make sure you understand:

  • [ ] What is pre-training?
  • [ ] What is fine-tuning?
  • [ ] Why are there two stages?
  • [ ] Difference between labeled and unlabeled data
  • [ ] When do you need fine-tuning?
  • [ ] At least 2 real-world examples
  • [ ] Cost difference (millions vs thousands)

If anything is unclear, read this chapter again!


πŸ”œ What’s Next?

In Chapter 4, we’ll start diving into the technical details:

  • Introduction to Transformer architecture
  • Brief look at β€œAttention is All You Need” paper
  • Understanding the building blocks
  • Preparing for actual coding!

Get ready to go deeper! πŸ”


πŸš€ Take Action Now!

What to do next:

  1. πŸ’¬ Comment Below - Which stage interested you more: pre-training or fine-tuning?
  2. βœ… Check Your Understanding - Can you explain both stages to a friend?
  3. πŸ”– Bookmark - Save for reference
  4. πŸ”„ Think About Use Cases - What would YOU fine-tune an LLM for?
  5. ⏭️ Stay Tuned - Chapter 4 coming soon!

Quick Reference

Key Terms Learned:

Term Meaning
Pre-training Training on massive unlabeled data (Stage 1)
Fine-tuning Refining on specific labeled data (Stage 2)
Foundation Model Pre-trained LLM (base model)
Unlabeled Data Raw text without tags
Labeled Data Text with answers/categories
Instruction Fine-tuning Teaching specific tasks (Q&A, translation)
Classification Fine-tuning Teaching categorization (spam detection)

Important Numbers:

  • GPT-3 training data: 300 billion words
  • Pre-training cost: $4.6 million
  • Fine-tuning cost: $1,000 - $100,000
  • Pre-training duration: Weeks to months
  • Fine-tuning duration: Hours to days

Real Companies Using Fine-tuning:

  • βœ… SK Telecom (Telecom support)
  • βœ… Harvey AI (Legal research)
  • βœ… JP Morgan Chase (Financial analysis)
  • βœ… And thousands more!

Thank You!

You’ve completed Chapter 3! πŸŽ‰

You now understand the complete lifecycle of building an LLM - from raw internet data to production-ready specialized AI. This knowledge is crucial for everything that follows!

Remember:

  • Pre-training = General education (expensive, done by big companies)
  • Fine-tuning = Specialization (affordable, YOU can do this!)

In the next chapter, we’ll start exploring the β€œsecret sauce” - the Transformer architecture that makes all of this possible!

See you in Chapter 4! πŸš€


Questions? Drop them in the comments below! We respond to every single one.