Chapter 10: Token Embeddings - Converting Words to Meaning-Rich Vectors

October 25, 2025 by The GSM Work

LLM AI Tutorial Series Beginners Token Embeddings Vector Embeddings Word2Vec PyTorch Semantic Similarity

Chapter 10: Token Embeddings - The Magic of Meaning-Rich Vectors

📖 Reading Time: 90 minutes
💻 Coding Time: 120 minutes

Welcome to Chapter 10! Today we unlock the secret sauce of LLMs! 🎉

Journey so far:

Tokenization (Chapter 7) ✅
Byte Pair Encoding (Chapter 8) ✅
Data Sampling & Context Windows (Chapter 9) ✅
Token IDs are ready!

Today’s mission:

Why token IDs aren’t enough
What are token embeddings?
How vectors capture meaning
Building embedding layers
Training embeddings
The foundation of GPT!

This is where the REAL magic happens! ✨

📑 Table of Contents

Why We Need Token Embeddings
The Problem with Random Token IDs
Why One-Hot Encoding Fails
Vector Embeddings: The Solution
Word2Vec Magic: King - Man + Woman = Queen
Building Embedding Layers in PyTorch
The Embedding Weight Matrix
Lookup Table Operations
Training Embeddings
GPT-2 Embeddings Deep Dive
Chapter Summary

Why We Need Token Embeddings

🎯 The LLM Pipeline

Where we are:

┌─────────────────────────────────────────────┐
│  Step 1: Tokenization                       │
│  "This is an example"                       │
│  → ["This", "is", "an", "example"]          │
└─────────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────────┐
│  Step 2: Token IDs                          │
│  ["This", "is", "an", "example"]            │
│  → [101, 202, 303, 404]                     │
└─────────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────────┐
│  Step 3: Token Embeddings ← TODAY!          │
│  [101, 202, 303, 404]                       │
│  → [[0.23, -0.45, 0.67, ...], ...]          │
└─────────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────────┐
│  Step 4: Feed to LLM Training               │
└─────────────────────────────────────────────┘

Question: Why not use token IDs directly? 🤔

The Problem with Random Token IDs

❌ Token IDs Alone Are Not Enough

Example vocabulary:

Word	Token ID
cat	34
kitten	-13
dog	5
puppy	89
book	2.9
tablet	-20

Problems:

❌ cat (34) and kitten (-13) are semantically related
   But token IDs don't capture this!

❌ dog (5) and puppy (89) are similar
   But 5 and 89 are far apart!

❌ cat (34) and book (2.9) are unrelated
   But we can't tell from IDs alone!

🧠 The Core Issue

Token IDs are just arbitrary numbers. They don’t encode semantic relationships!

Real-world analogy:

🖼️ Images:
CNNs exploit spatial relationships between pixels
✅ Eyes are close together
✅ Ears are close to head
✅ Position matters!

💬 Text:
Token IDs ignore semantic relationships
❌ "cat" and "kitten" = unrelated numbers
❌ Meaning not captured
❌ Huge training inefficiency!

🎯 What We Need

We need to exploit the inherent structure in language:

Just like CNNs exploit spatial structure in images,
we need to exploit SEMANTIC structure in text!

Words have meaning!
Similar words should have similar representations!

Why One-Hot Encoding Fails

📊 One-Hot Encoding Attempt

Idea: Represent each word as a vector with all 0s and one 1.

Example:

Word	One-Hot Vector
dog	`[1, 0, 0, 0]`
cat	`[0, 1, 0, 0]`
puppy	`[0, 0, 1, 0]`
book	`[0, 0, 0, 1]`

❌ Problems with One-Hot Encoding

1. No semantic similarity:

dog   = [1, 0, 0, 0]
puppy = [0, 0, 1, 0]

# Distance between dog and puppy
distance = sqrt((1-0)^2 + (0-0)^2 + (0-1)^2 + (0-0)^2)
         = sqrt(2)
         ≈ 1.41

# Distance between dog and book
dog  = [1, 0, 0, 0]
book = [0, 0, 0, 1]

distance = sqrt((1-0)^2 + (0-0)^2 + (0-0)^2 + (0-1)^2)
         = sqrt(2)
         ≈ 1.41

All words are equally distant! No semantic information! ❌

2. Huge dimensionality:

Vocabulary size = 50,000 words
→ Each word = 50,000-dimensional vector
→ 99.998% of values are 0
→ Extremely sparse and inefficient!

3. Orthogonality problem:

Every word vector is perpendicular to every other!
→ Dot product always = 0
→ No notion of similarity

One-hot encoding completely fails to capture meaning! 💔

Vector Embeddings: The Solution

💡 The Big Idea

Represent each word as a DENSE vector where dimensions correspond to semantic features!

🎨 Visual Example

Instead of arbitrary numbers, use features!

Features:

Has tail?
Is eatable?
Has four legs?
Makes sound?
Is a pet?

Vector representations:

dog    = [0.9, 0.1, 0.9, 0.8, 0.9]  # High: tail, legs, sound, pet
cat    = [0.9, 0.1, 0.8, 0.7, 0.9]  # High: tail, legs, sound, pet
apple  = [0.0, 0.9, 0.0, 0.0, 0.0]  # High: eatable
banana = [0.0, 0.9, 0.0, 0.0, 0.0]  # High: eatable

📊 Visualizing Semantic Space

    ┌────────────────────────────────┐
    │                                │
0.9 │   dog •                        │
    │       • cat                    │
    │                                │
0.5 │                                │
    │                                │
0.1 │                                │
    │                  • apple       │
0.0 │                  • banana      │
    └────────────────────────────────┘
     Has tail?         Is eatable?

Similar words cluster together! ✅

🎯 Key Insight

Compare distances:

# dog and cat (similar animals)
dog = [0.9, 0.1, 0.9, 0.8, 0.9]
cat = [0.9, 0.1, 0.8, 0.7, 0.9]
# Small difference in most dimensions!

# dog and banana (unrelated)
dog    = [0.9, 0.1, 0.9, 0.8, 0.9]
banana = [0.0, 0.9, 0.0, 0.0, 0.0]
# Large difference in most dimensions!

Vectors capture semantic meaning! 🎉

✨ Benefits of Vector Embeddings

✅ Semantic similarity captured
✅ Dense representations (not sparse)
✅ Efficient for large vocabularies
✅ Can be trained automatically
✅ Enable arithmetic operations on meaning!

Word2Vec Magic

🪄 The Most Famous Demo

Can vectors really capture meaning? Let’s test!

🤴 King - Man + Woman = ?

Hypothesis:

King is to Man as Queen is to Woman

King - Man + Woman should ≈ Queen

Why?

King   = Royalty + Masculine
Man    = Masculine
Woman  = Feminine

King - Man + Woman = Royalty + Masculine - Masculine + Feminine
                   = Royalty + Feminine
                   = Queen!

💻 Testing with Pre-trained Word2Vec

import gensim.downloader as api

# Load pre-trained Word2Vec
# Trained on Google News (100 billion words!)
# 300-dimensional vectors
model = api.load('word2vec-google-news-300')

# Test: King - Man + Woman = ?
result = model.most_similar(
    positive=['king', 'woman'],
    negative=['man']
)

print(result[0])

Output:

('queen', 0.7118)

IT WORKS! 71% confidence! 🎉

🔬 More Examples

Test 1: Gender relationships

# uncle - man + woman = ?
result = model.most_similar(
    positive=['uncle', 'woman'],
    negative=['man']
)
# → aunt (0.75)

# nephew - man + woman = ?
result = model.most_similar(
    positive=['nephew', 'woman'],
    negative=['man']
)
# → niece (0.72)

Test 2: Semantic similarity

# How similar are these words?
print(model.similarity('woman', 'man'))      # 0.766
print(model.similarity('king', 'queen'))     # 0.651
print(model.similarity('uncle', 'aunt'))     # 0.738
print(model.similarity('boy', 'girl'))       # 0.884
print(model.similarity('nephew', 'niece'))   # 0.695

# Unrelated words
print(model.similarity('paper', 'water'))    # 0.183

Related words have high similarity! ✅

Test 3: Finding similar words

# Find words similar to "tower"
print(model.most_similar('tower'))

Output:

[('towers', 0.79),
 ('skyscraper', 0.73),
 ('spire', 0.69),
 ('building', 0.68),
 ('monument', 0.66)]

Semantically related words cluster together! 🎯

📏 Distance Metric

Vector distance = Semantic distance!

import numpy as np

# Get vectors
man = model['man']
woman = model['woman']
nephew = model['nephew']
niece = model['niece']
semiconductor = model['semiconductor']
earthworm = model['earthworm']

# Compute L2 distance
print(np.linalg.norm(man - woman))           # 1.73 (close!)
print(np.linalg.norm(nephew - niece))        # 1.96 (close!)
print(np.linalg.norm(semiconductor - earthworm))  # 5.67 (far!)

Small distance = Similar meaning! 📊

🎯 Key Takeaway

Well-trained embeddings encode meaning!

✅ Similar words have similar vectors
✅ Vector arithmetic = Semantic arithmetic
✅ Geometric relationships = Meaning relationships
✅ This is the foundation of modern NLP!

Building Embedding Layers in PyTorch

🛠️ Implementation Time!

Now let’s build embeddings for LLMs!

📦 The Setup

Example sentence:

"quick fox is in the house"

Step 1: Tokenization

tokens = ["quick", "fox", "is", "in", "the", "house"]

Step 2: Create vocabulary (sorted)

vocab = {
    "fox": 0,
    "house": 1,
    "in": 2,
    "is": 3,
    "quick": 4,
    "the": 5
}

Step 3: Get token IDs

# Sentence: "in is the house"
token_ids = [2, 3, 5, 1]

🎯 Goal

Convert token IDs → embedding vectors

Token ID 2 → [0.23, -0.45, 0.67]
Token ID 3 → [-0.12, 0.89, -0.34]
Token ID 5 → [0.56, 0.23, -0.78]
Token ID 1 → [-0.34, 0.12, 0.90]

💻 Creating Embedding Layer

import torch
import torch.nn as nn

# Parameters
vocab_size = 6       # 6 words in vocabulary
embedding_dim = 3    # 3-dimensional vectors

# Create embedding layer
embedding_layer = nn.Embedding(
    num_embeddings=vocab_size,    # Vocabulary size
    embedding_dim=embedding_dim    # Vector dimension
)

print(embedding_layer)

Output:

Embedding(6, 3)

🔍 Understanding the Parameters

num_embeddings:

Number of unique tokens in vocabulary
GPT-2: 50,257
Our example: 6

embedding_dim:

Dimension of embedding vectors
GPT-2: 768 (small), 1600 (large)
Our example: 3

📊 The Embedding Weight Matrix

# View the embedding weights
print(embedding_layer.weight)

Output:

Parameter containing:
tensor([[-0.3421,  0.1234, -0.8765],  # Token ID 0
        [ 0.5678, -0.2345,  0.3456],  # Token ID 1
        [-0.1234,  0.8765, -0.4567],  # Token ID 2
        [ 0.9012, -0.3456,  0.1234],  # Token ID 3
        [-0.6789,  0.4567, -0.7890],  # Token ID 4
        [ 0.2345, -0.6789,  0.5678]], # Token ID 5
       requires_grad=True)

Shape: (6, 3) = (vocab_size, embedding_dim)

🎯 Key Points

1. Random initialization:

Weights are initialized randomly!
These will be optimized during training.

2. Matrix structure:

Rows = Number of tokens (vocab_size)
Columns = Embedding dimension (embedding_dim)

3. Each row = One token’s embedding:

Row 0 = Embedding for token ID 0
Row 1 = Embedding for token ID 1
...
Row 5 = Embedding for token ID 5

The Embedding Weight Matrix

📐 Matrix Dimensions

For our example:

┌─────────────────────────────────┐
│  Token ID 0: [ 0.23, -0.45, 0.67]  │  ← Row 0
│  Token ID 1: [-0.12,  0.89, -0.34] │  ← Row 1
│  Token ID 2: [ 0.56,  0.23, -0.78] │  ← Row 2
│  Token ID 3: [-0.34,  0.12,  0.90] │  ← Row 3
│  Token ID 4: [ 0.78, -0.56,  0.12] │  ← Row 4
│  Token ID 5: [-0.67,  0.34, -0.23] │  ← Row 5
└─────────────────────────────────┘
        ↑        ↑        ↑
      Dim 0    Dim 1    Dim 2

Matrix shape: (6 tokens, 3 dimensions)

🏗️ GPT-2 Dimensions

GPT-2 Small:

Vocabulary size: 50,257 tokens
Embedding dimension: 768

Embedding weight matrix shape: (50,257 × 768)
Total parameters: 50,257 × 768 = 38,597,376 parameters!

Just for embeddings! 😱

🎯 How It Works

Each token ID maps to a row:

# Token ID 3 → Row 3
embedding_layer.weight[3]
# Output: tensor([-0.34,  0.12,  0.90])

# Token ID 0 → Row 0
embedding_layer.weight[0]
# Output: tensor([ 0.23, -0.45,  0.67])

Simple lookup operation! ✨

Lookup Table Operations

🔎 Single Token Lookup

# Get embedding for token ID 3
token_id = torch.tensor([3])
embedding = embedding_layer(token_id)

print(embedding)
# Output: tensor([[-0.34,  0.12,  0.90]])

Returns 3D vector for token ID 3! ✅

📚 Multiple Token Lookup

# Get embeddings for multiple tokens
input_ids = torch.tensor([2, 3, 5, 1])

embeddings = embedding_layer(input_ids)

print(embeddings)
print(embeddings.shape)

Output:

tensor([[ 0.56,  0.23, -0.78],  # Token ID 2
        [-0.34,  0.12,  0.90],  # Token ID 3
        [-0.67,  0.34, -0.23],  # Token ID 5
        [-0.12,  0.89, -0.34]]) # Token ID 1

torch.Size([4, 3])

4 tokens → 4 embedding vectors! 🎉

🎯 Why “Lookup Table”?

It’s literally looking up rows!

Input: Token ID 3
       ↓
┌─────────────────────────┐
│  Row 0: [...]           │
│  Row 1: [...]           │
│  Row 2: [...]           │
│  Row 3: [-0.34, 0.12, 0.90] ← Found it!
│  Row 4: [...]           │
│  Row 5: [...]           │
└─────────────────────────┘
       ↓
Output: [-0.34, 0.12, 0.90]

Super efficient! O(1) lookup! ⚡

📊 Batch Processing

# Batch of sentences
# Sentence 1: [2, 3, 1]
# Sentence 2: [5, 0, 4]

batch = torch.tensor([
    [2, 3, 1],
    [5, 0, 4]
])

embeddings = embedding_layer(batch)

print(embeddings.shape)
# Output: torch.Size([2, 3, 3])
#         ↑  ↑  ↑
#         |  |  └─ Embedding dim
#         |  └──── Sequence length
#         └─────── Batch size

Training Embeddings

🏋️ How Are Embeddings Trained?

Remember: Embeddings start random!

Initial embeddings = Random noise
Trained embeddings = Meaningful vectors

How do we go from random → meaningful? 🤔

🎯 Training Process

Step 1: Random initialization

embedding_layer = nn.Embedding(vocab_size, embedding_dim)
# Weights are random!

Step 2: Forward pass

# Input: Token IDs
input_ids = torch.tensor([2, 3, 5, 1])

# Get embeddings
embeddings = embedding_layer(input_ids)

# Feed to LLM
output = model(embeddings)

# Compute loss
loss = criterion(output, target)

Step 3: Backward pass

# Compute gradients
loss.backward()

# Update embedding weights!
optimizer.step()

Embeddings are updated via backpropagation! ✨

📊 Training Dynamics

Initial (random):

cat   = [ 0.23, -0.45,  0.67]
kitten = [-0.89,  0.12, -0.34]
# Distance: Large! (random)

After training:

cat   = [ 0.78,  0.34,  0.56]
kitten = [ 0.82,  0.31,  0.59]
# Distance: Small! (similar meaning captured)

Training brings similar words closer! 🎯

🎓 What Guides Training?

Context in sentences!

"The cat sat on the mat"
"The kitten played on the rug"

→ "cat" and "kitten" appear in similar contexts
→ Their embeddings should be similar
→ Gradients push them closer together!

This is self-supervised learning! 🧠

🔥 Training Details

1. Embeddings are part of the model:

class GPTModel(nn.Module):
    def __init__(self):
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        # ... rest of model
    
    def forward(self, input_ids):
        x = self.embedding(input_ids)  # Gradients flow here!
        # ... rest of forward pass
        return x

2. Updated via backpropagation:

# Training loop
for batch in dataloader:
    input_ids, targets = batch
    
    # Forward pass (embeddings used here)
    outputs = model(input_ids)
    
    # Compute loss
    loss = criterion(outputs, targets)
    
    # Backward pass (embeddings updated here!)
    loss.backward()
    optimizer.step()

3. Requires_grad = True:

print(embedding_layer.weight.requires_grad)
# Output: True

# This means gradients will be computed
# and weights will be updated!

GPT-2 Embeddings Deep Dive

📏 GPT-2 Specifications

GPT-2 Small:

Parameter	Value
Vocabulary size	50,257
Embedding dimension	768
Embedding matrix shape	(50,257 × 768)
Total embedding parameters	38,597,376

That’s 38 million parameters just for embeddings! 😱

💻 Creating GPT-2 Size Embeddings

import torch.nn as nn

# GPT-2 specifications
vocab_size = 50_257
embedding_dim = 768

# Create embedding layer
gpt2_embeddings = nn.Embedding(
    num_embeddings=vocab_size,
    embedding_dim=embedding_dim
)

print(f"Shape: {gpt2_embeddings.weight.shape}")
print(f"Total parameters: {vocab_size * embedding_dim:,}")

Output:

Shape: torch.Size([50257, 768])
Total parameters: 38,597,376

🎯 Why 768 Dimensions?

Trade-offs:

Small dimensions (e.g., 128):

✅ Fewer parameters
✅ Faster training
❌ Less expressive
❌ Can't capture complex meanings

Large dimensions (e.g., 768):

✅ More expressive
✅ Captures nuanced meanings
✅ Better performance
❌ More parameters
❌ Slower training

768 = Sweet spot for GPT-2! ⚖️

🔬 GPT-2 vs GPT-3 vs GPT-4

Model	Vocab Size	Embed Dim	Parameters
GPT-2 Small	50,257	768	38.6M
GPT-2 Large	50,257	1,600	80.4M
GPT-3	50,257	12,288	617.6M
GPT-4	Unknown	Unknown	Unknown

Embedding parameters scale massively! 📈

🎓 Complete Example

import torch
import torch.nn as nn
import tiktoken

# Initialize tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Text
text = "Hello, how are you today?"

# Tokenize
token_ids = tokenizer.encode(text)
print(f"Token IDs: {token_ids}")

# Create embedding layer (GPT-2 size)
embedding_layer = nn.Embedding(
    num_embeddings=50_257,
    embedding_dim=768
)

# Convert to tensor
input_ids = torch.tensor(token_ids)

# Get embeddings
embeddings = embedding_layer(input_ids)

print(f"Input shape: {input_ids.shape}")
print(f"Embedding shape: {embeddings.shape}")

Output:

Token IDs: [15496, 11, 703, 389, 345, 1909, 30]
Input shape: torch.Size([7])
Embedding shape: torch.Size([7, 768])

7 tokens → 7 vectors of 768 dimensions! ✅

Embedding Layer vs Linear Layer

🤔 Are They The Same?

Surprising fact: Embedding layer = Special case of linear layer!

🔬 The Math

Linear layer:

output = input @ weight.T

# If input is one-hot encoded:
input = [0, 0, 1, 0]  # Select 3rd row
weight = [[w00, w01, w02],
          [w10, w11, w12],
          [w20, w21, w22],  ← Selected!
          [w30, w31, w32]]

output = [w20, w21, w22]  # Row 2!

Embedding layer:

token_id = 2  # Select 3rd row
weight = [[w00, w01, w02],
          [w10, w11, w12],
          [w20, w21, w22],  ← Selected!
          [w30, w31, w32]]

output = [w20, w21, w22]  # Row 2!

Same result! 🎉

💻 Proof in Code

import torch
import torch.nn as nn

vocab_size = 4
embedding_dim = 5
input_ids = [2, 3, 1]

# Method 1: Embedding layer
embedding = nn.Embedding(vocab_size, embedding_dim)
emb_output = embedding(torch.tensor(input_ids))

# Method 2: Linear layer with one-hot
linear = nn.Linear(vocab_size, embedding_dim, bias=False)
linear.weight = embedding.weight  # Use same weights!

# One-hot encode
one_hot = torch.zeros(len(input_ids), vocab_size)
one_hot[0, 2] = 1
one_hot[1, 3] = 1
one_hot[2, 1] = 1

linear_output = linear(one_hot)

# Compare
print(torch.allclose(emb_output, linear_output))
# Output: True

Mathematically identical! ✅

🎯 Why Use Embedding Layer?

Linear layer approach:

❌ Requires one-hot encoding
❌ Many multiplications with 0
❌ Memory inefficient (50,257-dim vectors!)
❌ Computationally wasteful

Embedding layer approach:

✅ Direct lookup (no one-hot)
✅ O(1) operation
✅ Memory efficient
✅ Optimized implementation

Embedding layer is MUCH faster! ⚡

📊 Performance Comparison

For GPT-2 (vocab = 50,257):

Method	Memory	Speed
Linear + One-hot	50,257 × 32 bits	Slow (matrix mult)
Embedding lookup	32 bits	Fast (direct access)

Embedding layer is ~1000x more efficient! 🚀

Chapter Summary

🎉 What We Learned Today

This was a FOUNDATIONAL chapter! Let’s recap:

1. Why Token IDs Aren’t Enough

Problem:
✗ Token IDs are arbitrary numbers
✗ Don't capture semantic relationships
✗ cat (34) and kitten (-13) = unrelated

Solution:
✓ Convert to dense vectors
✓ Encode semantic meaning
✓ Similar words = Similar vectors

2. Why One-Hot Encoding Fails

Issues:
✗ No semantic similarity (all equally distant)
✗ Huge dimensionality (50,000-dim vectors!)
✗ Sparse and inefficient

Embeddings win:
✓ Dense representations
✓ Semantic relationships preserved
✓ Efficient for large vocabularies

3. Vector Embeddings Capture Meaning

Word2Vec magic:
✓ King - Man + Woman = Queen
✓ Similar words have high similarity scores
✓ Distance in vector space = Semantic distance
✓ Trained on Google News (100B words)

Proof that embeddings work! 🎉

4. Building Embedding Layers

# PyTorch implementation
embedding = nn.Embedding(
    num_embeddings=vocab_size,  # 50,257 for GPT-2
    embedding_dim=768           # 768 for GPT-2
)

# Creates weight matrix: (50,257 × 768)
# Total parameters: 38.6 million!

5. Embedding Weight Matrix

Structure:
- Rows = Vocabulary size
- Columns = Embedding dimension
- Each row = One token's embedding vector

GPT-2:
- Matrix shape: (50,257 × 768)
- Initialized randomly
- Trained via backpropagation

6. Lookup Table Operations

Embedding layer = Lookup table

Input: Token ID 3
Operation: Retrieve row 3 from weight matrix
Output: 768-dimensional vector

Efficient: O(1) operation!
Beats one-hot + linear layer by 1000x!

7. Training Embeddings

Process:
1. Initialize weights randomly
2. Forward pass (use embeddings)
3. Compute loss
4. Backward pass (update embeddings!)
5. Repeat for millions of steps

Result:
Random vectors → Meaningful representations!

📊 Complete Pipeline

Text → Tokens → Token IDs → Embeddings → LLM Training

"Hello world"
      ↓
["Hello", "world"]
      ↓
[15496, 995]
      ↓
[[0.23, -0.45, ..., 0.67],   ← 768-dim
 [-0.12, 0.89, ..., -0.34]]  ← 768-dim
      ↓
Feed to GPT model!

💡 Key Takeaways

Token embeddings = Dense vectors that encode meaning
Similar words → Similar vectors (trained via context)
Embedding layer = Efficient lookup table (beats one-hot)
GPT-2 embeddings: 50,257 × 768 = 38.6M parameters
Trained jointly with LLM (via backpropagation)
Foundation of modern NLP (enables semantic understanding)
Next: Positional embeddings! (position also matters!)

🎯 What We Learned (Checklist)

[x] Why token IDs fail to capture meaning
[x] Problems with one-hot encoding
[x] How vectors can encode semantics
[x] Word2Vec demonstrations (King - Man + Woman = Queen)
[x] Building embedding layers in PyTorch
[x] Understanding embedding weight matrices
[x] Lookup table operations
[x] Training embeddings via backpropagation
[x] GPT-2 embedding specifications
[x] Embedding vs linear layer comparison

🔜 Next Chapter: Chapter 11

Topic: Positional Embeddings

What we’ll learn:

Why position matters in sequences
Absolute vs relative position
Sinusoidal positional encoding
Learned positional embeddings
Combining token + positional embeddings
GPT’s positional encoding scheme

Because “The cat sat on mat” ≠ “Sat cat the on mat”! 📍

📝 Practice Exercises

Try these before the next chapter:

Load Word2Vec and test your own analogies
Create embedding layer for vocabulary of 1000
Implement lookup for batch of token IDs
Visualize embedding space with PCA/t-SNE
Compare embedding vs linear layer performance
Train simple embeddings on toy dataset

Share your results! 💬

🚀 Take Action Now!

💻 Install gensim - Test Word2Vec examples
🧪 Experiment - Try King - Man + Woman yourself!
📝 Code Along - Build embedding layers
❓ Ask Questions - Comment if stuck
🔖 Bookmark - Critical reference material
⏭️ Get Ready - Next: Positional embeddings!

Quick Reference

Embedding Layer Template:

import torch.nn as nn

# Create embedding
embedding = nn.Embedding(
    num_embeddings=50257,  # Vocabulary size
    embedding_dim=768      # Vector dimension
)

# Use embedding
token_ids = torch.tensor([2, 3, 5, 1])
embeddings = embedding(token_ids)
# Shape: (4, 768)

Word2Vec Usage:

import gensim.downloader as api

# Load model
model = api.load('word2vec-google-news-300')

# Test analogy
result = model.most_similar(
    positive=['king', 'woman'],
    negative=['man']
)
# → queen

# Check similarity
similarity = model.similarity('cat', 'kitten')
# → 0.82 (high!)

Key Dimensions:

Model	Vocab Size	Embed Dim	Parameters
Tutorial	6	3	18
GPT-2 Small	50,257	768	38.6M
GPT-2 Large	50,257	1,600	80.4M
GPT-3	50,257	12,288	617.6M

Thank You!

You’ve completed Chapter 10 - Token Embeddings! 🎉

You now understand:

✅ Why embeddings are crucial
✅ How vectors capture meaning
✅ Word2Vec and semantic arithmetic
✅ Building embedding layers
✅ The embedding weight matrix
✅ Training embeddings

Next chapter: Positional Embeddings

The foundation is laid! Let’s build GPT! 🚀

📣 Your Feedback Matters!

Drop a comment:

Did Word2Vec blow your mind?
Which part was most enlightening?
Any questions about embeddings?
Share your experiments!

We respond to every comment! 💬

🎯 Coming Up

Chapter 11: Positional Embeddings
Chapter 12: Combining Token + Position
Chapter 13: Self-Attention Mechanism
Chapter 14: Building GPT from Scratch

The journey continues! 💻🔥

See you in Chapter 11 where we learn positional encoding! 🚀

Questions? Confused about embeddings? Drop them below! 💪