Chapter 10: Token Embeddings - Converting Words to Meaning-Rich Vectors

Chapter 10: Token Embeddings - The Magic of Meaning-Rich Vectors

πŸ“– Reading Time: 90 minutes
πŸ’» Coding Time: 120 minutes

Welcome to Chapter 10! Today we unlock the secret sauce of LLMs! πŸŽ‰

Journey so far:

  • Tokenization (Chapter 7) βœ…
  • Byte Pair Encoding (Chapter 8) βœ…
  • Data Sampling & Context Windows (Chapter 9) βœ…
  • Token IDs are ready!

Today’s mission:

  • Why token IDs aren’t enough
  • What are token embeddings?
  • How vectors capture meaning
  • Building embedding layers
  • Training embeddings
  • The foundation of GPT!

This is where the REAL magic happens! ✨


πŸ“‘ Table of Contents


Why We Need Token Embeddings

🎯 The LLM Pipeline

Where we are:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Step 1: Tokenization                       β”‚
β”‚  "This is an example"                       β”‚
β”‚  β†’ ["This", "is", "an", "example"]          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Step 2: Token IDs                          β”‚
β”‚  ["This", "is", "an", "example"]            β”‚
β”‚  β†’ [101, 202, 303, 404]                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Step 3: Token Embeddings ← TODAY!          β”‚
β”‚  [101, 202, 303, 404]                       β”‚
β”‚  β†’ [[0.23, -0.45, 0.67, ...], ...]          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Step 4: Feed to LLM Training               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Question: Why not use token IDs directly? πŸ€”


The Problem with Random Token IDs

❌ Token IDs Alone Are Not Enough

Example vocabulary:

Word Token ID
cat 34
kitten -13
dog 5
puppy 89
book 2.9
tablet -20

Problems:

❌ cat (34) and kitten (-13) are semantically related
   But token IDs don't capture this!

❌ dog (5) and puppy (89) are similar
   But 5 and 89 are far apart!

❌ cat (34) and book (2.9) are unrelated
   But we can't tell from IDs alone!

🧠 The Core Issue

Token IDs are just arbitrary numbers. They don’t encode semantic relationships!

Real-world analogy:

πŸ–ΌοΈ Images:
CNNs exploit spatial relationships between pixels
βœ… Eyes are close together
βœ… Ears are close to head
βœ… Position matters!

πŸ’¬ Text:
Token IDs ignore semantic relationships
❌ "cat" and "kitten" = unrelated numbers
❌ Meaning not captured
❌ Huge training inefficiency!

🎯 What We Need

We need to exploit the inherent structure in language:

Just like CNNs exploit spatial structure in images,
we need to exploit SEMANTIC structure in text!

Words have meaning!
Similar words should have similar representations!

Why One-Hot Encoding Fails

πŸ“Š One-Hot Encoding Attempt

Idea: Represent each word as a vector with all 0s and one 1.

Example:

Word One-Hot Vector
dog [1, 0, 0, 0]
cat [0, 1, 0, 0]
puppy [0, 0, 1, 0]
book [0, 0, 0, 1]

❌ Problems with One-Hot Encoding

1. No semantic similarity:

dog   = [1, 0, 0, 0]
puppy = [0, 0, 1, 0]

# Distance between dog and puppy
distance = sqrt((1-0)^2 + (0-0)^2 + (0-1)^2 + (0-0)^2)
         = sqrt(2)
         β‰ˆ 1.41

# Distance between dog and book
dog  = [1, 0, 0, 0]
book = [0, 0, 0, 1]

distance = sqrt((1-0)^2 + (0-0)^2 + (0-0)^2 + (0-1)^2)
         = sqrt(2)
         β‰ˆ 1.41

All words are equally distant! No semantic information! ❌


2. Huge dimensionality:

Vocabulary size = 50,000 words
β†’ Each word = 50,000-dimensional vector
β†’ 99.998% of values are 0
β†’ Extremely sparse and inefficient!

3. Orthogonality problem:

Every word vector is perpendicular to every other!
β†’ Dot product always = 0
β†’ No notion of similarity

One-hot encoding completely fails to capture meaning! πŸ’”


Vector Embeddings: The Solution

πŸ’‘ The Big Idea

Represent each word as a DENSE vector where dimensions correspond to semantic features!


🎨 Visual Example

Instead of arbitrary numbers, use features!

Features:

  1. Has tail?
  2. Is eatable?
  3. Has four legs?
  4. Makes sound?
  5. Is a pet?

Vector representations:

dog    = [0.9, 0.1, 0.9, 0.8, 0.9]  # High: tail, legs, sound, pet
cat    = [0.9, 0.1, 0.8, 0.7, 0.9]  # High: tail, legs, sound, pet
apple  = [0.0, 0.9, 0.0, 0.0, 0.0]  # High: eatable
banana = [0.0, 0.9, 0.0, 0.0, 0.0]  # High: eatable

πŸ“Š Visualizing Semantic Space

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                                β”‚
0.9 β”‚   dog β€’                        β”‚
    β”‚       β€’ cat                    β”‚
    β”‚                                β”‚
0.5 β”‚                                β”‚
    β”‚                                β”‚
0.1 β”‚                                β”‚
    β”‚                  β€’ apple       β”‚
0.0 β”‚                  β€’ banana      β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     Has tail?         Is eatable?

Similar words cluster together! βœ…


🎯 Key Insight

Compare distances:

# dog and cat (similar animals)
dog = [0.9, 0.1, 0.9, 0.8, 0.9]
cat = [0.9, 0.1, 0.8, 0.7, 0.9]
# Small difference in most dimensions!

# dog and banana (unrelated)
dog    = [0.9, 0.1, 0.9, 0.8, 0.9]
banana = [0.0, 0.9, 0.0, 0.0, 0.0]
# Large difference in most dimensions!

Vectors capture semantic meaning! πŸŽ‰


✨ Benefits of Vector Embeddings

βœ… Semantic similarity captured
βœ… Dense representations (not sparse)
βœ… Efficient for large vocabularies
βœ… Can be trained automatically
βœ… Enable arithmetic operations on meaning!

Word2Vec Magic

πŸͺ„ The Most Famous Demo

Can vectors really capture meaning? Let’s test!


🀴 King - Man + Woman = ?

Hypothesis:

King is to Man as Queen is to Woman

King - Man + Woman should β‰ˆ Queen

Why?

King   = Royalty + Masculine
Man    = Masculine
Woman  = Feminine

King - Man + Woman = Royalty + Masculine - Masculine + Feminine
                   = Royalty + Feminine
                   = Queen!

πŸ’» Testing with Pre-trained Word2Vec

import gensim.downloader as api

# Load pre-trained Word2Vec
# Trained on Google News (100 billion words!)
# 300-dimensional vectors
model = api.load('word2vec-google-news-300')

# Test: King - Man + Woman = ?
result = model.most_similar(
    positive=['king', 'woman'],
    negative=['man']
)

print(result[0])

Output:

('queen', 0.7118)

IT WORKS! 71% confidence! πŸŽ‰


πŸ”¬ More Examples

Test 1: Gender relationships

# uncle - man + woman = ?
result = model.most_similar(
    positive=['uncle', 'woman'],
    negative=['man']
)
# β†’ aunt (0.75)

# nephew - man + woman = ?
result = model.most_similar(
    positive=['nephew', 'woman'],
    negative=['man']
)
# β†’ niece (0.72)

Test 2: Semantic similarity

# How similar are these words?
print(model.similarity('woman', 'man'))      # 0.766
print(model.similarity('king', 'queen'))     # 0.651
print(model.similarity('uncle', 'aunt'))     # 0.738
print(model.similarity('boy', 'girl'))       # 0.884
print(model.similarity('nephew', 'niece'))   # 0.695

# Unrelated words
print(model.similarity('paper', 'water'))    # 0.183

Related words have high similarity! βœ…


Test 3: Finding similar words

# Find words similar to "tower"
print(model.most_similar('tower'))

Output:

[('towers', 0.79),
 ('skyscraper', 0.73),
 ('spire', 0.69),
 ('building', 0.68),
 ('monument', 0.66)]

Semantically related words cluster together! 🎯


πŸ“ Distance Metric

Vector distance = Semantic distance!

import numpy as np

# Get vectors
man = model['man']
woman = model['woman']
nephew = model['nephew']
niece = model['niece']
semiconductor = model['semiconductor']
earthworm = model['earthworm']

# Compute L2 distance
print(np.linalg.norm(man - woman))           # 1.73 (close!)
print(np.linalg.norm(nephew - niece))        # 1.96 (close!)
print(np.linalg.norm(semiconductor - earthworm))  # 5.67 (far!)

Small distance = Similar meaning! πŸ“Š


🎯 Key Takeaway

Well-trained embeddings encode meaning!

βœ… Similar words have similar vectors
βœ… Vector arithmetic = Semantic arithmetic
βœ… Geometric relationships = Meaning relationships
βœ… This is the foundation of modern NLP!

Building Embedding Layers in PyTorch

πŸ› οΈ Implementation Time!

Now let’s build embeddings for LLMs!


πŸ“¦ The Setup

Example sentence:

"quick fox is in the house"

Step 1: Tokenization

tokens = ["quick", "fox", "is", "in", "the", "house"]

Step 2: Create vocabulary (sorted)

vocab = {
    "fox": 0,
    "house": 1,
    "in": 2,
    "is": 3,
    "quick": 4,
    "the": 5
}

Step 3: Get token IDs

# Sentence: "in is the house"
token_ids = [2, 3, 5, 1]

🎯 Goal

Convert token IDs β†’ embedding vectors

Token ID 2 β†’ [0.23, -0.45, 0.67]
Token ID 3 β†’ [-0.12, 0.89, -0.34]
Token ID 5 β†’ [0.56, 0.23, -0.78]
Token ID 1 β†’ [-0.34, 0.12, 0.90]

πŸ’» Creating Embedding Layer

import torch
import torch.nn as nn

# Parameters
vocab_size = 6       # 6 words in vocabulary
embedding_dim = 3    # 3-dimensional vectors

# Create embedding layer
embedding_layer = nn.Embedding(
    num_embeddings=vocab_size,    # Vocabulary size
    embedding_dim=embedding_dim    # Vector dimension
)

print(embedding_layer)

Output:

Embedding(6, 3)

πŸ” Understanding the Parameters

num_embeddings:

Number of unique tokens in vocabulary
GPT-2: 50,257
Our example: 6

embedding_dim:

Dimension of embedding vectors
GPT-2: 768 (small), 1600 (large)
Our example: 3

πŸ“Š The Embedding Weight Matrix

# View the embedding weights
print(embedding_layer.weight)

Output:

Parameter containing:
tensor([[-0.3421,  0.1234, -0.8765],  # Token ID 0
        [ 0.5678, -0.2345,  0.3456],  # Token ID 1
        [-0.1234,  0.8765, -0.4567],  # Token ID 2
        [ 0.9012, -0.3456,  0.1234],  # Token ID 3
        [-0.6789,  0.4567, -0.7890],  # Token ID 4
        [ 0.2345, -0.6789,  0.5678]], # Token ID 5
       requires_grad=True)

Shape: (6, 3) = (vocab_size, embedding_dim)


🎯 Key Points

1. Random initialization:

Weights are initialized randomly!
These will be optimized during training.

2. Matrix structure:

Rows = Number of tokens (vocab_size)
Columns = Embedding dimension (embedding_dim)

3. Each row = One token’s embedding:

Row 0 = Embedding for token ID 0
Row 1 = Embedding for token ID 1
...
Row 5 = Embedding for token ID 5

The Embedding Weight Matrix

πŸ“ Matrix Dimensions

For our example:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Token ID 0: [ 0.23, -0.45, 0.67]  β”‚  ← Row 0
β”‚  Token ID 1: [-0.12,  0.89, -0.34] β”‚  ← Row 1
β”‚  Token ID 2: [ 0.56,  0.23, -0.78] β”‚  ← Row 2
β”‚  Token ID 3: [-0.34,  0.12,  0.90] β”‚  ← Row 3
β”‚  Token ID 4: [ 0.78, -0.56,  0.12] β”‚  ← Row 4
β”‚  Token ID 5: [-0.67,  0.34, -0.23] β”‚  ← Row 5
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↑        ↑        ↑
      Dim 0    Dim 1    Dim 2

Matrix shape: (6 tokens, 3 dimensions)


πŸ—οΈ GPT-2 Dimensions

GPT-2 Small:

Vocabulary size: 50,257 tokens
Embedding dimension: 768

Embedding weight matrix shape: (50,257 Γ— 768)
Total parameters: 50,257 Γ— 768 = 38,597,376 parameters!

Just for embeddings! 😱


🎯 How It Works

Each token ID maps to a row:

# Token ID 3 β†’ Row 3
embedding_layer.weight[3]
# Output: tensor([-0.34,  0.12,  0.90])

# Token ID 0 β†’ Row 0
embedding_layer.weight[0]
# Output: tensor([ 0.23, -0.45,  0.67])

Simple lookup operation! ✨


Lookup Table Operations

πŸ”Ž Single Token Lookup

# Get embedding for token ID 3
token_id = torch.tensor([3])
embedding = embedding_layer(token_id)

print(embedding)
# Output: tensor([[-0.34,  0.12,  0.90]])

Returns 3D vector for token ID 3! βœ…


πŸ“š Multiple Token Lookup

# Get embeddings for multiple tokens
input_ids = torch.tensor([2, 3, 5, 1])

embeddings = embedding_layer(input_ids)

print(embeddings)
print(embeddings.shape)

Output:

tensor([[ 0.56,  0.23, -0.78],  # Token ID 2
        [-0.34,  0.12,  0.90],  # Token ID 3
        [-0.67,  0.34, -0.23],  # Token ID 5
        [-0.12,  0.89, -0.34]]) # Token ID 1

torch.Size([4, 3])

4 tokens β†’ 4 embedding vectors! πŸŽ‰


🎯 Why β€œLookup Table”?

It’s literally looking up rows!

Input: Token ID 3
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Row 0: [...]           β”‚
β”‚  Row 1: [...]           β”‚
β”‚  Row 2: [...]           β”‚
β”‚  Row 3: [-0.34, 0.12, 0.90] ← Found it!
β”‚  Row 4: [...]           β”‚
β”‚  Row 5: [...]           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       ↓
Output: [-0.34, 0.12, 0.90]

Super efficient! O(1) lookup! ⚑


πŸ“Š Batch Processing

# Batch of sentences
# Sentence 1: [2, 3, 1]
# Sentence 2: [5, 0, 4]

batch = torch.tensor([
    [2, 3, 1],
    [5, 0, 4]
])

embeddings = embedding_layer(batch)

print(embeddings.shape)
# Output: torch.Size([2, 3, 3])
#         ↑  ↑  ↑
#         |  |  └─ Embedding dim
#         |  └──── Sequence length
#         └─────── Batch size

Training Embeddings

πŸ‹οΈ How Are Embeddings Trained?

Remember: Embeddings start random!

Initial embeddings = Random noise
Trained embeddings = Meaningful vectors

How do we go from random β†’ meaningful? πŸ€”


🎯 Training Process

Step 1: Random initialization

embedding_layer = nn.Embedding(vocab_size, embedding_dim)
# Weights are random!

Step 2: Forward pass

# Input: Token IDs
input_ids = torch.tensor([2, 3, 5, 1])

# Get embeddings
embeddings = embedding_layer(input_ids)

# Feed to LLM
output = model(embeddings)

# Compute loss
loss = criterion(output, target)

Step 3: Backward pass

# Compute gradients
loss.backward()

# Update embedding weights!
optimizer.step()

Embeddings are updated via backpropagation! ✨


πŸ“Š Training Dynamics

Initial (random):

cat   = [ 0.23, -0.45,  0.67]
kitten = [-0.89,  0.12, -0.34]
# Distance: Large! (random)

After training:

cat   = [ 0.78,  0.34,  0.56]
kitten = [ 0.82,  0.31,  0.59]
# Distance: Small! (similar meaning captured)

Training brings similar words closer! 🎯


πŸŽ“ What Guides Training?

Context in sentences!

"The cat sat on the mat"
"The kitten played on the rug"

β†’ "cat" and "kitten" appear in similar contexts
β†’ Their embeddings should be similar
β†’ Gradients push them closer together!

This is self-supervised learning! 🧠


πŸ”₯ Training Details

1. Embeddings are part of the model:

class GPTModel(nn.Module):
    def __init__(self):
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        # ... rest of model
    
    def forward(self, input_ids):
        x = self.embedding(input_ids)  # Gradients flow here!
        # ... rest of forward pass
        return x

2. Updated via backpropagation:

# Training loop
for batch in dataloader:
    input_ids, targets = batch
    
    # Forward pass (embeddings used here)
    outputs = model(input_ids)
    
    # Compute loss
    loss = criterion(outputs, targets)
    
    # Backward pass (embeddings updated here!)
    loss.backward()
    optimizer.step()

3. Requires_grad = True:

print(embedding_layer.weight.requires_grad)
# Output: True

# This means gradients will be computed
# and weights will be updated!

GPT-2 Embeddings Deep Dive

πŸ“ GPT-2 Specifications

GPT-2 Small:

Parameter Value
Vocabulary size 50,257
Embedding dimension 768
Embedding matrix shape (50,257 Γ— 768)
Total embedding parameters 38,597,376

That’s 38 million parameters just for embeddings! 😱


πŸ’» Creating GPT-2 Size Embeddings

import torch.nn as nn

# GPT-2 specifications
vocab_size = 50_257
embedding_dim = 768

# Create embedding layer
gpt2_embeddings = nn.Embedding(
    num_embeddings=vocab_size,
    embedding_dim=embedding_dim
)

print(f"Shape: {gpt2_embeddings.weight.shape}")
print(f"Total parameters: {vocab_size * embedding_dim:,}")

Output:

Shape: torch.Size([50257, 768])
Total parameters: 38,597,376

🎯 Why 768 Dimensions?

Trade-offs:

Small dimensions (e.g., 128):

βœ… Fewer parameters
βœ… Faster training
❌ Less expressive
❌ Can't capture complex meanings

Large dimensions (e.g., 768):

βœ… More expressive
βœ… Captures nuanced meanings
βœ… Better performance
❌ More parameters
❌ Slower training

768 = Sweet spot for GPT-2! βš–οΈ


πŸ”¬ GPT-2 vs GPT-3 vs GPT-4

Model Vocab Size Embed Dim Parameters
GPT-2 Small 50,257 768 38.6M
GPT-2 Large 50,257 1,600 80.4M
GPT-3 50,257 12,288 617.6M
GPT-4 Unknown Unknown Unknown

Embedding parameters scale massively! πŸ“ˆ


πŸŽ“ Complete Example

import torch
import torch.nn as nn
import tiktoken

# Initialize tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Text
text = "Hello, how are you today?"

# Tokenize
token_ids = tokenizer.encode(text)
print(f"Token IDs: {token_ids}")

# Create embedding layer (GPT-2 size)
embedding_layer = nn.Embedding(
    num_embeddings=50_257,
    embedding_dim=768
)

# Convert to tensor
input_ids = torch.tensor(token_ids)

# Get embeddings
embeddings = embedding_layer(input_ids)

print(f"Input shape: {input_ids.shape}")
print(f"Embedding shape: {embeddings.shape}")

Output:

Token IDs: [15496, 11, 703, 389, 345, 1909, 30]
Input shape: torch.Size([7])
Embedding shape: torch.Size([7, 768])

7 tokens β†’ 7 vectors of 768 dimensions! βœ…


Embedding Layer vs Linear Layer

πŸ€” Are They The Same?

Surprising fact: Embedding layer = Special case of linear layer!


πŸ”¬ The Math

Linear layer:

output = input @ weight.T

# If input is one-hot encoded:
input = [0, 0, 1, 0]  # Select 3rd row
weight = [[w00, w01, w02],
          [w10, w11, w12],
          [w20, w21, w22],  ← Selected!
          [w30, w31, w32]]

output = [w20, w21, w22]  # Row 2!

Embedding layer:

token_id = 2  # Select 3rd row
weight = [[w00, w01, w02],
          [w10, w11, w12],
          [w20, w21, w22],  ← Selected!
          [w30, w31, w32]]

output = [w20, w21, w22]  # Row 2!

Same result! πŸŽ‰


πŸ’» Proof in Code

import torch
import torch.nn as nn

vocab_size = 4
embedding_dim = 5
input_ids = [2, 3, 1]

# Method 1: Embedding layer
embedding = nn.Embedding(vocab_size, embedding_dim)
emb_output = embedding(torch.tensor(input_ids))

# Method 2: Linear layer with one-hot
linear = nn.Linear(vocab_size, embedding_dim, bias=False)
linear.weight = embedding.weight  # Use same weights!

# One-hot encode
one_hot = torch.zeros(len(input_ids), vocab_size)
one_hot[0, 2] = 1
one_hot[1, 3] = 1
one_hot[2, 1] = 1

linear_output = linear(one_hot)

# Compare
print(torch.allclose(emb_output, linear_output))
# Output: True

Mathematically identical! βœ…


🎯 Why Use Embedding Layer?

Linear layer approach:

❌ Requires one-hot encoding
❌ Many multiplications with 0
❌ Memory inefficient (50,257-dim vectors!)
❌ Computationally wasteful

Embedding layer approach:

βœ… Direct lookup (no one-hot)
βœ… O(1) operation
βœ… Memory efficient
βœ… Optimized implementation

Embedding layer is MUCH faster! ⚑


πŸ“Š Performance Comparison

For GPT-2 (vocab = 50,257):

Method Memory Speed
Linear + One-hot 50,257 Γ— 32 bits Slow (matrix mult)
Embedding lookup 32 bits Fast (direct access)

Embedding layer is ~1000x more efficient! πŸš€


Chapter Summary

πŸŽ‰ What We Learned Today

This was a FOUNDATIONAL chapter! Let’s recap:


1. Why Token IDs Aren’t Enough

Problem:
βœ— Token IDs are arbitrary numbers
βœ— Don't capture semantic relationships
βœ— cat (34) and kitten (-13) = unrelated

Solution:
βœ“ Convert to dense vectors
βœ“ Encode semantic meaning
βœ“ Similar words = Similar vectors

2. Why One-Hot Encoding Fails

Issues:
βœ— No semantic similarity (all equally distant)
βœ— Huge dimensionality (50,000-dim vectors!)
βœ— Sparse and inefficient

Embeddings win:
βœ“ Dense representations
βœ“ Semantic relationships preserved
βœ“ Efficient for large vocabularies

3. Vector Embeddings Capture Meaning

Word2Vec magic:
βœ“ King - Man + Woman = Queen
βœ“ Similar words have high similarity scores
βœ“ Distance in vector space = Semantic distance
βœ“ Trained on Google News (100B words)

Proof that embeddings work! πŸŽ‰

4. Building Embedding Layers

# PyTorch implementation
embedding = nn.Embedding(
    num_embeddings=vocab_size,  # 50,257 for GPT-2
    embedding_dim=768           # 768 for GPT-2
)

# Creates weight matrix: (50,257 Γ— 768)
# Total parameters: 38.6 million!

5. Embedding Weight Matrix

Structure:
- Rows = Vocabulary size
- Columns = Embedding dimension
- Each row = One token's embedding vector

GPT-2:
- Matrix shape: (50,257 Γ— 768)
- Initialized randomly
- Trained via backpropagation

6. Lookup Table Operations

Embedding layer = Lookup table

Input: Token ID 3
Operation: Retrieve row 3 from weight matrix
Output: 768-dimensional vector

Efficient: O(1) operation!
Beats one-hot + linear layer by 1000x!

7. Training Embeddings

Process:
1. Initialize weights randomly
2. Forward pass (use embeddings)
3. Compute loss
4. Backward pass (update embeddings!)
5. Repeat for millions of steps

Result:
Random vectors β†’ Meaningful representations!

πŸ“Š Complete Pipeline

Text β†’ Tokens β†’ Token IDs β†’ Embeddings β†’ LLM Training

"Hello world"
      ↓
["Hello", "world"]
      ↓
[15496, 995]
      ↓
[[0.23, -0.45, ..., 0.67],   ← 768-dim
 [-0.12, 0.89, ..., -0.34]]  ← 768-dim
      ↓
Feed to GPT model!

πŸ’‘ Key Takeaways

  1. Token embeddings = Dense vectors that encode meaning
  2. Similar words β†’ Similar vectors (trained via context)
  3. Embedding layer = Efficient lookup table (beats one-hot)
  4. GPT-2 embeddings: 50,257 Γ— 768 = 38.6M parameters
  5. Trained jointly with LLM (via backpropagation)
  6. Foundation of modern NLP (enables semantic understanding)
  7. Next: Positional embeddings! (position also matters!)

🎯 What We Learned (Checklist)

  • [x] Why token IDs fail to capture meaning
  • [x] Problems with one-hot encoding
  • [x] How vectors can encode semantics
  • [x] Word2Vec demonstrations (King - Man + Woman = Queen)
  • [x] Building embedding layers in PyTorch
  • [x] Understanding embedding weight matrices
  • [x] Lookup table operations
  • [x] Training embeddings via backpropagation
  • [x] GPT-2 embedding specifications
  • [x] Embedding vs linear layer comparison

πŸ”œ Next Chapter: Chapter 11

Topic: Positional Embeddings

What we’ll learn:

  • Why position matters in sequences
  • Absolute vs relative position
  • Sinusoidal positional encoding
  • Learned positional embeddings
  • Combining token + positional embeddings
  • GPT’s positional encoding scheme

Because β€œThe cat sat on mat” β‰  β€œSat cat the on mat”! πŸ“


πŸ“ Practice Exercises

Try these before the next chapter:

  1. Load Word2Vec and test your own analogies
  2. Create embedding layer for vocabulary of 1000
  3. Implement lookup for batch of token IDs
  4. Visualize embedding space with PCA/t-SNE
  5. Compare embedding vs linear layer performance
  6. Train simple embeddings on toy dataset

Share your results! πŸ’¬


πŸš€ Take Action Now!

  1. πŸ’» Install gensim - Test Word2Vec examples
  2. πŸ§ͺ Experiment - Try King - Man + Woman yourself!
  3. πŸ“ Code Along - Build embedding layers
  4. ❓ Ask Questions - Comment if stuck
  5. πŸ”– Bookmark - Critical reference material
  6. ⏭️ Get Ready - Next: Positional embeddings!

Quick Reference

Embedding Layer Template:

import torch.nn as nn

# Create embedding
embedding = nn.Embedding(
    num_embeddings=50257,  # Vocabulary size
    embedding_dim=768      # Vector dimension
)

# Use embedding
token_ids = torch.tensor([2, 3, 5, 1])
embeddings = embedding(token_ids)
# Shape: (4, 768)

Word2Vec Usage:

import gensim.downloader as api

# Load model
model = api.load('word2vec-google-news-300')

# Test analogy
result = model.most_similar(
    positive=['king', 'woman'],
    negative=['man']
)
# β†’ queen

# Check similarity
similarity = model.similarity('cat', 'kitten')
# β†’ 0.82 (high!)

Key Dimensions:

Model Vocab Size Embed Dim Parameters
Tutorial 6 3 18
GPT-2 Small 50,257 768 38.6M
GPT-2 Large 50,257 1,600 80.4M
GPT-3 50,257 12,288 617.6M

Thank You!

You’ve completed Chapter 10 - Token Embeddings! πŸŽ‰

You now understand:

  • βœ… Why embeddings are crucial
  • βœ… How vectors capture meaning
  • βœ… Word2Vec and semantic arithmetic
  • βœ… Building embedding layers
  • βœ… The embedding weight matrix
  • βœ… Training embeddings

Next chapter: Positional Embeddings

The foundation is laid! Let’s build GPT! πŸš€


πŸ“£ Your Feedback Matters!

Drop a comment:

  • Did Word2Vec blow your mind?
  • Which part was most enlightening?
  • Any questions about embeddings?
  • Share your experiments!

We respond to every comment! πŸ’¬


🎯 Coming Up

Chapter 11: Positional Embeddings
Chapter 12: Combining Token + Position
Chapter 13: Self-Attention Mechanism
Chapter 14: Building GPT from Scratch

The journey continues! πŸ’»πŸ”₯


See you in Chapter 11 where we learn positional encoding! πŸš€


Questions? Confused about embeddings? Drop them below! πŸ’ͺ