Chapter 10: Token Embeddings - The Magic of Meaning-Rich Vectors
π Reading Time: 90 minutes
π» Coding Time: 120 minutes
Welcome to Chapter 10! Today we unlock the secret sauce of LLMs! π
Journey so far:
- Tokenization (Chapter 7) β
- Byte Pair Encoding (Chapter 8) β
- Data Sampling & Context Windows (Chapter 9) β
- Token IDs are ready!
Todayβs mission:
- Why token IDs arenβt enough
- What are token embeddings?
- How vectors capture meaning
- Building embedding layers
- Training embeddings
- The foundation of GPT!
This is where the REAL magic happens! β¨
π Table of Contents
- Why We Need Token Embeddings
- The Problem with Random Token IDs
- Why One-Hot Encoding Fails
- Vector Embeddings: The Solution
- Word2Vec Magic: King - Man + Woman = Queen
- Building Embedding Layers in PyTorch
- The Embedding Weight Matrix
- Lookup Table Operations
- Training Embeddings
- GPT-2 Embeddings Deep Dive
- Chapter Summary
Why We Need Token Embeddings
π― The LLM Pipeline
Where we are:
βββββββββββββββββββββββββββββββββββββββββββββββ
β Step 1: Tokenization β
β "This is an example" β
β β ["This", "is", "an", "example"] β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββ
β Step 2: Token IDs β
β ["This", "is", "an", "example"] β
β β [101, 202, 303, 404] β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββ
β Step 3: Token Embeddings β TODAY! β
β [101, 202, 303, 404] β
β β [[0.23, -0.45, 0.67, ...], ...] β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββ
β Step 4: Feed to LLM Training β
βββββββββββββββββββββββββββββββββββββββββββββββ
Question: Why not use token IDs directly? π€
The Problem with Random Token IDs
β Token IDs Alone Are Not Enough
Example vocabulary:
| Word | Token ID |
|---|---|
| cat | 34 |
| kitten | -13 |
| dog | 5 |
| puppy | 89 |
| book | 2.9 |
| tablet | -20 |
Problems:
β cat (34) and kitten (-13) are semantically related
But token IDs don't capture this!
β dog (5) and puppy (89) are similar
But 5 and 89 are far apart!
β cat (34) and book (2.9) are unrelated
But we can't tell from IDs alone!
π§ The Core Issue
Token IDs are just arbitrary numbers. They donβt encode semantic relationships!
Real-world analogy:
πΌοΈ Images:
CNNs exploit spatial relationships between pixels
β
Eyes are close together
β
Ears are close to head
β
Position matters!
π¬ Text:
Token IDs ignore semantic relationships
β "cat" and "kitten" = unrelated numbers
β Meaning not captured
β Huge training inefficiency!
π― What We Need
We need to exploit the inherent structure in language:
Just like CNNs exploit spatial structure in images,
we need to exploit SEMANTIC structure in text!
Words have meaning!
Similar words should have similar representations!
Why One-Hot Encoding Fails
π One-Hot Encoding Attempt
Idea: Represent each word as a vector with all 0s and one 1.
Example:
| Word | One-Hot Vector |
|---|---|
| dog | [1, 0, 0, 0] |
| cat | [0, 1, 0, 0] |
| puppy | [0, 0, 1, 0] |
| book | [0, 0, 0, 1] |
β Problems with One-Hot Encoding
1. No semantic similarity:
dog = [1, 0, 0, 0]
puppy = [0, 0, 1, 0]
# Distance between dog and puppy
distance = sqrt((1-0)^2 + (0-0)^2 + (0-1)^2 + (0-0)^2)
= sqrt(2)
β 1.41
# Distance between dog and book
dog = [1, 0, 0, 0]
book = [0, 0, 0, 1]
distance = sqrt((1-0)^2 + (0-0)^2 + (0-0)^2 + (0-1)^2)
= sqrt(2)
β 1.41
All words are equally distant! No semantic information! β
2. Huge dimensionality:
Vocabulary size = 50,000 words
β Each word = 50,000-dimensional vector
β 99.998% of values are 0
β Extremely sparse and inefficient!
3. Orthogonality problem:
Every word vector is perpendicular to every other!
β Dot product always = 0
β No notion of similarity
One-hot encoding completely fails to capture meaning! π
Vector Embeddings: The Solution
π‘ The Big Idea
Represent each word as a DENSE vector where dimensions correspond to semantic features!
π¨ Visual Example
Instead of arbitrary numbers, use features!
Features:
- Has tail?
- Is eatable?
- Has four legs?
- Makes sound?
- Is a pet?
Vector representations:
dog = [0.9, 0.1, 0.9, 0.8, 0.9] # High: tail, legs, sound, pet
cat = [0.9, 0.1, 0.8, 0.7, 0.9] # High: tail, legs, sound, pet
apple = [0.0, 0.9, 0.0, 0.0, 0.0] # High: eatable
banana = [0.0, 0.9, 0.0, 0.0, 0.0] # High: eatable
π Visualizing Semantic Space
ββββββββββββββββββββββββββββββββββ
β β
0.9 β dog β’ β
β β’ cat β
β β
0.5 β β
β β
0.1 β β
β β’ apple β
0.0 β β’ banana β
ββββββββββββββββββββββββββββββββββ
Has tail? Is eatable?
Similar words cluster together! β
π― Key Insight
Compare distances:
# dog and cat (similar animals)
dog = [0.9, 0.1, 0.9, 0.8, 0.9]
cat = [0.9, 0.1, 0.8, 0.7, 0.9]
# Small difference in most dimensions!
# dog and banana (unrelated)
dog = [0.9, 0.1, 0.9, 0.8, 0.9]
banana = [0.0, 0.9, 0.0, 0.0, 0.0]
# Large difference in most dimensions!
Vectors capture semantic meaning! π
β¨ Benefits of Vector Embeddings
β
Semantic similarity captured
β
Dense representations (not sparse)
β
Efficient for large vocabularies
β
Can be trained automatically
β
Enable arithmetic operations on meaning!
Word2Vec Magic
πͺ The Most Famous Demo
Can vectors really capture meaning? Letβs test!
π€΄ King - Man + Woman = ?
Hypothesis:
King is to Man as Queen is to Woman
King - Man + Woman should β Queen
Why?
King = Royalty + Masculine
Man = Masculine
Woman = Feminine
King - Man + Woman = Royalty + Masculine - Masculine + Feminine
= Royalty + Feminine
= Queen!
π» Testing with Pre-trained Word2Vec
import gensim.downloader as api
# Load pre-trained Word2Vec
# Trained on Google News (100 billion words!)
# 300-dimensional vectors
model = api.load('word2vec-google-news-300')
# Test: King - Man + Woman = ?
result = model.most_similar(
positive=['king', 'woman'],
negative=['man']
)
print(result[0])
Output:
('queen', 0.7118)
IT WORKS! 71% confidence! π
π¬ More Examples
Test 1: Gender relationships
# uncle - man + woman = ?
result = model.most_similar(
positive=['uncle', 'woman'],
negative=['man']
)
# β aunt (0.75)
# nephew - man + woman = ?
result = model.most_similar(
positive=['nephew', 'woman'],
negative=['man']
)
# β niece (0.72)
Test 2: Semantic similarity
# How similar are these words?
print(model.similarity('woman', 'man')) # 0.766
print(model.similarity('king', 'queen')) # 0.651
print(model.similarity('uncle', 'aunt')) # 0.738
print(model.similarity('boy', 'girl')) # 0.884
print(model.similarity('nephew', 'niece')) # 0.695
# Unrelated words
print(model.similarity('paper', 'water')) # 0.183
Related words have high similarity! β
Test 3: Finding similar words
# Find words similar to "tower"
print(model.most_similar('tower'))
Output:
[('towers', 0.79),
('skyscraper', 0.73),
('spire', 0.69),
('building', 0.68),
('monument', 0.66)]
Semantically related words cluster together! π―
π Distance Metric
Vector distance = Semantic distance!
import numpy as np
# Get vectors
man = model['man']
woman = model['woman']
nephew = model['nephew']
niece = model['niece']
semiconductor = model['semiconductor']
earthworm = model['earthworm']
# Compute L2 distance
print(np.linalg.norm(man - woman)) # 1.73 (close!)
print(np.linalg.norm(nephew - niece)) # 1.96 (close!)
print(np.linalg.norm(semiconductor - earthworm)) # 5.67 (far!)
Small distance = Similar meaning! π
π― Key Takeaway
Well-trained embeddings encode meaning!
β
Similar words have similar vectors
β
Vector arithmetic = Semantic arithmetic
β
Geometric relationships = Meaning relationships
β
This is the foundation of modern NLP!
Building Embedding Layers in PyTorch
π οΈ Implementation Time!
Now letβs build embeddings for LLMs!
π¦ The Setup
Example sentence:
"quick fox is in the house"
Step 1: Tokenization
tokens = ["quick", "fox", "is", "in", "the", "house"]
Step 2: Create vocabulary (sorted)
vocab = {
"fox": 0,
"house": 1,
"in": 2,
"is": 3,
"quick": 4,
"the": 5
}
Step 3: Get token IDs
# Sentence: "in is the house"
token_ids = [2, 3, 5, 1]
π― Goal
Convert token IDs β embedding vectors
Token ID 2 β [0.23, -0.45, 0.67]
Token ID 3 β [-0.12, 0.89, -0.34]
Token ID 5 β [0.56, 0.23, -0.78]
Token ID 1 β [-0.34, 0.12, 0.90]
π» Creating Embedding Layer
import torch
import torch.nn as nn
# Parameters
vocab_size = 6 # 6 words in vocabulary
embedding_dim = 3 # 3-dimensional vectors
# Create embedding layer
embedding_layer = nn.Embedding(
num_embeddings=vocab_size, # Vocabulary size
embedding_dim=embedding_dim # Vector dimension
)
print(embedding_layer)
Output:
Embedding(6, 3)
π Understanding the Parameters
num_embeddings:
Number of unique tokens in vocabulary
GPT-2: 50,257
Our example: 6
embedding_dim:
Dimension of embedding vectors
GPT-2: 768 (small), 1600 (large)
Our example: 3
π The Embedding Weight Matrix
# View the embedding weights
print(embedding_layer.weight)
Output:
Parameter containing:
tensor([[-0.3421, 0.1234, -0.8765], # Token ID 0
[ 0.5678, -0.2345, 0.3456], # Token ID 1
[-0.1234, 0.8765, -0.4567], # Token ID 2
[ 0.9012, -0.3456, 0.1234], # Token ID 3
[-0.6789, 0.4567, -0.7890], # Token ID 4
[ 0.2345, -0.6789, 0.5678]], # Token ID 5
requires_grad=True)
Shape: (6, 3) = (vocab_size, embedding_dim)
π― Key Points
1. Random initialization:
Weights are initialized randomly!
These will be optimized during training.
2. Matrix structure:
Rows = Number of tokens (vocab_size)
Columns = Embedding dimension (embedding_dim)
3. Each row = One tokenβs embedding:
Row 0 = Embedding for token ID 0
Row 1 = Embedding for token ID 1
...
Row 5 = Embedding for token ID 5
The Embedding Weight Matrix
π Matrix Dimensions
For our example:
βββββββββββββββββββββββββββββββββββ
β Token ID 0: [ 0.23, -0.45, 0.67] β β Row 0
β Token ID 1: [-0.12, 0.89, -0.34] β β Row 1
β Token ID 2: [ 0.56, 0.23, -0.78] β β Row 2
β Token ID 3: [-0.34, 0.12, 0.90] β β Row 3
β Token ID 4: [ 0.78, -0.56, 0.12] β β Row 4
β Token ID 5: [-0.67, 0.34, -0.23] β β Row 5
βββββββββββββββββββββββββββββββββββ
β β β
Dim 0 Dim 1 Dim 2
Matrix shape: (6 tokens, 3 dimensions)
ποΈ GPT-2 Dimensions
GPT-2 Small:
Vocabulary size: 50,257 tokens
Embedding dimension: 768
Embedding weight matrix shape: (50,257 Γ 768)
Total parameters: 50,257 Γ 768 = 38,597,376 parameters!
Just for embeddings! π±
π― How It Works
Each token ID maps to a row:
# Token ID 3 β Row 3
embedding_layer.weight[3]
# Output: tensor([-0.34, 0.12, 0.90])
# Token ID 0 β Row 0
embedding_layer.weight[0]
# Output: tensor([ 0.23, -0.45, 0.67])
Simple lookup operation! β¨
Lookup Table Operations
π Single Token Lookup
# Get embedding for token ID 3
token_id = torch.tensor([3])
embedding = embedding_layer(token_id)
print(embedding)
# Output: tensor([[-0.34, 0.12, 0.90]])
Returns 3D vector for token ID 3! β
π Multiple Token Lookup
# Get embeddings for multiple tokens
input_ids = torch.tensor([2, 3, 5, 1])
embeddings = embedding_layer(input_ids)
print(embeddings)
print(embeddings.shape)
Output:
tensor([[ 0.56, 0.23, -0.78], # Token ID 2
[-0.34, 0.12, 0.90], # Token ID 3
[-0.67, 0.34, -0.23], # Token ID 5
[-0.12, 0.89, -0.34]]) # Token ID 1
torch.Size([4, 3])
4 tokens β 4 embedding vectors! π
π― Why βLookup Tableβ?
Itβs literally looking up rows!
Input: Token ID 3
β
βββββββββββββββββββββββββββ
β Row 0: [...] β
β Row 1: [...] β
β Row 2: [...] β
β Row 3: [-0.34, 0.12, 0.90] β Found it!
β Row 4: [...] β
β Row 5: [...] β
βββββββββββββββββββββββββββ
β
Output: [-0.34, 0.12, 0.90]
Super efficient! O(1) lookup! β‘
π Batch Processing
# Batch of sentences
# Sentence 1: [2, 3, 1]
# Sentence 2: [5, 0, 4]
batch = torch.tensor([
[2, 3, 1],
[5, 0, 4]
])
embeddings = embedding_layer(batch)
print(embeddings.shape)
# Output: torch.Size([2, 3, 3])
# β β β
# | | ββ Embedding dim
# | βββββ Sequence length
# ββββββββ Batch size
Training Embeddings
ποΈ How Are Embeddings Trained?
Remember: Embeddings start random!
Initial embeddings = Random noise
Trained embeddings = Meaningful vectors
How do we go from random β meaningful? π€
π― Training Process
Step 1: Random initialization
embedding_layer = nn.Embedding(vocab_size, embedding_dim)
# Weights are random!
Step 2: Forward pass
# Input: Token IDs
input_ids = torch.tensor([2, 3, 5, 1])
# Get embeddings
embeddings = embedding_layer(input_ids)
# Feed to LLM
output = model(embeddings)
# Compute loss
loss = criterion(output, target)
Step 3: Backward pass
# Compute gradients
loss.backward()
# Update embedding weights!
optimizer.step()
Embeddings are updated via backpropagation! β¨
π Training Dynamics
Initial (random):
cat = [ 0.23, -0.45, 0.67]
kitten = [-0.89, 0.12, -0.34]
# Distance: Large! (random)
After training:
cat = [ 0.78, 0.34, 0.56]
kitten = [ 0.82, 0.31, 0.59]
# Distance: Small! (similar meaning captured)
Training brings similar words closer! π―
π What Guides Training?
Context in sentences!
"The cat sat on the mat"
"The kitten played on the rug"
β "cat" and "kitten" appear in similar contexts
β Their embeddings should be similar
β Gradients push them closer together!
This is self-supervised learning! π§
π₯ Training Details
1. Embeddings are part of the model:
class GPTModel(nn.Module):
def __init__(self):
self.embedding = nn.Embedding(vocab_size, embed_dim)
# ... rest of model
def forward(self, input_ids):
x = self.embedding(input_ids) # Gradients flow here!
# ... rest of forward pass
return x
2. Updated via backpropagation:
# Training loop
for batch in dataloader:
input_ids, targets = batch
# Forward pass (embeddings used here)
outputs = model(input_ids)
# Compute loss
loss = criterion(outputs, targets)
# Backward pass (embeddings updated here!)
loss.backward()
optimizer.step()
3. Requires_grad = True:
print(embedding_layer.weight.requires_grad)
# Output: True
# This means gradients will be computed
# and weights will be updated!
GPT-2 Embeddings Deep Dive
π GPT-2 Specifications
GPT-2 Small:
| Parameter | Value |
|---|---|
| Vocabulary size | 50,257 |
| Embedding dimension | 768 |
| Embedding matrix shape | (50,257 Γ 768) |
| Total embedding parameters | 38,597,376 |
Thatβs 38 million parameters just for embeddings! π±
π» Creating GPT-2 Size Embeddings
import torch.nn as nn
# GPT-2 specifications
vocab_size = 50_257
embedding_dim = 768
# Create embedding layer
gpt2_embeddings = nn.Embedding(
num_embeddings=vocab_size,
embedding_dim=embedding_dim
)
print(f"Shape: {gpt2_embeddings.weight.shape}")
print(f"Total parameters: {vocab_size * embedding_dim:,}")
Output:
Shape: torch.Size([50257, 768])
Total parameters: 38,597,376
π― Why 768 Dimensions?
Trade-offs:
Small dimensions (e.g., 128):
β
Fewer parameters
β
Faster training
β Less expressive
β Can't capture complex meanings
Large dimensions (e.g., 768):
β
More expressive
β
Captures nuanced meanings
β
Better performance
β More parameters
β Slower training
768 = Sweet spot for GPT-2! βοΈ
π¬ GPT-2 vs GPT-3 vs GPT-4
| Model | Vocab Size | Embed Dim | Parameters |
|---|---|---|---|
| GPT-2 Small | 50,257 | 768 | 38.6M |
| GPT-2 Large | 50,257 | 1,600 | 80.4M |
| GPT-3 | 50,257 | 12,288 | 617.6M |
| GPT-4 | Unknown | Unknown | Unknown |
Embedding parameters scale massively! π
π Complete Example
import torch
import torch.nn as nn
import tiktoken
# Initialize tokenizer
tokenizer = tiktoken.get_encoding("gpt2")
# Text
text = "Hello, how are you today?"
# Tokenize
token_ids = tokenizer.encode(text)
print(f"Token IDs: {token_ids}")
# Create embedding layer (GPT-2 size)
embedding_layer = nn.Embedding(
num_embeddings=50_257,
embedding_dim=768
)
# Convert to tensor
input_ids = torch.tensor(token_ids)
# Get embeddings
embeddings = embedding_layer(input_ids)
print(f"Input shape: {input_ids.shape}")
print(f"Embedding shape: {embeddings.shape}")
Output:
Token IDs: [15496, 11, 703, 389, 345, 1909, 30]
Input shape: torch.Size([7])
Embedding shape: torch.Size([7, 768])
7 tokens β 7 vectors of 768 dimensions! β
Embedding Layer vs Linear Layer
π€ Are They The Same?
Surprising fact: Embedding layer = Special case of linear layer!
π¬ The Math
Linear layer:
output = input @ weight.T
# If input is one-hot encoded:
input = [0, 0, 1, 0] # Select 3rd row
weight = [[w00, w01, w02],
[w10, w11, w12],
[w20, w21, w22], β Selected!
[w30, w31, w32]]
output = [w20, w21, w22] # Row 2!
Embedding layer:
token_id = 2 # Select 3rd row
weight = [[w00, w01, w02],
[w10, w11, w12],
[w20, w21, w22], β Selected!
[w30, w31, w32]]
output = [w20, w21, w22] # Row 2!
Same result! π
π» Proof in Code
import torch
import torch.nn as nn
vocab_size = 4
embedding_dim = 5
input_ids = [2, 3, 1]
# Method 1: Embedding layer
embedding = nn.Embedding(vocab_size, embedding_dim)
emb_output = embedding(torch.tensor(input_ids))
# Method 2: Linear layer with one-hot
linear = nn.Linear(vocab_size, embedding_dim, bias=False)
linear.weight = embedding.weight # Use same weights!
# One-hot encode
one_hot = torch.zeros(len(input_ids), vocab_size)
one_hot[0, 2] = 1
one_hot[1, 3] = 1
one_hot[2, 1] = 1
linear_output = linear(one_hot)
# Compare
print(torch.allclose(emb_output, linear_output))
# Output: True
Mathematically identical! β
π― Why Use Embedding Layer?
Linear layer approach:
β Requires one-hot encoding
β Many multiplications with 0
β Memory inefficient (50,257-dim vectors!)
β Computationally wasteful
Embedding layer approach:
β
Direct lookup (no one-hot)
β
O(1) operation
β
Memory efficient
β
Optimized implementation
Embedding layer is MUCH faster! β‘
π Performance Comparison
For GPT-2 (vocab = 50,257):
| Method | Memory | Speed |
|---|---|---|
| Linear + One-hot | 50,257 Γ 32 bits | Slow (matrix mult) |
| Embedding lookup | 32 bits | Fast (direct access) |
Embedding layer is ~1000x more efficient! π
Chapter Summary
π What We Learned Today
This was a FOUNDATIONAL chapter! Letβs recap:
1. Why Token IDs Arenβt Enough
Problem:
β Token IDs are arbitrary numbers
β Don't capture semantic relationships
β cat (34) and kitten (-13) = unrelated
Solution:
β Convert to dense vectors
β Encode semantic meaning
β Similar words = Similar vectors
2. Why One-Hot Encoding Fails
Issues:
β No semantic similarity (all equally distant)
β Huge dimensionality (50,000-dim vectors!)
β Sparse and inefficient
Embeddings win:
β Dense representations
β Semantic relationships preserved
β Efficient for large vocabularies
3. Vector Embeddings Capture Meaning
Word2Vec magic:
β King - Man + Woman = Queen
β Similar words have high similarity scores
β Distance in vector space = Semantic distance
β Trained on Google News (100B words)
Proof that embeddings work! π
4. Building Embedding Layers
# PyTorch implementation
embedding = nn.Embedding(
num_embeddings=vocab_size, # 50,257 for GPT-2
embedding_dim=768 # 768 for GPT-2
)
# Creates weight matrix: (50,257 Γ 768)
# Total parameters: 38.6 million!
5. Embedding Weight Matrix
Structure:
- Rows = Vocabulary size
- Columns = Embedding dimension
- Each row = One token's embedding vector
GPT-2:
- Matrix shape: (50,257 Γ 768)
- Initialized randomly
- Trained via backpropagation
6. Lookup Table Operations
Embedding layer = Lookup table
Input: Token ID 3
Operation: Retrieve row 3 from weight matrix
Output: 768-dimensional vector
Efficient: O(1) operation!
Beats one-hot + linear layer by 1000x!
7. Training Embeddings
Process:
1. Initialize weights randomly
2. Forward pass (use embeddings)
3. Compute loss
4. Backward pass (update embeddings!)
5. Repeat for millions of steps
Result:
Random vectors β Meaningful representations!
π Complete Pipeline
Text β Tokens β Token IDs β Embeddings β LLM Training
"Hello world"
β
["Hello", "world"]
β
[15496, 995]
β
[[0.23, -0.45, ..., 0.67], β 768-dim
[-0.12, 0.89, ..., -0.34]] β 768-dim
β
Feed to GPT model!
π‘ Key Takeaways
- Token embeddings = Dense vectors that encode meaning
- Similar words β Similar vectors (trained via context)
- Embedding layer = Efficient lookup table (beats one-hot)
- GPT-2 embeddings: 50,257 Γ 768 = 38.6M parameters
- Trained jointly with LLM (via backpropagation)
- Foundation of modern NLP (enables semantic understanding)
- Next: Positional embeddings! (position also matters!)
π― What We Learned (Checklist)
- [x] Why token IDs fail to capture meaning
- [x] Problems with one-hot encoding
- [x] How vectors can encode semantics
- [x] Word2Vec demonstrations (King - Man + Woman = Queen)
- [x] Building embedding layers in PyTorch
- [x] Understanding embedding weight matrices
- [x] Lookup table operations
- [x] Training embeddings via backpropagation
- [x] GPT-2 embedding specifications
- [x] Embedding vs linear layer comparison
π Next Chapter: Chapter 11
Topic: Positional Embeddings
What weβll learn:
- Why position matters in sequences
- Absolute vs relative position
- Sinusoidal positional encoding
- Learned positional embeddings
- Combining token + positional embeddings
- GPTβs positional encoding scheme
Because βThe cat sat on matβ β βSat cat the on matβ! π
π Practice Exercises
Try these before the next chapter:
- Load Word2Vec and test your own analogies
- Create embedding layer for vocabulary of 1000
- Implement lookup for batch of token IDs
- Visualize embedding space with PCA/t-SNE
- Compare embedding vs linear layer performance
- Train simple embeddings on toy dataset
Share your results! π¬
π Take Action Now!
- π» Install gensim - Test Word2Vec examples
- π§ͺ Experiment - Try King - Man + Woman yourself!
- π Code Along - Build embedding layers
- β Ask Questions - Comment if stuck
- π Bookmark - Critical reference material
- βοΈ Get Ready - Next: Positional embeddings!
Quick Reference
Embedding Layer Template:
import torch.nn as nn
# Create embedding
embedding = nn.Embedding(
num_embeddings=50257, # Vocabulary size
embedding_dim=768 # Vector dimension
)
# Use embedding
token_ids = torch.tensor([2, 3, 5, 1])
embeddings = embedding(token_ids)
# Shape: (4, 768)
Word2Vec Usage:
import gensim.downloader as api
# Load model
model = api.load('word2vec-google-news-300')
# Test analogy
result = model.most_similar(
positive=['king', 'woman'],
negative=['man']
)
# β queen
# Check similarity
similarity = model.similarity('cat', 'kitten')
# β 0.82 (high!)
Key Dimensions:
| Model | Vocab Size | Embed Dim | Parameters |
|---|---|---|---|
| Tutorial | 6 | 3 | 18 |
| GPT-2 Small | 50,257 | 768 | 38.6M |
| GPT-2 Large | 50,257 | 1,600 | 80.4M |
| GPT-3 | 50,257 | 12,288 | 617.6M |
Thank You!
Youβve completed Chapter 10 - Token Embeddings! π
You now understand:
- β Why embeddings are crucial
- β How vectors capture meaning
- β Word2Vec and semantic arithmetic
- β Building embedding layers
- β The embedding weight matrix
- β Training embeddings
Next chapter: Positional Embeddings
The foundation is laid! Letβs build GPT! π
π£ Your Feedback Matters!
Drop a comment:
- Did Word2Vec blow your mind?
- Which part was most enlightening?
- Any questions about embeddings?
- Share your experiments!
We respond to every comment! π¬
π― Coming Up
Chapter 11: Positional Embeddings
Chapter 12: Combining Token + Position
Chapter 13: Self-Attention Mechanism
Chapter 14: Building GPT from Scratch
The journey continues! π»π₯
See you in Chapter 11 where we learn positional encoding! π
Questions? Confused about embeddings? Drop them below! πͺ