Welcome to The GSM Work

Technical insights, tutorials, and stories from the world of technology

Chapter 10: Token Embeddings - Converting Words to Meaning-Rich Vectors

Master token embeddings from scratch! Learn why random token IDs fail, how vectors capture semantic meaning (King - Man + Woman = Queen!), build embedding layers in PyTorch, understand Word2Vec, implement lookup tables, and prepare embeddings for GPT training. Discover why embeddings are the secret sauce of LLMs.

Read more

Chapter 9: Data Sampling & Context Windows - Preparing Data for LLM Training

Learn how to create input-target pairs for LLM training. Master context windows, sliding windows, batch processing, and PyTorch DataLoaders. Understand auto-regressive training, why next-word prediction needs special data preparation, and implement efficient data pipelines from scratch.

Read more

Chapter 8: Byte Pair Encoding (BPE) - How GPT Tokenizes Text

Master Byte Pair Encoding (BPE) - the tokenization algorithm used in GPT-2, GPT-3, and ChatGPT. Learn why it's superior to word and character tokenization, build BPE from scratch, understand subword tokenization, and use tiktoken library. Complete Python implementation with real examples.

Read more

Chapter 6: Complete Roadmap - 3 Stages of Building LLMs From Scratch

Your complete roadmap for building LLMs from scratch. Learn the 3-stage process: Data Preparation & Architecture (Stage 1), Pre-training (Stage 2), and Fine-tuning & Deployment (Stage 3). Understand tokenization, attention mechanisms, training loops, and building real applications like spam classifiers and chatbots.

Read more

Chapter 5: GPT Architecture - From Transformers to ChatGPT Evolution

Deep dive into GPT architecture. Learn the evolution from GPT-1 to GPT-4, zero-shot vs few-shot learning, auto-regressive models, unsupervised learning, emergent behavior, and why training costs $4.6 million - explained for absolute beginners.

Read more