Deconstructing Large Language Models
An interactive journey into the heart of LLMs. Explore the core concepts, visualize the mechanics, and understand the technology that's changing our world, from the underlying math to the hardware that powers it.
Part 1: The Foundations
Before a model can learn, it needs an architecture. We'll break down the revolutionary Transformer model and explore how we turn raw text into numbers the machine can understand.
The Transformer Architecture
The Transformer dispensed with the sequential nature of older models (like RNNs) and embraced parallelism. Its core innovation is the self-attention mechanism, which allows every word in a sentence to look at every other word simultaneously to build contextual understanding.
Self-Attention: Q, K, V
Attention works by creating three vectors for each input token: a Query (what I'm looking for), a Key (what I contain), and a Value (what I will give you). By comparing the Query of one token to the Keys of all others, the model calculates "attention weights"—how much focus to pay to each token—and produces a weighted sum of the Values. Try calculating this by hand in the next section!
The entire operation can be expressed in a single, highly parallelizable matrix equation:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
The scaling factor \(\sqrt{d_k}\) is crucial for stabilizing gradients during training.
Multi-Head Attention
Instead of one attention mechanism, Transformers use many "heads" in parallel. Each head learns to focus on different types of relationships (e.g., syntactic, semantic), creating a richer, more nuanced understanding of the text.
Step 1: Tokenization
Models don't see words; they see tokens. Byte-Pair Encoding (BPE) is a clever algorithm that merges frequent character pairs into subword tokens.
Step 2: From Tokens to Embeddings
The model maps each token ID to a high-dimensional vector. This "embedding" represents the token's meaning. Click a token ID to see its vector.
Embedding Matrix (Simplified)
Selected Embedding Vector:
Part 2: An LLM Step-by-Step
Let's demystify the math by calculating the core components of a single attention operation by hand. This will build a strong intuition for what's happening under the hood.
Step 1: Calculate an Attention Score
The first step is to measure the similarity between a "Query" vector and a "Key" vector. This is done with a dot product. Multiply corresponding elements and sum the results.
(1.5 * 0.8) + (-0.5 * 1.2) + (2.0 * -1.0) = ?
Step 2: Apply Softmax
After calculating scores for all keys, we use the softmax function to convert these raw scores into a probability distribution (they all sum to 1). The formula for a score \(s_i\) in a set of scores \(S\) is given by:
$$ \text{softmax}(s_i) = \frac{e^{s_i}}{\sum_{j \in S} e^{s_j}} $$
Given raw scores (logits): [2.0, 1.0, 0.1]
First, exponentiate each: \(e^{2.0} \approx 7.39\), \(e^{1.0} \approx 2.72\), \(e^{0.1} \approx 1.11\)
Sum of exponentiated values: \(7.39 + 2.72 + 1.11 = 11.22\)
Now, calculate the softmax probability for the first score (2.0):
\(7.39 / 11.22 = ?\)
Step 3: Compute the Final Value
Finally, we compute a weighted sum of the "Value" vectors. Each Value vector is multiplied by its corresponding attention probability (from softmax), and the results are summed up.
Attention Weights: [0.66, 0.24, 0.10]
Calculate the first dimension of the output vector:
(0.66 * 1) + (0.24 * 0) + (0.10 * 0) = ?
Part 3: The Training Process
Training breathes life into the model. We'll cover the self-supervised objective that builds knowledge and provide concrete code examples in PyTorch and TensorFlow.
Pre-training: Learning from the Web
LLMs are pre-trained on trillions of words using a self-supervised objective called Causal Language Modeling (CLM). The task is simple: predict the next word. To get good at this, the model must implicitly learn grammar, facts, and reasoning.
The Training Loop
- Forward Pass: The model gets a text sequence and predicts the next word at each position.
- Loss Calculation: Cross-Entropy Loss measures how wrong the prediction was.
- Backward Pass: Backpropagation calculates the gradient (direction of error) for all model weights.
- Parameter Update: The Adam optimizer adjusts the weights to reduce the loss.
Simulated Training Loss
This chart simulates the training process, where the model's error (loss) decreases as it learns from data over millions of steps.
Implementation in Code
The Transformer architecture can be built from scratch using popular frameworks. Below is a simplified implementation of a single Encoder Layer, the core repeating block of the Transformer's encoder.
import torch.nn as nn
class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask):
# Attention sub-layer
attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output)) # Add & Norm
# Feed-forward sub-layer
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output)) # Add & Norm
return x
import tensorflow as tf
from tensorflow.keras.layers import Layer, LayerNormalization, Dropout
class EncoderLayer(Layer):
def __init__(self, d_model, num_heads, dff, rate=0.1):
super(EncoderLayer, self).__init__()
self.mha = MultiHeadAttention(d_model, num_heads)
self.ffn = PositionwiseFeedforward(d_model, dff)
self.layernorm1 = LayerNormalization(epsilon=1e-6)
self.layernorm2 = LayerNormalization(epsilon=1e-6)
self.dropout1 = Dropout(rate)
self.dropout2 = Dropout(rate)
def call(self, x, training, mask):
# Attention sub-layer
attn_output = self.mha(x, x, x, mask)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(x + attn_output) # Add & Norm
# Feed-forward sub-layer
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
out2 = self.layernorm2(out1 + ffn_output) # Add & Norm
return out2
Part 4: The Hardware Engine
LLMs are computationally hungry. Their performance is defined by the hardware they run on, primarily Graphics Processing Units (GPUs).
Visualizing Parallelism: CPU vs. GPU
This animation shows how a GPU's parallel architecture is perfectly suited for matrix operations. A CPU must perform each row calculation one by one, while a GPU can do them all at once.
CPU (Sequential)
GPU (Parallel)
GPU Evolution for AI
Modern data center GPUs have evolved specifically for AI. This chart compares key metrics across recent NVIDIA architectures.
Part 5: Alignment & Inference
A pre-trained model is a raw diamond. Alignment polishes it into a helpful assistant, and inference optimizations make it fast and efficient.
The 3 Steps of RLHF
Reinforcement Learning from Human Feedback (RLHF) makes models helpful and harmless. Click each step to learn more.
1. Supervised Fine-Tuning (SFT)
2. Reward Model Training
3. RL Optimization (PPO)
Goal:
Teach the model the conversational format.Process: Fine-tune the base LLM on a small, high-quality dataset of prompt-response pairs curated by humans.
Goal:
Create a model that can score responses based on human preferences.Process: Humans rank several model responses to a prompt. A new "Reward Model" is trained on this preference data to predict which responses humans would prefer.
Goal:
Use the Reward Model to further improve the SFT model.Process: Use Reinforcement Learning (PPO) to optimize the SFT model. The model generates responses, the Reward Model provides a "reward" score, and the SFT model's weights are adjusted to maximize this reward.
KV Caching: The Critical Optimization
This animation shows how the KV Cache speeds up generation. The "Prefill" step processes the prompt once, and the "Decode" steps reuse that information to generate new tokens quickly.
Part 7: Interactive Playground
Get hands-on with core LLM mechanics. Explore how attention works and how different decoding strategies change the model's output.
Self-Attention Explorer
Explore self-attention as a heatmap. Each row represents a "query" token, and columns represent "key" tokens. Darker cells indicate higher attention scores. Hover over a row label to highlight its attention pattern.
Decoding Strategy Comparator
The way a model chooses the next word drastically affects its output. Compare different strategies below.