Deconstructing Large Language Models

An interactive journey into the heart of LLMs. Explore the core concepts, visualize the mechanics, and understand the technology that's changing our world, from the underlying math to the hardware that powers it.

Part 1: The Foundations

Before a model can learn, it needs an architecture. We'll break down the revolutionary Transformer model and explore how we turn raw text into numbers the machine can understand.

The Transformer Architecture

The Transformer dispensed with the sequential nature of older models (like RNNs) and embraced parallelism. Its core innovation is the self-attention mechanism, which allows every word in a sentence to look at every other word simultaneously to build contextual understanding.

Self-Attention: Q, K, V

Attention works by creating three vectors for each input token: a Query (what I'm looking for), a Key (what I contain), and a Value (what I will give you). By comparing the Query of one token to the Keys of all others, the model calculates "attention weights"—how much focus to pay to each token—and produces a weighted sum of the Values. Try calculating this by hand in the next section!

The entire operation can be expressed in a single, highly parallelizable matrix equation:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

The scaling factor \(\sqrt{d_k}\) is crucial for stabilizing gradients during training.

Multi-Head Attention

Instead of one attention mechanism, Transformers use many "heads" in parallel. Each head learns to focus on different types of relationships (e.g., syntactic, semantic), creating a richer, more nuanced understanding of the text.

Step 1: Tokenization

Models don't see words; they see tokens. Byte-Pair Encoding (BPE) is a clever algorithm that merges frequent character pairs into subword tokens.

Step 2: From Tokens to Embeddings

The model maps each token ID to a high-dimensional vector. This "embedding" represents the token's meaning. Click a token ID to see its vector.

Embedding Matrix (Simplified)

Selected Embedding Vector:

Select a Token ID above.

Part 2: An LLM Step-by-Step

Let's demystify the math by calculating the core components of a single attention operation by hand. This will build a strong intuition for what's happening under the hood.

Step 1: Calculate an Attention Score

The first step is to measure the similarity between a "Query" vector and a "Key" vector. This is done with a dot product. Multiply corresponding elements and sum the results.

Query: [1.5, -0.5, 2.0]
Key: [0.8, 1.2, -1.0]

(1.5 * 0.8) + (-0.5 * 1.2) + (2.0 * -1.0) = ?

Step 2: Apply Softmax

After calculating scores for all keys, we use the softmax function to convert these raw scores into a probability distribution (they all sum to 1). The formula for a score \(s_i\) in a set of scores \(S\) is given by:

$$ \text{softmax}(s_i) = \frac{e^{s_i}}{\sum_{j \in S} e^{s_j}} $$

Given raw scores (logits): [2.0, 1.0, 0.1]

First, exponentiate each: \(e^{2.0} \approx 7.39\), \(e^{1.0} \approx 2.72\), \(e^{0.1} \approx 1.11\)

Sum of exponentiated values: \(7.39 + 2.72 + 1.11 = 11.22\)

Now, calculate the softmax probability for the first score (2.0):

\(7.39 / 11.22 = ?\)

Step 3: Compute the Final Value

Finally, we compute a weighted sum of the "Value" vectors. Each Value vector is multiplied by its corresponding attention probability (from softmax), and the results are summed up.

Attention Weights: [0.66, 0.24, 0.10]

Value 1: [1, 0, 0] Value 2: [0, 1, 0] Value 3: [0, 0, 1]

Calculate the first dimension of the output vector:

(0.66 * 1) + (0.24 * 0) + (0.10 * 0) = ?

Part 3: The Training Process

Training breathes life into the model. We'll cover the self-supervised objective that builds knowledge and provide concrete code examples in PyTorch and TensorFlow.

Pre-training: Learning from the Web

LLMs are pre-trained on trillions of words using a self-supervised objective called Causal Language Modeling (CLM). The task is simple: predict the next word. To get good at this, the model must implicitly learn grammar, facts, and reasoning.

The Training Loop

  1. Forward Pass: The model gets a text sequence and predicts the next word at each position.
  2. Loss Calculation: Cross-Entropy Loss measures how wrong the prediction was.
  3. Backward Pass: Backpropagation calculates the gradient (direction of error) for all model weights.
  4. Parameter Update: The Adam optimizer adjusts the weights to reduce the loss.

Simulated Training Loss

This chart simulates the training process, where the model's error (loss) decreases as it learns from data over millions of steps.

Implementation in Code

The Transformer architecture can be built from scratch using popular frameworks. Below is a simplified implementation of a single Encoder Layer, the core repeating block of the Transformer's encoder.


import torch.nn as nn

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask):
        # Attention sub-layer
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output)) # Add & Norm
        
        # Feed-forward sub-layer
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output)) # Add & Norm
        return x

import tensorflow as tf
from tensorflow.keras.layers import Layer, LayerNormalization, Dropout

class EncoderLayer(Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = PositionwiseFeedforward(d_model, dff)
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, x, training, mask):
        # Attention sub-layer
        attn_output = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output) # Add & Norm
        
        # Feed-forward sub-layer
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output) # Add & Norm
        return out2

Part 4: The Hardware Engine

LLMs are computationally hungry. Their performance is defined by the hardware they run on, primarily Graphics Processing Units (GPUs).

Visualizing Parallelism: CPU vs. GPU

This animation shows how a GPU's parallel architecture is perfectly suited for matrix operations. A CPU must perform each row calculation one by one, while a GPU can do them all at once.

CPU (Sequential)


GPU (Parallel)

GPU Evolution for AI

Modern data center GPUs have evolved specifically for AI. This chart compares key metrics across recent NVIDIA architectures.

Part 5: Alignment & Inference

A pre-trained model is a raw diamond. Alignment polishes it into a helpful assistant, and inference optimizations make it fast and efficient.

The 3 Steps of RLHF

Reinforcement Learning from Human Feedback (RLHF) makes models helpful and harmless. Click each step to learn more.

1. Supervised Fine-Tuning (SFT)

2. Reward Model Training

3. RL Optimization (PPO)

Goal:

Teach the model the conversational format.

Process: Fine-tune the base LLM on a small, high-quality dataset of prompt-response pairs curated by humans.

Goal:

Create a model that can score responses based on human preferences.

Process: Humans rank several model responses to a prompt. A new "Reward Model" is trained on this preference data to predict which responses humans would prefer.

Goal:

Use the Reward Model to further improve the SFT model.

Process: Use Reinforcement Learning (PPO) to optimize the SFT model. The model generates responses, the Reward Model provides a "reward" score, and the SFT model's weights are adjusted to maximize this reward.

KV Caching: The Critical Optimization

This animation shows how the KV Cache speeds up generation. The "Prefill" step processes the prompt once, and the "Decode" steps reuse that information to generate new tokens quickly.

Part 7: Interactive Playground

Get hands-on with core LLM mechanics. Explore how attention works and how different decoding strategies change the model's output.

Self-Attention Explorer

Explore self-attention as a heatmap. Each row represents a "query" token, and columns represent "key" tokens. Darker cells indicate higher attention scores. Hover over a row label to highlight its attention pattern.

Decoding Strategy Comparator

The way a model chooses the next word drastically affects its output. Compare different strategies below.