The Illustrated LLM

An Interactive Guide to Foundational Models

By Sreekanth Pannala | June 22, 2025 | Created with assistance from Gemini

First, what is an LLM?

Generative AI is a type of AI that can create new content—like text, images, or code. Its engine is a Neural Network, a system inspired by the human brain that learns patterns from data. A Large Language Model (LLM) is a massive neural network trained on vast amounts of text.

At its core, an LLM is a machine built to do one thing incredibly well: predict the next word in a sequence based on probability.

Part 1: Model Construction

Step 0: Data Collection

The journey begins with unimaginable amounts of data—a significant portion of the digital world. This is filtered down to trillions of high-quality words.

Step 1: Tokenization

Text is broken into numerical IDs called tokens. Each token ID points to a high-dimensional vector (an embedding) that represents its meaning.

Step 2: Architecture & Pre-training

The Transformer architecture uses a self-attention mechanism to understand context. During pre-training, the model adjusts trillions of parameters over months by constantly predicting the next word in its vast dataset. This is where it learns grammar, facts, and reasoning.

The Engine: A Neural Network Layer

Each layer in a neural network performs a simple linear algebra operation followed by a non-linear activation. (Click to see calculation)

\text{Output} = \text{activation}(W \cdot \text{Input} + b)

Example Calculation:

Input = [0.5, -0.2]

Weights (W) = [[0.7, 0.2], [-0.1, 0.4]]

Bias (b) = [0.1, -0.1]

Result = W * Input + b = [0.47, -0.08]

The Innovation: Self-Attention

Self-attention is a special layer that dynamically weighs the importance of other words in a sequence. It projects the input into three matrices: Query, Key, and Value.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Click to see a concrete example

1. Input Embeddings (X) for "AI is"

X = [[1, 0, 1, 0], [0, 1, 0, 1]]

2. Multiply by Weight Matrices (W) to get Q, K, V

Q = [[1,1,2],[1,1,1]], K = [[0,2,1],[2,1,1]], V = [[1,2,3],[1,4,0]]

3. Calculate Scores & Attention Weights

Scores = QK^T = [[4,5],[3,4]]
Weights = softmax(Scores) = [[0.35, 0.65], [0.36, 0.64]]

4. Final Output (Weighted Sum)

Output = Weights * V = [[1.0, 3.3, 1.05], [1.0, 3.28, 1.08]]

The Learning Process: Backpropagation

1. Calculate Loss

The model predicts, we compare it to the truth, and calculate a "loss" score representing the error. (Click for details)

The Cross-Entropy loss function is ideal for classification tasks like predicting the next token. It heavily penalizes the model for being confident and wrong, pushing it towards accurate probability distributions.

2. Find the Gradient

Backpropagation uses the chain rule from calculus to find the gradient of the loss for every single weight. (Click for details)

The gradient is a vector that tells us two things for each weight: the direction of the steepest ascent in loss, and the magnitude of that slope. It's the key to knowing how to adjust the weight.

3. Update Weights

We adjust each weight by taking a small step in the opposite direction of its gradient. This is called Gradient Descent. (Click for details)

$$ W_{new} = W_{old} - \alpha \frac{\partial L}{\partial W_{old}} $$ This simple update rule, when applied trillions of times, allows the complex web of weights to settle into a configuration that minimizes the overall loss and produces accurate predictions. The learning rate, α, is a critical hyperparameter that controls how large these update steps are.

Part 2: Alignment & Feedback (RLHF)

Alignment teaches the model to be helpful and safe using human feedback.

1

Collect Preferences

Humans rank different model responses to the same prompt. (Click)

This step is crucial and labor-intensive. High-quality preference data is the foundation for a well-behaved model.

2

Train Reward Model

A model is trained to predict human preference scores. (Click)

This model learns a complex function to map from any generated text to a single scalar value representing "goodness."

3

Fine-Tune with RL

The LLM is tuned to maximize the score from the Reward Model. (Click)

Using algorithms like PPO, the LLM explores the space of possible responses, learning a policy that generates text humans are likely to rate highly.

Part 3: Inference & Generation

"Inference" is using the trained model. It produces raw scores (logits) for all possible next tokens. The Softmax function converts these scores into probabilities.

Enter a starting prompt:

Choose a decoding strategy:

Greedy Top-k Sampling

Generated Text

Part 4: Agents & Constraints

An LLM becomes an Agent when it can use tools to accomplish multi-step tasks. To do this reliably, its output must be constrained.

The Agentic Loop

The agent thinks, acts by calling a tool, observes the result, and repeats.

🤔 Thought: I need to find a number first...

🎬 Action: SearchAPI("Messi goals 2023")

🔭 Observation: "Scored 11 goals"

✅ Final Answer: "The square root of 11 is 3.32."

Constrained Generation

To use tools, the agent must generate perfect code, like JSON. We can force its output to follow a strict format (a schema).

// Constraint (JSON Schema)
{ "name": "string", "age": "integer" }

// Guaranteed Valid Output
{ "name": "John Doe", "age": 34 }

The Journey Ahead

The field of Generative AI is evolving at an incredible pace. What was state-of-the-art yesterday is foundational today. The core principles of data, architecture, and feedback, however, remain central to this progress.

Resources for Further Learning

arXiv.org - The home of cutting-edge research papers, often released here before formal publication.
The Hugging Face Blog - Accessible explanations of new models and techniques from a leader in open-source AI.
DeepLearning.AI - In-depth courses from Andrew Ng, a pioneer in the field.
Distill.pub - A journal known for outstanding interactive and visual explanations of machine learning concepts.