foundations
Part of the AI system design curriculum
How LLMs Actually Work
A ground-up tour of tokens, embeddings, attention, and why transformers scale.
A large language model does not read your prompt the way you do. It converts text into numbers, multiplies those numbers through tens of billions of learned parameters, and predicts the single most probable next word, over and over until it stops. What makes this feel like intelligence is not magic but three interlocking mechanisms: tokenization, learned vector representations, and the self-attention operation. Understanding how they work together is the foundation for every architectural decision you will make in an AI system.
From Text to Numbers: Tokens and Embeddings
Tokenization
The first step in every LLM pipeline is breaking input text into tokens: discrete integer IDs that the model can process. Tokens are not words. A tokenizer trained with Byte-Pair Encoding might split "tokenization" into ["token", "ization"] and represent a common word like "the" as a single token while splitting a rare technical term into four fragments. Most modern vocabularies contain 32,000-128,000 unique tokens.
This matters for practitioners. A prompt that costs 50 tokens in English might cost 90 in Korean because less common scripts are represented less efficiently. Context window limits are always in tokens, not words, and code is typically denser than prose.
Byte-Pair Encoding starts from a character-level alphabet and iteratively merges the most frequent adjacent symbol pair until the vocabulary reaches a target size. After enough merges, common words map to a single token and rare words decompose gracefully into recognizable subword fragments, so the tokenizer never produces a hard unknown-token error. The vocabulary is a by-product of corpus frequency statistics rather than a manually curated word list, which means it reflects whatever languages and domains dominated the training data.
Token fertility differs across content types in ways that carry real cost implications. English prose averages roughly 0.75 tokens per word. Korean and Thai average closer to 1.5 to 2 tokens per word for equivalent meaning, because fewer of their subword patterns reached the merge threshold in corpora dominated by English. Python source code runs about 3 to 4 characters per token on average: reserved keywords like def and return compress to single tokens, but long variable names and string literals fragment. A 128,000-token context window therefore holds substantially less Chinese or Japanese text than English text of equivalent informational density, and API cost estimates must account for the target language.
Tokenization also surfaces subtle bugs worth knowing about. The string "Bank" and " Bank" (with a leading space) are often different token IDs with distinct embeddings. A model asked to complete text that ends mid-subword may produce garbled output on rare technical terms. Production teams fine-tuning on domain-specific text sometimes expand the tokenizer vocabulary and retrain the embedding layer to improve fertility on specialized terms, accepting the cost of training new embedding parameters from scratch.
Embeddings: The Geometry of Meaning
Once a prompt is tokenized, each ID is converted to a floating-point vector, often 4,096 to 16,384 dimensions. These vectors are not arbitrary: they are learned during training so that semantically related tokens end up close together in the high-dimensional space. The famous example holds: the direction from "man" to "king" is roughly parallel to the direction from "woman" to "queen," because the training corpus encodes that relationship in billions of co-occurrence patterns.
Imagine every concept in the model's vocabulary has coordinates on a vast map. Proximity on that map reflects conceptual similarity. The model doesn't need explicit synonym rules; it absorbs the map's shape from raw text at scale.
The embedding layer is a matrix of shape (vocab_size, d_model). For a model with a 128,000-token vocabulary and d_model of 4,096, this table alone holds roughly 500 million parameters before a single transformer block is counted. Expanding the vocabulary adds parameters to both the input embedding and the output projection layer (they often share weights in smaller models), and those new parameters need sufficient gradient signal to converge. This is why changing the tokenizer mid-training is expensive: the existing embedding geometry is partially invalidated.
What distinguishes transformer embeddings from earlier approaches like Word2Vec is that they are contextual. The hidden state for "bank" at the final transformer layer is shaped by every other token in the context window, not just the token ID in isolation. The same token entering the model produces different hidden-state vectors at layer 32 depending on what surrounds it. The embedding table is the starting point; each transformer block refines and enriches the representation by mixing in context through attention and compressing associations through the FFN.
The Self-Attention Mechanism
Why Attention Changed Everything
Earlier neural sequence models processed tokens one at a time, chaining hidden states the way a relay race passes a baton. Information about the first word had to survive dozens of handoffs to influence the last word. Long-range dependencies degraded unpredictably.
Self-attention replaces that chain with a direct-communication network. Every token simultaneously asks every other token: "How relevant are you to my current meaning?" and receives an answer weighted by learned relevance scores. Distance in the sequence no longer limits influence.
The tradeoff is that the attention score matrix has shape (seq_len, seq_len). Computing and storing it is O(n²) in sequence length. For a 128,000-token context window this matrix, in full fp16 precision, would occupy around 32 GB per layer per batch item before you account for any other activations. Algorithms like FlashAttention avoid materializing the full matrix by computing attention in tiles that fit in on-chip SRAM, reducing memory bandwidth pressure at the cost of some recomputation. This is why FlashAttention is enabled by default in every serious inference stack, and it is a prerequisite for long-context models rather than an optional optimization.
Queries, Keys, and Values
Each token's embedding is linearly projected into three roles:
- Query (Q): what this token is searching for
- Key (K): what this token advertises about itself
- Value (V): the content this token contributes when selected
The attention weight between token i and token j is the dot product of i's query and j's key, divided by the square root of the head dimension, then passed through a softmax. The final output for token i is a weighted sum of all values, where the weights reflect relevance scores.
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor) -> torch.Tensor:
d_k = Q.size(-1)
# Similarity of every query to every key: shape (seq, seq)
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
# Normalize to probability weights along the key dimension
weights = F.softmax(scores, dim=-1)
# Weighted blend of values
return torch.matmul(weights, V)The / (d_k ** 0.5) scaling is a subtle but critical detail. Without it, large dimensions cause dot products to grow proportionally, pushing the softmax into near-zero-gradient regions and stalling learning. Scaling by √d keeps the expected magnitude stable regardless of vector size.
To make this concrete: if d_k is 64 and the components of Q and K are drawn from a standard normal distribution, the expected dot product is 0 but the standard deviation is 8. At d_k = 4,096 without scaling, the standard deviation of raw dot products would be 64. The largest logit then dominates the softmax exponentially and every other attention weight collapses toward zero, making the softmax behave like a hard argmax. Gradients through almost every position become zero, and training stalls. Dividing by the square root of d_k normalizes the standard deviation back to 1.0 regardless of head dimension and keeps the softmax in a range where gradients flow through all positions.
In matrix notation, Q has shape (seq_len, d_k), K has shape (seq_len, d_k), and V has shape (seq_len, d_v), where typically d_k == d_v == d_model / n_heads. The output of a single head has shape (seq_len, d_v), and all heads are concatenated to (seq_len, d_model) before a final linear projection mixes the results. Consider a three-token sequence "The cat sat": after projecting queries and keys to d_k = 4 for one head, you compute a 3 by 3 raw score matrix. After scaling by √4 = 2 and applying row-wise softmax, each row sums to 1. The weight at row "sat", column "cat" reflects how strongly "sat" attends to "cat". The output for "sat" is then the weighted sum of all three value vectors, letting the model pull in context from any position.
Multiple Heads in Parallel
A single attention head can only capture one type of relationship at a time. Real language has overlapping structure: subject-verb agreement, pronoun coreference, semantic entailment. Multi-head attention splits the embedding dimension across several independent attention heads, computes attention separately in each, and concatenates the results. One head might track syntactic dependencies; another might specialize in entity co-reference. The architecture discovers these roles without being told.
Each head operates on a slice of d_model. For d_model = 4,096 and n_heads = 32, each head receives d_k = 128 dimensions. The heads run in parallel, so multi-head attention costs roughly the same as a single head over the full dimension while capturing richer relationships across diverse subspaces simultaneously.
The KV Cache
During autoregressive generation, the model produces one new token at a time by appending the predicted token and running another forward pass. Without caching, this would recompute the keys and values for every preceding token in every layer on each step, even though those tokens have not changed. The KV cache stores those tensors in GPU memory so that each new step only needs to compute the query for the newest token and attend against the already-stored keys and values.
The memory footprint follows a simple formula:
kv_cache_bytes = 2 * n_layers * n_kv_heads * head_dim * seq_len * bytes_per_element
The factor of 2 accounts for both K and V. For a model like Llama 2 7B (32 layers, 32 KV heads, head_dim 128, fp16), each token in the cache occupies 2 * 32 * 32 * 128 * 2 = 524,288 bytes, roughly 0.5 MB. A 4,096-token context therefore consumes about 2 GB of VRAM purely for the KV cache, on top of the roughly 14 GB needed to hold the model weights in fp16. At 32,000 tokens the cache reaches 16 GB, exceeding the weight memory itself.
This is why reducing n_kv_heads matters for practical deployment. Llama 3 8B uses 8 KV heads instead of 32, cutting the per-token cache size by 4x. At 128,000 tokens, the difference is 16 GB versus 64 GB, which determines whether long-context inference is feasible on a single high-end GPU or requires a multi-GPU setup.
In concurrent serving, KV cache pressure is almost always the binding constraint on batch size, not model weights. Systems like PagedAttention (the foundation of vLLM) manage KV tensors as memory pages that can be shared between sequences with common prefixes or evicted when needed, enabling much higher concurrency without requiring all cache data to remain resident simultaneously.
Building a Transformer Block
Causal Masking and the Decoder Stack
Almost every deployed generative model (GPT-4, Claude, Llama, Mistral) uses a decoder-only transformer. The defining feature is causal masking: during both training and inference, attention scores for positions in the future are set to negative infinity before the softmax, so each token can only attend to tokens that came before it. This lets the model be trained to predict the next token, the same objective used at inference time.
In implementation the causal mask is a lower-triangular matrix of zeros with the strict upper triangle filled with negative infinity. After adding this mask to the raw attention logits, the softmax maps negative infinity to exactly zero probability, making future-position attention numerically equivalent to not attending at all. During training the entire sequence is available at once, so all positions are computed in parallel; the mask is what enforces the sequential dependency constraint without sacrificing the parallelism that makes transformers trainable at scale.
A single transformer layer applies:
- Masked multi-head self-attention (with residual connection)
- Layer normalization
- A position-wise feed-forward network: two linear layers with a non-linear activation between them (often SwiGLU in modern models)
- Another layer normalization
Stacking 32-128 of these blocks produces the depth that allows the model to build progressively more abstract representations.
The feed-forward sublayer is larger than it looks. For d_model = 4,096, the FFN typically projects up to 4 * d_model = 16,384 intermediate dimensions, or larger in SwiGLU variants which use a gated multiplication that effectively requires three weight matrices instead of two. This expansion-contraction structure is where most parameters in a dense transformer reside: roughly two-thirds of total parameters sit in the FFN layers rather than in the attention projections. The FFN is commonly described as where the model stores factual associations compressed during training, while attention handles context routing and integration.
Position Encodings
Self-attention has no built-in notion of order; swapping two tokens produces the same attention scores unless position is explicitly encoded. Early models injected fixed sinusoidal signals. Modern models use Rotary Position Embedding (RoPE), which encodes relative position by rotating query and key vectors before the dot product. Because it encodes relative rather than absolute position, RoPE generalizes better to sequences longer than those seen during training, which is critical for multi-million-token context windows.
RoPE applies a rotation matrix to pairs of dimensions in Q and K, where the rotation angle is a function of both the token's position and the dimension index. When you compute the dot product of a rotated Q against a rotated K, the result depends only on the relative offset between the two positions, not their absolute indices. This relative-position property is what enables length generalization: a model trained on 8,192-token sequences can, with additional fine-tuning or dynamic scaling of the rotation base frequency, attend sensibly across 128,000-token sequences without relearning the positional logic from scratch.
In practice most models still degrade beyond their training length even with RoPE, because the attention score distribution shifts out of distribution as relative offsets grow large. Techniques like YaRN and LongRoPE address this by rescaling the base frequency to spread positional information more evenly across the extended range. Dynamic NTK interpolation is a simpler variant that can be applied at inference time without retraining.
Why Transformers Scale
The reason the transformer architecture has dominated for nearly a decade comes down to parallelism. During training, all tokens in a sequence are processed simultaneously. The attention matrix is computed in one large matrix multiplication. This contrasts sharply with recurrent models, which could not parallelize across the time dimension. Greater parallelism means more data processed per unit of compute, which translates directly into better models.
Kaplan et al. (2020) showed that language model performance scales as a smooth power law with model size, dataset size, and compute budget. The scaling laws held remarkably well for three orders of magnitude and continue to guide model design today, even as the field has refined them with "inference-optimal" considerations: training a smaller model for much longer than the Chinchilla-optimal point produces a model that is cheaper to serve at scale.
The Chinchilla paper (Hoffmann et al., 2022) revised the Kaplan ratios significantly. Compute-optimal training, by their analysis, allocates roughly one training token per parameter, whereas Kaplan had suggested concentrating compute on larger models with relatively fewer tokens. GPT-3 at 175B parameters trained on 300B tokens was therefore substantially undertrained by Chinchilla standards. Models like Llama 2 and Mistral deliberately overtrain relative to Chinchilla by training a 7B or 13B model on 2 trillion or more tokens, accepting higher training compute in exchange for a smaller model that is cheaper to serve at inference time. The decision depends on your deployment volume: for a team serving billions of inferences, shaving latency and cost per forward pass easily justifies significant extra training spend.
The O(n²) attention complexity is the main barrier to scaling in the sequence dimension. Doubling context length quadruples the attention computation and peak activation memory per layer. FlashAttention addresses this not by reducing asymptotic complexity but by reordering computation so that the full score matrix is never materialized in high-bandwidth DRAM: attention is computed in tiles that fit in on-chip SRAM, exchanging some recomputation for a substantial reduction in memory bandwidth. This enables context lengths that would otherwise exhaust available VRAM and can improve wall-clock throughput by 2 to 4x on typical hardware. Sparse attention variants such as sliding-window or dilated attention reduce complexity to O(n log n) or O(n) at the cost of potentially missing dependencies beyond the window size.
Mixture of Experts
Frontier models push scaling further with Mixture of Experts (MoE). Instead of a single large feed-forward layer in each transformer block, an MoE layer contains multiple parallel expert networks and a learned router. For any given token, only two or three experts activate; the rest are skipped. This means a model with 600 billion total parameters might process each token through only 40 billion active parameters, dramatically reducing inference cost while retaining the capacity that comes with a large parameter count.
The router is a small linear layer that maps the current token's hidden state to a logit for each expert. A top-k gating function selects the k highest-scoring experts, applies softmax over only those k logits to produce a blending weight, and combines the expert outputs as a weighted sum. During training, an auxiliary load-balancing loss penalizes routing distributions that concentrate most tokens on a small subset of experts. Without this auxiliary term the router often collapses early in training, sending nearly all tokens to the same one or two experts while the rest receive almost no gradient signal and fail to specialize.
Expert collapse is a real failure mode. When the router routes most tokens to a few popular experts, the underutilized experts develop weak representations because they are rarely updated. The load-balancing auxiliary loss combats this by encouraging a roughly uniform distribution across experts, but the weight on this loss is sensitive to tune: too high and it dominates the task objective, producing evenly loaded experts that are poorly differentiated. In practice, MoE models also use expert capacity limits that hard-cap how many tokens a single expert can receive per batch, dropping excess tokens rather than overloading one expert.
The total versus active parameter distinction is critical for capacity planning. Total parameters determine the minimum VRAM needed to load the model: a 600B parameter model in fp16 requires at minimum 1.2 TB of GPU memory, which today means 16 or more H100-class GPUs just to hold the weights. Active parameters determine throughput: at 40B active parameters per token, each forward pass does roughly the same FLOPs as a 40B dense model, giving quality closer to the larger model at the serving cost of the smaller one. You cannot run the 600B MoE on hardware sized for a 40B dense model, even though inference compute per token is comparable. Memory to hold all weights is the non-negotiable constraint.
The Full Forward Pass
Putting it together, one forward pass through a decoder LLM looks like this:
- Tokenize input → sequence of integer IDs
- Embed each ID → matrix of shape
(seq_len, d_model) - Add position encodings
- Pass through N transformer blocks (attention → FFN, each with residuals and norms)
- Project the final hidden states through a vocabulary-sized linear layer
- Apply softmax → probability distribution over the next token
- Sample or argmax to select a token, append it, repeat
Every parameter in the model lives in the projection matrices inside those transformer blocks. There is nothing else. The emergent behavior that makes these systems remarkable is entirely a consequence of gradient descent optimizing next-token prediction across trillions of examples.
Interview angle
How would you explain attention to someone who has never seen the math?
What they are probing for: conceptual clarity and ability to communicate technical ideas
Each token in the sequence needs to decide which other tokens are relevant to its current meaning. Attention gives every token three things: a query for what it is looking for, a key for what it broadcasts about itself, and a value for what it contributes. To update token A, you compute a similarity score between A's query and every other token's key, normalize those scores into weights with a softmax, then take a weighted average of all value vectors. The result is a new representation for A that blends in context from the most relevant positions, regardless of where in the sequence those positions appear. The key insight is that this happens for all tokens simultaneously in one matrix multiplication.
Why do we divide attention scores by the square root of d_k? What goes wrong without it?
What they are probing for: mathematical intuition and understanding of gradient flow
When two vectors of dimension d_k have independent standard-normal components, their dot product has variance d_k, so the standard deviation grows as the square root of d_k. At d_k = 4,096 without scaling, raw dot products have a standard deviation of 64, which means the largest logit dominates the softmax exponentially and every other attention weight collapses toward zero. The softmax effectively becomes a hard argmax, and gradients through almost every position become zero, stalling learning. Dividing by the square root of d_k normalizes the standard deviation back to 1.0 regardless of head dimension and keeps the softmax in a range where meaningful gradients flow through all positions.
What is the KV cache, what does it cost in memory, and why does it constrain batch size more than model weights in production?
What they are probing for: systems-level knowledge and inference cost awareness
During autoregressive generation, the keys and values for all prior tokens are fixed, so recomputing them on every step wastes compute. The KV cache stores those tensors in GPU memory so that each step only computes the query for the newest token. The cost per token is roughly 2 times layers times KV heads times head dimension times bytes per element. For a 32-layer model with 32 KV heads and head dimension 128 in fp16, that is about 0.5 MB per token, which reaches 16 GB at a 32,000-token context window and exceeds model weight memory. In concurrent serving, because each user request grows its own independent KV cache as the conversation extends, KV cache pressure exhausts available VRAM long before model weights become the bottleneck.
How does Mixture of Experts reduce per-token compute, and why can you not deploy a 600B MoE model on hardware sized for a 40B dense model?
What they are probing for: architecture tradeoffs and capacity planning reasoning
An MoE layer replaces the single FFN in each transformer block with N independent expert networks plus a learned router that selects the top k experts per token. With k = 2 and 64 experts, each token activates roughly 1/32 of the total FFN capacity, keeping FLOPs per token close to a much smaller dense model. The catch is that all expert weights must reside in GPU memory simultaneously because the router decision is only known at runtime and any expert could be selected for any token. Total parameters set the minimum VRAM floor; active parameters set the throughput and latency profile. A 600B MoE needs hardware that can hold 600B parameters even though each token only traverses 40B of them.
What are the memory implications of extending context length, and what fails first as context grows?
What they are probing for: failure mode reasoning and scaling judgment
The KV cache grows linearly with context length, so a 4x context increase requires 4x the KV cache VRAM on top of fixed model weight memory. Attention computation is O(n squared) in sequence length, though FlashAttention makes this feasible by tiling the computation to avoid materializing the full score matrix in high-bandwidth memory. In practice, the first thing to exhaust is GPU memory for the KV cache, especially under concurrent load. Quality also degrades before the hardware limit is hit: models trained on shorter sequences see the attention score distribution shift out of distribution at large relative offsets, so tokens near the far end of the context window receive less reliable attention. Techniques like YaRN rescale RoPE base frequencies to extend the useful range, but long-context degradation remains an active problem.
What is the practical difference between compute-optimal and inference-optimal training, and when does each matter?
What they are probing for: engineering judgment and deployment economics
Compute-optimal training minimizes loss per unit of training compute by balancing model size and token count, roughly one training token per parameter per the Chinchilla findings. Inference-optimal training accepts higher training compute to produce a smaller model that is cheaper to serve at scale. Llama-series models train a 7B or 13B model on two trillion or more tokens, far beyond the Chinchilla-optimal point, because for a team serving that model to many users the ongoing per-inference cost dominates the one-time training spend. The right choice depends on deployment volume: at low volume the training budget dominates and Chinchilla-optimal sizing makes sense; at high volume, a smaller but overtrained model that shaves latency and compute per forward pass is worth significant extra training investment.