Rasul Alakbarli — 29.12.2025

This tutorial contains my distilled notes from Hugging Face’s Ultrascale Playbook. While I will try to capture the core technical logic of the book, I still highly recommend reading the original as there are many things that I have not mentioned here.

In these notes we will first start with important refreshers, then we dive into single GPU optimization, and only after that we will move on to multi GPU training and parallelization methods. There are 5 main parallelization methods that we will discuss here: Data Parallelism (ZeRO-1, ZeRO-2, ZeRO-3), Tensor Parallelism, Context Parallelism, Pipeline Parallelism and finally Expert Parallelism. Now, let’s begin!

Numerical precision

In a GPU (and computer hardware in general), we don't store numbers like “12.5” as a single value. We store it as a binary formula composed of three parts.

$$ \text{Value} = (-1)^{\text{Sign}} \times (1 + \text{Mantissa}) \times 2^{(\text{Exponent})} $$

Sign: determines if the value is positive or negative (always 1 bit).
Mantissa: determines the precision. More bits here mean more decimal places and higher accuracy.
Exponent: determines the range. A larger exponent allows the model to represent massive or tiny numbers without hitting "overflow" or "underflow."

Choosing a format is a trade-off between speed, memory, and numerical stability. Here are some standard formats used in LLM training:

Format	Mantissa	Exponent	Max Value	Typical Use Case
FP32	23 bits	8 bits	$3.4 \times 10^{38}$	Master Weights, Optimizer States
BF16	7 bits	8 bits	$3.4 \times 10^{38}$	LLM Training Standard
FP16	10 bits	5 bits	$65,504$	Inference, Older Training

The Memory Hierarchy

To understand why GPU training is often limited by data movement rather than raw math speed, we must look at the GPU memory hierarchy.

GPUs utilize several different memory types, each with a specific trade-off between capacity and speed. It is helpful to visualize this as a pyramid: the closer you get to the "brain" (the tensor cores) where all the calculations happen, the faster the memory becomes, but the less space you have.

VRAM (High Bandwidth Memory - HBM): This is the "main warehouse." It is large (80GB on an H100) but relatively slow to access. This is where our model weights, gradients, and optimizer states live.
L2 Cache: A middle-ground storage area that is much faster than VRAM but much smaller. It acts as a staging area to reduce the number of trips the GPU has to make to the VRAM.
SRAM (Shared Memory / L1 Cache): This is incredibly fast, on-chip memory. However, it is tiny (measured in kilobytes/megabytes). Data must be here for the GPU to work on it efficiently.
Registers: The "desktop." This is where the actual numbers sit while the tensor cores perform the matrix multiplication.

During training, data is streamed through this hierarchy. The efficiency of our training often depends on how effectively we can keep the tensor cores fed without waiting for data to arrive from the slower VRAM: