Rasul Alakbarli — 29.12.2025
This tutorial contains my distilled notes from Hugging Face’s Ultrascale Playbook. While I will try to capture the core technical logic of the book, I still highly recommend reading the original as there are many things that I have not mentioned here.
In these notes we will first start with important refreshers, then we dive into single GPU optimization, and only after that we will move on to multi GPU training and parallelization methods. There are 5 main parallelization methods that we will discuss here: Data Parallelism (ZeRO-1, ZeRO-2, ZeRO-3), Tensor Parallelism, Context Parallelism, Pipeline Parallelism and finally Expert Parallelism. Now, let’s begin!
In a GPU (and computer hardware in general), we don't store numbers like “12.5” as a single value. We store it as a binary formula composed of three parts.
$$ \text{Value} = (-1)^{\text{Sign}} \times (1 + \text{Mantissa}) \times 2^{(\text{Exponent})} $$
Choosing a format is a trade-off between speed, memory, and numerical stability. Here are some standard formats used in LLM training:
| Format | Mantissa | Exponent | Max Value | Typical Use Case |
|---|---|---|---|---|
| FP32 | 23 bits | 8 bits | $3.4 \times 10^{38}$ | Master Weights, Optimizer States |
| BF16 | 7 bits | 8 bits | $3.4 \times 10^{38}$ | LLM Training Standard |
| FP16 | 10 bits | 5 bits | $65,504$ | Inference, Older Training |
To understand why GPU training is often limited by data movement rather than raw math speed, we must look at the GPU memory hierarchy.
GPUs utilize several different memory types, each with a specific trade-off between capacity and speed. It is helpful to visualize this as a pyramid: the closer you get to the "brain" (the tensor cores) where all the calculations happen, the faster the memory becomes, but the less space you have.
During training, data is streamed through this hierarchy. The efficiency of our training often depends on how effectively we can keep the tensor cores fed without waiting for data to arrive from the slower VRAM: