Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training

Table of Contents
- FP8 format explanation
- NVIDIA Blackwell introduces microscaling formats
- Tensor scaling
- Block scaling
- MXFP8
FP8 format explanation
- FP8 is a lower-precision numerical format employed in modern LLMs to balance computational efficiency with numerical stability.
- FP8 training benefits from dedicated FP8 Tensor Cores within the NVIDIA H100 architecture.
- It utilizes 8 exponent and 7 mantissa bits to represent weights, activations, and gradients with a wide dynamic range.
NVIDIA Blackwell introduces microscaling formats
- The latest NVIDIA Blackwell GPU architecture supports lower-precision numerical formats like FP4, FP6, and enhanced FP8 Blackwell Tensor Cores.
- Sub-FP8 formats enable finer-grained scaling for efficiency gains in compute, memory, and bandwidth.
- Compression of bits with quantization must be done cautiously to prevent degradation in LLM training convergence.
Tensor scaling
- In tensor scaling, a single scaling factor is calculated and applied uniformly across all elements in a tensor.
- The scaling factor is determined using an algorithm based on the history of values, ensuring reactive adjustments to the dynamic range of the tensor.
Block scaling
- Block scaling divides tensors into smaller contiguous blocks, with each block assigned a unique scaling factor.
- This approach, exemplified by MXFP8 in NVIDIA Blackwell architecture, allows for better accommodation of magnitude variations within a tensor.
MXFP8
- MXFP8 in NVIDIA Blackwell utilizes block scaling to assign scaling factors to small blocks within a tensor.
- Different block sizes, like 1×128 or 128×128, enable efficient handling of variations in magnitude within the tensor.