articles channels tags spaces toolkit

NVIDIA Technical BlogJune 5, 2025

Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training

Table of Contents

FP8 format explanation
NVIDIA Blackwell introduces microscaling formats
Tensor scaling
Block scaling
MXFP8

FP8 format explanation

FP8 is a lower-precision numerical format employed in modern LLMs to balance computational efficiency with numerical stability.
FP8 training benefits from dedicated FP8 Tensor Cores within the NVIDIA H100 architecture.
It utilizes 8 exponent and 7 mantissa bits to represent weights, activations, and gradients with a wide dynamic range.

NVIDIA Blackwell introduces microscaling formats

The latest NVIDIA Blackwell GPU architecture supports lower-precision numerical formats like FP4, FP6, and enhanced FP8 Blackwell Tensor Cores.
Sub-FP8 formats enable finer-grained scaling for efficiency gains in compute, memory, and bandwidth.
Compression of bits with quantization must be done cautiously to prevent degradation in LLM training convergence.

Tensor scaling

In tensor scaling, a single scaling factor is calculated and applied uniformly across all elements in a tensor.
The scaling factor is determined using an algorithm based on the history of values, ensuring reactive adjustments to the dynamic range of the tensor.

Block scaling

Block scaling divides tensors into smaller contiguous blocks, with each block assigned a unique scaling factor.
This approach, exemplified by MXFP8 in NVIDIA Blackwell architecture, allows for better accommodation of magnitude variations within a tensor.

MXFP8

MXFP8 in NVIDIA Blackwell utilizes block scaling to assign scaling factors to small blocks within a tensor.
Different block sizes, like 1×128 or 128×128, enable efficient handling of variations in magnitude within the tensor.