Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training

thumbnail

Table of Contents

  1. FP8 format explanation
  2. NVIDIA Blackwell introduces microscaling formats
  3. Tensor scaling
  4. Block scaling
  5. MXFP8

FP8 format explanation

  • FP8 is a lower-precision numerical format employed in modern LLMs to balance computational efficiency with numerical stability.
  • FP8 training benefits from dedicated FP8 Tensor Cores within the NVIDIA H100 architecture.
  • It utilizes 8 exponent and 7 mantissa bits to represent weights, activations, and gradients with a wide dynamic range.

NVIDIA Blackwell introduces microscaling formats

  • The latest NVIDIA Blackwell GPU architecture supports lower-precision numerical formats like FP4, FP6, and enhanced FP8 Blackwell Tensor Cores.
  • Sub-FP8 formats enable finer-grained scaling for efficiency gains in compute, memory, and bandwidth.
  • Compression of bits with quantization must be done cautiously to prevent degradation in LLM training convergence.

Tensor scaling

  • In tensor scaling, a single scaling factor is calculated and applied uniformly across all elements in a tensor.
  • The scaling factor is determined using an algorithm based on the history of values, ensuring reactive adjustments to the dynamic range of the tensor.

Block scaling

  • Block scaling divides tensors into smaller contiguous blocks, with each block assigned a unique scaling factor.
  • This approach, exemplified by MXFP8 in NVIDIA Blackwell architecture, allows for better accommodation of magnitude variations within a tensor.

MXFP8

  • MXFP8 in NVIDIA Blackwell utilizes block scaling to assign scaling factors to small blocks within a tensor.
  • Different block sizes, like 1×128 or 128×128, enable efficient handling of variations in magnitude within the tensor.