NVIDIA TensorRT 10.0 Upgrades Usability, Performance, and AI Model Support

Feature Upgrades

TensorRT 10.0 introduces INT4 Weight-Only Quantization (WoQ) with block quantization and improved memory allocation options.
WoQ allows for refitting engines with weights without rebuilding the engine at runtime, resulting in a smaller plan file that can be refitted using weights from the ONNX model.
Weight streaming option enables streaming network weights from host memory to device memory during execution.
TensorRT Model Optimizer Python APIs are available for model optimizations to accelerate inference.

TensorRT can stream network weights during execution instead of loading them into device memory at engine load time, improving memory usage and performance.

A comprehensive library of post-training and training-in-the-loop model optimizations is included.
Model Optimizer enables quantization, sparsity, and distillation to reduce model complexity and optimize inference speed.
MLPerf Inference v4.0 showcases a 1.3x speedup with post-training sparsity on top of FP8 quantization using TensorRT-LLM on NVIDIA H100 with Llama 270B.

TensorRT-LLM, an open-source library for optimizing LLM inference, now supports weight-stripped engines added in TensorRT 10.0.
Enhanced support for various AI models and optimizations adds versatility to the TensorRT ecosystem.

NVIDIA TensorRT 10.0 brings significant upgrades, including INT4 quantization, weight streaming, Model Optimizer, and expanded AI model support.
These features enhance performance, usability, and model optimization capabilities within the TensorRT framework.