NVIDIA TensorRT 10.0 Upgrades Usability, Performance, and AI Model Support

thumbnail

Table of Contents

  1. Feature Upgrades
  2. Weight Streaming
  3. NVIDIA TensorRT Model Optimizer 0.11
  4. Expanded Support for AI Models
  5. Summary

Feature Upgrades

  • TensorRT 10.0 introduces INT4 Weight-Only Quantization (WoQ) with block quantization and improved memory allocation options.
  • WoQ allows for refitting engines with weights without rebuilding the engine at runtime, resulting in a smaller plan file that can be refitted using weights from the ONNX model.
  • Weight streaming option enables streaming network weights from host memory to device memory during execution.
  • TensorRT Model Optimizer Python APIs are available for model optimizations to accelerate inference.

Weight Streaming

  • TensorRT can stream network weights during execution instead of loading them into device memory at engine load time, improving memory usage and performance.

NVIDIA TensorRT Model Optimizer 0.11

  • A comprehensive library of post-training and training-in-the-loop model optimizations is included.
  • Model Optimizer enables quantization, sparsity, and distillation to reduce model complexity and optimize inference speed.
  • MLPerf Inference v4.0 showcases a 1.3x speedup with post-training sparsity on top of FP8 quantization using TensorRT-LLM on NVIDIA H100 with Llama 270B.

Expanded Support for AI Models

  • TensorRT-LLM, an open-source library for optimizing LLM inference, now supports weight-stripped engines added in TensorRT 10.0.
  • Enhanced support for various AI models and optimizations adds versatility to the TensorRT ecosystem.

Summary

  • NVIDIA TensorRT 10.0 brings significant upgrades, including INT4 quantization, weight streaming, Model Optimizer, and expanded AI model support.
  • These features enhance performance, usability, and model optimization capabilities within the TensorRT framework.