NVIDIA Technical BlogFebruary 21, 2024

NVIDIA TensorRT-LLM Revs Up Inference for Google Gemma

Introduction to NVIDIA TensorRT-LLM & Google Gemma
Features of TensorRT-LLM for Gemma
Real-time Performance with H200 Tensor Core GPUs
Customizing Gemma with NVIDIA NeMo Framework

1. Introduction to NVIDIA TensorRT-LLM & Google Gemma

NVIDIA has collaborated with Google to optimize the Gemma family of open models using TensorRT-LLM. This collaboration provides users the ability to develop with Language Model (LLMs) using just a desktop with an NVIDIA RTX GPU. Gemma is built from the same research and technology as the Gemini models and offers improved customization options for Python developers.

2. Features of TensorRT-LLM for Gemma

Three key features of TensorRT-LLM that enhance the performance of Gemma models: FP8 quantization, XQA kernel, and INT4 Activation-aware weight quantization (INT4 AWQ). These features help optimize model parameters, reduce memory footprint, increase throughput, and improve inference latency for the most popular LLMs.

3. Real-time Performance with H200 Tensor Core GPUs

TensorRT-LLM with NVIDIA H200 Tensor Core GPUs delivers exceptional performance for Gemma 2B and Gemma 7B models. A single H200 GPU can achieve over 79,000 tokens per second on the Gemma 2B model and almost 19,000 tokens per second on the Gemma 7B model. This level of performance enables real-time latency for over 3,000 concurrent users with just one H200 GPU deployed.

4. Customizing Gemma with NVIDIA NeMo Framework

Developers can use the NVIDIA NeMo framework to customize and deploy Gemma in production environments. The framework supports 3D parallelism for training, offering a comprehensive solution for adapting Gemma models to specific use cases. Get started today with customizing Gemma using the NeMo framework.

NVIDIA TensorRT-LLM Revs Up Inference for Google Gemma

Table of Contents

1. Introduction to NVIDIA TensorRT-LLM & Google Gemma

2. Features of TensorRT-LLM for Gemma

3. Real-time Performance with H200 Tensor Core GPUs

4. Customizing Gemma with NVIDIA NeMo Framework