NVIDIA TensorRT-LLM Revs Up Inference for Google Gemma

Table of Contents
- Introduction to NVIDIA TensorRT-LLM & Google Gemma
- Features of TensorRT-LLM for Gemma
- Real-time Performance with H200 Tensor Core GPUs
- Customizing Gemma with NVIDIA NeMo Framework
1. Introduction to NVIDIA TensorRT-LLM & Google Gemma
NVIDIA has collaborated with Google to optimize the Gemma family of open models using TensorRT-LLM. This collaboration provides users the ability to develop with Language Model (LLMs) using just a desktop with an NVIDIA RTX GPU. Gemma is built from the same research and technology as the Gemini models and offers improved customization options for Python developers.
2. Features of TensorRT-LLM for Gemma
Three key features of TensorRT-LLM that enhance the performance of Gemma models: FP8 quantization, XQA kernel, and INT4 Activation-aware weight quantization (INT4 AWQ). These features help optimize model parameters, reduce memory footprint, increase throughput, and improve inference latency for the most popular LLMs.
3. Real-time Performance with H200 Tensor Core GPUs
TensorRT-LLM with NVIDIA H200 Tensor Core GPUs delivers exceptional performance for Gemma 2B and Gemma 7B models. A single H200 GPU can achieve over 79,000 tokens per second on the Gemma 2B model and almost 19,000 tokens per second on the Gemma 7B model. This level of performance enables real-time latency for over 3,000 concurrent users with just one H200 GPU deployed.
4. Customizing Gemma with NVIDIA NeMo Framework
Developers can use the NVIDIA NeMo framework to customize and deploy Gemma in production environments. The framework supports 3D parallelism for training, offering a comprehensive solution for adapting Gemma models to specific use cases. Get started today with customizing Gemma using the NeMo framework.