NVIDIA Technical BlogFebruary 1, 2024

Deploy an AI Coding Assistant with NVIDIA TensorRT-LLM and NVIDIA Triton

In this tutorial, we will learn how to deploy an AI coding assistant using NVIDIA TensorRT-LLM and NVIDIA Triton. AI coding assistants, also known as code LLMs, are capable of generating code, filling in missing code, adding documentation, and providing problem-solving tips. We will be using a code LLM called StarCoder, which is trained on 80+ programming languages.

To deploy our AI coding assistant, we need to have basic knowledge of deep learning inference and LLMs, as well as access to Hugging Face and familiarity with the Transformers library. We will also use the TensorRT-LLM optimization library and NVIDIA Triton with the TensorRT-LLM backend.

The first step is to clone and build the TensorRT-LLM library. Next, we download the StarCoder model from Hugging Face and place it in the appropriate directory. We then deploy the model on the Triton Inference Server, which allows us to create a production-ready deployment of our LLM. The Triton Inference Server backend for TensorRT-LLM leverages the C++ runtime for fast inference execution and includes techniques like in-flight batching and paged KV caching.

To set up the Triton Inference Server, we create a model repository and add the necessary script files for tokenization and detokenization. We also configure the batch size and memory limits for efficient inference. Finally, we launch the Triton server with the TensorRT-LLM backend.

With our AI coding assistant deployed, we can now use it to generate code, fill in missing code, and provide helpful suggestions. This can greatly speed up the coding process and improve productivity.