NVIDIA Technical BlogMay 20, 2025

Advance Video Analytics AI Agents Using the NVIDIA AI Blueprint for Video Search and Summarization

NVIDIA AI Blueprint for Video Search and Summarization (VSS)

The NVIDIA AI Blueprint for Video Search and Summarization (VSS) integrates Video and Language Models (VLMs), Large Language Models (LLMs), and retrieval-augmented generation (RAG) to enable stored and real-time video analysis. It provides a recipe for long-form video understanding and accelerates the development of video analytics AI agents.

Computer Vision Pipeline

Enhance accuracy by tracking objects in a scene with zero-shot object detection and using bounding boxes, segmentation masks, and Set-of-Mark (SoM) to guide vision-language models. This helps improve detection through a predefined set of reference points or labels.

Single-GPU Deployment

A single-GPU deployment recipe has been introduced for smaller workloads, utilizing low memory modes and smaller LLMs. The configuration includes sharing all models on a single GPU, using a smaller LLM model (Llama 3.1 8B Instruct), and maintaining contexts for each source separately with CA-RAG.

Multi-Stream Processing

APIs for Summarization and Q&A can be called in parallel across different threads or processes for various video files or live streams. Each chunk of data is tagged with a unique stream ID to facilitate multi-stream processing.

Audio and Visual Information Combination

For each chunk, video description from the VLM, audio transcript from the ASR service, and additional metadata like timestamp information are sent to the retrieval pipeline for processing and indexing. The audio processing feature can be enabled or disabled, and each summarization request can configure audio transcription options.

Computer Vision Integration

Integrating specific computer vision models with VLMs enhances video analysis by providing detailed metadata on objects, including positions, masks, and tracking IDs. The CV and tracking pipeline in VSS generates comprehensive CV metadata for videos and live streams, enabling object detection and tracking based on user-specified classes like "vehicle, truck."

Each video chunk is processed by the VLM model, overlaying sampled frames with object IDs and segmentation masks.