NVIDIA Technical BlogJanuary 8, 2025

Build a Video Search and Summarization Agent with NVIDIA AI Blueprint

NVIDIA AI Blueprint for Video Search and Summarization

Introduction

NVIDIA AI Blueprints introduce a reference workflow for long-form video understanding using VLMs, LLMs, and Graph-RAG techniques. This allows for the development of video analytics AI agents capable of tasks such as summarization, Q&A, and event detection on live streams.

Video Analytics AI Agent

The blueprint combines VLMs and LLMs with datastores to create a scalable and GPU-accelerated video understanding agent. The VLM pipeline decodes video chunks, generates embeddings, and uses a VLM to provide responses to user queries.

Video Ingestion

To summarize videos or answer questions, a comprehensive index capturing important information must be built. The VLMs and LLMs produce dense captions and metadata to create a knowledge graph. Long videos are divided into chunks, analyzed by VLMs for captions, then summarized and aggregated by the CA-RAG module.

VLM Pipeline

Decodes video chunks
Generates embeddings
Uses VLM for per-chunk responses

Caption Summarization (LLM)

LLM prompt used to combine VLM captions
Graph-RAG techniques extract key insights for summarization, Q&A, and alerts

By leveraging these components, the AI agent can provide efficient video summarization, interactive Q&A, and customized alerts on live video streams.