Develop Generative AI-Powered Visual AI Agents for the Edge

thumbnail

Table of Contents

  1. Introduction
  2. Build Visual AI Agents for the Edge using Jetson Platform Services
  3. Prompt Engineering
  4. Integration with Jetson Platform Services and a Mobile App
  5. Conclusion

Introduction

This blog post explores how to build VLM-based Visual AI Agents that can run from edge to cloud. It focuses on implementing for edge use cases on Jetson Orin using Jetson Platform Services, a suite of prebuilt microservices that provide essential out-of-the-box functionality for building computer vision solutions on NVIDIA Jetson Orin. The goal is to create a VLM-based visual AI agent application that detects events on live-streaming cameras and sends notifications to the user through a mobile app.

Build Visual AI Agents for the Edge using Jetson Platform Services

By combining VLMs with Jetson Platform Services, we can develop a generative AI-powered application capable of detecting events set by the user in natural language on live video streams. The pseudocode for this process is provided along with a utility library and full reference examples on GitHub.

Prompt Engineering

VLMs are prompted with three main components: system prompt, user prompt, and input frame. Adjusting the system and user prompt of the VLM teaches it how to evaluate alerts on a live stream and output results in a structured format. The user prompt can be supplied through the REST API, and the user input combined with system prompt is given to the VLM along with a frame from the input live stream.

Integration with Jetson Platform Services and a Mobile App

The end-to-end system integrates with a mobile app to build the VLM-powered Visual AI Agent. The APIs exposed by Jetson Platform Services and the VLM service are accessed by the mobile app through the API Gateway. The app allows users to set custom alerts in natural language on selected live streams, chat with the VLM about the input live stream, and view the live stream directly in the app using WebRTC from VST.

Conclusion

This blog post demonstrates the potential of combining VLMs with Jetson Platform Services to create a Visual AI Agent, highlighting the capabilities of edge computing in developing AI-powered applications for real-time event detection and notification.