NVIDIA Technical BlogJuly 1, 2025

Best-in-Class Multimodal RAG: How the Llama 3.2 NeMo Retriever Embedding Model Boosts Pipeline Accuracy

Introduction
Evolution of Multimodal Language Models
Challenges in Document Processing
Llama 3.2 NeMo Retriever Multimodal Embedding Model
Efficient Document Retrieval
DigitalCorpora-767 Dataset
Get Started with NeMo Retriever

Introduction

The development of multimodal language models, also known as vision language models (VLMs), has enabled the processing of text and raw images to generate appropriate responses. NVIDIA introduced the NeMo Retriever microservice for document image retrieval, utilizing the Llama 3.2 NeMo Retriever Multimodal Embedding 1B model.

Evolution of Multimodal Language Models

Recent progress in vision language models has led to complex visual processing capabilities, enhancing tasks like ChartQA. Models like Gemma 3, PaliGemma, SmolVLM, QwenVL, and LLaVA-1.5 have advanced the field to handle text and image data effectively.

Challenges in Document Processing

Documents are often complex and require parsing into text for information retrieval tasks. Multimodal information retrieval systems need robust retrieval components like multimodal embedding and ranker models to find relevant information from diverse knowledge bases.

Llama 3.2 NeMo Retriever Multimodal Embedding Model

The Llama 3.2 NeMo Retriever Multimodal Embedding 1B model is a powerful vision embedding model with 1.6B parameters. It efficiently processes multimodal information retrieval tasks by embedding raw page images and textual information.

Efficient Document Retrieval

By adopting the "retrieval in vision space," the NeMo Retriever model directly embeds raw page images, preserving visual information. It uses a vision encoder, a large language model, and a linear projection layer for effective document retrieval.

DigitalCorpora-767 Dataset

The DigitalCorpora-767 dataset comprises 767 PDFs with 991 human-annotated questions, covering text, tables, charts, and infographics. This diverse benchmark helps evaluate the performance of retrieval systems.

Get Started with NeMo Retriever

Developers can utilize the NeMo Retriever microservices to create high-accuracy information retrieval pipelines for real-time business insights, prioritizing data privacy and efficiency.