Microsoft Dev Blogs

Multimodal RAG with Vision: From Experimentation to Implementation

In this multi-modal RAG scenario, the ingestion process involves transforming image content into text format by generating detailed descriptions using a multimodal LLM. This approach aims to enrich the understanding of images and improve responses to image-related queries. The experimentations encompass various aspects such as custom loader configuration, accurate image description generation, and inference using a multi-modal LLM. By systematically testing different configurations and evaluating against predefined baselines, the performance is assessed using specific retrieval and generative metrics.

The provided Q&A evaluation dataset includes diverse question and answer pairs for accurate evaluation:

1. Document Title: Ingestion Process
   - Question: What are the steps of the ingestion flow in RAG?
   - Answer: Chunk documents, enrich documents, embed chunks, and persist data
   - Image link: [Link to image](https://test.com/images/ingestion_steps.png)
   - Type: Vision

2. Document Title: KNN
   - Question: What types of datasets should exhaustive KNN be used for?
   - Answer: Create a table of contents and put a summary for each chapter, then format in markdown. The result should be in English.
   - Type: N/A

These examples showcase the structured approach to experimentations and the diverse nature of questions for evaluation in the multimodal RAG implementation.