Microsoft Dev Blogs

Multimodal RAG with Vision: From Experimentation to Implementation

thumbnail

In this multi-modal RAG scenario, the ingestion process involves transforming image content into text format by generating detailed descriptions using a multimodal LLM. This approach aims to enrich the understanding of images and improve responses to image-related queries. The experimentations encompass various aspects such as custom loader configuration, accurate image description generation, and inference using a multi-modal LLM. By systematically testing different configurations and evaluating against predefined baselines, the performance is assessed using specific retrieval and generative metrics.

The provided Q&A evaluation dataset includes diverse question and answer pairs for accurate evaluation:

  1. Document Title: Ingestion Process

    • Question: What are the steps of the ingestion flow in RAG?
    • Answer: Chunk documents, enrich documents, embed chunks, and persist data
    • Image link: Link to image
    • Type: Vision
  2. Document Title: KNN

    • Question: What types of datasets should exhaustive KNN be used for?
    • Answer: Create a table of contents and put a summary for each chapter, then format in markdown. The result should be in English.
    • Type: N/A

These examples showcase the structured approach to experimentations and the diverse nature of questions for evaluation in the multimodal RAG implementation.