Intel Tech BlogJuly 16, 2024

Improve your Tabular Data Ingestion for RAG with Reranking

Improving Tabular Data Ingestion for RAG with Reranking

In this post, we explore how to enhance a Retriever-Augmented Generation (RAG) system by incorporating a reranker to select the most relevant context chunks for better response generation by a large language model (LLM).

Data Preparation

Data from a PDF on the World's Billionaires is used as the knowledge base.
Two paths are followed for text and tabular data extraction and processing.
Text path involves data cleaning, chunking, and metadata addition.
Tabular path includes extraction of tabular data, conversion to context chunks using row-based and table-based approaches.

Indexing

Unified Context Scenario

Uses ChromaDB to create a single collection of structured data optimized for semantic search and embedding generation.

Distributed Context Scenario

ChromaDB manages two distinct collections: distrctx_context_collection and distrtbl_table_collection.

Retrieval

Relevant documents are retrieved from the collections before inputting to the LLM.
Data from the World's Billionaires PDF is converted to a retrievable format.

Helper Functions for Indexer

Functions are created for the indexing stage to facilitate data organization and retrieval.

Through the combined efforts of data preparation, indexing, and retrieval stages, the RAG system is optimized to provide the most accurate and relevant information to the LLM for generating responses in a Q&A chatbot scenario.