Airbnb Tech BlogSeptember 10, 2024

Riverbed Data Hydration — Part 1

Introduction
Notification Pipeline
- Join
- Stitch
Data Source Join
- JoinConditionsDag
- JoinResultsDag

1. Introduction

Riverbed Data Hydration aims to keep the read-optimized store up-to-date with the system-of-record data stores. It facilitates developers in building and managing pipelines to integrate data from various sources. This article focuses on the join transformation within the Notification Pipeline, demonstrating the design of a DAG-like structure for efficient joining of different data sources in a memory-efficient manner.

2. Notification Pipeline

Join

In the Notification Pipeline, data sources from entities like Review and User are integrated to generate materialized view documents with the review ID as the document ID. The pipeline consists of Source Pipelines to consume CDC events and a Notification Pipeline to process these events for constructing materialized view documents.

Stitch

After fetching data from various sources based on Notification events, the Stitch step models the join results into a Java Pojo called StitchModel for further customized processing.

3. Data Source Join

JoinConditionsDag

JoinConditionsDag is a Directed Acyclic Graph used in Riverbed to store relationships among data sources. Each node represents a unique data source, and each edge signifies a join condition. In the Notification Pipelines, the root node is always metadata for the notification event.

JoinResultsDag

JoinResultsDag stores join results efficiently, especially in complex join relationships. Riverbed initializes JoinResultsDag with the Notification event as the root and retrieves data from each source based on JoinConditionsDag, encapsulating results into Cells for further processing.

By utilizing JoinConditionsDag and JoinResultsDag, Riverbed efficiently performs joins starting from the primary source ID provided by Notification events.