Incremental Processing using Netflix Maestro and Apache Iceberg

Incremental processing is an efficient approach to process new or changed data in workflows. In this article, Netflix explains how they are building an incremental processing solution using Netflix Maestro and Apache Iceberg. They address three common challenges faced by dataset owners: data accuracy, data freshness, and backfill.
To address the issue of data accuracy, Netflix uses a lookback window approach where data is reprocessed within a certain time window to include late arriving data. However, this approach involves creating manual backfill workflows for each stage of the pipeline, which is time-consuming.
For data freshness, Netflix supports scheduling workflows in a micro-batch fashion with state tracking functionality. This allows them to process data in smaller intervals and track the state of data changes.
To improve data accuracy, Netflix provides support to process all late arriving data and achieve the required data accuracy with improved performance. They also offer managed backfill support, which automates the process of building, monitoring, and validating backfill workflows. This greatly improves engineering productivity.
The approach involves capturing incremental data changes and tracking their states. Depending on how the target table is derived from the source table, the approach may involve reprocessing all data, reprocessing specific rows, or indicating a range of data to be affected. Netflix uses the append pattern to write new data to the target table.
Netflix Maestro is highly scalable and extensible, allowing it to support existing and new use cases. Apache Iceberg provides the necessary infrastructure for efficient data management and querying.
Overall, the incremental processing solution provided by Netflix Maestro and Apache Iceberg improves data accuracy, data freshness, and backfill capabilities in workflows, leading to better efficiency and productivity.