Building a Spark observability product with StarRocks: Real-time and historical performance analysis

Table of Contents
- Introduction
- Key Components of the New Architecture
- Data Model and Ingestion
- Handling Real-Time and Historical Data
- Query Performance and Optimization
1. Introduction
In this blog post, we explore leveraging StarRocks to build the next generation of the Spark observability platform at Grab. We discuss the architecture, data model, and key features that help overcome previous limitations and provide more value to Spark users.
2. Key Components of the New Architecture
- StarRocks Database: Replaces InfluxDB for real-time and historical data storage.
- Direct Kafka Ingestion: StarRocks ingests data directly from Kafka, eliminating the need for external agents.
- Custom Web Application (Iris UI): Replaces Grafana dashboards, providing a centralised interface with a custom API.
- Superset Integration: Connected directly to StarRocks, offering real-time data access.
- Simplified Offline Data Process: Scheduled backups from StarRocks to S3 for easy data lake integration.
3. Data Model and Ingestion
The Iris observability system monitors job executions and ad-hoc cluster usage. An example of creating a routine load job for cluster worker metrics is provided, showcasing continuous data ingestion from Kafka.
4. Handling Real-Time and Historical Data
The new Iris system uses StarRocks to manage both real-time and historical data efficiently. Features include routine load for near real-time data ingestion, materialised views for pre-calculation and aggregation, and direct querying in StarRocks for historical runs.
5. Query Performance and Optimization
StarRocks significantly improves query performance for Spark observability. Materialised views combine data from various tables, eliminating the need for complex joins during query execution and enhancing overall efficiency.
By incorporating StarRocks into the Spark observability platform, Grab has overcome previous limitations and established a unified, flexible, and efficient solution for real-time and historical performance analysis.