Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 1 of 2)

    thumbnail

    Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 1 of 2)

    Introduction

    As Pinterest's data processing needs grow, the Big Data Platform (BDP) team started considering alternatives for the next generation data processing platform. Candidate platforms had to support containers, execute Pinterest's custom Spark fork, leverage technical improvements, lower operational costs, and improve developer velocity.

    Evaluation of Platforms

    A comprehensive evaluation was conducted, with Kubernetes emerging as a favorable option due to its fine-grained support for container management. Performance tuning was a key aspect, with levers like JDK versions being optimized for better performance.

    Design Challenges

    Building a new platform with Kubernetes and EKS presented challenges such as integrating EKS into the existing environment, finding replacements for Hadoop components, ensuring compatibility with the Pinterest ecosystem, and optimizing cost-effectiveness through Graviton instances.

    Building Alternatives

    Key components like User UI, Job Submission, and HDFS in the Hadoop ecosystem needed replacements for the new platform. Pinterest EKS clusters were augmented with Spark Operator, YuniKorn, and Remote Shuffle Service.

    Core Application Aspects

    Moka's UI serves as a centralized portal for users to view job status and access live UI or Spark History Server. Job execution details, logs, and system metrics are collected and displayed using various tools like Archer, Statsboard, and custom dashboards.


    Part 2 of this article will delve into more details on the core application-focused aspects of the platform.