Scaling Kafka to Support PayPal’s Data Growth

- Introduction to scaling Kafka at PayPal: Apache Kafka is an open-source distributed event streaming platform used at PayPal for data streaming pipelines, integration, and ingestion. Scaling Kafka was necessary to handle the tremendous growth of PayPal's streaming data while ensuring high availability, fault tolerance, and optimal performance.
- Implementation of cluster management: PayPal introduced improvements to have better control over Kafka clusters and reduce operational overhead. These improvements include the Kafka Config Service and Kafka ACLs.
- Introduction of Kafka Config Service: Before the Kafka Config Service, clients would hardcode the broker IPs to which they connect, causing maintenance nightmares. The Kafka Config Service offers a plug-and-play model for Kafka clients, reduces operational and support overhead, and helps standardize Kafka configuration across multiple clients.
- Implementation of Kafka ACLs: Before ACLs were introduced, any PayPal application could connect to any existing topics, posing operational risks. Now applications must authenticate and authorize to gain access to Kafka clusters and topics. This provides a highly available and secure platform with low latency for business-critical workflows.
- Development of PayPal Kafka Libraries: To ensure secure environments for clusters and client connections, PayPal implemented Kafka libraries. These libraries provide topic portability within Kafka clusters and different environments, reducing the impact of operational changes and improving developer efficiency.
- Introduction of monitoring library: The monitoring library publishes critical metrics for client applications, allowing PayPal to monitor the health of end-user applications. This library supports over 800 applications on the Kafka platform, avoiding overhead for end-users and efficiently managing key management events such as certificate updates and key rotations.
- Improvement in QA environment: The QA environment was rebuilt on Google Cloud Platform, with brokers spread across multiple zones to achieve high availability. This new platform reduces cloud costs by 75% and improves performance by about 40% compared to the previous setup.