Kafka cluster healing and workload balancing

Yu Yang | Pinterest engineer, Data Engineering

At Pinterest, we use Kafka as the central message transporter for data ingestion, streaming processing and more. With a growing user base of +175 million people, and an ever expanding graph of +100 billion Pins, we currently run >1,000 Kafka brokers in the cloud.

At this scale, we encounter Kafka broker failures every week and sometimes multiple failures in a day. When a broker fails, the on-call engineer needs to replace the dead broker on time to minimize the risk of data loss due to other node failures. We often need to move workloads among brokers to balance the workload. Replacing brokers and re-balancing workloads requires carefully creating and editing the partition reassignment files and manually executing Kafka script commands. These operations add significant overhead to the team.

To scale up the Kafka service operation, we built DoctorKafka, a service for Kafka cluster auto-healing and workload balancing. DoctorKafka can detect broker failure and automatically reassign workload on the failed brokers to a working one. It can also perform workload balancing based on the settings. Today we’re releasing DoctorKafka as an open source project on GitHub. In this post we’ll cover its architecture and how it can be useful to you.

High-level overview

Figure 1. High-level overview of DoctorKafka

Figure 1 shows the high-level overview of DoctorKafka. It’s composed of three parts:

  • A metrics collector that’s deployed on every broker. It regularly collects Kafka process and host metrics and sends them to a Kafka topic. We use Kafka as broker stats storage to simplify DoctorKafka’s setup and reduce its dependency on other systems.
  • A central DoctorKafka service that manages multiple clusters, analyzes broker stats metrics to detect broker failure and executes operation commands for cluster healing or workload balancing. DoctorKafka records the executed commands in another “Action Log” topic.
  • A web UI for browsing Kafka clusters status and viewing execution history. Figure 2 shows the UI for managing two test clusters. Figure 3 is a detailed view of one cluster.
Figure 2. DoctorKafka front page

Note that DoctorKafka only takes confident actions. It will send alerts if it’s not confident which actions to take.

Figure 3. DoctorKafka cluster view

DoctorKafka in action

The metrics collector runs on each broker and collects Kafka broker metrics on inbound and outbound network traffic as well as the stats of each replica. Figure 4 shows part of the broker stats collected by a metrics collector. Topic partition reassignment usually incurs extra network traffic and distorts the metrics, even with replication quota setting (available in Kafka 0.10.1). Because of this, the metrics collector explicitly reports if a topic partition is involved in reassignment when the metric is collected.

Figure 4. Broker stats collected by metrics collector

When the central DoctorKafka service starts, it first reads the broker stats from the past 24–48 hours. With this DoctorKafka infers the resource requirement for each replica’s workload. As Kafka workload is mostly network bounded, DoctorKafka focuses only on replica network bandwidth usage.

After it starts, DoctorKafka periodically checks the status of each cluster. When a broker failure is detected, it reassigns the workload of the failed broker to others with available bandwidth. If there’s not enough resources in the cluster to re-assign the workload, it will send an alert. Similarly, when DoctorKafka does workload balancing, it identifies the brokers whose network traffic exceeds the settings, moves the workload to others with less traffic or performs preferred leader election to move the traffic. We ignore the broker stats collected during partition reassignment and network traffic cool-down period to get more precise workload information.

DoctorKafka has been in production at Pinterest for months and helps us manage >1000 Kafka brokers. We’re happy to release it to the community, and you can find its source code and design documentation on GitHub. Your feedback and comments are very welcome!

Open source is a big priority for engineers at Pinterest. Companies like YouTube, Google, Tinder, Snap and more use our open source technology to power app persistence, image downloading and more. For all our open source projects, check out our GitHub.

If you’re interested in open source or tackling infrastructure challenges like this at scale, join us!

Acknowledgements: Thanks Jon Parise for helping me through the open source process.

Source link