Partition assignments are critical to a typical distributed data system. A partition’s replicas could be in different states. For example, in the above graph, each partition has three replicas; one of them is the Primary replica, while the other two are Backup replicas.
Apache Helix is a generic cluster management framework used for the automatic management of partitioned and replicated distributed systems hosted on a cluster of nodes. Helix automates the reassignment of partitions in the face of node failure and recovery, cluster expansion, and reconfiguration. For a detailed description of the architecture of the Apache Helix system and how it can be used to manage distributed systems, visit the website. You can also read about Helix’s concepts and design philosophy here.
Helix was originally developed within LinkedIn and is now widely used within and outside LinkedIn to manage critical production services that requires fault tolerance and expansion without any downtime. Within LinkedIn, Helix has been used to build and manage many of our distributed data systems and production services, such as: Espresso, LinkedIn’s horizontally-scalable document store for primary data; Venice, our derived data serving platform for merging batch and streaming data sources; Ambry, LinkedIn’s distributed object store; Gobblin, a unified ingestion framework; and our real-time analytics platforms like Pinot. Helix has also been adopted by several companies, notable for Uber’s cross-datacenter Kafka replicator, Instagram’s IGDM, Box’s Box Notes, Turn’s high performance key value store, clustering in jBPM, and Pinterest’s Terrapin.
As we discussed above, the critical function of Helix is to manage the resources and partitions of the data systems. Helix internally employs a rebalancer workflow to determine how to allocate a resource’s partitions to nodes as the cluster topology changes. To manage a set of resources broken into partitions and replicas as nodes become available or unavailable, the Helix rebalancer can be plugged into a different strategy (we call it “rebalance strategy”) to determine the partition-to-node mapping given the current cluster state. The default rebalance strategy Helix had previously was a simple hash-based heuristic strategy.
However, we are moving to data centers with single top-of-the-rack switches, which introduce a single point of failure wherein the loss of a switch effectively means the loss of all machines in that rack. More generally, when running a cluster of nodes across multiple racks or multiple availability zones, the Helix partition placement logic needs to understand the cluster topology in order to perform replica placement such that the loss of a rack or an availability zone does not result in data unavailability or a service disruption.
In addition to topology-aware partition placements, there are a few other requirements we believe are important to Helix’s partition placement logic.
Even distribution of partitions. Partitions for a data resource should be evenly distributed among a cluster of machines, and ideally, replicas in the same state should be evenly distributed within these machines, too. An uneven distribution of partitions (thus data) could result in unbalanced service distribution, e.g., some machines get a higher number of requests than others, which could end up leading to service disruption or lower utilization.
Handle nodes with different processing capability. In our legacy default partition management logic, all machines within the same cluster are treated as having equal capacity. However, in real scenarios, machines in a cluster may not be homogeneous; some of them may be able to handle more traffic and data than others. The partition management logic needs to take into account these capacity differences when distributing the partitions.
Minimized partition movements during topology changes. As nodes become available or unavailable, or during topology changes (adding a node, removing a node, or adding new rack, etc.), the partitions may be reshuffled among other nodes to keep the resources load-balanced across the cluster. However, for a stateful distributed data service such as a data store, the cost of moving a replica to a different node could be high, since it usually involves bootstrapping data from other replicas or a remote backup. Therefore, the partition management logic needs to minimize these movements as much as possible.
In this post, we will discuss the topology-aware rebalance strategy we have recently introduced into Helix’s full-auto rebalancer, as well as how these features achieve the requirements we have just described.
Helix’s rebalancer is one of its most critical components. It is tasked with determining how to allocate resources to nodes as the cluster topology changes. To manage a set of resources broken into partitions and replicas as nodes become available or unavailable, the Helix rebalancer needs to determine the best partition-to-node mapping given the current cluster state, and to move partitions around nodes if needed to meet all the constraints while ensuring overall load distribution across the cluster.
The Helix rebalancer employs a rebalance workflow to compute the partition-node assignment and generate partition movement plans for a Helix-managed cluster. The graph below shows each individual task that will be executed during the rebalance workflow. Some of the important tasks include:
Gather current cluster state: This task is to collect all of the required cluster states, including instance liveness state, current replica state, and system configurations.
Compute IdealState: After collecting the required cluster states, the rebalancer computes the new IdealState for each resource that satisfies all constraints. Helix’s rebalancer is designed to be able to plug in different rebalance strategies when computing the IdealState of the system.
Generate partition movement and state transition plan: In this step, the CurrentState is compared with the IdealState. If there is any difference, Helix will determine the partition movement, and thus the state transition plan, for each resource.
Compute difference: Once the IdealState is computed, Helix looks up the state machine and computes the transitions required to take the system from its current state to IdealState.
Apply constraints: Blindly issuing all the transitions computed in the previous step might result in violating constraints; for example, transitioning a master replica to a slave should happen before its counterpart slave to master transition in another node in order to make sure there is only at most one master at a time. Also, there could be certain priority among different resources and throttling that is set by the application. Helix computes the subset of the transitions such that executing them in any order will not violate the system constraints.
Generate and issue state transition requests: This is the last task in the workflow, where the state transition requests are sent to the participants. On receiving the request, the Helix client agent that is embedded in the participant invokes the appropriate transition handlers provided by the application. Once the transition completes, the success/failure is reflected in the CurrentState. The Helix controller reads the CurrentState and re-runs the rebalance workflow until the CurrentState converges with IdealState. Once they are identical to each other, the system is considered to be in a stable state. Having this simple goal of computing the IdealState and issuing appropriate transitions such that CurrentState converges with IdealState makes Helix controller-generic.