As stated above, Espresso uses targeted jobs for backups. Helix UI is very useful for spotting any abnormalities in the execution of jobs. A user can easily access Helix UI to monitor the status of existing workflows and jobs and perform simple operations. Furthermore, Helix UI supports authentication, operation authorization, and audit logging, which adds a layer of security for distributed applications built with Helix.
Recent performance and stability improvements
Minimizing redundant ZNode creation
IdealState and ExternalView are the metadata information reflecting future status and current status of a Helix-defined resource. In older implementations of Task Framework, a job was treated as a Helix resource, like a DB partition, for example. This meant that an IdealState ZNode and an ExternalView ZNode were being generated for every job at creation. This was problematic because Task Framework jobs and generic Helix resources are different in nature: jobs tend to be short-lived, transient entities that are created and removed frequently, but generic Helix resources tend to be enduring entities that stay and continue to serve requests. Therefore, the creation and removal of so many ZNodes were proving to be costly and no longer appropriate for jobs. These ZNodes would stay in ZooKeeper until the specified data expiry time (usually one day or longer). It was even worse for recurrent jobs—one set of IdealState/ExternalView ZNodes was being created for the original job template, and another set of IdealState/ExternalView ZNodes would be created for each scheduled run of the job. An improvement was made so that the IdealState ZNode of a job would only be generated when it was scheduled to run and would be removed immediately once the job was complete. In practice, there should be only a few jobs running concurrently for each workflow, so only a small number of IdealState ZNodes would exist for each workflow at any given time. In addition, the job’s ExternalView ZNode would no longer be created by default.
Problem with scheduling recurring jobs
We discovered that the timer was not stable when scheduling recurring jobs from recurrent workflows. This was mostly because we set a timer for each job when it was added to a recurrent job queue and because maintaining a large set of timers for all current and future jobs was prone to bugs. In addition, during Helix Controller’s leadership hand-off, these timers were not properly being transferred to the new leader Controller. This was fixed by making Task Framework set a timer for each workflow, instead of jobs. There are now much fewer timers Helix needs to track, and during Helix Controller’s leadership hand-off, the new leader Controller will scan all existing workflows and reset the timers appropriately.
Task metadata accumulation
We observed that task metadata (ZNodes) were quickly piling up in ZooKeeper when Task Framework was under a heavy load, continuously assigning tasks to nodes. This was affecting performance and scalability. Two fixes were proposed and implemented: periodical purging of job ZNodes and batch reading of metadata. Deleting jobs in terminal states periodically effectively shortened the metadata preservation cycle, which reduced the number of task metadata present at any given point of execution. Batch reads effectively reduced the amount and frequency of read traffic, solving the problem of overhead from redundant reads from Zookeeper.
Next steps for Task Framework
Restructuring metadata hierarchy in ZooKeeper
As discussed in previous sections of this post, the performance of Task Framework is directly affected by the amount of Zookeeper storage structure (ZNodes) present in ZooKeeper. Helix stores ZNodes for workflow and job configurations and contexts (current status of workflows/jobs) in one flattened directory. This meant that every data change, in theory, triggers a read of all ZNodes in the directory, which could cause a catastrophic slowdown when the number of ZNodes under the directory is high. Although small improvements like batch reads have alleviated the problem, we identified that the root of the issue is the way ZNodes are stored. In the immediate future, a new ZNode structure will be introduced to Helix so that Task Framework ZNodes will reflect the hierarchical nature of workflows, jobs, and tasks. This will greatly reduce the ZooKeeper read and write latency, enabling Task Framework to execute more tasks faster.
More advanced distribution algorithm and strategy
Task Framework uses Consistent Hashing to compute an assignment of tasks to available nodes. There are two ways in which task distribution could be improved. First, Helix Controller currently computes an assignment of tasks in every run of its pipeline. Note that this pipeline is shared across Helix’s generic resource management, which implies that some runs of the pipeline may have nothing to do with Task Framework, causing Helix Controller to compute a task assignment for naught. In other words, we have observed a fair amount of redundant computation. Moreover, Consistent Hashing might not be the appropriate assignment strategy for tasks. Intuitively, matching tasks up with nodes should be simple: whenever a node is able to take on a task, it should as soon as possible. With Consistent Hashing, you might see some nodes busy executing tasks whereas other nodes would be sitting idle.
The producer-consumer pattern has been identified to be a more appropriate model for distributing tasks over a set of nodes—the producer being the Controller, and the set of available nodes being the consumers. We believe that this new distribution strategy will greatly increase the scalability of Task Framework.
Task Framework as an independent framework
Helix started as a framework for generic resource/cluster management, and Task Framework was developed by treating jobs as a special resource with its own state model. However, we now have LinkedIn users only using the Task Framework functionality of Helix. Task Framework has seen its share of growth, and to meet the ever-increasing scalability needs of its users, we have decided that its separation from generic resource management is inevitable.
The separation work is threefold: 1) Reduce resource competition; 2) Remove needless redundancy; and 3) Remove deployment dependency. The work of both Task Framework and the generic resource management framework are managed by a single, central scheduler: Helix Controller. That means that a single Controller runs in one JVM with no isolation, and we cannot prevent a slowdown in resource management from affecting assigning and scheduling of tasks, and vice versa. In other words, there is a resource competition between the two entities. In this sense, we need to separate one Helix Controller into two separate Controllers running independently—a Helix Controller and a Task Framework Controller.
This separation naturally solves the problem of redundant computation incurred in the pipeline, as mentioned in the previous section. Changes in generic resources will no longer trigger an unnecessary pipeline run in the Task Framework Controller. Also, we plan to extend this separation to the deployment/repository level so that new deployments/rollbacks are not affecting each other. We believe that this will not only improve the overall performance of both components but also improve the experience for developers.
In this post, we have explored Task Framework, a growing component of Apache Helix that helps you manage distributed tasks. We have also discussed the current limitations of the system and the improvements we are making to position Helix’s Task Framework as a reliable piece of distributed infrastructure here at LinkedIn as well as for others in the open source community.
We would like to thank our engineers Jiajun Wang and Harry Zhang and our SRE, Eric Blumenau. Also, thank to Yun Sun, Wei Song, Hung Tran, Kuai Yu, and other developers and SREs from LinkedIn’s data infrastructure for their invaluable input and feedback on Helix. In addition, we want to thank Kishore Gopalakrishna, other committers from the Apache Helix community, and alumni Eric Kim, Weihan Kong, and Vivo Xu for their contribution. Finally, we would like to acknowledge the constant encouragement and support from our management Lei Xia, Ivo Dimitrov and Swee Lim.