Solving Manageability Challenges with Nuage

Matrix: The columns representing Service1..N denote LinkedIn’s large-scale services like Kafka, Samza, Ambry, Venice, Espresso, etc. Each of the rows in the matrix represents a generic feature that the Nuage team has built to efficiently manage such massive systems. 

Let’s dive deeper into some of the generic features that we have abstracted out and built in the last few years. These features by themselves are scalable, fault-tolerant, adaptable, cross-service pluggable, and used extensively within LinkedIn by product teams.

Resource provisioning
Nuage manages a large fleet of clusters, and getting a coordinated and aggregated view of the geographically-distributed clusters is the key in resource provisioning. Nuage provides a few default provisioning algorithms, which suit most of the use cases. However, the algorithms are designed to be extensible with service-specific provisioning logics. Some of the customization includes considering the granular details of all the hosted tenants and their expected rate of expansion while provisioning.

Security (authentication and authorization)
When a new resource is created, Nuage mandates the creation of necessary access control rules. Who can access and who is allowed to change the rules are already set during the resource creation time. If an unauthorized caller wants to access the resource, then Nuage forwards the request for approval process to the respective resource owners. This is achieved by an interceptor framework that analyzes all incoming requests and determines if the request needs a formal review process or not. Formal reviews are created within Nuage and sent to the resource’s owner, along with a ticket that captures required information. Critical resources are reviewed and approved with further due diligence by the Security Organization. All these rules are frequently polled and consumed into different services’ routers to enforce authorization at scale. The resource owners can also monitor real-time authorization decisions, grouped by the number of callers who have been allowed/denied for the last few weeks. Such real-time stats increase the operability and debuggability of many production services.

Workflow management
Since Nuage is a management portal for a spectrum of complex services, some of the operations can take from a few minutes up to even days to complete. Workflows are either triggered because of user requests or by some of the cron jobs, which often perform CPU/ IO-intensive workloads. This means that we need a distributed, asynchronous workflow management solution. Nuage uses Helix Task Framework for management, and the workflow can be represented by a Directed Acyclic Graph (DAG) of jobs. Jobs, in turn, can contain one or more tasks, where a task is the most granular runnable entity. This framework ensures: load distribution, that the maximum number of tasks are executed per instance, retries with delayed execution, task prioritization, and instance grouping to assign tasks to a designated group of instances.

Nuage also supports a special kind of workflow that converts any service call into a reviewable approval process. This satisfies additional security checks and allows for human intervention in cases where it is mandated by the in-house security team of LinkedIn.

Quota enforcement
Most services at LinkedIn have pre-defined SLAs and are provisioned with limited capacity needs. In order to be functional with the pre-defined SLAs, quota enforcement is widely adopted by most complex services. Most applications at LinkedIn are built with the framework, an open source REST framework for building scalable RESTful backend servers. Nuage enables any application to enforce quotas on the caller through a quota throttling system. The quota values are configured via Nuage by product teams, and are persisted in a datastore. The quota enforcement system is built as a library that is readily consumed and run by routers/application servers that need quota enforcement. The library caches the relevant quotas and fetches the initial value from the storage layer during the bootstrap. In the case of live updates, Nuage writes to Zookeeper and the library will be notified via a callback. The library is set to listen for changes in the relevant Zookeeper nodes, which helps to refresh the target quota instantaneously in the case of live updates. Throttling is supported on different metrics like aggregation window, percentiles, and a few others. The enforcement can also be set based on either the cost of a request or simply on the count. The adoption within LinkedIn has been huge and most engineering teams have a clearly defined SLA and manage the interservice calls through the quota enforcement system powered by Nuage.

Resource usage profiling and prediction
Understanding the current usage of a database/cluster empowers the management layer in several ways. This plays a key role in capacity management, provisioning new resources, understanding the set of all over-provisioned and under-provisioned resources, generating the cost-to-serve reports for services, calculating the headroom and making suggestions to decrease/increase the capacity with recommendation numbers, tracking availability, and reporting to service SREs to indicate the overall usage on a regular basis. In order to achieve this, Nuage collects the raw usage data, decolatoring it with mathematical models tailored for various needs, and then stores the information in a way that allows for fast accessing. This space has a lot of challenges, which include profiling accuracy, aggregated and coordinated usages of multi-tenant clusters, and avoiding staleness in cached data. Within LinkedIn, Resource Intelligence is a fairly new initiative focusing on cost and capacity intelligence required to meet our rapid business expansion in a cost-effective way. Nuage also plays a significant role by automatically triggering remedial and cost-cutting actions based on the usage data. These actions include automatically adjusting quotas based on the usage pattern, retiring unused resources, and evaluating the resource owner’s capacity needs in terms of business value proposition.

Cluster metadata management
Nuage manages a fleet of cluster metadata for different services, and each cluster is geographically distributed, replicated, fault-tolerant, scalable, and highly available. Most services manage their clusters using Apache Helix, a generic cluster management solution, in order to achieve the above design goals. Nuage manages many cluster attributes, like a list of all advertising availability zones (AZ), possible data-flow among AZs, types of supported workloads (small, medium, bursty), number of hosts, and max available capacity per AZs. These configurations are dynamic in nature, and are controlled by underlying services. Nuage stays up-to-date by listening to cluster changes, as these parameters directly affect many decisions like provisioning, reporting, alerting, usage tracking, and visualization of data flows between AZs.

Future extension

Currently, we are still shifting Nuage from being reactive to more proactive and relying on automated data-driven decisions going forward. This requires collection of real-time statistics from different systems within LinkedIn, as well as aggregating and experimenting with many algorithms to make informed decisions on provisioning, capacity management, headroom calculation of a resource, sending warning notifications, and detecting and recovering from failures. As the adoption of Nuage increases, we also want to guarantee availability as it becomes critical to the business. We are thinking of new strategies to operate seamlessly even with many dependent component failures. This requires the management layer to automatically initiate self-healing workflows to recover from most failures and also learn to operate with partial system failures. 


Special thanks to the stellar team for the tireless contributions: Vishal Gupta, Terry Fu, Ji Ma, Yifang Liu, Changran Wei, Yinlong Su, Darby Perez, Tyler Corley, and Micah Stubbs. Thanks to the leadership team for their continued support and investment in boosting productivity across all engineering teams: Mohamed Battisha, Eric Kim, Parin Shah, Josh Walker, and Swee Lim.

A huge shout-out to all our partner teams who helped us in extending the platform across many services: Kafka, Espresso, Ambry, Quotas, Graph, Samza, Pinot, Venice, DataVault, and Search.

Source link