Rethinking Netflix’s Edge Load Balancing – Netflix TechBlog – Medium

By Mike Smith, Netflix Cloud Gateway

We briefly touched on some of the load balancing improvements we’ve recently been making in our Open Sourcing Zuul 2 post. In this post, we’ll go into more detail on the whys, hows and results of that work.


On the Netflix Cloud Gateway team we are always working to help systems reduce errors, gain higher availability, and improve Netflix’s resilience to failures. We do this because even a low rate of errors at our scale of over a million requests per second can degrade the experience for our members, so every little bit helps.

So we set out to take our learnings from Zuul, plus those from other teams, and improve our load balancing implementation to further reduce errors caused by overloaded servers.


In Zuul, we’ve historically used the Ribbon load balancer with a round-robin algorithm and some filtering mechanisms for blacklisting servers that have a high connection failure rate.

Over the years we’ve gone through a couple of improvements and customizations designed to send less traffic to servers that have recently launched, to attempt to avoid overloading them. These have made significant improvements, but for some particularly troublesome origin clusters, we would still see much higher load-related error rates than desirable.

If all servers in a cluster are overloaded, then there’s little improvement we can make in choosing one server over another, but we often see situations where only a subset of servers are overloaded. For example:

  • Cold servers post-startup (during red-black deployments and auto-scaling events).
  • Servers temporarily slowing/blocking due to staggered dynamic property/script/data updates or large GC events.
  • Bad server hardware. We’ll often see some servers permanently running slower than others, whether due to noisy neighbors or different hardware.

Guiding Principles

It can be useful to have some principles in mind when starting on a project, to help guide us with the flood of small and large decisions we need to make daily while designing software. Here are some that were used on this project.

Work within the constraints of the existing load balancer framework

We had coupled our previous load-balancing customizations to the Zuul codebase, which had precluded us from sharing these with other teams at Netflix . So this time we made a decision to accept the constraints and additional investment needed, and to design with reuse in mind from the get-go. This makes adoption in other systems more straightforward, and reduces the chances of reinventing-the-wheel.

Apply learnings from others

Try to build on top of the ideas and implementations of others. For example, the choice-of-2 and probation algorithms which were previously trialled at Netflix in other IPC stacks.

Avoid Distributed State

Prefer local decision-making to avoid the resiliency concerns, complexities and lag of coordinating state across a cluster.

Avoid Client-Side Configuration and Manual Tuning

Our operational experience over the years with Zuul has demonstrated that situating parts of a services’ configuration into a client service not owned by the same team … causes problems.

One problem is that these client-side configurations tend to either get out of sync with the changing reality of the server-side, or introduce coupling of change-management between services owned by different teams.

For example, the EC2 instance type used for Service X is upgraded, causing fewer nodes to be needed for that cluster. So now the “maximum number of connections per host” client-side configuration in Service Y should be increased to reflect the new increased capacity. Should the client-side change be made first, or the server-side change, or both at the same time? More likely than not, the setting gets forgotten about altogether and causes even more problems.

When possible, instead of configuring static thresholds, use adaptive mechanisms that change based on the current traffic, performance and environment.

When static thresholds are required, rather than have service teams coordinate threshold configurations out to each client, have the services communicate this at runtime to avoid the problems of pushing changes across team boundaries.

Load-Balancing Approach

An overarching idea was that while the best source of data for a servers’ latency is the client-side view, the best source of data on a servers’ utilization is from the server itself. And that combining these 2 sources of data should give us the most effective load-balancing.

We used a combination of mechanisms that complemented each other, most of which have been developed and used by others previously, though possibly not combined in this manner before.

  • A choice-of-2 algorithm to choose between servers.
  • Primarily balance based on the load balancers’ view of a servers’ utilization.
  • Secondarily balance based on the servers’ view of its utilization.
  • Probation and server-age based mechanisms for avoiding overloading newly launched servers.
  • Decay of collected server stats to zero over time.

Combining Join-the-Shortest-Queue with Server-Reported Utilization

We chose to buttress the Join-the-shortest-queue(JSQ) algorithm that is commonly used, with a 2nd algorithm based on the server-reported utilization to try and give us the best of both.

Problems with JSQ
Join-the-shortest-queue works very well for a single load-balancer, but has a significant problem if used in isolation across a cluster of load balancers. The problem is that the load balancers will tend to herd and all choose the same low-utilized servers at the same time, thus overloading them, and then moving on to the next least-utilized and overloading that, and onward …

This can be resolved by using JSQ in combination with the choice-of-2 algorithm. This mostly removes the herding problem, and works well except for some deficiencies around each load balancer not having a complete picture of server utilization.

JSQ is generally implemented by counting the number of in-use connections to a server from only the local load balancer, but when there are 10’s-100’s of load balancer nodes, the local view can be misleading.

A single load balancers’ view can be very different from the actual reality

For example in this diagram, load balancer A has 1 inflight request to server X and 1 to server Z, but none to server Y. So when it receives a new request and has to choose which server is least utilized — from the data it has available to it — it will choose server Y. This is not the correct choice though — server Y is actually the most utilized, as two other load balancers currently have inflight requests to it, but load balancer A has no way of knowing that.

This illustrates how a single load balancers’ viewpoint can be entirely different from the actual reality.

Another problem we experience with relying only on the client-side view, is that for large clusters — particularly when combined with low traffic — a load balancer often has only a few in-use connections to a subset of the servers in a pool of hundreds. So when it’s choosing which server is least-loaded, it can often just be choosing between zero and zero — ie. it has no data on the utilization of either of the servers it’s choosing between, so has to just guess randomly.

One solution to this problem could be to share the state of each load balancers’ inflight counts with all other load balancers … but then you have a distributed state problem to solve.

We generally employ distributed mutable state only as a last resort, as the value gained needs to outweigh the substantial costs involved:

  • Operational overhead and complexity it adds to tasks like deployments and canarying.
  • Resiliency risks related to the blast radius of data corruptions (ie. bad data on 1% of load balancers is an annoyance, bad data on 100% of them is an outage).
  • The cost of either implementing a P2P distributed state system between the load balancers, or the cost of operating a separate database with the performance and resiliency credentials needed to handle this massive read and write traffic.

An alternative simpler solution — and one that we’ve chosen — is to instead rely on the servers reporting to each load balancer how utilized they are …

Server-Reported Utilization
Using each server’s viewpoint on their utilization has the advantage that it provides the aggregate of all load balancers that are using that server, and therefore avoids the JSQ problem of an incomplete picture.

There were 2 ways we could have implemented this — either:

  1. Actively poll for each servers’ current utilization using health-check endpoints.
  2. Passively track responses from the servers annotated with their current utilization data.

We chose the 2nd option, as it was simple, allowed for frequent updating of this data, and avoided the additional load placed on servers of having N load balancers poll M servers every few seconds.

An impact of this passive strategy is that the more frequently a load balancer sends a request to one server, the more up-to-date it’s view of that servers’ utilization. So the higher the RPS, the higher the effectiveness of the load-balancing. But then conversely, the lower the RPS, the less effective the load-balancing.

This hasn’t been an issue for us, but it is likely that for services that receive a low RPS through one particular load balancer (while receiving high RPS through another separate one), actively polling health-checks could be more effective. The tipping point would be where the load balancer is sending a lower RPS to each server than the polling frequency used for the health-checks.

Server Implementation
We implemented this on the server side by simply tracking the inflight request count, converting it to a percentage of the configured max for that server, and writing that out as a HTTP response header:

X-Netflix.server.utilization: <current-utilization>[, target=<target-utilization>]

The optional target-utilization can be specified by a server to indicate what percentage utilization they are intending to operate at under normal conditions. This is then used by the load balancer for some coarse-grained filtering done as described later.

We experimented a little with using metrics other than the inflight count, such as operating system reported cpu utilization and load average, but found them to cause oscillations seemingly due to the lag induced by them being based on rolling averages. Thus, we decided to just stick with the relatively simple implementation of just counting inflight requests for now.

Choice-of-2 Algorithm Instead of Round-Robin

As we wanted to be able to choose between servers by comparing their statistics, the existing simple round-robin implementation was going to have to go.

An alternative within Ribbon that we tried was JSQ combined with a ServerListSubsetFilter to reduce the herding problem of distributed-JSQ. This gave reasonable results, but the resulting request distribution across target servers was still much too wide.

So we instead applied some earlier learnings from another team at Netflix and implemented the Choice-of-2 algorithm. This has the advantage of being simple to implement, keeps cpu cost on the load balancer low, and gives good request distribution.

Choose based on Combination of Factors

To choose between servers, we compare them on 3 separate factors:

  1. Client Health: rolling percentage of connection-related errors for that server.
  2. Server Utilization: most recent score provided by that server.
  3. Client Utilization: current number of inflight requests to that server from this load balancer.

These 3 factors are used to assign scores to each server, and then the aggregate scores are compared to choose the winner.

Using multiple factors like this does make the implementation more complicated, but it hedges against edge-case problems that can occur from relying solely on one factor.

For example, if one server starts failing and rejecting all requests, it’s reported utilization will be much lower — due to rejecting requests being faster than accepting them — and if that was the only factor used, then all the load balancers would start sending more requests to that bad server. The client-health factor mitigates against that scenario.


When randomly choosing the 2 servers to compare, we filter out any servers that are above conservatively configured thresholds for utilization and health.

This filtering is done on each request to avoid the staleness problems of filtering only periodically. To avoid causing high cpu load on the load balancer, we only do a best-effort by making N attempts to find a randomly chosen viable server, and then falling-back to non-filtered servers if necessary.

Filtering like this helps significantly when a large proportion of the pool of servers have persistent problems. As in that scenario, randomly choosing 2 servers will frequently result in 2 bad servers being chosen for comparison, even though there were many good servers available.

This does though have the disadvantage of depending on statically configured thresholds, which we’ve tried to avoid. The results of testing though, convinced us this was worthwhile adding, even with some general-purpose (ie. non service-specific) thresholds used.


For any server that the load-balancer hasn’t yet received a response from, we only allow one inflight request at a time. We filter out these in-probation servers until a response is received from them.

This helps to avoid overloading newly launched servers with a flood of requests, before giving them a chance to indicate how utilized they are.

Server-Age Based Warmup

We use server age to progressively ramp up traffic to newly launched servers over the course of their first 90 seconds.

This is another useful mechanism like probation that adds some caution about overloading servers in the sometimes delicate post-launch state.

Stats Decay

To ensure that servers don’t get permanently blacklisted, we apply a decay rate to all stats collected for use in the load-balancing (currently a linear decay over 30 secs). So for example, if a server’s error-rate goes up to 80% and we stop sending it traffic, then the value we use will decay down to zero over 30 seconds (ie. it will be 40% after 15 secs).

Operational Impacts

Wider Request Distribution

A negative impact of moving away from using round-robin for load-balancing, is that where before we had a very tight distribution of requests across servers in a cluster, we now get a larger delta between servers.

Using the choice-of-2 algorithm has helped to mitigate this a lot (compared to JSQ across all or a subset of servers in a cluster), but it’s not been possible to avoid it entirely.

So this does need to be taken into account on the operational side, particularly for canary analysis where we typically compare absolute values of request counts, error rates, cpu, etc.

Slower Servers Receive Less Traffic

Obviously this is the intended effect, but for teams used to round-robin where traffic is apportioned equally, this has some knock-on effects on the operational side.

As the traffic distribution across origin servers is now dependent on their utilization, if some servers are running a different build that is more or less efficient, then they will receive more or less traffic. So:

  • When clusters are being red-black deployed, if the new server-group has a performance regression, then the proportion of traffic to that group will be less than 50%.
  • The same effect can be seen with canaries — the baseline may receive a different volume of traffic than the canary cluster. So when looking at metrics, its best to look at the combination of RPS and CPU (as for example RPS may be lower in the canary, while CPU is the same).
  • Less effective outlier detection — we typically have automation that monitors for outlier servers (typically a VM thats slow immediately from launch due to some hardware issue) within a cluster and terminates them. When those outliers receive less traffic due to the load-balancing, this detection is more difficult.

Rolling Dynamic Data Updates

Moving from round-robin to this new load balancer has the useful effect of working very well in conjunction with staged rollouts of dynamic data and properties.

Our best practice is to deploy data updates one region(datacenter) at a time to limit the blast radius of unanticipated problems.

Even without any problems caused by the data update itself, the act of a server applying an update can cause a brief load spike (typically GC related). If this spike occurs simultaneously on all servers in a cluster, it can lead to a large spike in throttling and errors propagating upstream. In this scenario there’s little a load balancer can do to help, as all servers are experiencing the high load.

A solution though — when used in combination with an adaptive load balancer like this — is to do rolling data updates across the servers in a cluster. If only a small a proportion of the servers are applying the update at once, then the load balancer can briefly reduce traffic to them, as long as there are enough other servers in the fleet to take the diverted traffic.

Synthetic Load-Testing Results

We used synthetic load-testing scenarios extensively while developing, testing and tuning the different aspects of this load-balancer. These were very useful in verifying the effectiveness with real clusters and networks, as a reproducible step above unit-tests, but without yet using real user traffic.

More details of this testing are listed later in the appendix, but summarizing the main points:

  • The new load balancer, with all features enabled, gave orders of magnitude reduction in throttling and connection errors compared to the round-robin implementation.
  • There was a substantial improvement in average and long-tail latency (a 3x reduction compared to the round-robin implementation).
  • The addition of the server-utilization feature by itself, added significant value, providing one order of magnitude of the error reductions, and the majority of the latency reductions.
Comparison of the Results

Impact on Real Production Traffic

We’ve found the new load balancer to be very effective at distributing as much traffic to each origin server as that server can handle. This has the valuable effect of routing around both intermittently and persistently degraded servers, without any manual intervention, which has avoided significant production problems from causing engineers to be woken in the middle of the night.

It’s difficult to illustrate the impact of this during normal operation, but it can be seen during production incidents, and for some services even during their normal steady-state operation.

During Incidents

A recent incident involved a bug in a service that was causing more and more server threads to block over time — ie. from the point a server launched, a few threads would block each hour until it eventually started to reach its maximum and throttle.

In the below chart of RPS per server, you can see that before 3am there was a wide distribution across the servers. This was due to the servers with higher numbers of blocked threads being sent less traffic by the load balancers. Then after 3:25am autoscaling started launching more servers, each of which received roughly double the RPS of the existing servers — as they did not yet have any blocked threads and therefore could successfully handle more traffic.

Requests/sec per Server

Now if we look at this chart of the rate of errors per server over the same time range, you can see that the distribution of errors across all the servers throughout the incident was fairly evenly distributed, even though we know that some servers had much less capacity than others. This indicates that the load balancers were working effectively, and that all servers were being pushed slightly past their effective capacity, due to too little overall available capacity in the cluster.

Then when autoscaling launched new servers, they were sent as much traffic as they could handle, up to the point that they were erroring at the same low level as the rest of the cluster.

Errors/sec per Server

So in summary, the load-balancing was very effective at distributing traffic to the servers, but in this case not enough new servers were launched to bring the overall error level all the way back down to zero.


We have also seen a significant reduction in just the steady-state noise in some services of servers having blips of throttling due to gc events for a few seconds. That can be seen here where the errors reduced substantially after the new load balancer was enabled:

Load-related error rate in the weeks before and after new load balancer enabled

Alerting Gaps

An unanticipated impact has been to highlight some gaps in our automated alerts. Some existing alerts based on error rates from services, which would previously have fired when progressive problems were only affecting a small subset of a cluster, now fire much later or not at all, as error rates are kept lower. This has meant that teams have not been notified of sometimes large problems affecting their clusters. The solution has been to plug these gaps by adding additional alerts on deviations in utilization metrics rather than just error metrics.


This post isn’t intended as a plug for Zuul — although it is a great system — but more to share and add another datapoint to the wealth of interesting approaches out there in the proxying/service-mesh/load-balancing community. Zuul is a great system to test, implement, and improve these kinds of load balancing schemes; running them with the demands and scale of Netflix gives us the capabilities to prove and improve these approaches.

There are many different approaches that can be taken to improve load-balancing, and this one has worked well for us, producing significant reductions in load-related error rates and much improved balancing of real-world load on servers.

As with any software system though — you should make decisions based on your own organizations’ constraints and goals, and try to avoid chasing the gnat of perfection.

If this kind of work is interesting to you feel free to reach out to me or us here on the Cloud Gateway team at Netflix.

Source link