The producer is adding items to the queue (from multiple threads) and the consumer is taking items from the queue (also from multiple threads).
Now assume that the producer is producing at a constant rate of 900 Queries Per Second (QPS). The capacity of the consumer is 1,000 QPS; that’s how much it can process. How many items are in the queue? Usually about a couple of items; one or two, maybe up to five. It depends on how threads get scheduled.
In such a setup, the maximum size of the queue (its configured capacity) doesn’t need to be big; the queue is acting as a hand-off point.
Now imagine that the producer starts producing at a faster rate than the capacity of the consumer. Say at 1,100 QPS, while the consumer can still only consume 1,000 QPS. How many items are in the queue? If the size of the queue was 10, then the number of items in it is 10. If it’s 1,000, then the number of items grows by 100 every second until it gets full. The queue very quickly becomes full, regardless of size.
If the queue is configured to reject items when full, the producer will receive feedback from the consumer that it can’t process items at the rate they are being produced. This feedback is called “backpressure.”
But in a production system, there is rarely just a single producer, queue, and consumer; there’s almost certainly a chain of these, where a consumer is also a producer for the next link in the chain. In these situations, we want the backpressure to propagate throughout the chain quickly.
Correct management of backpressure is critical to the health of systems at peak load. If little thought has been put into backpressure, then once the system hits its capacity limit, it won’t degrade gracefully even if theoretically it could have handled the peak load. In fact, it might just enter a death spiral.
The most common example would be a queue with an inappropriately large maximum size (or worse, an unbounded queue). The queue can start taking up more and more RAM, triggering GC more frequently (in a system with a GC) and thus reducing the throughput (and thus capacity of the system). The queue then becomes even fuller, the GC triggers even more frequently, and down the spiral of death we go.
The problem isn’t even limited to languages with a GC. Similar issues arise with threadpools with unbounded (or very high) max active thread limits. If thousands of threads are all trying to make progress (and few, if any, are blocked on IO), the capacity of the system plummets due to the cost of context switching (and the side effects of that, like CPU cache thrashing, etc.). Naturally, this further increases the processing backlog.
This doesn’t even mention possible failure to handle the backpressure feedback on the producer side, which brings a separate set of problems.
Auditing our Jetty configuration
Cognizant of the above, and with a hunch that some Jetty-related queues were too big, we started looking into every single Jetty config value we were setting. Some of those configs did stand out as wrong.
First, our Jetty request queue max size was set to 5,000. Experience and intuition told us this was likely wrong. Looking around, Jetty documentation recommends a value between 50 and 500. The documentation makes an important point: make sure the queue is large enough to handle a traffic spike. Unlike our simple example in the backpressure section, producer load in reality can often be uneven.
Now let’s imagine a worst-case scenario: suddenly, 500 queries arrive in the queue at the exact same moment. We know that the client (followfeed-query) has a total request timeout of 400 ms. We also know that our peak capacity for a followfeed-storage machine at this time is about 1,200 QPS. So in that 400 ms budget, we’ll be able to respond to 1,200 queries/s * 0.4 s = 480 queries. The other 20 queries we will still process and respond to, but the client won’t care (it has already moved on because of the timeout).
So that’s roughly the number of queries we can be usefully buffering up. With a request queue size of 5,000, it would take us ~4.17 seconds to respond to all of them (assuming the machine isn’t melting!), and ~4,520 responses would be wasted work.
Thus, we reduced the request queue size to 500. This is still actually too high because followfeed-query won’t sit idly by while 400 ms pass; it will start sending an identical request ~150 ms after the first one (to a separate followfeed-storage replica) if the request hasn’t arrived by that point. Those other replicas are likely to have responded to a large fraction of those 480 queries before the machine with the full queue does. We’ll be further tuning the request queue size in the near future, but a 10x reduction should be immediately helpful.
The second queue that appeared to be too big was Jetty’s acceptQueueSize parameter. We had it overridden to 2,000. Here’s what the Jetty docs have to say about it:
“The size of the pending connection backlog. The exact interpretation is JVM and operating system specific and you can ignore it. Higher values allow more connections to wait pending an acceptor thread. Because the exact interpretation is deployment dependent, it is best to keep this value as the default unless there is a specific connection issue for a specific OS that you need to address.”
You’ll notice that this has an admonition to not touch the default without a very good reason. This piled yet more suspicion on that 2,000 size, because the default is 0. Now 0 clearly looked like a sentinel value, so the question became, “What is the default?” Digging through code, we can see that Jetty passes the provided acceptQueueSize value to ServerSocket.bind() as the value of a backlog parameter. ServerSocket is a standard class in the Java Runtime Environment (JRE), so we can look into its source to see what it does when the provided value is 0 (the JavaDocs state the value is “implementation dependent”).
The source code shows that the actual default value is 50 and is provided to the listen(2) syscall. (For those that want to learn more about this TCP connection queue, here’s a great article.) For our purposes, knowing that the JRE default is 50 (with JRE defaults usually being sensible), and that we were using 2,000 for no obvious reason, was enough to experiment with using the default.
So we packaged up this config change along with the request queue size change and benchmarked them with a copy of production traffic on an isolated machine. This experiment was a success.
Deploying this change to production helped us handle excess load gracefully; we now reject requests before they eat up RAM, and the smaller request queue means it takes us less time to recover. The smaller TCP connection accept queue means we start rejecting connections when under load, which works well because those requests will be retried on other replicas.
But these changes did something else, too: they increased our capacity. We now see far fewer retries attempted from followfeed-query for the same amount of site traffic. The request queue size reduction lowered GC pressure when the machines are under heavy load. The TCP accept queue size reduction provided backpressure to followfeed-query machines that are trying to reconnect to the followfeed-storage machine that was returning errors (and closing connections on the client) when under load.
Over-exuberant retry strategy
As is common with wide-ranging investigations that look for possible causes of production issues, we’ve found plenty of problems that needed addressing besides those listed above.
One of the biggest issues was with our retry strategy in followfeed-query (the broker service in front of followfeed-storage). When a followfeed-query machine sends a request to a followfeed-storage machine hosting a particular partition, it waits for about 150 ms before concluding that something is very wrong (median latency for followfeed-storage is ~8 ms and 99th percentile latency is ~100 ms). At that point, it makes the same call to a separate replica of followfeed-storage hosting the same partition (while still waiting for the older call to complete, just in case it comes back). There’s a third fetch attempt after the second one if that too is still pending after an additional 150 ms.
The reason why this is needed is because of GC pauses; if a followfeed-storage machine hits a long GC pause, the calls that it is working on need to be retried for the sake of reducing user-visible latency.
Now imagine what will happen when a followfeed-storage machine spikes up in latency for a temporary reason (GC pause, loading data from disk into caches, etc.). A part of the call volume it was getting will be redirected (temporarily) to other machines hosting the same partitions. Since we have 4 replicas hosting an identical set of partitions, 3 machines now start taking more traffic. If those machines were near peak operating load, they start having issues as well, causing more latency increases, which causes more retries from followfeed-query, which causes yet more latency increases…this becomes a death spiral.
This is fairly obvious, so naturally, followfeed-query had a system in place to prevent this. There were (well-tuned!) rate-limiters in place for the second and third fetch attempts to stop such a situation for happening. Unfortunately, we realized that the rate-limiters were global, not per-host. In other words, the total number of retries across all the followfeed-storage machines was rate-limited in followfeed-query (and the max retry call volume was capped at ~2.5% of peak non-retry volume). Unfortunately, this meant that if a single set of machines hosting a set of partitions started having issues, there was more than enough “retry call budget” to drown those machines in 2x more requests, effectively DDOS-ing the system.
So, we made the rate-limiters per-host.
After thorough testing of this change, we deployed it everywhere. One of our data centers didn’t have the Jetty configuration changes deployed by that point and one did, so we could make direct comparisons between them (at comparable levels of traffic), since they used the same hardware. For the data center with only the per-host rate-limiter changes, we could see a sizable number of retries getting blocked by the new rate-limiters (and the death spiral was being prevented, as expected). For the data center with both the rate-limiters and the Jetty changes, almost no retries were being blocked. Thus we know the Jetty changes increased capacity.
Other ideas we tried
For context, the original configuration values were set several years ago. Knowledge of why those values were chosen is lost to time. They probably seemed like good choices back then, but the system had never really seen load that exceeded capacity until now; thus, there wasn’t a pressing need to tune those parameters.
We found several other parameters that needed tuning and other places where we could add graceful degradation as a result of our investigation. Some of these issues were hurting peak load behavior, and some these we tuned out of caution.
TCP connection limits in followfeed-query were wrong. We had min/max TCP connections per followfeed-storage host set to 10/100. We only really need 0.4 TCP connections per (followfeed-query, followfeed-storage) machine pair, even for peak load. Reducing the min/max to 1/4 completely removed TCP connection spikes during load tests. The spikes had been 7x normal TCP established connections on each followfeed-storage machine; after the change, the TCP connection count remains flat during peak load.
HTTP request timeouts at the library level were not tuned to match our application-level timeouts. App-level timeouts were 400 ms, while HTTP request timeouts were 3.5 seconds. This means that after a long GC pause, followfeed-storage could still try to process some requests that the client has long since stopped caring about.
We had no request rate-limiting inside followfeed-storage. We know the QPS at which we start failing, so putting a rate-limiter on the server end is useful for graceful degradation and providing backpressure by returning 429 Too Many Requests responses.
Out-of-date GC configuration. We haven’t needed to tune our GC configuration for a long time, and the configuration we had in production wasn’t a great match for today’s workloads. Tuning these parameters helped address GC pauses.
Another critical issue we identified was incorrect metric computation. We discovered that the QoS metrics we showed at the start of the blog post were overcounting errors by 300%; several years ago when they were created, the data produced was correct, but because of architectural changes made since then, the algorithms computing the metrics were out of date.
We’ve rebuilt our QoS metrics computation from scratch so that we account for these changes, add additional accuracy, and include new QoS metrics for an even clearer picture of production behavior.
We’ve also identified future work that will improve our capacity, graceful degradation, and ability to debug issues like this in the future.
Our request orchestration code in followfeed-query has accumulated technical debt over time that makes it very hard to change and reason about. We’ve also learned a lot about the behavior of our system since it was launched several years ago; that knowledge can be put to good use in request orchestration. We expect to be able to improve the reliability of the system here.
The way we distribute data partitions across followfeed-storage machines isn’t ideal for peak load. Today, followfeed-storage machine replicas host the same 20 data partitions (with 4 replicas). Thus if a followfeed-storage machine is having issues, the request volume gets distributed across 3 other machines. If we assigned partitions heterogeneously (where every machine would effectively have a random set of 20 partitions) while still assuring at least 4 machines host each partition, then when a machine starts having issues, the call volume would be distributed to ~80 other machines instead of 3. This would massively reduce the extra load per-machine.
Here’s why it would be 80 machines: one machine is hosting 20 partitions. There are 4 machines hosting each partition. With effectively random partition allocation, the total number of machines involved that host at least one of the 20 affected partitions is 4 * 20 = 80 machines. (Technically 79, because one machine is having problems.) Note: This calculation assumes that no machine is hosting two or more partitions from the affected set.
Missing service-level continuous headroom testing. Our issues with reaching capacity limits were only discovered during site-wide load tests. Ideally, we’d be able to identify these limits well before that point.
We’re still not sure what was the real underlying cause of that blow-up in Jetty RAM usage; we know it’s related to the number of in-flight requests and connections, and working to reduce those did help. We have pulled in the Jetty experts at the company to look at our issue.
But that issue was only one of many problems we needed to address to increase capacity and improve our ability to gracefully degrade at peak load.
Meanwhile, the changes we’ve made have stabilized the system and we’ve successfully passed several load tests, doing great not only at our original data center traffic target but at a target 10% higher as well. We have plenty of follow-up work to improve capacity and pay down accumulated technical debt in our serving path, but that work has already begun.