Scaling Email Infrastructure for Medium Digest | by Penny Shen | Sep, 2020

In other words, the number of events the queue has to process over one day is approaching the maximum number of events the queue can process at its maximum processing rate.

If we go beyond the limit (which we did by the end of October 2019), the queue will remain backed-up forever because we add more events than we can process every day.

The most obvious solution was to bump up the processing rate, but due to digest’s scale, that would require us to significantly increase resources for underlying services, which is costly. We wanted to see if we could optimize our email infrastructure before shelling out the big bucks.

As described earlier, all digest events were processed by the same queue. The problem with this approach is that when the queue gets backed up, a generated digest can’t get passed onto Sendgrid right away because the send event gets added all the way to the back of the queue. (Although SQS doesn’t guarantee FIFO order, events still get processed roughly in the order they are put in.)

The first thing we did was putting send events onto their own queue. Since send events are fast compared to most other email events, this new queue is unlikely to get backed up, and a generated digest can be handed to Sendgrid without further delay.

With this change, we got a decent amount of breathing room, as around 20% of the events on the original queue were send events.

We were off the hook for now, but this was a Band-Aid rather than a solution, as we were still close to our limit. In fact, with an influx of users in the first half of 2020, we reached the limit again by June 2020.

An important best practice for email deliverability is to “sunset” inactive recipients, which means to only send emails to users who have been active with your products in the last few months. Following this rule, we only send digests to users who have either opened an email or visited Medium in the last 45 days.

Previously, we handled evaluating whether a user has been sunsetted in the generation event itself. This means we were processing one generation event for every user who has ever signed up for a Medium account.

However, most of the generation events we process don’t actually translate to send events. In fact, we only send emails to 1/4 of all users we try generating for.

We check whether we should send each digest in the very beginning of the generation event, so the generation events that don’t become send events (presumably because those users have been sunsetted) are effectively no-op.

Even though these no-op events are relatively fast, they can still cause the queue to get backed up because we rate-limit each queue by number of events per second. For example, even if 100% of events are no-op and we finish processing the whole batch in the first 100ms, it’ll still wait the full one second before picking up the next batch of events.

Seeing this, the first thing we decided to do was to process events for active users only. We added a field to our user email table and updated it every time a user becomes active. Then, when we’re querying the table to decide who to generate digest for, we only query for those who were active in the last 45 days according to the field.

This doesn’t sound too hard! We put in the necessary changes behind a flag and started ramping up percent of events processed using this new strategy.

As we ramped to 50%, we started seeing a significant increase of errors from Rex, the recommendations service we rely on for personalizing stories in the digest.

Upon closer inspection, we realized that the request pattern from our offline-processing service to Rex had become more bursty with short peaks every 10 minutes, which causes Rex to thrash as it continually scales up and back down within each interval.

This is because all the no-op events for sunsetted users effectively served as a buffer and smoothed out the request pattern. Without these no-op events, every generation event resulted in an actual request to Rex, and Rex was doing poorly under this new pattern.

We were able to spread out the generation events so that the request pattern to Rex is more even, reducing thrashing. The difference for traffic pattern to Rex is shown below:

Source link