We’ve talked a bit about the motivations for building Rex and what language we wanted to use. How does Rex actually work?
The generation of the Medium feed can be described in seven basic steps: aggregating, preprocessing, annotating, ranking, postprocessing, caching, and validating.
We source stories we think a user will enjoy, and we understand users may like stories for different reasons. For example, you may always read stories from authors or publications you follow. Or perhaps you really like technology and always want to read stories in the technology topic. For your feed, we have many different story providers, each of which provides you with stories we think you’ll like for a particular reason.
The three stories here were surfaced for the following reasons:
From your network: This story was published in a publication I follow (500ish). Rex sources the top-performing stories from publications I follow (like with topic-based providers, we look at stories in a publication many users have read and clapped on).
Based on your reading history: Based on the stories I’ve read and clapped on so far, users with a reading history similar to mine have also liked this story. Finding users with a similar reading history to mine and making recommendations based on those is a technique called collaborative filtering, which Rex relies on to find high-quality stories for each user.
Photography: I followed the photography topic, so for my homepage, Rex sources some of the top-performing stories in this topic (that is, the stories in the photography topic that many people have read and clapped on), and adds them into my feed.
Once we’ve aggregated these high-quality stories for the user, we filter out stories we think may not be suitable for a user at a given time. Maybe we’ve sourced a user a story they’ve already read — there’s no need to show them the same story twice. We may add a preprocessor to remove stories the user has read before. During the preprocessing step, we use different preprocessors to filter out stories, with each preprocessor filtering for a particular reason.
Once we’ve amassed a group of stories we think a user will like, we have to rank them by how much we think a user will like each story. Before we can rank them we have to fetch a significant amount of data from our data stores to get all of the necessary information (for example, who is the author of a story, what topic is the story in, how many people have clapped on this story, etc.). We calculate most of the features we need for ranking stories via offline Scala jobs and store them in two tables that we query at the time of feed creation. This allows us to minimize the number of I/O calls we’re making when assembling all the necessary data.
There’s information about each particular user ↔ story pair that can’t be calculated offline and has to be checked online (for example, does the user for whom we’re generating a feed follow the author of a particular story?), but precalculating the features lets us do much of the work beforehand.
Once we’ve gathered all the necessary data to rank each story, actually ranking the stories depends on what ranking strategy we use. We first transform the results from the annotation step into an array of numerical values and pass each story and set of values to another Medium microservice that hosts our feed-ranking models. This separate microservice assigns a score to each story, where the score represents how likely we think the user is to read this particular story.
A lot of great work has gone into building Medium’s model-hosting microservice as well, but that’s a tale for an upcoming Medium Engineering story. 😉
After ranking stories, there are often some “business rules” we may want to apply. Postprocessors apply story-ranking rules that ensure a better user experience. For example, we’ll see the top of a user’s feed dominated at times by a single author, single publication, or single topic. Because we want a user to see a more diverse set of authors, publications, and topics represented, we added a postprocessor that prevents a single entity from dominating the top of a user’s feed.
Once we’ve finished generating the feed, we store the feed in Redis, an in-memory data store, for future use. We don’t necessarily want to show a new feed every time the user visits Medium. This could make for a confusing user experience if they’re visiting Medium often in a short amount of time. Hence, after generating a feed, we store the feed in Redis for a short period of time.
If we’re reading our feed from the cache, some of the stories in the cached ranked-feed list may no longer be suitable for candidates. For example, if I follow and subsequently unfollow a given author, I should remove stories from that author from my feed if stories by that author are in my cached feed. The validation step filters out potentially unwanted stories from the cached feed, stories that may have been suitable candidates when we first created it.