Building the Activity Graph, Part I

Co-authors: Val Markovic and Vivek Nelamangala


Serving a feed of relevant, personalized content to 500 million members is a massive undertaking. Accordingly, our feed infrastructure is constantly evolving to take advantage of new relevance models, new features, and more efficient ways of scaling our infrastructure. In this post, we describe the Activity Graph, a new system that allows us to understand deep relationships between members’ content.

The origin of the Activity Graph

The story of the almost year-long project behind LinkedIn’s Activity Graph begins with a bug report, as things usually go. We noticed that sometimes, sponsored content (i.e., an ad) would show up in the first position in a member’s feed. This is against our internal best practices and something we actively try to avoid; we want the most interesting organic content to be the first thing a member sees, not an ad.

Speaking of “organic content,” let’s define it: it consists of the pieces of member-generated content in the feed, which we call “Activities.” An Activity is defined by three main components: Actor, Verb, and Object. An example in prose would be “Val shared a text post,” or “Vivek liked a comment.” We present these Activities as cards in the feed UI.

The root cause of an ad showing up as the first piece of content was that we did actually prepare an organic Activity for the first slot, but that Activity was dropped later on (during a process we call “decoration”) because it was marked by our systems as spam. So the problem stemmed from spam being removed at the last moment before the feed was displayed to the member.

But it wasn’t just spam organic content we had to worry about. Our policies around spam were changing to introduce the concept of “low-quality” (LQ)—content that’s not quite spam, but that most LinkedIn members wouldn’t want to see. You can read more about how we classify this kind of content in another post about various strategies for keeping the feed relevant for our members. We needed a way to support these new business rules, and dropping LQ content during decoration (in addition to dropping spam) made the ad-in-first-place issue worse.

This also created other problems beyond a poor member experience; our relevance teams train their machine learning models based on member interactions with the feed, and those models need accurate data. If an Activity is dropped during the decoration step, what the member actually sees when the page is rendered does not match what the model thinks they saw.

The decoration step

It’s important to understand what decoration means for the feed before we continue. The lower layers of the feed stack (the FollowFeed system) deal with identifiers for Activities; an example would be urn:li:activity:123. So when FollowFeed recommends a list of Activities, it will send back a list of URNs. Near the top of the stack, the URNs need to be resolved so that the full data for each Activity is available; but the Activity data itself could be referencing yet more URNs, and those too need to be resolved as a result. This step of recursive URN resolution is called decoration (“deco”).

If the decoration system sees that any URN that can be transitively reached during decoration has some spam/LQ state attached, it refuses to decorate the top-level record. This is meant to be a safety net.

The solution seems obvious: don’t serve spam/LQ content up the stack in the first place!

Making FollowFeed understand spam and LQ content

LinkedIn’s organic feed is served by FollowFeed. Historically, it has relied on decoration to drop spam and LQ content. To help keep spam/LQ content from even reaching the decoration step, FollowFeed’s indexing system needed to know when an Activity was spam/LQ by ingesting events from LinkedIn’s spam classifiers. But there’s a little bit of nuance here: since decoration drops the top record if it sees any transitive URN is spam/LQ, we need to do the same thing, but at the indexing layer too.

And now we reach the really hard part: we need to build a graph of all the Activities and how they relate to each other and all the other URNs that they reference. Without this, we just aren’t aware of the transitive relationships between URNs.

Let’s start with an example subgraph:

Source link