We want to find all the paths of maximum length 3 from a starting node (a viewing member) to a destination node (company), and rank them based on the edges’ weights. Note that we only recommend the first degree connections. The first degree connection can be:
Directly connected to company: currently or previously working at the company
Indirectly connected to company: connected to a person who is directly connected to the company.
Since we only recommend first degree connections to our members, these length 3 paths are concatenated into length 2 paths representing the member and their connections.
Mathematically, given a viewer, a first degree member (let’s call him fdMember), and a company, the score of the path can be defined as:
score(viewer, fdMember, company) = f (affm_m(viewer, fdMember), affm_c (fdMember, company))
where affm_m is the affinity score between two members, calculated based on a variety of factors like profile similarity, engagement between members, etc., and affm_c is the affinity score between a member and a company.
Let’s say sdMember denotes a second degree connection of the viewer who is directly connected to fdMember as well as to the company. So, affm_c can be further broken down into:
= directAffinity(# of years of employment, job role, etc.) if fdMember is directly connected to company
= indirectAffinity( fover all j (fdMember, sdMemberj, company)) if fdMember is indirectly connected to company
Function indirectAffinity is responsible for concatenating a length 3 path from a viewer to a company to a length 2 path. Each of the functions f, directAffinity, and indirectAffinity corresponds to either a machine-learned model or a heuristic-based function in our implementation. We can experiment with different functions for f, directAffinity, and indirectAffinity to achieve the highest clickthrough rate (CTR); however, that discussion is out of the scope of this post.
We faced the following challenges in implementing this system, and as you will see throughout the post, these issues are not easy to address at the same time.
Liquidity: To ensure that we could have recommendations for most members, we used historical page view data to calculate the percentage of page views for which we would have at least one recommendation. We defined this percentage as “liquidity.” We found that using only direct connections, we would get around 30% liquidity, but if we extended to indirect connections as well, our liquidity bumps up to 70%.
Cost to Serve (C2S): We have 500 million members and more than 9 million companies on LinkedIn. With this data, we have the candidate sets to provide recommendations for about 500B <viewer, company> tuples (growing each day!). Generating all such recommendations in a batch (Hadoop) job on commodity hardware would take days, especially for computing indirectAffinity. Such a job would also negatively impact other jobs running on the same cluster. If we calculated a maximum of 10 recommendations for all the possible <viewer, company> tuples, it would require around 80TB of storage space in a fast (SSD-based) key, value lookup store. That introduces massive C2S when most of the records in that store will be cold.
Latency: We need to return the recommendations as quickly as possible when a member visits a company or a job page so as to not negatively impact their experience.
Our goal was to build an architecture that provided 70% liquidity, kept storage and computation costs sustainable, and had a 99th percentile latency of less than 500ms. We chose this latency number because it is less than the retrieval time for other items shown on the conversation window. Therefore, loading contextual suggestions in parallel with other items should not increase the overall load time of the conversation windows.
The rest of this post describes how we followed an iterative approach to reach our goals.
Our first attempt at building the system was guided by the principles of Data Jujitsu: before we invested too much in building a perfect system, we wanted to investigate if users liked our product in the first place. Since all the data that we needed for computing recommendations is available offline in our Hadoop cluster, we simply computed the recommendation offline for each start node (viewing member) and the end node (company) pair and pushed the result into our online key-value data store ready to serve. This way we could quickly launch our initial product and learn how it would perform.
But this strategy yielded a massive number of <viewer, company> tuple recommendations. In order to optimize our efforts, we prioritized generating tuples for our most active members and companies, managing to reduce the above result set by 60%. We then decided to generate indirect connection recommendations for only 10% of these members, which further reduced the result set to an acceptable value that required only 1TB of online key, value storage (as compared to 80TB for all the possible tuples).
We also added several other optimizations, described below.
Eliminate massive joins
In our first attempt, our Hadoop jobs were still running for days and not converging. We found out that we were doing a massive join between the member’s connection table (> 1 trillion entries if we include second degrees) and the company employee table (> 1 billion entries) in order to compute the function indirectAffinity. Data skewness leading to long tails in MapReduce jobs made the situation worse. We decided to optimize this join in the following way.