As the year draws to a close, we’re taking a look back at ten of our most popular 2019 articles on the LinkedIn Engineering Blog. Examining the list, it’s clear that topics pertaining to open source and artificial intelligence are some of the most popular, as are posts that look at how we tackle technical challenges at scale. We’re excited to share new progress and updates in all of these areas in the new year, but in the meantime, take a look at what you might have missed in 2019.
#10: Authorization at LinkedIn’s Scale
LinkedIn members entrust us with their personal data, and it’s incumbent upon us to maintain that trust within a safe, professional environment. We use a microservice architecture, which means that hundreds of services are making tens of millions of calls per second, on average, to retrieve, process, and serve data. It’s essential to make sure that data is only being accessed and shared when there’s a valid business case, and we use Access Control Lists (ACLs) to do so. This can become challenging, especially as we grow, because we need to be able to check authorizations quickly and manage a large (and ever-changing) number of ACLs. This blog post explores how we address these challenges through techniques like caching, periodic refreshes of ACL data, and centralized control of ACLs.
#9: Scaling Machine Learning Productivity at LinkedIn
Historically at LinkedIn, we created bespoke machine learning (ML) systems to help power key components of the member experience. However, we realized that this non-standardized system was difficult to scale, which is why we created Pro-ML: a program to improve efficiency of ML work and make it more accessible to a wider range of engineers by building a plug-and-play ecosystem and automating key components. This blog post explores the main layers of Pro-ML: exploring, authoring, training, deploying, running of the ML models, health assurance, and the feature marketplace. It also covers the unique way in which AI teams are organized at LinkedIn.
#8: Pinot Joins Apache Incubator
Four years ago, we open sourced Pinot, a scalable, distributed OLAP data store developed at LinkedIn to deliver real-time, low latency analytics. In this blog post, we shared that Pinot had entered Apache incubation, an exciting milestone in the project’s development. The post also covers some of the innovations that have been introduced in Pinot since it was open sourced, including a filesystem abstraction that allows users to plug in their preferred storage backend, support for byte serialized TDigest data, and new indexing techniques.
#7: Community-Focused Feed Optimization
LinkedIn’s feed is a key component of our platform, allowing members to view—and participate in—updates and conversations from other members. As part of our effort to serve members the best content possible, we use machine learning algorithms to determine what each member’s feed should look like. Our system features a two-pass architecture, with first pass rankers (FPR) generating candidate posts for the feed and the second pass ranker (SPR) that scores those candidates to determine the final feed composition. In this blog post, we describe updates made to our FPR system in order to achieve multi-objective optimization while maintaining the low latency requirements of our infrastructure. These updates include an upgrade for the ranking engine and new model deployment technology that allow us to update ML models as needed. Additionally, we describe how we rebuilt our ML model for candidate selection to use a single XGBoost tree ensemble.
#6: The AI Behind LinkedIn Recruiter Search and Recommendation Systems
This blog post provides an overview of our model explorations and the architecture utilized for Talent Search systems at LinkedIn. Talent search has unique domain challenges, including the need to optimize for two-way interest, the importance of personalization, and the complexity of the queries (for instance, queries that have both structured and unstructured fields). By using the models and architectures described in this post, we’ve been able to steadily increase key business metrics, including the number of InMails accepted by candidates. Examples include using Gradient Boosted Decision Trees instead of linear models for search ranking and using Generalized Linear Mixed (GLMix) models for entity-level personalization.
#5: Data Hub: A Generalized Metadata Search & Discovery Tool
In this blog post, we introduced Data Hub, our latest step in our metadata journey at LinkedIn. It is an evolution of WhereHows, a project we had previously open sourced, and we hope to eventually release Data Hub to the community as well. Data Hub is a generalized metadata search and discovery tool with two main components: a modular UI frontend and a generalized metadata architecture backend. The frontend allows for three types of interactions—search, browse, and view/edit metadata—while the backend has innovations in metadata modelling, ingestion, serving, and indexing. These details are all explored further in the blog post.
#4: Introducing Kafka Cruise Control Frontend
Kafka is a big part of LinkedIn’s tech stack, and over the years, we’ve open sourced several tools to help automate and manage Kafka clusters. One of the most important of these is Kafka Cruise Control, which helps handle large-scale operational challenges with running Kafka. In this post, we introduced Kafka Cruise Control Frontend (CCFE) as a new open source project that acts as a central dashboard for an entire Kafka ecosystem. It’s a Single Page Web Application that can be deployed with either Cruise Control or any standard webserver and has a simplified UI that provides information about cluster status and allows for administrative action. Examples of CCFE features include displaying Kafka cluster load metrics, the history and status of Cruise Control tasks, and the ability to add or remove brokers from clusters.
#3: Building the next version of our infrastructure
In this blog post, we announced that we have begun a multi-year migration of all LinkedIn workloads to the Microsoft Azure public cloud. Senior Vice President of Engineering Mohak Shroff wrote about the journey LinkedIn’s infrastructure has been on as we scaled from 50 million members to more than 660 million today. He also described the ways in which we’ve already begun leveraging Azure, including for machine translation in the feed and keeping inappropriate content off our site. The post closes by looking forward to the innovations and scale that moving to Azure will unlock as we pursue our mission of bringing economic opportunity to every member of the global workforce.
#2: Open Sourcing Brooklin: Near Real-Time Data Streaming at Scale
Brooklin is a distributed service for streaming data in near real-time and at scale, and has been in use at LinkedIn for three years. It has two main use cases: being used as a streaming bridge (between cloud services, across data centers, etc.), and being used for change data capture. It has even replaced Kafka MirrorMaker at LinkedIn, used to mirror Kafka data between clusters and across data centers. In this blog post, we announced that Brooklin had been made available as an open source project for the community and described some of its features, particularly those that make it well-suited for Kafka mirroring, in greater detail.
#1: How LinkedIn Customizes Apache Kafka for 7 Trillion Messages per Day
This post marked an exciting milestone for LinkedIn: we announced that the total number of messages handled by LinkedIn’s Kafka deployments had recently surpassed 7 trillion per day. Originally developed at LinkedIn and then released to the open source community, Kafka has seen strong external adoption, but the incredible scale at which it runs at LinkedIn is fairly unique. As a result, we face various scalability and operability challenges, and in order to solve for these issues, we maintain an internal version of Kafka that’s specifically tailored to LinkedIn. In this blog post, we introduced our version of Kafka, including sharing the code for our release branches on GitHub. We discuss some of the details of the release we run in production, the way that we develop patches, and how we make decisions around upstreaming the changes we make.
Thank you to everyone who has contributed to and read the Engineering Blog this year, and a special thanks to our 2019 technical editors: Banu Muthukumar, Chris Ng, Michael Kehoe, Nikolai Avteniev, Paul Southworth, Szczepan Faber, and Val Markovic. Thanks also to Anne Trapasso and Stephen Lynch on the engineering communications team who help make the blog possible.