Scaling Machine Learning Productivity at LinkedIn


We began with a review of our assets, counting hundreds of relevance services and several core learning technologies including tree ensembles, generalized additive mixture ensembles, and deep learning. We then broke down our efforts into a set of layers.

The layers we focused on are:

  • Exploring and authoring
  • Training
  • Deploying
  • Running
  • Health assurance
  • Feature marketplace

The Pro-ML life cycle

Exploring and authoring
The modeling process starts with exploration of the problem space, the features, and the data, and then identifying a particular goal—for example, computing the probability of a member clicking on a job. You then train and evaluate ML algorithms and train a model to achieve your goal. The model needs to be evaluated (e.g., cross-fold evaluation, area-under-the-curve, f-scores) and retried. This often takes many attempts, as there are many hyperparameters to test, especially for deep models (e.g., number of layers, size of layers, types of convolutions).

We selected two strategies to address this. First, we built a domain-specific language (DSL) with IntelliJ bindings to capture the input features, their transformations, the ML algorithms employed, and the output results. Second, we are building a Jupyter notebook integration that allows step-by-step exploration of the data, selection of features, and drafting the DSL. It also allows you to tune the model parameters and drive the training.

Training
Some of LinkedIn’s data-driven features are very time-sensitive, and as such, they are computed mostly online (for example, recommending new connections). For most of our products, however, we still use offline training; some teams may train every couple of hours, while other teams have tens of models (or sub-components of a model) that are trained and retrained daily. We rely heavily on our Hadoop systems for offline training. ML developers use our Pro-ML unified training service, to which we continuously add newer model types and other tools like hyperparameter tuning. The training service is tightly interconnected with the online serving and feature management ecosystems. This ensures that the same input files are used throughout the system and minimizes the risk of errors. The training services leverage Azkaban and Spark in order to run the actual training. Once a model passes offline validation, the training library passes over the trained artifacts and metadata to the deployment system.

Model deployment
Understanding deployment starts by defining what we mean by “ML artifacts.” We are interested in the identity, components, versioning, and dependencies relative to other artifacts in the system. For example, a model may have a global component in the tens of MB and member-specific components in the GB. Each of these may be created separately with its own version and have dependencies on code (libraries, services) and features. We then store this information in a central repository, where it is leveraged for automatic validation (e.g., are all the features available both offline and online?) and deployment. The target destination for an artifact may be a service, a key value store, or other infrastructure components. The deployment service provides orchestration, monitoring, and notification to ensure that the desired code and data artifacts are in sync. The deployment also ties with the experimentation platform to make sure that all active experiments have the required artifacts in the right targets in the overall system.

Running
Much of the excitement around AI focuses on the exploring and training steps. This isn’t enough for real systems, however. We need to be able to reliably, efficiently, and operably evaluate the models in production. This includes offline in systems such as Spark and Pig, nearline in Samza, online in REST services, and deep in our search stack. Historically, teams have written custom scorers in each environment, but this is intensive and error prone—it is too easy to have a small delta between the training and serving environments, leading to difficult-to-diagnose bugs. To address these overall challenges, we built a custom execution engine called Quasar for running the DSL discussed in the “Authoring” section above. The engine takes the features from the marketplace (see below) and the coefficients and DSL code from the model deployment system, and then applies the code to the data and coefficients. We have also built a higher-order declarative Java API (ReMix) for defining composable online workflows for query rewriting, feature integration, driving downstream recommendation engines, and blending the results. We are also building a distributed model serving system, driven by Quasar, to federate multiple inference engines, including various versions of TensorFlow Serving and XGBoost.

Health assurance
The processes that produce and update ML artifacts are hard to test and monitor. The health assurance layer of Pro-ML is made out of automatic and on-demand services. The automated services ensure that the online and offline features (inputs to the model) are similar in a statistical sense. They also validate that the online model behavior is in sync with the expected behavior; for example, that the predicted score is in line with the expected precision from the offline training. If an anomaly is detected, the ML engineer can use on-demand services to understand the source of the discrepancy. They use replay, store, explore, and perturb techniques in order to further isolate the problem: is there a bug in the code, missing data, or should the model simply be retrained?

Feature marketplace
The output of a system is only as good as the data that goes in. Big Data is powering the current AI cycle and managing it requires a dedicated system. We have tens of thousands of features that need to be produced, discovered, consumed, and monitored. At LinkedIn, we have Frame, a system to describe features both offline and online. Frame is used by both consumers and producers. We publish Frame’s metadata about the features in a centralized database/UI system, which is also connected to our Model Repository. This allows ML engineers to search for features based on various facets including the type of feature (numeric, categorical), statistical summary, and current usage in the overall ecosystem.

Organizational structure

How are we organizing the work of the AI teams at LinkedIn to help solve for the problems outlined above (scale, resources, opportunity, etc.)? Historically, many engineering organizations have been very hierarchical. You have managers, have managers reporting to those managers, and then you have teams of engineers. This isn’t how we have structured the Pro-ML initiative.

After a decade of rapid progress and experimentation within the Data organization at LinkedIn, we arrived at an organizational model that closely aligns AI teams with product teams, but maintains the reporting relationship to the parent AI organization. This ensures that researchers can collaborate and share best practices with fellow experts who are working to solve similar hard problems, while still having a dedicated ML team under the product “chain-of-command” that we are supporting.

Organizing the Pro-ML teams

Similarly, the team behind Pro-ML has been organized around five main pillars, each of which supports one of the stages of the model development life cycle. Typically, each of the pillars has a lead (usually an engineer), a tech lead, and several engineers. Just like with our embedded AI teams across LinkedIn’s business lines and practice areas, these engineers come from across the organization, including product engineering, our foundation/tools organization, and infrastructure teams. The Pro-ML team is distributed across the world, and includes engineers in Bangalore, Europe, and in multiple locations in the United States. We also have a leadership team that helps set the vision for the project and (most importantly) works to eliminate friction so that each of our pillars can stand on its own.

We are now more than one year into our effort to transform artificial intelligence at LinkedIn to make it scale across all of engineering—keeping it fast, efficient, and operable.

Conclusion

Just as software has taken over the world, artificial intelligence is taking over software. AI techniques are finding uses everywhere in software engineering, from detecting fake members to mapping out career paths. Similarly, we are making investments not only in new AI research and in developing the AI skills of our employees, but also in initiatives like Pro-ML that increase the productivity of our engineers.

Pro-ML will increase the number of products that can take advantage of AI and expand the number of teams that are able to train and deploy models. Additionally, it will reduce the time needed for model selection, deployment, etc., and provide automation in key areas like health assurance. Finally, it gives our people more time to do what they do best: finding creative solutions to hard technical problems, using LinkedIn’s unique and highly-structured dataset.

Learn more about Pro-ML and AI at LinkedIn



Source link