An illustration showing the new multi-task setup. Two groups of objectives are trained with two separate neural networks with common inputs.
Below are a few salient features of the setup.
Multi-task setup: Multi-task formulations of machine learning problems allow for joint learning of multiple objectives, exploiting differences and commonalities between objectives to improve performance over separate formulations. We designed our deep learning architecture to jointly train on both passive and active consumption objectives, with a specific focus on community sharing objectives, via our XGBoost-based feature design, described later in this post. In addition to facilitating shared learning among different objectives, this setup also simplifies the modeling paradigm, with just a single model to output response probability for our overall utility function.
Response grouping for shared learning: For optimal transfer learning among objectives, we group objectives into two categories: (1) passive consumption oriented, (2) active consumption oriented. As a result, we split our deep learning network into two towers representing each category, shown as two different colored towers in figure above. We call this “two-tower multi-task deep learning setup.”
Data sampling and model calibration: For data sampling, we no longer perform custom downsampling per objective, but instead utilize all recorded interactions with updates across objectives. Despite fears that this could cause very uncommon objectives to be ignored, with enough data and normalization of features, all of our objectives were well trained and calibrated. This allowed our new deep modeling setup to still interact well with our generalized linear models as they act as a well-calibrated fixed effects model for our random effects models to train.
Feature space: Our multi-tower architecture is composed of multiple layers of fully connected layers that start with an embedding lookup for input-sparse features. The majority of features we have historically built have been sparse, including categorical and numeric. We pass all these sparse features into a single XGBoost tree, utilizing the leaves as input categorical features, which we embed. For high dimensional features, such as text, images, etc., we utilize dense embeddings that we directly take into Tensorflow, relying on batch normalization or projection layers to ensure smooth training.
Multi-task modeling details
As mentioned previously, we had historically been training a logistic regression model for each of the above objectives separately, effectively limiting the shared learning among related objectives. In the spirit of training a joint model, we started with a single model to output probabilities for all the objectives in our utility function. However, we discovered that approach performed suboptimally compared to separating the parameter space between passive consumption and active consumption oriented objectives, hence the two-tower setup.
We optimize for cross entropy loss per objective to train a multi-layer network for both the towers. We identified the following key challenges and learnings for the model training process:
Model variance: We observed significant variance in model performance, especially for sparse objectives (e.g., reshare) that correlated with output in both our offline evaluation metrics as well as online A/B testing. We identified the initialization and the optimizers (such as Adam) that contribute significantly to variance in the early stage of training. A warm start routine to gradually increase the learning rate helped to overcome a majority of the variance problem.
Model calibration: Our feature and model score monitoring infrastructure (such as ThirdEye) helped to identify several model calibration challenges, especially at the interaction stage with modeling components external to the deep learning setup. O/E ratio mismatch among different objectives (compared to our previous setup) was one such challenge, and we identified several sampling schemes for negative response training data affecting O/E ratios.
Feature normalization: Our XGBoost based feature design provides the model with an embedding lookup layer that avoids the feature normalization issues for model training. However, as we expanded into embeddings based features, we realized that normalization would play a major role into the training process. Batch normalization and/or having a translational layer helped alleviate some of these problems.
Analyzing problem space of efficient scoring
In order to score a deep neural network model across hundreds of features and many objectives, it was important to instrument and measure all aspects of our model inference. For this, we did JVM profiling for our standard modeling stack on Java, TensorFlow profiling using tf-profiler, and instrumentation of system latencies and metrics.
When we investigated the performance of our previous multi-objective models, it became clear that evaluating each model independently added significant costs. As the models were very mature, they had accumulated historical features over time that were manually crafted that were no longer optimal years later. This manual feature engineering has been superseded by more powerful techniques like XGBoost or the neural networks described in this post. Profiling made it clear that while XGBoost feature derivation accounted for a minority of computation time, it was overwhelmingly important in our model compared to hand-crafted features.
Looking into the inference costs of deep learning, major standouts were the conversion costs from sparse string features to integer backed tensors, the serialization costs of features through gRPC to TensorFlow Serving, and the costly sparse embedding lookups within TensorFlow. Luckily, the LinkedIn feed has recently built out tensors as standard feature representation, which removed conversion costs by having tensors as first class citizens for training and evaluation. Inference batch size was an important configuration for us to tune as well. TensorFlow was able to scale much better with larger batch sizes as compared to our previous setup because it explicitly batched features together for efficient linear algebra calculations within a batch. Tuning the batch size to be higher allowed for better parallelization within a mini-batch while still allowing separate mini-batches in a feed session to be scored in parallel.