Haoyu Chen | Pinterest Labs Research Intern, Shopping Discovery
Pedro Silva | Software Engineer, Shopping Discovery
Somnath Banerjee | Head of Shopping Discovery
People have always come to Pinterest for shopping inspiration, and we’ve made big strides over the years to make that as seamless as possible so Pinners (users) can go from inspiration to purchase, including evolving shoppable Product Pins, improving recommendations and making it easier for merchants to upload their catalogs to curate and feature their products. The vision of the Pinterest Shopping team is to empower Pinners to discover high-quality products and enable them to purchase with ease and confidence. As a key step to fulfill this, we built the Related Products module to recommend Pinners the products we believe they’ll love based on the Pin they’re currently viewing. This module appears as “Shop Similar” and “More to Shop” across the app.
Related Products recommendations can be thought of as two high-level components. The first is candidate generation where, given the current context, Pin, and user, we generate a set of hundreds of products that are relevant to that user at that moment. The second component is product ranking: from the set of products generated in the previous step, how should we rank them for each Pinner such that the products on the top are more relevant in the given context, helping them easily discover the products they love. In this article, we’ll focus on the second component of this process and explain how we use multi-task learning , calibration , and Bayesian optimization  to build a flexible, interpretable, and scalable candidate ranking solution for Related Products recommendations.
At Pinterest, there are different types of engagements: close-up means tapping on the Pin to take a closer look, save means saving the Pin to a board, click means clicking through the Pin to visit the linked website, and long-click is a click where the user stays offsite for an extended period. Previously, we treated engagement prediction as a binary classification task where impressions without engagements are negatives, and impressions with any engagement are positives. We then chose an importance weight in the loss function for each engagement type according to their business values. For example, we may choose to set a higher weight for the long click engagement because it indicates that the user might have made a purchase on the linked website. The loss for a batch of samples with size n is then
where ImportanceWeight(i)is determined by the engagement type of sample i, y ⁽ᶦ⁾ binary is the true binary label, which is 1 for an impression with engagement (close-up, save, click, or long-click) or 0 otherwise, and ŷ ⁽ᶦ⁾ binary is the predicted score for sample i. Finally, we trained a deep neural network (Figure 2) using historical user, query, and candidate pin features to minimize the defined loss, and used the predicted scores to rank and recommend the candidates.
This simplified approach is easy to implement and model agnostic, but it leads to a series of problems:
- We lose information by combining different engagement types into one binary label. If a user both saves and long clicks a Pin, we have to drop one of the user’s actions since only one type of engagement can be chosen for each sample. An alternative would be to duplicate the training data using a different engagement per sample.
- The predicted score is not interpretable. It tells us how “engaging” a candidate is but its exact meaning is determined by the importance weights we choose.
- The task of engagement prediction is coupled with business value. If we ever want to try a different set of importance weights, we need to retrain the model, which is detrimental to developer and experimentation velocity.
To deal with all these drawbacks, we experimented with a multi-task learning model.
Multi-task learning model
We kept the feature and fully connected layers from the binary classifier but changed the output head layer. Instead of outputting a single score y⁽ᶦ⁾ binary, the multi-task model outputs four scores ŷ⁽ᶦ⁾ save, y⁽ᶦ⁾click, y⁽ᶦ⁾long click, and y ⁽ᶦ⁾closeup — one for each engagement type (Figure 2).
The loss function then becomes:
where y ⁽ᶦ⁾t is the label for each engagement type t, and TaskWeight(t) is a tuning parameter used to combine the log losses of the four output heads. We tried both hand-tuning the loss weights and learning them automatically using the idea of homoscedastic uncertainty weight , but the gain is subtle because the four tasks are similar to each other. Therefore, we ended up using equal weights for the losses for simplicity.
The key difference between the loss function in Eq.2 and the one in Eq.1 is that we do not lose engagement information. The four output heads can borrow knowledge from each other by sharing the previous layers, and this would also alleviate the overfitting problem compared to fitting one model for each engagement type.
Calibration and interpretability
Now that each output head is trained against a single type of engagement, the predicted score can be naturally explained as the probability of having such engagement. However, when the label distribution of the evaluation data is different from that of the training data, we still need an extra step — called calibration — to transform the predicted scores into probabilities. This is a common problem with AutoML DNN models . In our case, this problem is mainly caused by the negative downsampling in training data and can be corrected by a simple yet powerful method . Suppose we only sample portion of the negatives while keeping all the positives. The calibrated score is then:
The calibration plot in Figure 3 compares how well the probabilistic predictions of different heads are calibrated. It can be seen that p̂⁽ᶦ⁾t is almost the same as the fraction (i.e. the true probability) of positives, which means the formula’s predicted scores are well-calibrated. We also show the Brier score in the legend for each head. This score measures the calibration, and a lower score means better calibration. The bottom plot in Figure 3 shows the fraction of positive samples in each small bin grouped by their calibrated scores p̂⁽ᶦ⁾t.
To rank the candidates, we still needed a single score. Here, we consider a linear combination of the predicted probabilities as the utility score:
Now that the ranking score is calculated after the model training, we can finally decouple the business value from engagement prediction. This means that if we want to shift the predictions towards a particular engagement type, we do not have to retrain the model for tuning the utility weights.
Moving to a multi-task architecture significantly impacted the engagement metrics in Related Products with an increase in propensity and volume of all engagement types compared to the previous binary classifier. Table 1 below summarizes all the metric gains observed against the single head model of the same Neural Network, where the different positive engagements were combined using importance weight (Equation 1). These metrics were obtained through an online A/B experiment.
This work unlocks new possibilities, such as adding more tasks to the model to predict the probability of Pins getting hidden, leading to check-out, and having other types of engagements. Adding more tasks could help increase the generalization ability of the model and prevent overfitting. When we have negative engagement types such as hides, we may also consider using a more complex model architecture like soft-sharing of the previous layers . One of the most interesting problems with this new approach is answering how to optimize the utility function weights. We go into more depth on this in the next section.
Utility function and Bayesian optimization
As we explained previously, we utilize a linear combination of the predicted probabilities as the utility score for the multi-task learning model to rank the candidates. That being said, it is still hard to pick the optimal utility weights by trying out all possible combinations. The goal is to optimize the weights for each engagement type in the utility function such that the final score by which the pins will be ordered ranks pins from most to least likely to be engaged. This is a classic hyperparameter tuning optimization problem, and Bayesian optimization has been successfully used to solve it. 
The main objective of using Bayesian optimization is to solve the problem described above with the minimum number of trials. It is most useful for the case where the objective function has an unknown but smooth form, and it is expensive to evaluate the function over the whole parameter space. In our case, let’s assume we would like to maximize the long clickthrough Area Under the Precision-Recall Curve, which is an unknown function f over the utility weights wt. During Bayesian optimization, f(w save, w click, w long click, w close-up) is modeled as Gaussian Processes (GPs). Suppose we first select some random utility weights wt and evaluate them to get the corresponding objective value f(wt). Then we can update the model as the posterior GPs given the observed wt, f(wt) pairs (see Figure 4 for an illustration).
The posterior model can be used to predict f for unobserved wt and quantify the uncertainty (the grey area in Figure 4) around them. The predictions and uncertainty estimates are combined to derive an acquisition function, which is used to pick the next candidate wt for evaluation. After repeating this process for more steps, we can get a better model with less uncertainty and are able to pick the most promising wt that could maximize f.
We use the Ax library  to implement Bayesian optimization for utility weights selection. In addition to GPs modeling and candidate selection, the Ax library also supports adding constraints to parameters and secondary objectives. Our optimization problem is then characterized as:
Suppose you have a baseline model b that you are trying to outperform with this method. For example, it could be the last production model or the model with hand-picked weights for the utility function. In this case, bclick, bsave, and bclose-up are the performance for each head for the baseline model. cclick, csave, and cclose-up are parameters we use to adjust the secondary objectives. For example, the setting csave=0.9, cclick=1, cclose-up=1 means we are willing to trade a little save performance for better long click performance, but we do not want to hurt clicks or close-ups. S1 through S5 are the secondary objectives we defined for this optimization. S1 will force the weights to add to a constant since the rank will not change if we multiply all weights by the same number, which helps with interpretability (could be any constant). S2 delimits the ranges of the weights. S3, S4, and S5 optimize the performance of click, save, and close-up performance, which we denote as fclick, fsave, and fclose-up, respectively.
Despite our efforts, in online A/B experiments, the multi-task learning model using the utility function with the weights selected using Bayesian optimization did not outperform the model with hand-picked utility weights in engagement propensity or volume. We hypothesize that we have not achieved the optimal utility weights by Bayesian optimization because we only used offline optimization, where we generate and evaluate the sets of weights against samples from past engagement. A natural next step of this work is to further optimize the utility weights by Bayesian optimization on the online metrics, which should lead to a new optimal set of weights. In general, we would need to do at least two more experiments. The first one is to collect the knowledge needed to build a Bayesian model: we generate pseudo-random sets of utility weights and use them to set up the treatment groups. After collecting the metrics from each group and sending them back to the Ax platform, we can then fit the Bayesian model and generate candidate weights for the next experiment. We would then wait for this second experiment to complete and choose the best performing group as a candidate for production if it outperforms the hand-picked utility weights.
Evolving the Related Products ranking model’s architecture from a single head model to a multi-task neural network has improved engagement across all the engagement types while also providing us with more interpretable outputs, which are not only useful for debugging purposes but also for model performance analysis. This work also allows us to continue exploring and expanding the model to accommodate new tasks and new architectures as our use cases evolve. We also learned that offline Bayesian optimization does not always outperform other weight selection methods. However, the process gives us confidence that the current weights are aligned with our understanding of what Pinners find engaging and inspirational on our shopping surfaces. We’ll follow up this work with an online Bayesian optimization approach, which has proven far more successful for solving similar problems than its offline counterpart.
Acknowledgments: The authors would like to thank the following people for their contributions: Van Lam, Sai Xiao, Onur Gungor, Olafur Gudmundsson, Chen Chen, and Kim Toy.
 Caruana, Rich. “Multitask learning.” Machine learning 28.1 (1997): 41–75.
 Zhao, Zhe, et al. “Recommending what video to watch next: a multitask ranking system.” Proceedings of the 13th ACM Conference on Recommender Systems. 2019.
 Dal Pozzolo, Andrea, et al. “Calibrating probability with undersampling for unbalanced classification.” 2015 IEEE Symposium Series on Computational Intelligence. IEEE, 2015.
 Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. “Practical bayesian optimization of machine learning algorithms.” Advances in neural information processing systems. 2012.
 Kendall, Alex, Yarin Gal, and Roberto Cipolla. “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
 Liebel, Lukas, and Marco Körner. “Auxiliary tasks in multi-task learning.” arXiv preprint arXiv:1805.06334 (2018).
 Guo, C., Pleiss, G., Sun, Y. & Weinberger, K.Q.. (2017). “On Calibration of Modern Neural Networks.” Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1321–1330
Multi-task Learning for Related Products Recommendations at Pinterest was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.