Hypothetical Dagli pipeline for smart replies. Circles represent inputs to the DAG. Arrows connect the result of one node to the input of another.
Inference, Personalization and Diversity
When you receive a message, it’s used, together with the preceding conversation, to predict what your responses might be so we can show you the top few highest-probability candidates. Often, these suggestions have placeholders which are used to personalize the message; for example, the model might predict that “Thanks, RECIPIENT_FIRST_NAME” is a good response to “Just sent you the document”. These placeholders are replaced with the corresponding pieces of information, so (if you’re talking to Jane) what you ultimately see is “Thanks, Jane”.
One potential issue is that there are, for example, many ways to say “yes”: “yeah”, “yup”, “sure”, etc., and if “yeah” is predicted with high probability, “sure” tends to be as well. This creates a problem in the diversity of the smart replies we display; we’d prefer not to show you three different ways to say “yes” as this precludes us from also suggesting “maybe” or “no”, reducing the chance at least one of the options is a good suggestion for you. Instead, we use the aforementioned semantic groupings of the candidate replies to check if all the suggestions have the same meaning; if so, we enforce simple rules (like “no more than two suggestions should be from the same semantic group”) to ensure a more diverse final set of suggestions.
Text generation models are typically evaluated by comparing the generated text to one or more “reference” texts using a metric like BLEU or Word Error Rate, and we could potentially use these to evaluate the replies suggested by our models, too. However, these metrics tend not to work well on the kind of very short messages used for smart replies; if the actual reply made by a user was “yep” but we predicted “yes”, either metric would consider this as bad (or good) as predicting “no”, or “zebra”, or “antidisestablishmentarianism”. While there are more sophisticated metrics that avoid this problem somewhat by considering the synonimity of words, judging the equivalence of texts is a hard problem and the resulting scores still often do not reflect the real performance of the model as might be perceived by a human.
Fortunately, because we know which semantic group each possible candidate reply belongs to, we have an even better (and much simpler) alternative: checking whether both the actual and predicted reply correspond to the same semantic group. So if the actual reply was “Certainly” and the model predicted “Sure”, we consider that correct because both replies have the same meaning, but a prediction of “Goodbye” would be wrong. While this does not capture the exact connotation (“yep” is less formal than “yes”), it nonetheless allows us to quantify the performance of the model in a way that is both robust and very comprehensible, e.g. “the percent of times when one of the top three suggestions had the correct meaning”. Such metrics are invaluable in both estimating the quality of the user experience and, especially, judging whether one model variant should be preferred to another.
Every week, massive numbers of messages are sent by our members. Every single one of these messages has the potential for the recipient to want to use a smart reply recommendation. This poses a very difficult problem of serving highly computationally intensive recommendations at the rapid rate that members demand them. At the same time, we need to ensure that the speed of the message delivery is unaffected by the recommendation engine.