How we used sequence models and LSTM networks to suggest responses for our customer service agents
By Xin (Cindy) Chen, Mengting Wan
At Airbnb, we serve a global community of hosts and guests. An essential part of Airbnb’s business is to provide high quality customer service at scale. Live chat is one of our customer support channels, where customers can chat with our customer service agents to resolve their issues.
It is time-consuming for agents to repeatedly type or copy and paste frequently used messages in their conversations. This is not only a waste of time for both our customers and agents, it can also prevent deep engagement with customers when resolving their issues. To address these issues, we introduced Smart Replies into our customer support chat system, where we suggest short responses for agents to use with one click.
The Anatomy of a Conversation
If you are relatively new to conversational AI, the various terms used in literature on modeling conversation dialogues can be confusing. Before diving into the details of our algorithms, it might be helpful to clarify a few terms: utterance, message, turn, and round.
Message/utterance: As shown in Figure 2, one message is one text blurb. One message can contain multiple sentences. Here we treat message and utterance interchangeably.
In this context, a customer initializes a chat conversation by describing their issues in a message, and this creates an inbound ticket. All examples we use start with a customer message.
Turn: A turn is composed of all messages sent consecutively by one interlocutor.
Round: One round is one turn from each of the interlocutors. Because our goal is to suggest responses to the agent for the next round, in our context, a round always starts with the agent’s turn and end with the customer’s turn. The first round (for inbound tickets, only has the customer’s turn) and the last round (for most customer support chat conversations, only has the agent’s turn) are exceptions to this rule.
Response Candidate Generation
The first step towards a smart-reply system is to have a pool of candidate responses. We took all agent messages from chat history, anonymized personal information in the messages, and tokenized them into sentences. We vectorized these sentences using TF-IDF weighted word2vec model (Mikolov et al, 2013). Then we applied a scalable mini-batch K-means clustering algorithm and subsequently a hierarchical clustering algorithm to the sentence vectors to obtain clusters of sentences with similar semantic meanings. For example, “Give me a moment while I look into your case.” and “Let me look into this.” are clustered together. Similarly, “Do you need any further assistance?” and “Is there anything else I can help you with?” are in the same semantic cluster.
By clustering sentences with similar semantic meanings together, we identified repeated patterns from agent messages. These clusters can also help us avoid redundancy and increase diversity in the recommendations — only responses from distinct semantic clusters are recommended at the same time.
The generated response candidates were finalized by content experts to ensure proper tone and style.
Long and Short are Relative: Sequence Models with LSTMs
After we have the pool of candidate responses, the next task is to recommend the top N responses based on the conversation context. In this post, we will introduce two versions of recommendation algorithms. Both are sequence models (RNN, Recurrent Neural Networks) leveraging LSTM units.
Long and Short in LSTM
For the first algorithm, we adopted the sequence to sequence model architecture in Figure 4. This kind of sequence to sequence model structure or its variants are used in several smart-reply systems (Kannan et al. 2016). The unit of the sequence is a word, and the entire sequence is one message. The model takes the most recent preceding message and outputs one or more potential responses.
In this architecture, information is passed from one unit to the next. This is essential in all RNN based sequence models. There are different types of RNN cell structures that control how information flows through the chain. In a vanilla RNN unit, only one single state vector is passed from the preceding unit to the next. If the input sequence is long, such as “Thank you for the help, everything has been resolved on our end”, instead of just “Thank you for the help”, memory about early parts of the sequence (long-term memory) could be lost. The predicted output would not be ideal if a good prediction is dependent upon such long-term information.
LSTM (Long Short Term Memory) networks are a special kind of RNN capable of memorizing both long-term and short-term information from an input sequence. This blog post (Olah, 2015) does an excellent job illustrating why LSTMs are capable of doing so. Simplistically speaking, LSTMs have a more complex structure, which allows two state vectors being passed from the preceding cell to the next, one for short-term memory and the other for long-term memory.
It is worth pointing out that the responses directly generated from the sequence to sequence model may contain incomplete sentences and have unsatisfying styles. We used the K-nearest neighbor method to find the closest responses from the above mentioned candidate pool and ensured that the recommended responses are from distinct semantic clusters.
The deployed system with this algorithm has resulted in positive impact, and received positive feedback from our agents. However, we also found it has major limitations for our use case:
- Although this method uses LSTM networks, the input sequence is only the most recent preceding message. The long-term memory can only be as long as the input sequence. However, a chat conversation has many rounds, and previous messages contain relevant information, which this method is not able to capture.
- This method is particularly suitable for suggesting reactive responses and answering customer questions. However, it has difficulties in suggesting proactive responses such as investigative questions (e.g., “Do you perhaps have another account with us?”, “Which payment method would you like to use?”). This could also be due to the lack of long term context from earlier parts of the conversation. Investigations are critical steps where customer service agents collect necessary information to resolve customers’ issues, so we designed a second algorithm architecture to solve this problem.
Long and Short in a Conversation
Long and short are relative. While the above model uses LSTMs, it only has context on the most recent preceding message. This is considered short-term context in a conversation, so we designed a new algorithm architecture to carry long-term context from all preceding messages.
As shown in Figure 5, this new model architecture treats all messages in one round as input to an LSTM unit, so all preceding rounds in the conversation formulate the sequence. In one round, the dialogue turn embedding of the agent x(a) and that of the customer x(c) are concatenated together to form the input to the LSTM unit. y is the output representing agent responses in the next round.
We also considered ticket issue as an additional feature. A ticket issue can be assigned by the customer or the agent at the creation of a ticket, or predicted by a separate issue prediction model. We experimented with concatenating the ticket issue with the turn embeddings in the input layer, concatenating it with the hidden states at the output layer, or both.
From the example in Figure 1, this new model architecture has enabled us to suggest proactive responses for the agents that lead the flow of the conversation, particularly investigative questions. It also has the ability to carry long term context from the early part of the conversation.
Challenges and Next Steps
There are several challenges imposed by productionizing the second model. For the first model described above, at online serving time, the entire sequence is served as input, thus there is no need to cache the hidden states between the LSTM units. However, in the second model, only the preceding round is served as input, and we need to cache the hidden states generated from the unit before.
Notice that in our second model architecture, multiple messages in the same dialogue turn are concatenated during offline training. However, the end-of-a-turn signals are not always accessible in real-time, we do not know whether one interlocutor will send another message in the same turn or not. We are investigating ways to address and resolve this problem during online serving.
On average, our agent responses contain 2 to 3 sentences per message. It will increase the adoption of the recommendations and help our agents if they do not need to manually combine multiple sentences together in one message. Currently, we have static rules to combine multiple sentences for introductions and closes of a conversation. How to combine sentences smartly throughout the conversation based on context is worthy of solving.
Finally, while the current recommendation user interface uses three hover-over bubbles, it can become cumbersome especially when we combine multiple sentences in one recommendation. We plan to build a set of smart composing and quick access features to help our agents ease the burden of processing text content and reduce the inconvenience of repeated typing, so that they can focus on engaging and helping our guests and hosts.
AAAI DEEP-DIAL Workshop Paper
You can find more details of the model architecture, training process, and evaluation methods in our paper (arXiv: https://arxiv.org/abs/1811.10686). This paper has been accepted for full oral presentation at the AAAI Workshop on Reasoning and Learning for Human-Machine Dialogues(DEEP-DIAL 2019). Please join us at the conference to learn more about it. See you in January 27th — February 1, 2019 at Honolulu, Hawaii!