By Hyungjun Lee, Jacob van Gogh
This blog post details how Lyft predicts riders’ destinations when they open the app.
Destination prediction enables Lyft to provide relevant location suggestions, allowing riders to set their destination with a single tap instead of typing out search queries, and thus making this part of the ride hailing experience effortless.
We tackle the destination recommendation problem using the rider’s historical rides. The main idea is to limit candidate recommendations to addresses where the rider has previously taken a Lyft ride to or from. Within this candidate set, we use an attention mechanism (discussed in more detail below), to determine which locations are most relevant to the current session.
The Candidate Set
Ride destinations are highly personal — each location has different meanings to different riders. For example, a rider’s home is a highly probable destination for that rider, but would not be relevant to most others. We therefore need to tailor predictions specifically for each rider.
We achieve this personalization by restricting the candidate destinations to the locations that appear either as the origin or the destination of a previous ride taken by the rider, filtered by a physical proximity measure. (For example, a rider probably isn’t taking a Lyft to their home in San Francisco when they’re in New York on a business trip). An added benefit of this approach is that it limits the number of candidate destinations to a reasonable size — classification problems with many candidate labels generally require a candidate generation step for computational reasons. In a naive approach, there is an extremely large number of destinations a rider can theoretically travel to.
In order to predict which of the previous ride origins and destinations a rider is traveling to, we use various forms of attention. Let’s first review the basics of attention.
Consider a sequence of vectors (keys), K, where the goal is to create a score for each individual vector to use for weighting purposes. Imagine a separate vector (query), q, which is the same dimension as the keys. One way to score all of the keys is to then take the dot product similarity between each key and the query. We can then normalize these scores to get a standard weighting, which we then apply to a sequence of values, V (which must be the same length as our keys, and can even be the keys themselves). Our simplified attention would then be constructed as follows:
What if we want to generate more than one set of weights for our keys? There’s nothing preventing us from stacking a set of queries into a matrix, Q. Finally, research has shown that the performance of this scheme is improved, particularly on longer sequences, by scaling the weights by the square root of the length of keys / values¹, n, giving us the following attention function:
We enhance the capabilities of attention by utilizing multi-head attention. As the name suggests, multi-head attention has multiple heads, the outputs of which are concatenated to yield the final multi-head attention output. Each head can be trained to focus on a particular context, instead of a single big attention mechanism being trained to focus on everything.
Specifically, each head applies linear transformations to the query (Q), key (K), and value (V), and applies attention to the transformed matrices:
where W^Q_i, W^K_i, and W^V_i are trainable weights defining the linear transformations. The output of the multi-head attention is the concatenation of the outputs of each head:
where H is the number of heads.
The final output of the model, which represents the probabilities that each location in the candidate set is the destination of the current session, is the weighted average of the historical origins and destinations represented as a one-hot encoded vector generated by an attention layer. The inputs to this final attention layer are:
- Query: A vector representing the current ride context.
- Key: A sequence of vectors representing the context of each historical ride.
- Value: A sequence of one-hot encoded vectors of historical origins and destinations.
To obtain the current and historical ride contexts that serve as the query and key for the final attention layer, we send the raw feature vectors from the current session and the historical rides jointly through a series of joint self-attention layers. The joint self-attention layers help frame the “meaning” of each ride in the context of the rides around it.
There are four separate attention mechanisms in the joint self-attention layer: one self-attention for each current and each historical ride context, and two cross-attentions between the two. In self-attention, the query, key, and value are all provided by the same context. In cross-attention, the query comes from one context and the key and value from the other. We add a skip connection on top of the attention layers to obtain the intermediate output, as shown above in the diagram. Finally, the intermediate output is sent through a pointwise feedforward layer with another skip connection to yield the final output:
We assume that the ride will originate from the rider’s current location, and the ride will be requested shortly after the app is opened. Thus, the raw ride features are:
- Latitude and longitude of the rider’s current location.
- Request time (i.e., the current time).
- Request time, latitude and longitude of origin and destination of the rider’s historical rides.
We first send these raw ride features through pointwise feedforward layers. For historical rides, there are two separate pointwise feedforward layers to extract the raw context vectors for the origins and destinations, respectively. The results of these pointwise feedforward networks are fed through two successive joint self-attention layers described above to yield the query (current ride context) and the key (historical ride contexts) for the final attention layer. At this point, we append an additional output class to the historical origin/destination set to account for the possibility that the desired destination of the current session is not in the historical set. A corresponding key vector consisting of all 0s is also appended to the keys. The resulting query, key, and value are sent through the final attention layer to yield the current session destination probabilities.
The diagram below shows the complete view of the network:
This model is trained on 16 million pre-COVID rides for 50 epochs. COVID-19 has drastically changed the ride patterns of Lyft riders. When evaluating this model’s performance, it was important to ensure that the model provided high-quality predictions regardless of the types of rides passengers were taking. We analyzed the performance of the model on a dataset from early 2020, before the US started sheltering in place, and on one from when the effects of COVID-19 were the strongest. One particular metric of interest to us is top 2 accuracy, which measures the percentage of rides in which the correct destination is in the model’s top 2 predictions. Our top 2 suggested destinations are highlighted in the app, so performance in these spots is especially important. Our model improved the top 2 accuracy by 8% over the existing methodology.
Lyft conducted a two-week, user-split experiment across all regions comparing this new model to the previous one. Riders who were provided suggestions from the new model used them to set their destinations 3.5% more often than users still on the old model. It’s always exciting when the performance improvements seen in the offline model translate to the live system, but the work doesn’t stop here. This 3.5% increase isn’t as large as the improvement we saw offline. We believe there are optimizations we can make to the way the suggested destinations are displayed that can help us get closer to the offline results.
The team is always striving to leverage our vast amounts of data to improve the Lyft experience. If you’re interested in joining, please check out our job listings.