Capturing Espresso updates
Brooklin, LinkedIn’s stream ingestion service, offers change-capture services from Espresso and Oracle data sources, and has a simple integration with Samza. In our pipeline, Samza listens to change-capture events from Brooklin for the Espresso commute preference updates and sends out Kafka messages for the isochrone requests.
Getting and persisting isochrones
The requests to the Bing Isochrone API go through GaaP, LinkedIn’s Gateway-as-a-Platform. For each commute preference, we make several requests to Bing with different commute times. The individual isochrone responses are later aggregated to form an isochrone gradient.
The high-level Samza fluent API is used to combine multiple Samza tasks into one Samza application. The application handles merging messages from different sources, aggregating the individual isochrone responses to create the isochrone gradient, repartitioning the complete messages by member ID, and finally, determining whether the update is the latest and eligible to be persisted. To determine the order of the responses, the timestamp from the online Espresso update is propagated throughout the process and saved locally in the Samza application.
Venice is chosen as the storage system for our derived data as it offers sub-millisecond latency and supports very high QPS, which allows us to use the data in different latency-sensitive job services.
Solving delayed updates in the Venice store
The asynchronous pipeline can take a couple of minutes to complete, starting from a complete Espresso update to a new isochrone gradient being saved in Venice. To prevent users from seeing stale data while waiting for the async response, we invalidate all reads to the isochrones during this time. For every isochrone read, we query Venice for the isochrone and Espresso for the latest timestamp. We only return the isochrones if the timestamps in the Venice record and Espresso record match.
Having Brooklin and Samza work together facilitates the re-bootstrapping process. Espresso regularly publishes a snapshot of the entire database. Brooklin supports bootstrapping by having two streams, one reading from the snapshot and one reading from the online change-capture stream. When the bootstrapping instance starts, our Samza processor first consumes from the snapshot bootstrap stream and later switches to the change-capture stream. The Brooklin/Samza system consumer seamlessly makes the switch without the client’s knowledge.
In addition, the Venice team is building support for a hybrid store to simultaneously consume from a finite reprocessing job and a non-finite nearline job, similar to how they supported Venice Hybrid for batch push and nearline jobs. Once this work is done, re-bootstrapping becomes an automatic process that can be run anytime on demand to regenerate the isochrones without interfering with the flow for new updates from members.
Working with isochrones
Thanks to cutting-edge optimizations from Bing, we can generate large isochrones for 2-hour drives in under 10 seconds. However, depending on the number of vertices of the multipolygon returned, isochrones can be unwieldy and impractical. Since the public Isochrone API from Bing Maps must serve many different use-cases outside of our own, it was important for us to be able to limit the size of these isochrones ourselves to meet storage and performance requirements, all while maintaining a reasonable tradeoff in accuracy. After investigating the literature on computational geometry and polygon simplification, we narrowed our experimentation down to two popular choices for polyline simplification: the Douglas-Peucker algorithm and the Visvalingam-Whyatt algorithm. After we experimented with and benchmarked the two using the Java Microbenchmarking Harness, we chose Visvalingam-Whyatt for its slightly faster performance, tendency to produce smoother edges, and preservation of gradual changes over longer distances when compared to Douglas-Peucker. Some of these characteristics are noticeable even when comparing the simplification of a small example isochrone in San Francisco’s Golden Gate Park, as illustrated below: