Decomposing network calls on the Lyft mobile apps | by Don Yu | Oct, 2020

When Lyft was first developed, it was built using a monolithic server architecture. Within this architecture, all mobile clients relied on a single endpoint for fetching all data pertaining to the user and their ride (the “state of the world”): /users/:user_id/location. Using a 5 second polling loop, the client would send the server the user’s current location information, and then receive “the state of the world” back in what we called the Universal Object.

The Lyft client apps hit the /users/:user_id/location endpoint every 5 seconds to receive the “state of the world” back.

This Universal Object was exactly like the name suggests — it included nearly everything the app needed to render a screen. Need to know whether a driver is online? Check the Universal Object. Need to display the driver’s earnings? Also in the Universal Object. Need ride stop information? You get the point. By introducing a “state of the world” polling loop, Lyft was able to iterate quickly when building new features, since product teams could piggy-back on this polling loop by adding new fields to the Universal Object. Additionally, page loads were seamless and the client apps didn’t need complex data management because all user or ride data was always present.

The majority of the data displayed by the client app came from the Universal Object.

This architecture made a lot of sense when Lyft was first starting out and iteration speed was the biggest priority, but it led to tech debt and numerous resiliency issues as we scaled our user base and transitioned our servers to a microservice architecture in 2017. A single “universal” endpoint introduced a single point-of-failure, and did not leverage the extensibility and independence of microservices. Through a joint effort across various teams spanning 13+ engineers, we were able to decompose the Universal Object into many isolated endpoints.

Over 40 new endpoints now replace /users/:user_id/location to populate data in the client app.

Why we chose to decompose the Universal Object (UO)

There were numerous server and client benefits that motivated this project:

Benefit #1: Improved reliability because the new APIs allow for partial availability and failure isolation.

The Universal Object (UO) polling loop did not have resource isolation, and a single incorrect field on a small portion of the response could prevent the client from parsing the entire payload, thereby leading to a blocked user experience. By decomposing the UO and introducing resource isolation, new endpoints were able to map to a single core data model and remain independent from each other. For example, a bug in the microservice that serves user profile information would no longer be able to prevent the driver from receiving route stop information.

Benefit #2: Easier to triage user issues.

By simplifying the number of downstream service dependencies that were needed to construct a single endpoint’s response payload, we were able to create a far simpler debugging interface server-side. Prior to migrating off of Lyft’s monolith service, the client-facing API did not decompose the Universal Object into smaller pieces, which led to engineers having to investigate dozens of different microservices to root cause a bug on a specific field of the Universal Object. By decomposing the “state of the world” endpoint to match the microservice architecture, engineers only needed to investigate 1–2 microservices.

Benefit #3: Reduced costs by polling different resources at different rates and transitioning to push.

Different resources update at different rates. For example, the ETA to the next stop changes far more frequently than the user’s profile information. When we decomposed the Universal Object into separate resources, our clients were able to request updates from different APIs at different rates by adjusting their polling intervals. This saved a ton of compute time on the microservices that previously had to perform redundant calculations every time the Universal Object was requested.

The reduced costs from this change apply to both hosting costs on Lyft’s side and bandwidth costs for our users. Throughout these experiments, we have seen shorter latency on network requests as payload sizes decreased. Overall client bandwidth costs also decreased because the client app was able to leverage push streams, poll at varying rates, or switch to a single-fetch for non-updating payloads.

How to decompose an endpoint safely and seamlessly

One of the core goals of the project was to avoid breaking the mobile apps or blocking other product teams from feature development. Here are the best practices we followed to ship a mobile app refactor of this size.

Best practice #1: Create client-side abstractions.

When we started client-side work, we wanted to make sure that other developers could start working on top of the new data models immediately. To that end, we first started mapping the existing Universal Object into the various “decomposed” data models on the client before we split the endpoints. At the same time, we deprecated the existing data model fields to prevent new usages.

This approach gave client engineers time to adjust to the new paradigm, build new features on top of it without worrying about the switch to “decomposed” endpoints, and made it possible to start testing a client-only decomposition before the underlying APIs were ready server-side.

Feature code depended on a client-side wrapper service that would locally map to the decomposed endpoints or to the Universal Object based on whether a feature flag for A/B testing was enabled.

Best practice #2: Shadow server-side.

As we built out the new APIs to replace the Universal Object, we needed to ensure that there were no mismatches between the payloads on the new and legacy APIs. For example, the address of a route stop fetched from the decomposed /v1/routes/stops endpoint would need to exactly match the route stop address fetched from /users/:user_id/location. To that end, we shadowed 1% of all production traffic and tracked mismatch counts on a Wavefront dashboard to monitor potential bugs during rollout.

Best practice #3: Experiment rigorously on the new endpoints.

We launched 28+ A/B experiments over the last year to confirm that the decomposition did not break any important user flows or cause harm to business metrics. While these experiments slowed down rollout, running isolated experiments helped the team uncover numerous edge cases that affected a smaller percentage of users.

Is decomposing network calls always the right choice?

Lyft has already started to enjoy the fruits of the decomposition effort. The new decomposed endpoints have a reduced p50 latency of <120ms compared to the p50 latency of >200ms on calls to the original /users/:user_id/location endpoint. One of the biggest impacts from decomposition came from moving new ride request info off of the Universal Object, which ended up reducing the time between when Lyft matches a driver to a ride on our backend and when the driver is notified of the match by over 20%. While Lyft still maintains the legacy endpoint to support older app versions, newer versions of the rider and driver mobile apps no longer fetch the Universal Object.

However, decomposing network calls might not always be the right choice. After decomposition, there is no longer a single source of truth and thus the client cannot assume that it always has the same state of the world as the server. There could be lag across the different decomposed polling streams, leading to one piece of information being more up-to-date than another (i.e., the driver’s location versus the ride’s status). Also, there is additional client complexity as engineers need to hit the specific endpoint that has the data model they need rather than access a field on a state-of-the-world object.

A continuous “state of the world” polling loop could be the right architecture when your product is in its early stages, as was the case for Lyft in the past. However, decomposition became a necessary adaptation as Lyft’s user base grew exponentially and stability became the highest priority. Even if decomposition is not the right architectural decision at the moment, it is important to keep the trade-offs top-of-mind when considering the future scale of an application. We hope that our journey decomposing the Universal Object at Lyft will help readers improve the design/scalability of their client apps.

Don Yu, Sarah Mazur, Daniel Duan, Pierce Johnson, and the many other engineers involved in this project work on different teams at Lyft (Resilience, Core Services, Client Architecture, Driver App Platform, and more). If you’re passionate about creating resilient client apps or tackling interesting scaling problems like this one, read more about them on our blog or apply to join our team!

Source link