Lyft’s Journey through Mobile Networking

In 5 years, the number of endpoints consumed by Lyft’s mobile apps grew to over 500, and the size of our mobile engineering team increased by more than 15x. To scale with this growth, our infrastructure had to evolve dramatically to utilize new advances in modern networking in order to continue to provide benefits for our users. This post describes the journey through the evolution of Lyft’s mobile networking: how it’s changed, what we’ve learned, and why it’s important for us as a growing business.

The early iterations of the Lyft apps used commonly known networking frameworks like URLSession on iOS and OkHttp on Android. All of our APIs were JSON over RESTful HTTP. The workflow for developing an endpoint looked something like the following diagram, where engineers hand-wrote APIs on each platform based on a tech spec:

Original process for hand-creating APIs.

In this world, the burden of consistently writing an API implementation across 3 platforms (iOS, Android, and server) was placed on individual engineers. With a small team and limited codebase sizes, this was an acceptable first approach. However, as our teams started to get larger and our velocity increased, our communication complexity also increased and led to more errors. For any given resource, there were numerous potential points of failure in this implementation process, each stemming from a lack of consistency between platforms and resulting in problems we’ve all likely encountered:

  • An engineer on one platform could write a typo in a JSON mapping.
  • One platform might assume a field is optional while another treats it as non-optional.
  • Different mobile clients may handle HTTP status codes and/or response bodies differently.
  • A change might be made on one platform in a future iteration of an API, but not on others.

In an effort to reduce handwritten code and subsequent programming errors, members of the Android team began adopting YAML-based Swagger for defining APIs.

With Swagger came an in-house code generator that created Java files containing functions that wrapped calls to OkHttp (similar to what Retrofit provides, but without the annotation processing). Android engineers ran this generator locally to create type-safe API methods which could be imported into the codebase.

This change meant the workflow between platforms became asymmetrical:

Workflow with YAML generation for Android.

Swagger made it easier to integrate new APIs on Android and provided some documentation for the APIs exposed to mobile, but we still had similar inconsistency problems as those mentioned previously:

  • No code generation for server or iOS.
  • YAML definitions (and thus Android’s implementations) could differ from the server and/or iOS implementations since there were no strict API contracts between platforms.
  • As the sole consumers of the YAML definitions, Android engineers became the de facto maintainers, and the definitions became outdated.

While Swagger generation attempted to drive consistency and documentation, it didn’t provide enough value since not all platforms were able to leverage the tool. In order to maintain a high bar of consistency for APIs, we needed a single source of truth which all platforms could utilize.

As the number of services and APIs grew, it became clear that we needed a single source of truth for APIs that provided guarantees around the behavior of any given API on every platform.

We envisioned a central collaboration space where any engineer could add an API definition from which consistent interfaces and implementations would be generated for our services and mobile clients. The goals for this project were ambitious:

  1. Create one canonical source of truth for every API definition, respected across all platforms.
  2. Generate ergonomic, consistent, and testable interfaces in each of our supported languages for every definition.
  3. Abstract away the networking implementation details from the generated interfaces, allowing for the underlying implementations to change and evolve over time.

We chose to adopt protocol buffers (“protobuf”) as the Interface Definition Language (“IDL”) that would be adopted across the company for the following reasons:

  • Simple and descriptive language for API contracts.
  • Open-source generators are available for most languages.
  • Writing new generators is possible by using protoc-gen-star.
  • Backwards compatible with our existing JSON APIs (with some tweaking), enabling easy migration.
  • Supports an optimized binary format which could be adopted in the future.

Through our IDL system, we aimed to unify the workflow for all engineers regardless of the platform they worked on, and to provide a common collaboration ground for them:

Workflow with IDL and protobuf generation on all platforms.

Protobuf allows for defining RPCs (Remote Procedure Calls, basically endpoints) that accept and return messages (models) over gRPC (a thin message-based streaming protocol built on top of HTTP). However, we needed to support all pre-existing Lyft APIs that use JSON over RESTful HTTP, and wanted a way to build future functionality on top of protobuf that was available to mobile clients and transparent to engineers.

To accomplish this, we opted to write custom code generators for Swift and Kotlin. These code generators transform the protobuf language’s AST (Abstract Syntax Tree) provided by protoc-gen-star (an open-source protoc plugin developed by Lyft) to create models and endpoint definitions which can then be consumed by mobile apps.

By writing our own custom generators, we were able to embed additional metadata using protobuf annotations, which enabled back-porting existing RESTful JSON HTTP APIs. Our protobuf definitions look something like this:

Using this simple protobuf definition, our mobile code generators produce a few different things:

Models created by the generators called DTOs (Data Transfer Objects) represent a structure that is common across platforms. Engineers are expected to convert these to/from application-level models for use in their product code, allowing UI/business logic to be separated from the API format.

Note: Although the examples below are in Swift, the corresponding Kotlin APIs match 1:1.

The generated APIs include a protocol/interface for the service and its corresponding RPC declarations, allowing them to be easily mocked for testing. A production-ready implementation of the interface is also generated.

These interfaces and implementations utilize Rx patterns for the generated APIs, making combining/transforming them quite simple:

  • Each RPC is exposed as a function that returns an observable containing a native Result type with either a successful response or a typed error.
  • When the observable is subscribed to, the API call is performed, and the result of the call is emitted over the Rx stream.
  • If the stream is disposed by the consumer before it completes, the API call is canceled automatically.

We’ll get more into the implementation details of how the API call is actually performed by the networking stack later in this post, but it’s important to note that these details are hidden away behind the interfaces of the generated APIs. This decision was key to unlocking future improvements to the transport layer.

Lastly, mock classes conforming to the same interfaces are generated and compiled into separate modules that are only visible to test targets. These classes make it easy to mock out and test features that consume the generated APIs.

The generated client is consumed by product code and looks something like this:

Using an IDL with centralized definitions and automated code generators established a high bar of consistency between different platforms, made consuming APIs very simple, and created a clear line between the interfaces and their implementations. In other words, it abstracted away how network calls are executed and paved the way for future improvements to the transport layer.

Once we established strict API contracts and clear abstractions through IDL, we considered: How can we change what data is being sent between the client and server in a way that benefits our users? Since protobuf supports an optimized binary format for serializing models, we decided to try it out using an A/B test. To do so, we employed content negotiation as a way to “upgrade” server responses to protobuf if both the client and server support it for a given resource:

Content negotiation for JSON/protobuf between a mobile client and a service.

The above diagram outlines the following workflow:

  1. Client sends a JSON request and indicates it knows how to read both protobuf and JSON responses for a given endpoint using the Accept header.
  2. If the service receiving the request can also read protobuf, middleware running on the service “upgrades” the request from JSON to protobuf using the IDL definition. If not, the original JSON is passed through.
  3. The service responds with protobuf, indicated by the Content-Type header.
  4. If the client originally indicated it was capable of reading protobuf responses, the middleware passes the response through. If not, the response is converted back to JSON.
  5. Client is able to take advantage of the reduced protobuf payload size.

This workflow enabled us to migrate resources to protobuf definitions independently on the client and server while iteratively taking advantage of the benefits provided by the binary format. In doing so, we saw reduced response payload sizes (sometimes cutting down size by > 50%!), improved success rates, and faster response times on our larger endpoints. Best of all, product teams were able to take advantage of these improvements without any changes to their code, thanks to the IDL abstraction in place.

Average response size savings by resource by using protobuf instead of JSON.

After deploying content negotiation, the team started asking: Considering that our IDL abstraction and infrastructure allows us to experiment with its implementation, what if we were to change how we send data in addition to what we send?

On the server side, we use Envoy proxy to relay all network traffic into, within, and out of our service mesh. Envoy supports a myriad of protocols and provides a rich feature set that is used throughout our infrastructure to provide consistent behavior between our hundreds of microservices (i.e., auth, protocol negotiation, stats, etc.).

We envisioned a world where Envoy could run on our iOS/Android clients, bringing the power and consistency of the project to mobile and essentially making them simply another node on our service mesh. This would enable us to write core functionality once and share it across both mobile and our services (for example, performing auth handshakes entirely within Envoy), and would allow us to deploy new protocols like QUIC before they’re officially supported by OS providers.

Earlier this year, we announced Envoy Mobile — an iOS/Android library with the goal of bringing Envoy’s functionality to mobile apps. Since then, we’ve begun testing the library in pre-release versions of the Lyft rider app.

Envoy Mobile adds iOS and Android clients as nodes on the Envoy mesh.

We believe that Envoy Mobile will not only become a key piece in Lyft’s networking infrastructure, but that it will revolutionize how we enable network consistency across platforms throughout the industry. All of the work on Envoy Mobile is taking place in open source, and we’ll be publishing more detailed posts about it in the near future. If you’re interested in contributing to Envoy Mobile or testing out the library, you can check out the project on GitHub.

Throughout Lyft’s mobile networking journey, we were able to recognize the value of consistency across platforms and languages, and iterated on our infrastructure to abstract away networking implementation details while making it easy for engineers to interact with APIs. These changes allowed the organization to scale, and provided clear benefits to our end users by optimizing how they communicate with our network.

In the near future, we plan to open source the code generators described in this post. Additionally, we’ll be continuing our efforts to push Envoy Mobile to production. We invite you to check out our roadmap or open a PR if you’re interested in contributing to the library!

This post provided a high-level discussion of APIs at Lyft, but there are numerous topics that we plan to publish more about in the future, including:

  • Deep dives into Envoy Mobile as it evolves.
  • How we enabled streaming and push using IDL.
  • Building protobuf generation for mobile.

If you’re interested in joining the team at Lyft, we’re hiring! Feel free to check out our careers page for more info.

Source link