The making of Edge Gateway, the highly-available and scalable self-serve gateway to configure, manage, and monitor APIs of every business domain at Uber.
Evolution of Uber’s API gateway
In October 2014, Uber had started its journey of scale in what would eventually turn out to be one of the most impressive growth phases in the company. Over time we were scaling our engineering teams non-linearly each month and acquiring millions of users across the world.
In this article, we will go through the different phases of the evolution of Uber’s API gateway that powers Uber products. We will walk through history to understand the evolution of architectural patterns that occurred alongside this breakneck growth phase. We will speak of this evolution over three generations of the gateway systems, exploring their challenges and their responsibilities.
First gen: the organic evolution
A 2014 survey of Uber’s architecture would have resulted in two key services: dispatch and api. The dispatch service was responsible for connecting a rider with a driver and the api service was our user and trip’s long term store. Besides these there were a single digit number of microservices that supported the critical flows on our consumer app.
Both the rider app and the driver app connected to the dispatch service using a single endpoint hosted at ‘/’. The body of the endpoint had a special field named “messageType” that determined the RPC command to invoke a specific handler. The handler responded with a JSON payload.
Among the set of RPC commands, 15 were reserved for critical real-time operations like allowing the driver partners to start accepting trips, rejecting trips and riders to request trips. A special messageType was named ‘ApiCommand’ which proxied all requests to the api service with some additional context from the dispatch service.
In the context of an API gateway, it would look like ‘ApiCommand’ was our gateway into Uber. The first gen is the outcome of an organic evolution of a single monolithic service that started to serve real users and found a way to scale with additional microservices. Dispatch service served as a mobile interface with public facing APIs – but included a dispatch system with matching logic and a proxy to route all other traffic to other microservices within Uber.
The glory days of this first generation system did not last for long after that as it had already been in production for the previous few years. By Jan 2015 the blueprint of a brand new API gateway (arguably the first true gateway) was bootstrapped and the first semantically RESTful API allowing the Uber rider app to search for destination location was deployed with a few thousand queries per second (QPS). It was a step in the right direction.
Second gen: the all-encompassing gateway
Uber adopted a microservice architecture in its very early days. This architectural decision led to an eventual growth of 2,200+ micro-services that powered all of Ubers products by 2019.
The API gateway layer was named as RTAPI, short for Real Time-API. It started out with a single RESTful API in early 2015 and grew to become a gateway with many public-facing APIs powering more than 20 growing portfolios of mobile and web clients. The service was a single repository that was broken up into multiple specialized deployment groups as it continued to grow at an exponential rate.
This API gateway was one of the largest NodeJS applications at Uber with some impressive stats:
- many endpoints across 110 logical endpoint groupings
- 40% of engineering had committed code to this layer
- 800,000 req/s at peak
- 1.2 million translations served to localize data for clients
- 50,000 integration tests executed on every diff under 5 minutes
- For the longest time there was a deploy almost every single day
- ~1M lines of code handling the most critical user flows
- ~20% of mobile build is code generated from the schemas defined in this layer
- Communicated with ~400+ downstream services that are owned by 100+ teams at Uber
Goals of our second gen
Every piece of infrastructure within the company has a predetermined set of goals to satisfy. Some of the goals started out during the initial design and some were picked up along the way.
100s of teams were building features in parallel. The number of microservices providing foundational functions developed by backend teams was exploding. The frontend and mobile teams were building product experiences at an equally fast pace. Gateway provided the decoupling needed and allowed our apps to continue relying on a stable API gateway and the contracts it provided.
All mobile to server communications were primarily in HTTP/JSON. Internally, Uber had also rolled out a new internal protocol that was built to provide a multiplexed bidirectional transport protocol. There was a point at which every new service at Uber adopted this new protocol. This fragmented the backend systems with services between two protocols. Some subset of those services also allowed us to only address them via peer to peer networking. The networking stack at that time was also in a very early stage and the gateway shielded our product teams from the underlying network changes.
All APIs used by the company need a certain set of functionalities that should remain common and robust. We focused on authentication, monitoring (latency, error, payload size), data validation, security audit logging, on-demand debug logging, baseline alerting, SLA measurement, datacenter stickiness, CORS configurations, localization, caching, rate limiting, load shedding and field obfuscation.
During this time, numerous app features adopted a functionality to push data from the server to the mobile apps. These payloads were modeled as APIs and the same “cross-cutting concerns” discussed above. The final push to the app was managed by our streaming infrastructure.
Reduced round trips
The internet has evolved over the last decade working around various shortcomings of the HTTP stack. The reduction of roundtrips over HTTP is a well known technique used by frontend applications (remember image sprites, multiple domains for assets downloading, etc). In a microservices architecture, reducing round trips to access bits and pieces of the microservice functionalities are combined together at the gateway layer that “scatter-gathers” data from various downstream services to reduce the round trips between our apps and the backend. This is especially important for our users in low bandwidth networks on cellular networks in Latam, India and other countries.
Backend for the frontend
Development speed is a very critical feature of any successful product. Throughout 2016 our infrastructure for new hardware was not Dockerized and provisioning new services was easy, yet hardware allocations were slightly complex. The gateway provided a wonderful place for teams to get their feature started and finished in a single day. This was because it was a system our apps called into, the service had a flexible development space for writing code, and had access to the hundreds of microservices clients within the company. The first generation of Uber Eats was completely developed within the gateway. As the product matured, pieces were moved out of the gateway. There are numerous features at Uber that are built-out completely at the gateway layer using existing functionalities from other existing microservices.
Challenges with our approach
Our initial goals for the gateway was largely io bound and a team was dedicated to supporting Node.js. After rounds of reviews, Node.js became the language of choice for this gateway. Over time there were growing challenges with having such a dynamic language and providing freeform coding space for 1,500 engineers at such a critical layer of the Uber architecture.
At some point, with 50,000 tests running on every new API/code change, it was complicated to reliably create a dependency-based incremental testing framework with some dynamic loading mechanisms. As other parts of Uber moved on to Golang and Java as the primarily supported languages, onboarding new backend engineers onto the gateway and its asynchronous Node.js patterns slowed down our engineers.
The gateway grew quite large. It took on the label of a monorepo (the gateway was deployed as 40+ independent services) and upgrading the 2,500 npm libraries to a newer version of Node.js continued to increase effort exponentially. This meant that we could not adopt the latest version of numerous libraries. At this time Uber started adopting gRPC as the protocol of choice. Our version of Node.js did not help in that effort.
There are constant cases of null pointer exceptions (NPE) that could not be prevented during code reviews and shadow traffic resulting in key gateway deployments stalled for a few days until the NPE is fixed on some unrelated new unused APIs. This further slowed down our engineering velocity.
The complexity of the code in the gateway deviated from being IObound. A performance regression introduced by a few APIs could result in slowing down the gateway.
Two specific goals of the gateway resulted in a lot of stress on this system. “Reduced round trips” and “backend for the frontend” was a recipe for a large amount of business logic code to leak into the gateway. At times this leak was by choice and other times without reason. With over one million lines of code, it was pretty hard to discern between “reduce round trips” and heavy business logic.
With the gateway being critical infrastructure to keep our customers continuing to move and eat, the gateway team started to become a bottleneck for product development at Uber. We mitigated this through API sharded deployments and decentralized reviews, but the problems of being a bottleneck were not resolved to a satisfactory level.
This is when we had to rethink our next generation strategy for the API gateway.
Third gen: self-service, decentralized, and layered
By early 2018, Uber had completely new business lines with numerous new applications. The number of business lines is only continuing to grow – Freight, ATG, Elevate, groceries and more. Within each line of business, the teams managed their backend systems and their app. We needed the systems to be vertically independent for fast product development. The gateway has to provide the right set of functions that can actually accelerate them and avoid the technical and non-technical challenges mentioned above.
Goals of our third gen
The company was very different from the last time we designed our second gen gateway. Reviewing all the technical and non-technical challenges, we started designing the third generation with a new set of goals.
Separation of concerns
This new architecture encourages the company to follow a tiered approach to product development.
Edge Layer: the true gateway system that provides all the functionalities described in the goals of our gateway section of our second generation system except “backend for the frontend” and “reduced round trips.”
Presentation Layer: microservices that are tagged specifically to provide the backend for the frontend for their features & products. The approach results in product teams managing their own presentation and orchestration services that fulfill the APIs needed by the consuming apps. The code in these services is catered towards view generation and aggregation of data from many downstream services. There are separate APIs to modify the response catering to the specific consumer. For example, the Uber Lite app might need much less information related to pickup maps compared to the standard Uber rider app. Each of these might involve a varying number of downstream calls to compute the required response payload with some view logic.
Product Layer: these microservices are tagged specifically to provide functional, reusable APIs that describe their product/feature. These might be reused by other teams to compose and build new product experiences.
Domain Layer: contains the microservices that are the leaf node that provides a single refined functionality for a product team.
Reduced goals of our edge layer
One of the key contributors to the complexity was the ad hoc code in the second gen that consisted of view generation and business logic. With the new architecture, those two functionalities have been moved out into other microservices that are owned and operated by the independent teams on standard Uber libraries and frameworks. The edge layer was operating as a pure edge layer without custom code.
It is key to note that some teams that are starting can have a single service that satisfies the responsibilities of presentation, product, and service layer. As the feature grows, it can be deconstructed into the different layers.
This architecture provides immense flexibility to start small and arrive at a north star architecture that is consistent across all our product teams.
Technical building blocks
In our effort to move over to the newly envisioned architecture, we needed key technical components to be in place.
The edge layer which was originally served by our second generation gateway system was replaced by a single standalone Golang service paired with a UI. “Edge gateway” was developed in house as the API lifecycle management layer. All Uber engineers now have the ability to visit a UI to configure, create, and modify our product-facing APIs. The UI is capable of simple configurations like authentication as well as advanced configurations like request transformations and header propagation.
Given that all the product teams are going to maintain and manage a set of microservices (potentially at each layer of this architecture for their feature/product), the edge layer team collaborated with the language platform team to agree on a standardized service framework named “Glue” to be used across Uber. The Glue framework provides an MVCS framework built on top of the Fx dependency injection framework.
The categories of code in the gateway that fall into the buckets of “reduced round trips” and “backend for the frontend” required a lightweight DAG execution system in Golang. We built an in-house system called Control Flow Framework (CFF) in Golang that allows engineers to develop complex stateless workflows for business logic orchestration in service handlers.
Moving a company that has operated in a particular way for the last few years into a new technical system is always a challenge. This challenge was particularly large as it impacted how 40% of Uber engineering operated regularly. The best way to undertake such an effort is to build consensus and awareness of the goal. There were a few dimensions to focus on.
The centralized team migrated a few high-scale APIs and critical endpoints into the new stack to validate as many of the use-cases as possible and validate that we can start having external teams migrating their endpoints and logic.
Since there were many APIs, we had to clearly identify ownership. This wasn’t simple, since a large number of APIs have cross-team owned logic. For APIs that were clearly mapped to a certain product/feature, we automatically assigned them but for complex ones we took it case by case and negotiated the ownership.
After breaking it up into teams we grouped the endpoint teams into many groups (usually by larger company org structure, for example Rider, Driver, Payments, Safety, etc ) and contacted the Engineering leadership to find an engineering and program POCs to lead their teams throughout 2019.
The centralized team trained both the Engineering and Program leads into ‘how’ to migrate, ‘what’ to look out for, ‘when’ to migrate. We created support channels where developers from other teams could go for questions and assistance in their migration. An automated centralized tracking system was put in place to hold teams accountable, offer visibility into progress and update leadership.
Iterating on the strategy
As the migration was proceeding we encountered edge cases and challenged assumptions. A number of times new features were introduced and at other times, we chose not to pollute the new architecture with features that are not relevant at a given layer.
As the migration was proceeding our team continued to think of the future and direction where the technical organization is moving and ensured to make adjustments to the technical guidance as we progressed through the year.
Ultimately we were able to execute effectively on our commitments and are well on our way to moving towards a self-serve API gateway and a layered service architecture.
After having spent the time developing and managing three generations of gateway systems at Uber, here are some high level observations about API gateways.
If there is an option, stick with a single protocol for your mobile applications and internal services. Adopting multiple protocols and serialization formats ultimately results in a huge overhead in the gateway systems. Having a single protocol provides you the choice of how feature rich your gateway layer can be. It can be as simple as a proxy layer or an extremely complex and feature rich gateway that can implement graphQL using a custom DSL. If there are multiple protocols to translate, the gateway layer is forced to be complex to achieve the simplest process of routing a http request to a service in another protocol.
Designing your gateway system to scale horizontally is extremely critical. This is especially true in the case of a complex gateway system like our second generation and the third generation. The ability to build independent binary for API groups was a critical feature that allowed our second generation gateway to scale horizontally. A single binary would have been too large to run 1600 complex APIs.
UI based API configurations are great for incremental changes in an existing API, but creation of a new API is usually a multi-step process. As engineers, sometimes the UI might feel slower than working directly on a checked out codebase.
Our development and migration timeline from second generation to the third was 2 years long. It is critical to have continuous investment as engineers transition in and out of the project. Keeping a sustainable momentum is extremely critical for the success of such long running projects with the northstar goal in mind.
Finally, every new system need not support all the tech debt features from the old system. Making conscious choices on dropping support is critical for long term sustainability.
Looking back at our evolution of our gateways, one would wonder if we could have skipped a generation and arrived at the current architecture. Anyone yet to begin this journey in a company might also wonder if they should start from a self-service API gateway. It is a hard decision to make, as these evolutions are not standalone decisions. A lot depends on the evolution of supporting systems across the company, like infrastructure, language platforms, product team, growth, size of the product and many more.
At Uber, we have found strong indicators of success with this latest architecture. Our daily API changes in the third generation system have already surpassed our second generation numbers. This directly correlates to a faster paced product development lifecycle. Moving to a golang based system has improved our resource utilization and request/core metrics significantly. The latency numbers on most of our APIs have decreased significantly. There is still a long way to go as the architecture matures and older systems are rewritten into the newer tiered architecture during their natural rewrite cycle.