Unlike server-side programming, mobile code cannot be retracted once shipped and can only be updated once a user opts in to an app upgrade. For Uber Engineering, this presents unique challenges when it comes to releasing new features incrementally, fixing bugs, and mitigating outages at scale across our mobile applications.
Uber’s Experimentation Platform (XP) is responsible for ensuring that new mobile features roll out as seamlessly as possible for our users. In its normal mode of operation, XP allows a mobile app to query a back-end service (Treatment Service) to retrieve a set of features (flags) that should be turned on for a particular user. However, this becomes problematic—and buggy!—when the state of flags is queried in the code before the XP payload is returned to the app.
To combat these challenges, we developed XP Background Push to mitigate bugs safely and efficiently in real time. Unlike traditional pull-based models, our push-based tool enables us to proactively force a user’s app into a certain configuration without having to wait for the app to call a back-end service and return the payload, making it easier and quicker to fix bugs at Uber-scale.
In this article, we discuss how this powerful new tool enables our engineers to deploy more reliable apps for our users by leveraging existing Uber services.
Fixing bugs before XP Background Push
Without XP Background Push, it was difficult to have absolute certainty that we could return users to a working state in the case of a major bug sitting behind a feature flag.
Take the following theoretical scenario as an example: Uber releases a new feature in App v1 and we later discover that for some users, the feature causes a crash that prevents them from requesting a ride. We immediately incorporate a fix to this bug in App v2, but it will take time to upload v2 to the App or Google Play Stores. Moreover, even once the new version is available for download, users must voluntarily upgrade the patched version, and only users who have voluntarily downloaded App v2 have incorporated the fix.
Ideally, we would be able to roll back the broken feature in v1 dynamically via our XP to get users back to a working state until they upgrade to the patched v2. In this scenario, new features are hidden behind feature flags and the state of these flags are queried from our Treatment Service, a back-end service which decides which “treatment” to show a user. Treatment Service enables us to control the rollout server-side and roll back buggy features by simply disabling a flag.
This sort of server-side control is extremely powerful and works most of the time. However, there are still some edge cases which cannot be handled with traditional feature flagging, e.g., when the code behind a flag occurs before the configuration payload is returned. This is significant to Uber because to ensure that app load times are quick and seamless, flags are fetched asynchronously. In other words, we prefer to only use APIs that do not block use of the UI in-app.
Consider a scenario that occurs when the request for experiments is non-blocking: a buggy feature which is gated by a flag is shipped, and the flag is checked in app before the Treatment Service payload is returned. This flag is turned on for some subset of the population in Treatment Service, but on first launch, the flag will be disabled since the payload has not arrived. When the payload arrives, it is cached for subsequent launches. The code behind the flag starts causing crashes and hence the flag is subsequently disabled in Treatment Service. However, the flag cannot be disabled because it is checked before the payload arrives, and the crash happens before the payload (containing the information needed to disable the flag) is received .
This situation is far from ideal and can impact the Uber experience for our users. Fortunately, with XP Background Push, bugs like these can be mitigated with the snap of your fingers—or rather, the delivery of a push notification.
Using push notifications to fix bugs
To fix bugs and resolve crashes in these types of scenario, our XP Background Push uses silent push notifications to enable or disable feature flags. If there is a bug that needs to be fixed, we send out a silent push notification which contains a payload that instructs the app to turn a feature on or off, mitigating poor app performance until the issue is resolved in a future version.
This sounds simple enough in theory but actually requires deliberate and intelligent interaction between services to execute properly. Sending push notifications to a large number of users takes time and puts stress on our infrastructure, so pushing notifications to all users in the case of an outage is neither feasible nor efficient.
Thus, the first and trickiest problem is determining which users are affected by a bug and therefore need to be sent notifications. Since Treatment Service is responsible for evaluating flag status and sending it down to mobile clients, it can determine which flags are enabled on a particular user’s device, or conversely, which user devices a particular flag is enabled on.
XP Background Push workflow
We leverage the fact that Treatment Service determines the status of all flags by logging the evaluation results for all user/device pairs to a Kafka topic. This topic is ingested by a Samza job, and the data is loaded into a Cassandra table for database lookup. The Cassandra table stores the most recent Treatment Service results for all flags across every Uber account, and its pipeline processes over 1,000 events per second.
As displayed in Figure 2, below, a push process is triggered by engineers who are responsible for mitigating issues within the apps:
In the background, our A/B testing platform, Morpheus, constructs the payload that needs to be sent to users based on the new configuration. This payload is sent to an internal service called GroupPusher, along with a key identifying the flag which needs to be rolled back. GroupPusher is a service which takes some key identifying a set of users in Cassandra and a payload and proceeds to send the payload to all of the users identified set. GroupPusher then pulls the list of affected users from Cassandra based on this key.
GroupPusher next calls Pusher, an internal service that sends the payload to users at a rate of about 3,000 pushes per second. Pusher is responsible for sending push notifications to users, e.g., to let them know that their driver is approaching their pickup location: “Your Uber is arriving now.” Pusher then sends the payload down to the mobile clients via APNs (for iOS) and GCM (for Android).
This notification is silent to the user, but behind the scenes, the app is being reconfigured. When the payload is received, the flag configuration overwrites any previously existing flag states in the cache, which mitigates the bug until it can call a new Treatment Service payload with the updated configuration.
The long-term push for Uber’s XP
XP Background Push was developed as a tool for mitigating outages. Each of Uber’s apps contains hundreds and in some cases thousands of flags responsible for configuring the app. XP Background Push ensures that if an issue arises, our team will be able to address it as long as it is gated by a flag. In turn, this confidence empowers us to make mobile experiences more seamless for our users.
If you are up for the challenge of helping us grow Uber’s XP, consider applying for a role on our team!
AJ Ribeiro is a San Francisco-based software engineer on Uber’s Experimentation Platform team.