We use our own library to capture Java crashes on Android, Breakpad to capture native crashes on Android, and PLCrashreporter to capture iOS crashes. We have caching logic to capture crashes and send reports when the network becomes available. We normalize crash data regardless of platform (iOS or Android) into a crash payload backed by avro schema for consistent reporting.
After the client libraries capture the crash, it is sent to the tracking frontend service, which in turn publishes the crash event to Kafka. Our data processing listens for crash events in the incoming Kafka stream, sanitizes the crash data as it is recieved, and symbolicates crashes.
After this process, we produce another Kafka event that is processed by Inception. Inception takes care of deduping exceptions, provides an API server to query exceptions, and also forwards each crash instance to Elasticsearch.
To get the sessions count and unique users count, we built a separate processing pipeline which listens for application session events in the Kafka stream and stores the data in Pinot. Since we want to get a distinct users count or total sessions count for a given time range over large volume of data, we chose Pinot (timeseries database).
Harrier is the UI component, which talks to all three data sources—ElasticSearch, Inception, and Pinot—to present information to the user. The UI is also integrated with JIRA to track exceptions.
The complexity of the system increased with multiple backends, and some of the backends were relatively new for us. We faced some problems on the way while building the system. Let’s go over the challenges we faced and how we resolved them.
Challenges and solutions
We saw that some crashes were not being emitted to our data center from the Android app. Our investigation revealed that (on Android) crashes cannot be sent by spawning a new thread when the virtual machine is being shutdown by the mobile app. The underlying networking stack could spawn new threads to send crashes. This prevented crashes from being emitted to our data center. Hence, we started sending the crash report in the same thread as the crashing thread, using HttpURLConnection. If for some reason the crash cannot be sent, it is stored on the device locally and emitted the next time the app opens. We also limit the number of crashes that can be stored on the user device.
When we added Breakpad to capture native crashes on Android apps, we found that native crashes stopped showing up on the Google Play developer console. Since only one signal handler can be registered for capturing native crashes on Android apps, we stopped seeing native crashes on the Google Play store. We are fine with this, however, since we wrote our own tool to symbolicate native crashes on Android.
With respect to symbolication for exceptions from iOS, downloading symbol files for built-in libraries for a given iOS version is a big challenge. The only way to obtain the symbol files is by connecting a device running the desired iOS version. The symbol files get stored in ~/Library/Developer/Xcode/iOS DeviceSupport. If you are missing symbols for an older version of iOS, it gets harder to find a device which runs the old iOS version to download symbols. We utilize our mobile device lab for getting iOS symbols.
Our first line of data processing is a Samza job which does data treatment and symbolication. Early on, we ran into issues with the Samza job running out of memory, and we had no additional information to debug this issue. We used Yourkit profiler to profile the Samza job to figure out the root cause. The profiler report revealed that the Samza job ran out of memory after it consumed around 16K events in memory. Samza jobs have a default configuration to store 50K events in memory. Since crash data payloads are huge, with stack traces and other details, the job ran out of memory when it tried to load more than 16K events. Hence, we reduced the number of events kept in memory to 10K instead of the default 50K and this prevented the Samza job from running out of memory.
On the Elasticsearch part, we removed the daily indexes and switched to weekly/monthly index to keep the cluster stable. This reduced the total number of indexes by a factor of 30. Each index is a lucene index and it consumes resources (memory, file handles, etc.), which caused our cluster to run out of memory, and logstash was not able to ingest records. Fatal and non-fatal exceptions have been moved into separate indices to reduce the index size and also improve query performance. We also added sharding for non-fatal errors. All these changes reduced the query time from 120 seconds to 5 seconds.
The UI to visualize crashes was powered by Ember and Flask. Initially, our caching logic was to display to the user a snapshot of the system state and update the snapshot whenever someone visited the page. However, this lead to the following problems:
User sees stale data on first hit;
It did not scale with the amount of data we were receiving;
System was neither performant nor stable when there was huge amount of data.
Crash data is stored in two granularities: metadata, which contains the count, and details data, which contains the stack trace. We switched to pagination and removed caching logic from the UI. First, we paginated only the details data and pulled all the metadata at once. This approach failed to scale with the number of unique exceptions received. Finally, we paginated the fetching of metadata as well as the details data to make the UI stable and performant. This also solved all the performance problems that we previously had.
Having put a lot of energy and time into making sure that our backends and UI met our needs, it was time to test our system. One of our goals this year is to onboard SREs to support the system that we are building. We came up with a strategy for how we test our system and onboard SREs at the same time. Developers take down pieces of our backend pipeline one by one and ensure that alerts are triggered as expected, and then SREs bring the system back up. For symbolication, we also added tracking information so that we get alerts when there are missing OS symbols/app symbols. We capture the monitoring of our systems using Ingraphs and have set up auto alerts which ping the SRE on call whenever systems are in trouble. The hands-on testing strategy worked great, as the SREs and developers are on the same page with respect to why alerts are triggered and what they indicate. This also helped fix any gaps in communication quickly.
Thinking about it now, we have come a long way, overcoming a lot of obstacles and challenges along the way. We have a reliable client library, stable backend, and a performant UI.
We have multiple pieces involved in the system and each of them is critical to making this project a success. We would like to specifically thank the following engineers who helped in building the system: Toon Sripatanaskul, Karthik Ramgopal, Neville Carvalho, Sreedhar Veeravalli, Steven Pham, and Jerry Weng. We would like to thank managers Hans Granqvist, David He, and our director Brandon Duncan for supporting the project. Special thanks to Vasanthi Renganathan for helping us keep track of the project and being instrumental in helping us reach the goal. We would also like to thank our Tools SRE folks for supporting our systems.