Liang Ma | Software Engineer, App Foundations
In early 2020, we started seeing significantly elevated out-of-memory (OOM) crashes in the Pinterest iOS app. That incident resulted in a declined Crash-Free Users Rate (CFUR), from the previous 99% to 96%, which was a steep drop. What was going on?
We improved many systems along the journey, but those learnings could be a separate blog post. The primary focus of this blog post is to share with the broader iOS community what we have learned through this iOS specific issue.
For context, in the Pinterest iOS app, we use NSURLSession to talk to Pinterest REST API endpoints. Most endpoints had been on HTTP2 for years, and some endpoints were left behind for varying reasons. After spending a good amount of time troubleshooting, we ruled out other red herrings at the same period and found that when HTTP2 is enabled on certain endpoints, the Pinterest iOS app would experience around 20 times more OOM crashes than usual.
The figures below, which were captured in a replication test later, may give you a sense of how intimately they tied together:
TL’DR — the culprit was the incorrect use of HTTPBodyStream pattern in our code. When particular errors are triggered, they can leak multiple gigabytes of memory within a minute thus leading to an OOM crash. More details are discussed below.
It was not that straightforward to relate the spiking of OOM crashes to HTTP2 and further narrow down the root cause to a generic HTTP issue in the code — in fact, it took us a couple of months to sort that out. Here are a few things we did:
- Duplicate a test endpoint: We duplicated a test endpoint for HTTP2 testing/verification purposes, and iOS clients were controlled to connect to that endpoint through an experiment. Another benefit of this temporary endpoint was that later we pointed patched-app versions to this endpoint for adopting HTTP2 before we turned on HTTP2 again on the original endpoint (in this case, it’s almost one year later).
- Contextual logging: We logged memory metadata (total, available, footprint, peak, etc from task_vm_info) and network metrics (latency, error code, payload size, etc.), along with existing context events. They’re useful in finding out the patterns of OOMs, like how memory accumulated and spiked in responding to foreground/background mode switch, anomalous network timeout, and such events.
- Memory tools by Xcode: These memory tools were the ultimate weapons that we used to narrow down the root cause to exact code:
- Monitor memory report and be ready to debug memory graph or profile in Instruments whenever seeing an abnormal memory increase.
With the memory graph, it’s much easier to dig into memory allocations and figure out which objects are suspicious. Also, command-line tools (vmmap, leaks, etc.) are very useful in analyzing raw .memgraph files too.
Through communication with the team at Apple, I learned about a known issue that network requests might potentially enter into a loop that could consume excessive memory if the HTTP headers size is too large. I was able to reproduce an OOM crash with mocked large headers, and fortunately, we didn’t have the large request headers problem in our app. But this clue had inspired me to dig more deeply into the lower network layer.
As mentioned previously, the root cause is the incorrect use of HTTPBodyStream. In our code, some requests use HTTPBodyStream to provide body data, which works fine in normal cases even if -URLSession:task:needNewBodyStream: was not implemented.
However, according to Figure 6 (taken from the API document of -URLSession:task:needNewBodyStream:):
Delegate method -URLSession:task:needNewBodyStream: must be implemented to work properly in all circumstances if the request has HTTPBodyStream set. Opps!
- People might think after setting HTTPBodyStream to an NSURLRequest, the job is done and Apple networking would take care of everything. They may not realize that they also need to implement -URLSession:task:needNewBodyStream: since it works fine in normal cases, until “authentication challenge or recoverable server error” happens.
- Recoverable server errors might not be that rare, but Request Timeout failures can be considered recoverable errors. Also, we found HTTP redirection is also a cause, and it’s guaranteed to trigger the memory problem (combined with the HTTPBodyStream issue above):
- The API doc doesn’t clearly state the consequence of missing -URLSession:task:needNewBodyStream:, which is not as simple as a failed request. Instead, when the problem is triggered, it’s very likely running into excessive network hops and eventually causing the OOM issue.
You may wonder why the OOMs only happen on certain endpoints. That’s mainly because those endpoint requests used HTTPBodyStream incorrectly, and they’re more frequently called than other endpoints. They could continue running in the background mode, which, when added together, are more likely to run into the errors.
In the end, we were able to verify our fix in the combinations below:
- Strictly follow Apple’s API guideline, as the cost of fixing such errors on Production may be very high.
- OOM crash is critical and should be part of your app stability score
- WWDC iOS memory deep dive video is worth to watch
Acknowledgements: Thanks to Bill Kunz, Jon Parise, Scott Beardsley, iOS platform team & Arpit Diggi for their great support along the way. Kudos to everyone who had been involved.