What we learned from an iOS app OOMs incident | by Pinterest Engineering | Pinterest Engineering Blog | Jun, 2021


Liang Ma | Software Engineer, App Foundations

In early 2020, we started seeing significantly elevated out-of-memory (OOM) crashes in the Pinterest iOS app. That incident resulted in a declined Crash-Free Users Rate (CFUR), from the previous 99% to 96%, which was a steep drop. What was going on?

We improved many systems along the journey, but those learnings could be a separate blog post. The primary focus of this blog post is to share with the broader iOS community what we have learned through this iOS specific issue.

For context, in the Pinterest iOS app, we use NSURLSession to talk to Pinterest REST API endpoints. Most endpoints had been on HTTP2 for years, and some endpoints were left behind for varying reasons. After spending a good amount of time troubleshooting, we ruled out other red herrings at the same period and found that when HTTP2 is enabled on certain endpoints, the Pinterest iOS app would experience around 20 times more OOM crashes than usual.

The figures below, which were captured in a replication test later, may give you a sense of how intimately they tied together:

Figure 1 — when HTTP2 was enabled on that endpoint
Figure 2 — OOMs spiked instantly and aligned w/ HTTP2 timeline.

TL’DR — the culprit was the incorrect use of HTTPBodyStream pattern in our code. When particular errors are triggered, they can leak multiple gigabytes of memory within a minute thus leading to an OOM crash. More details are discussed below.

It was not that straightforward to relate the spiking of OOM crashes to HTTP2 and further narrow down the root cause to a generic HTTP issue in the code — in fact, it took us a couple of months to sort that out. Here are a few things we did:

  • Duplicate a test endpoint: We duplicated a test endpoint for HTTP2 testing/verification purposes, and iOS clients were controlled to connect to that endpoint through an experiment. Another benefit of this temporary endpoint was that later we pointed patched-app versions to this endpoint for adopting HTTP2 before we turned on HTTP2 again on the original endpoint (in this case, it’s almost one year later).
Figure 3 — mocked by triggering errors, on the simulator.

With the memory graph, it’s much easier to dig into memory allocations and figure out which objects are suspicious. Also, command-line tools (vmmap, leaks, etc.) are very useful in analyzing raw .memgraph files too.

Figure 4 — generate memory graph
Figure 5 — CFNetwork objects are listed on the top.

Through communication with the team at Apple, I learned about a known issue that network requests might potentially enter into a loop that could consume excessive memory if the HTTP headers size is too large. I was able to reproduce an OOM crash with mocked large headers, and fortunately, we didn’t have the large request headers problem in our app. But this clue had inspired me to dig more deeply into the lower network layer.

As mentioned previously, the root cause is the incorrect use of HTTPBodyStream. In our code, some requests use HTTPBodyStream to provide body data, which works fine in normal cases even if -URLSession:task:needNewBodyStream: was not implemented.

However, according to Figure 6 (taken from the API document of -URLSession:task:needNewBodyStream:):

Figure 6 — API documentation

Delegate method -URLSession:task:needNewBodyStream: must be implemented to work properly in all circumstances if the request has HTTPBodyStream set. Opps!

  • People might think after setting HTTPBodyStream to an NSURLRequest, the job is done and Apple networking would take care of everything. They may not realize that they also need to implement -URLSession:task:needNewBodyStream: since it works fine in normal cases, until “authentication challenge or recoverable server error” happens.
Figure 7 — OOMs chart
Figure 8 — HTTP 3xx timeline
  • The API doc doesn’t clearly state the consequence of missing -URLSession:task:needNewBodyStream:, which is not as simple as a failed request. Instead, when the problem is triggered, it’s very likely running into excessive network hops and eventually causing the OOM issue.

You may wonder why the OOMs only happen on certain endpoints. That’s mainly because those endpoint requests used HTTPBodyStream incorrectly, and they’re more frequently called than other endpoints. They could continue running in the background mode, which, when added together, are more likely to run into the errors.

In the end, we were able to verify our fix in the combinations below:

  • Strictly follow Apple’s API guideline, as the cost of fixing such errors on Production may be very high.

Acknowledgements: Thanks to Bill Kunz, Jon Parise, Scott Beardsley, iOS platform team & Arpit Diggi for their great support along the way. Kudos to everyone who had been involved.

To learn more about Engineering at Pinterest, check out our Engineering Blog, and visit our Pinterest Labs site. To view and apply to open opportunities, visit our Careers page.



Source link