API Profiling at Pinterest – Pinterest Engineering – Medium


Anika Mukherji | API Intern

When I walked into Pinterest on the first day of my internship and learned I’d be focusing on profiling the API Gateway service — the core backend service of the Pinterest product — my only thought was “What is profiling?”. Profiling is often shoved aside as a side project or as a lower priority concern, and not one typically taught in my college CS courses. Essentially, the services come first, and profiling of is a far second. Moreover, profiling is not always seen as a precursor to optimization, which can result in wasted time optimizing code that doesn’t significantly affect performance in production. That being said, profiling is a critical step in the software development process in order to create a truly performant system.

Before my arrival at Pinterest, a basic webapp had been built to accompany a regularly scheduled CPU profiling job (and consequent flamegraph generation) for all of our Node and Python services. My primary goal for the summer was to expand this tool for our API Gateway service, but with the flexibility to adopt for other services in the future, and eventually allow for profiling of all Pinterest services. The three arms of functionality I worked on were memory profiling, endpoint operational cost calculation and dead code detection.

Solving for increased optimization

I primarily worked on optimizing optimization, including expanding resource tracking and profiling tooling. In terms of performance in production, our evaluation of resource utilization for the API Gateway service was limited to CPU usage. There was a need for a holistic assessment of which parts of the API Gateway service were performant, and which parts of the codebase needed quality improvement. With that information, developer resources could be allocated to the least performant endpoints, and we could improve the overall process of optimization.

What exactly is profiling?

Software profiling is a type of dynamic programming analysis that aims to facilitate optimization by collecting statistics associated with execution of the software. Common profiling measurements include CPU, memory, and frequency of function calls. Essentially, profiling scripts are executed in tandem with another executing program for a certain duration of time (or for the entirety of the script being profiled), and they output a profile, or summary, of relevant statistics afterwards. The recorded metrics can then be used to evaluate and analyze how the program behaves.

There are two common types of approaches to profiling:

Event-Based Profiling:

  • Track all occurrences of certain events (such as function calls, returns, and thrown exceptions)
  • Deterministic (more accurate)
  • Heavy Overhead (slower, more likely to impact profiled process)
  • Example Python packages include: cProfile/profile, pstats, line_profiler

Statistical Profiling:

  • Sample data by probing call stack periodically
  • Non-deterministic (less accurate, though you can mitigate through stochastic noise reduction)
  • Low Overhead (faster, less likely to impact profiled process)
  • Example Python packages include: vmprof, tracemalloc, statprof, pyflame

We opted for statistical profiling for our production machines because of the lower overhead. If the job is run regularly for long periods of time, accuracy increases without increasing response latency due to heavy overhead. While profiling is important, it should not harm production performance itself.

Memory profiling

TL;DR: tracemalloc to track memory blocks

Our API Gateway service is written in Python, so the most apparent solution was to use an existing Python package to gather memory stack traces. Python 3’s tracemalloc package was the most appealing, with one large problem: we still use Python2.7. While our Python 3 migration is underway, it’ll be many months until that project is completed. This incompatibility forced us to patch and distribute our own copy of Python, in addition to using the backported pytracemalloc package. Just another reminder that updating to the latest version of Python is ideal for both performance and utilization of latest tooling.

The basic approach here was run a script on a remote node — one of our API production hosts — that sends signals 15 minutes apart that trigger signal handlers (functions registered to execute when a certain signal arrives).

Signals were a fitting choice because they don’t add any overhead when not running the signal handler, and because we don’t want to enable profiling all the time on all the machines. Since even a 0.1% overhead at scale is expensive. We decided to overload the SIGRTMIN+N signals to start and stop the profiling job on a received signal. The stack traces are collected, then saved to a temporary file within /tmp/. Another script is run on the remote host to produce a flamegraph, and then all files are saved to a persistent datastore and sourced by our Profiler webapp.

Operational cost calculations

TL;DR: Finding the expensive endpoints (and their owners!)

The calculation of endpoint operational costs required the combination of two sorts of data: resource utilization data, and request metrics. Our resource utilization information is given in two units — USD$ and instance hours — and is provided on a monthly basis.

Using request counts, the relative popularity of each endpoint can be calculated. This popularity is used as a weight to divide total resources used by the API Gateway service. Since most of our request data is in units of requests per minute, I decided to break cost down to that time scale as well. As each API endpoint has an owner, average operational costs for a given owning team is also calculated.

The ability to identify the most costly endpoints, and the engineers/teams to whom they belong, encourages ownership and proactive monitoring of their performance. It’s important to note that these calculated metrics aren’t absolute sources of truth — rather, their significance lies in how they compare relative to one another. Exact monetary impact isn’t the objective, it is the ability to identify unperformant outliers.

This approach is naïve in that it fails to account for CPU time proper, and also the distinction between costly handlers (endpoint-specific functions in the API Gateway) and costly requests. For example, requests can trigger asynchronous tasks which aren’t necessarily attributed to the API Gateway Service, the same endpoints with different parameters can have different cost structures, as can different handlers; and downstream services’ processing isn’t associated with the given request.

We could address this by creating an integration test rig that runs a set of known production-like requests and measures CPU time spent relative to the baseline for the application. We could further maximize impact of this by incorporating it into our continuous integration process, giving developers key insights into the impacts of their code changes. Additionally tracing via a given Request-ID would enable more holistic coverage for our overall architecture.

Dead code detection

TL;DR: Uncovering abandoned code (and deleting it)

Unused and unowned code is a problem. Old experiments, old tests, old files, etc. can rapidly clutter repositories and binaries if they’re able to fly under the radar. Discovering which lines of which files are never executed in production is both useful and easily actionable. In pursuit of identifying this dead stuff hiding in our service, I employed a standard Python test coverage tool.

While the primary use of a test coverage tool is to discover which lines of code are missed by unit and integration tests, you can run a job to run the same tool on a randomly selected production machine to see what lines of code are “missed”. As the job is run several times a day, the lines that are commonly missed in all runs for a given day are surfaced. An annotated version of the file is shown for easier visualization of what lines are “dead”, and who to contact to see if the code should be removed.

This is a fairly naïve implementation to begin detecting dead code: the code is run by a multitude of services and jobs, and determining the dead code in common among all of them is a more complex problem that still needs to be more carefully addressed. It’s also fairly expensive as it uses an event based collection technique rather than a statistical sampling.

What’s next

TL;DR: it’s all for optimization

I don’t have much experience with “big data”, but after building these tools and starting to run the jobs regularly, I was bombarded with large influxes of data. My gut reaction was to shove it all into the webapp, and leave developers to find what was useful (more is better right?). However, I quickly learned that while this data made sense to me, as someone who spent weeks to generate it, it was opaque and arguably impenetrable for engineers who never use flamegraphs or don’t have perspective into operational cost, and with respect to utility, was far from optimal.

It came to my attention that the new features I created would most likely have the following primary uses, so these were the key insights to be surfaced:

  • Finding files and functions that use the most memory
  • Engineers finding how expensive their api endpoints are
  • Starting point for cleaning dead code out of our repositories
  • Finding the most popular and costly parts of the API

To spread company-wide awareness of this tool, I held an engineering-wide workshop with flamegraph-reading and other profiling analysis activities. In just two days, two different potential optimizations (single line changes) were found and realized, saving the company a significant amount of annual spend.

At a surface level, these use cases provide a wide range of insights on resource utilization by the API and what parts of the codebase are used less in production. The birds eye view, however, is much more exciting and motivating. Not all parts of the code base are created equally — some functions will be executed a much greater number of times than others. Spending too many hours on rarely executed endpoints is a poor use of developer resources and is the worst possible strategy to optimize performance; in other words, blind optimization is not really optimization.



Source link