Analyzing distributed trace data – Pinterest Engineering – Medium

Brittany Herr | Pinterest engineering intern 
Naoman Abbas | Pinterest engineer, Visibility

Earlier this year we open sourced Pintrace, a distributed tracing pipeline. A trace follows a request as it moves through a backend, tracking where and how it spends its time. Traces are made up of spans, each of which represents an operation. At Pinterest, there are tens of services and hundreds of network calls per-trace. The information from distributed tracing provides deep visibility into what goes on in our backend and helps identify bottlenecks .

However, one trace isn’t representative of the thousands of traces logged each minute (let alone the millions of requests per minute these traces are sampled from). Additionally, any single trace could have been an outlier or logged incorrectly. When it comes to assessing the overall performance of the Pinterest backend, we need to use trace data in a way that gives us a much bigger picture. That’s why we built the Pintrace Trace Analyzer, which performs an aggregated analysis of traces. Looking at stats from thousands of traces over a longer period of time not only weeds out the outliers/buggy traces, but provides a holistic view of performance.

Waterfall diagram depicting a single trace. Each row is a span: in each span, the light blue block represents the time taken for that span, and the dark blue label has the service name.

Pintrace Trace Analyzer

The power of this tool is its ability to compare two batches of traces — displaying stats for each of the two and highlighting the changes. An unexpected and significant change in a metric can indicate that something’s going wrong in a deployment. We focused on two indicators for Pintrace Trace Analyzer:

  1. Per-service latency: how long a service takes to perform its operation.
  2. Number of network calls: how many times a service was called.

A significant (and unexpected) change in either of these could mean an error in the code which could cause a production issue, resulting in increased wait times for Pinners. Seeing the changes in latency patterns or spikes in network calls backed by information from thousands of traces can shed light on areas of concern in the production code.

How it works

With this tool, an engineer can hone in on these problem areas by comparing two different groups of traces, as specified by the user. “Different” here can mean:

  • Different time periods, i.e. before and after a deployment or incident.
  • Across different devices, e.g. web and iOS.
  • From different countries.
  • Types of requests, e.g. v3_home_feed and v3_get_pin

The Trace Analyzer consists of two main parts–a Spark job (for high-capacity data processing) and a Jupyter notebook UI.

Workflow of Trace Analyzer. Blue components were built in this project and grey components already existed as part of Pintrace.
  1. The user enters two sets of trace specifications into the Jupyter interface.
  2. A Python script posts the Spark job to Chronos with the user parameters.
  3. The Spark job pulls span data from Elasticsearch.
  4. The job runs the analysis and pushes the results to Elasticsearch.
  5. Jupyter pulls the results from Elasticsearch and displays the visual report for the user.

These are a few example diagrams generated by the report:

Overview table comparing two batches of traces.
Histogram showing overall latency for traces in each batch.
Sample traces (trace ID and value) for various percentiles that link to the traces in Pintrace.

What this means for us

This tool more efficiently and effectively targets sources of concern and reduces latency incidents. We ran a comparison report on trace data for a specific date (with bad performance) against a normal performance.

Overview table that shows a huge latency increase.

The below table breaks down number of RPCs per service, we found a spike in number of calls for the service “pinandboardservice.”

This indicated unnecessary calls to an API which slowed down requests. With information like this, we can quickly detect the source of an issue and fix any bugs.

Future steps

We’ve used this tool after incidents to identify where an error occurred and diagnose what happened . We’re eager to continue exploring future extensions with Pintrace as this is only the tip of the distributed tracing iceberg. Some other possible use cases are:

  • Integrating it with our deployment process to detect whether a new version of the app will cause performance issues. This would allow us to use the tool as a preventative measure rather than passively as a debugging tool.
  • Automatically generating reports so engineers can easily check the status of each day’s deployments.
  • Setting up alerts for when latency or number of RPC hits a certain threshold.

The use cases for this tool will expand as the instrumentation of our traces expands. For example, we plan to add versioning or experiment information to each trace, so we can compare stats across these differentiators.

What we’ve implemented so far establishes the base of aggregate trace analysis. By adding extensions to this project, we’ll be able to further leverage trace data to ultimately provide Pinners with a faster experience.

Source link