How we designed our Continuous Integration System to be more than 50% Faster | by Pinterest Engineering | Pinterest Engineering Blog | Feb, 2021

Urvashi Reddy | Software Engineer, Engineering Productivity Team

Earlier this year, the Engineering Productivity team at Pinterest published a blog called How a one line change decreased our clone times by 99%. In that post, we described how a simple Git configuration sped up clone times in one of our largest repositories at Pinterest. In this post, we’ll talk about how we significantly decreased our build times in a CI that serves another major repository at Pinterest. Spoiler alert: it was not a one line change!

The content covered in this blog was presented at BazelCon 2020. Check out the presentation Designing a Language Agnostic CI with Bazel Queries.

The Engineering Productivity team’s vision is to “build a developer platform that inspires developers to do their best work.” One of the integral pieces of this platform is the Continuous Integration (CI) pipelines. The CI pipelines are responsible for validating code changes and producing release artifacts that can be deployed to one of our supported Continuous Delivery platforms. With over 1,000 engineers at the company, our team is faced with an interesting challenge of providing reliable and fast CI pipelines that serve the major repositories at scale.

In order to meet those outcomes, our team made a few key design choices:

  • Adopt Bazel as our build tool
  • Use language based monorepos
  • Test and release only the changed code
  • Leverage Bazel’s BUILD file as a contract between CI and developers
  • Create release abstractions with custom Bazel rules
  • Parallelize work as much as possible

Adopting Bazel

Choosing a single build tool that is multi-language allows our team to create CI workflows that can be applied to any repository using Bazel at Pinterest. Since Bazel is hermetic by design, we can run Bazel targets in CI without needing to configure or manage dependencies on the host machines.

Language based monorepos

Our CI pipelines are triggered when code is committed to a repository, which means having a CI pipeline for every repository. In order to limit the number of CI pipelines and repositories we have to manage, we group our services into one repository per language.

If you’re interested in learning more about the above two choices, check out another BazelCon 2020 presentation from our team called Pinterest’s Journey to a Monorepo.

Test and release only changed code

At scale, running every target within a repository is expensive. Even with Bazel’s cache, running all test targets with bazel test //… still means spending time fetching and loading dependencies for targets. Those network calls to the cache are time consuming and can be avoided entirely if only the minimal set of targets are run.

Additionally, the services within our monorepos vary in commit frequency. Some of them are contributed to daily while others are more sporadic. By running in CI only what’s affected by a change, we can significantly speed up our build times.

So how do we get the minimal set of targets to run? We created a Golang CLI called build-collector to report the targets to run in CI. The CLI takes in a set of commits and uses Bazel queries to output the list of targets to run. The CLI looks at the files that were changed and runs the appropriate query to find the affected targets. For example, if a couple source code files were changed build-collector would run the following query:

The above command uses the rdeps query function to find the reverse dependencies of the source files. The output is a list of targets we can run in CI. In order to get test targets specifically, build-collector wraps the above with a filter using the kind query function:

Note: Alternatively, the tests query function can be used to filter for test targets

This is just one type of query that build-collector runs. The full list of queries are explained in the Designing a Language Agnostic CI with Bazel Queries presentation. At this point, you might be wondering how we get release targets. We’ll cover that further below when we talk about our custom release rule implementation.

In our CI script, we call build-collector in the following way:

Here’s an example of build-collector’s output file for test targets. The same JSON schema is used for release targets as well.

The CI script parses the JSON files outputted by build-collector and runs the targets in parallel across multiple workers.

Leverage Bazel’s BUILD file as a contract between CI and developers

We want Pinterest engineers to focus on feature work and not have to learn too many things about how our CI is set up. Given that we also want to run the minimal set of targets in CI, we need a way for developers to communicate which targets are for local development and what parts of their service should be tested and released in CI. This is where the Bazel BUILD file comes in handy since it is already the place that developers are defining tests and release artifacts pertaining to their service. Developers follow a few simple conventions in the BUILD file so that our CI can figure out exactly what to build.

Those conventions are:

  • Use test rule types for running tests (standard Bazel practice)
  • Use Pinterest custom release rules for generating release artifacts
  • Use supported tags like no-ci to indicate what should not run in the pipelines

Create release abstractions with custom Bazel rules

At Pinterest, we support a number of different artifact types that are released to various stores. For example, docker images are sent to an internal registry, jars are published to Artifactory, etc. To support these workflows, we implemented custom Bazel rules for the common use cases at Pinterest. The custom rules help us to create an abstraction over the infrastructure. All developers need to do is indicate what they want to publish within the BUILD file using our custom rules.

A common workflow is creating and publishing docker images that are then referenced when deploying to EC2 or Kubernetes. Below is an example of how engineers can use a custom Bazel rule called container_release to get CI to make their release artifacts available for deployment.

In this example BUILD file, this service has created a docker image using the open source container_image rule. Using the custom container_release rule, the service author can publish the docker image to a Pinterest registry as well as specify what deployment artifacts should be made available to our Continuous Delivery platforms (Teletraan for EC2 deployments and Hermez for Kubernetes workloads).

There are a few benefits we get from implementing custom release rules:

  • It abstracts the infrastructure layers for our developers. They don’t have to be aware of where the deployment artifacts end up and how they are consumed by our CD platforms.
  • Developers control what parts of their service are released via version controlled code
  • We can support dev versions of their release artifacts by controlling where the artifacts are released.

That last benefit is made possible with another release abstraction within the custom release rule implementation. Each Pinterest custom release rule is actually a Bazel macro that generates two custom release rules: artifact_ci_release and artifact_dev_release. Our developers don’t see or interact with these rules directly, but they are used by our CI and local development workflows to ensure that they are run in the right context. For instance, below is the Bazel query build-collector runs to obtain release targets for source code changes:

A further optimization we made here was to control the dependency order that the release artifacts are run in within the artifact_*_release implementation. For instance, the docker images are published to the registry before we publish the YAML files that reference them. Doing it this way made the querying logic in build-collector fast and straightforward.

Parallelize as much as possible

In order to make sure our CI is running as fast as possible, we want to parallelize running targets wherever possible. We currently achieve this with what we call the Dispatcher Model. The Dispatcher Model is pretty simple: we figure out what targets need to run in CI and dispatch the execution of those targets to workers that run in parallel.

This has a significant benefit when running release targets. If a developer only cares about releasing artifacts from a few services they contribute to, they shouldn’t have to wait for all the other services in the CI build to be finished. Running release targets independently and in parallel provides developers with their release artifacts as soon as they are ready.

What about Pull Request (PR) builds?

Our PR pipeline kicks off a CI run every time a new pull request is created. We patch the code changes and use a temporary commit to pass to build-collector. This allows us to easily reuse the same CI setup for PR builds as well. The only difference is that we don’t create any release artifacts and instead check that the release targets can compile by running Bazel build.

The Results

At the beginning of this year, we invested in the above design choices in repo called Optimus. Optimus is a monorepo that houses 120+ java services and holds some of our most critical data platforms at Pinterest. Optimus and its CI pipeline serves 300 monthly active contributors.

At the beginning of this year, we didn’t use build-collector and weren’t using the release rule abstractions in Optimus. In place of those things, we were running all the targets within a service when code changes were made, and we had granular release rules for releasing release artifacts. At that time, the P50 time for the CI build was 52mins and the P90 time was 69mins. After migrating to our new CI design, we saw the P50 time drop to 19mins and P90 went to 49mins within a week. That’s 2.7x faster for P50 and 1.4x for P90!

Chart comparing build times the week before and after the CI migration
One month distribution of build times with the old CI
One month distribution of build times with new CI

Bazel is a powerful tool we can leverage to create a Continuous Integration Platform that works for a variety of use cases at Pinterest. Optimizations like build-collector and the custom release rules build the foundational layer of the platform from which we can create more enhancements that will further improve the velocity and health of our CI builds. Some of the things we’d like to look into next with Bazel are: remote execution, profiling, and a system for automatically detecting and excluding flaky tests.

Source link