How LinkedIn scales compatibility testing


Simplified workflows for post-merge validation job (left) and compatibility testing validation job (right)

Developer productivity challenges at scale

When we started employing this testing procedure back in 2014, it worked well for our scale. At the time, LinkedIn had 1,000 multiproducts; the most widely consumed multiproduct was a library that implemented a core Gradle plugin for Java, called gradle-jvm, which had a few hundred consumers. Today, that same library is consumed by 4,500 multiproducts in its latest major version. LinkedIn’s development workforce has seen a tenfold increase in size in the same timeframe.

In fact, a few years ago, we reached a point that a single code submission to gradle-jvm would take 14 hours to complete due to the CI system’s validation of compatibility testing. Developers working with this plugin had to submit their code and wait until the next business day to get feedback from the CI system.

Furthermore, every time that compatibility testing failed, library producers had to dig through the logs of each failed consumer (multiproduct), which was time-consuming. Often, a small percent of consumers failed not because of the code changes under test, but because of their own non-deterministic (flaky) tests, producing a false positive signal that blocked the producers from publishing a new version of their library.

With these challenges in mind, we decided it was time to revisit our compatibility testing implementation and expand our tooling surrounding it to keep up with our rapid growth. In the rest of this blog post, we explain how we enhanced the debuggability, stability, and performance aspects of compatibility testing across LinkedIn.

Debuggability

Compatibility testing failures can be difficult to troubleshoot. The debugging process requires library producers to dig through build and test logs for each failed consumer (multiproduct), which is time-consuming. Reliability issues in the infrastructure may also cause failures at this scale, due to transient network issues, hardware malfunction, or a bad rollout of the CI system that introduces a software regression.

Below are some of the features that the team built in the CI system to assist developers with the debugging process:

  • The ability to reliably determine if a failure was caused by the infrastructure or by a legitimate issue in the validation. For each failure, actionable and fine-grained errors are displayed on the UI with links to the relevant execution logs and helpful resources (e.g., documentation) that can assist users to better understand and fix the issue, or triage it to the right team as needed.
  • A comprehensive one-page failure report displaying a holistic view of compatibility testing results. For each consumer, the report shows the pass/fail result, a summary of the failure cause, and other relevant information that helps producers identify issues more quickly, without having to go through individual logs for failed consumers.
  • Infrastructure support to temporarily store the cloned workspace of failed consumers and make them available for developers to debug.
  • Guidelines and tooling to seamlessly reproduce a consumer failure locally so developers can use their favorite IDE to debug.

Stability

Compatibility testing can produce a false positive signal when dealing with consumers (multiproducts) that fail not because the code change in the library under test is backwards incompatible, but because of their own non-deterministic (flaky) tests. Such failures (even if a handful) block the publication of a new version for the library and put the burden on the producers to debug the failures in the consumer domain before re-submitting their code changes.

Often, while producers are busy debugging, newer versions of consumers will get published. Hence, by the time producers re-submit their code changes, the state of the system has changed, and a new set of flaky consumers might produce a false positive signal in compatibility testing, requiring further debugging. This loop can be endless, like a game of whack-a-mole. We realized we needed a strategy to break this cycle.

That’s why we built infrastructure support that empowers producers to make their code submissions resilient to consumer failures, while still receiving a reliable signal from compatibility testing. Specifically, the following configurable parameters can be customized by a library producer:

  • Failure threshold. An upper bound on the percent of consumers that are allowed to fail without blocking publication of a new version (by default, zero). Libraries with many direct consumers set this to a small non-zero value, like 4% for gradle-jvm (i.e., 180 out of 4,500 consumers).
  • Ignored consumers. A set of known unstable consumers that may be ignored for each code submission irrespective of the Failure threshold (by default, an empty set).
  • Required consumers. A set of key reliable consumers with stable builds and tests that must pass against each code submission irrespective of the Failure threshold (by default, an empty set).

In certain situations when tactical code changes in a library are needed to fix an urgent issue, the library producers may take an intelligent risk and use an override to bypass compatibility testing. Such code submissions will be marked with an override stamp and audited.

Performance

We used software profiling to instrument our CI system’s post-merge validation implementation, and identified three main performance bottlenecks. Next, we discuss these bottlenecks and our approaches for addressing them, which reduced the compatibility testing execution time for gradle-jvm from 14 hours to 2 hours.

Job scheduler
Our CI system uses our internal job orchestration service, Orca, to run jobs. Particularly, our post-merge validation is implemented as a job, which then runs compatibility testing jobs. Given our bounded machinery resources, we limit the number of concurrent runs per each post-merge to 300.

Our original implementation used to trigger jobs as a batch and then wait for all the jobs in that batch to complete before starting a new batch. However, we observed that oftentimes, a handful of long running jobs in each batch would create a bottleneck in performance for the scheduler, preventing it from kicking off more jobs.

We enhanced the scheduler to trigger a new job as soon as a running job completes, so that at any given point in time, we maximize our resource utilization. This new scheduling approach reduced the compatibility testing execution time for gradle-jvm from 14 hours to 7 hours. The difference between the two scheduling approaches is illustrated below.



Source link