Productivity at Scale: How We Improved Build Time by 400% at LinkedIn


Co-authors: Szczepan FaberYiming WangMihir GandhiDeep Majumder, and Irina Issayeva

Introduction

At LinkedIn, we have 500+ microservices (3,500 end-points), which make up various aspects of the software ecosystem. This translates to a significant number of web applications. In 2012, we added the Play Framework to our tech stack to build microservices and web applications. Since then, over 330 Play apps have been created and hundreds of engineers work on them daily. One of these Play apps is a service that supports the entire API traffic from the LinkedIn website and the mobile clients, handling over 130,000 requests per second.

While the operational metrics for these Play apps were great, we noticed that developer happiness and developer productivity had come under strain. Having to wait one hour to set up a development environment seemed like an egregious waste of time. It used to take quite a while to refresh the IDE (Intellij IDEA) for one of our largest Play apps. Given that many engineers actively work on the application every day, that wasted time translated to hundreds of hours of non-productive build time. Slow IDE refresh was just one of the symptoms of a larger problem. Local development was also a pain due to slow “hot reload,” which took minutes instead of seconds.

Additionally, LinkedIn engineering aims for “3×3” deployment, i.e., release 3 times per day with no more than 3 hours between committing code to seeing it in production. But considering the slow build times and consequent loss of developer productivity, achieving 3×3 became difficult.

One source of the productivity loss was the build system. Play Framework uses SBT as its default build system. SBT had served us well for a couple of years, however the growing size of our applications and LinkedIn’s scale started to push the build system to its limits. So, we started to look for alternatives and eventually decided to move to Gradle.

A year later, as of this post, our largest Play app takes less than 5 minutes for IDE refresh, and build times are down from 60 minutes to 15 minutes. Here’s the story of how we got to this point.

All was not well

For a long time, we used vanilla Play, which comes bundled with SBT. As the size and scale of our Play apps grew, we started to encounter a few hard-to-ignore problems that made us look for alternatives.

  • Scalability: On the larger Play apps, build times were as high as 45 minutes to 1 hour. Dependency refresh and consequently IDE setup took an equally long time. The build system was a clear bottleneck and would not scale with our increasing code base.

  • Productivity: On some of our larger applications, setting up and updating the development environment took close to an hour. Considering that a large number of engineers working on these larger apps need to update their developer environment once per day on average, this meant a huge number of lost hours per day. This was a huge drain on productivity.

  • Maintainability: LinkedIn is primarily a Java shop, with more than 10K applications and services written in Java. However, build scripts for SBT are written in Scala. SBT syntax is very sophisticated and, arguably, dense. The multitude of overloaded operators can easily get overwhelming and is a readability nightmare. Maintenance was bound to be a challenge in the long run.

Early struggles

As of late 2014, LinkedIn had already been using Gradle very extensively to build Java, Android, C/C++, and iOS applications. It was only natural that Gradle was considered as an option to replace SBT. So, LinkedIn sponsored Gradle to support Play applications. In an engineering blog post from 2015, we talked about LinkedIn’s plan to gradually phase out SBT and use Gradle as the primary tool for building Play Applications. In this engineering blog post from 2015, we talked about LinkedIn’s plan to gradually phase out SBT and use Gradle as the primary tool for building Play applications.

In 2015, we made our first attempt to replace SBT with Gradle, but failed to do so. The amount of effort needed was daunting and prevented us from gaining momentum. At the time, nobody else outside of LinkedIn seriously considered using Gradle for Play. The community indicated serious gaps in a forum ticket titled “Why we are not using the Gradle Play plugin.” We couldn’t learn from others, so we had to pave the way!

SBT has always been deeply integrated with Play. The framework’s documentation, best practices, plugin system, and hot reload workflows were tied to SBT. Moreover, teams at LinkedIn developed significant build customizations on top of SBT. For all of these reasons, we were not making progress on the “Play on Gradle” project, and the pressure was building. Play codebases at LinkedIn were growing, which further exacerbated the productivity issues.

In late 2016, we restarted the project with a different execution strategy. Instead of working off the long list of Play on Gradle gaps, we started hacking and slashing to take the first app to production. Operating in a “start-up” mode was a game changer. We quickly identified the gaps that were of the highest priority. We identified problems that we needed to take to Gradle, Inc. (people behind Gradle) or to Lightbend (company behind Play Framework). Having working applications (built with Gradle) in production gave us validation and data points to get more buy in for the project. Boosting productivity by 4X was a great incentive for teams to chip in to the common effort. In early 2017, we had the first set of apps in production, and our team grew. We moved from a “start-up” mode into execution proper, getting more teams migrated every month, improving Gradle plugins as we pushed forward and cleaning up the tech debt accumulated in the early phase of the project.

A formidable journey

The move to Gradle was a complex one spanning several quarters. The transition had to be smooth and relatively painless for the hundreds of engineers who would be impacted by it. The project spanned two years, from something with just a couple of engineers working on it to something that now includes engineers from multiple teams. As of the writing of this post, we are almost at the finish line. Most of our important Play applications have been migrated to using Gradle, and the last few are on track to complete migration within the next few months.

The transition was an arduous one, owing to many factors:

  • Active development: The repositories that were being migrated from SBT to Gradle-based build systems were under active development. The migration had to be transparent to the engineers working on them. We had to ensure Gradle and SBT could co-exist in the same application while the migration was in progress, so that SBT could continue to be used while we added support for Gradle builds. The work was akin to changing the engine of a moving car!

  • Application variety: We have numerous kinds of applications built on the Play framework, such as UI frontends, mid-tiers, and API frontends, among others. Each of these have their own complexities and idiosyncrasies.

  • Existing infrastructure: Since LinkedIn had been using Gradle for a long time (for other non-play applications), several Gradle plugins already existed that needed to be accounted for during the transition. Fixes, patches, and rewrites were needed to reuse these plugins with Play applications, but maintaining backward compatibility was essential too. In addition, LinkedIn has its own layer of repository management and internal frameworks, which adds an additional layer of complexity.

  • Variations in build scripts: There wasn’t a “one-size-fits-all” solution for migrating build scripts. Every application had its own customizations, so the migration required a cross-organization collaboration to be executed effectively. LinkedIn was one of the first adopters of the Gradle Play plugin, and therefore, as with any new product, we encountered bugs that needed fixing. When possible, we made bug fixes in-house and raised pull requests to merge them back with Gradle open source so that they could benefit the larger Gradle community.

Overall, this was a complex, cross-organizational project that involved picking the right proof-of-concept repositories, working with the owners of those repositories to understand their build structure, improving the Gradle plugin for Play, and working with individual teams to migrate their custom build scripts.

Worth it!

We have seen some amazing results post-migration. Some of these include:

  • Productivity boost: Faster dependency resolution in Gradle helped improve productivity across the board. These productivity gains help us bring delightful features and fixes faster to our members. Additionally, IDE refresh times improved by 4x-6x on some of our applications. The developer happiness that resulted from this was echoed multiple times during our internal surveys.

  • Faster builds: Default support for incremental builds meant that engineers can often skip full builds during local development. Build times improved by a factor of 150% to 400%, depending on the size of the application code. We’ve seen over 1,000 hours of productivity gains across various stages of the build lifecycle per quarter.

  • Maintainability: Gradle is compatible with JVM languages like Groovy, Java, and Kotlin. We develop most of our plugins with Java and integrate them seamlessly with the build scripts. This is ideal for long-term maintainability, as Java is the predominant language at LinkedIn.

  • Scalability: We ensured that every Gradle task used in LinkedIn’s Play-based applications would support the necessary scale by making tasks incremental. Techniques such as pathing jar for handling large classpaths set us up to scale gracefully with increasing code size. Gradle also gives us an excellent platform to further scale the build systems with build caching.



Source link