Build in the Cloud: Distributing Build Outputs

This is the fourth in a four part series describing how we use the cloud to scale building and testing of software at Google. This series elaborates on a presentation given during the Pre-GTAC 2010 event in Hyderabad. To get a sense of the scale and details on the types of problems we are solving in Engineering Tools at Google take a look at this post.

In the previous post, we saw how distributing build load across many machines and reusing the results of previous build actions can produce extremely fast build times. This distribution and reuse itself exposed other performance bottlenecks.  Specifically, a full build of a large project may produce several gigabytes of output files, all of which need to be shipped from the cloud back to the developer’s local machine. This taxes both our networks and the developer’s local disk, the limited latencies and bandwidth of which slow down the build.

Furthermore, a developer typically does not need access to all of the outputs of a build. She may be interested only in the final binary that results from a build, not all of the intermediate object files. Another example is when the developer uses the distributed build system to build and execute a test in the cloud; in this case, she does not need local access to any build outputs at all, but simply an indication that the test passed or failed. It is thus unnecessary and wasteful to always ship all build outputs from the cloud back to the developer’s local machine. Instead, it is sufficient to ship back only those build outputs that are actually accessed by the developer on demand.

To address these bottlenecks, our distributed build system writes all build outputs that it generates to a persistent, distributed storage system in the cloud. This storage system can handle sustained read and write rates many times those of a local disk, and the network connectivity between the distributed build system and this storage system has significantly lower latency and higher bandwidth than to the developer’s local machine. The build outputs are stored by content fingerprint, providing a straightforward mechanism for indexing and retrieval of the files. (This cloud storage system also serves as a secondary caching layer for the distributed build system, providing an additional performance benefit in instances when build outputs fall out of the primary caching layer.) Using this distributed storage system in the cloud, we are able to make build speeds more than twice as fast as a distributed build that stores all build outputs on the developer’s local disk.

With cloud-based storage of build outputs providing substantial increases in build times, the remaining challenge is to still make build outputs available to the developer’s local machine, on demand, once the build completes. We achieve this by writing a custom file system to provide a view of the build outputs. Like we described in our first blog post, the File System in Userspace (FUSE) kernel module is a convenient way of implementing a user space daemon that implements such a file system. As individual build steps complete, the distributed build system informs the daemon about the creation of new build outputs, providing both a relative file path and a content fingerprint. Then, when the developer accesses one of the files on these paths, the daemon downloads the file from the cloud, using the content fingerprint as the key, and serves it via FUSE to the developer (caching downloaded files on the developer’s machine to speed up subsequent accesses). The content fingerprint itself is also made available to the build system as an extended attribute via the FUSE layer, which, as we indicated in our second blog post, is necessary for the build system to perform incremental builds.

In addition to serving build outputs from the cloud and build output metadata from the local machine, the local daemon performs many additional tasks important for the functionality of the system. For example, build outputs are kept in the cloud only as long as they are referenced by at least one developer’s local machine. The daemon periodically extends leases for all build outputs present in any build on a developer’s workstation. As another example, the performance of the virtual file system would be dismal if large build outputs were fetched sequentially once they were requested by the developer; instead, the daemon downloads chunks of the file in parallel from multiple storage servers in the cloud. The daemon is also responsible for local data integrity checks, local disk management and data eviction, and many other maintenance tasks.

The picture below summarizes how Google’s build system works as a whole, taking into account the four blog posts in this series:

The developer invokes the build system client on her workstation, specifying one or more targets to build. The client then coordinates and dispatches individual build actions to execute in the cloud, reading the necessary source code metadata from the FUSE daemon to provide as input to the cloud execution system. The build outputs of these actions (such as binaries) are then stored in our cloud storage system, and downloaded on demand by the user through FUSE daemon. The end result of this is a build system that can deliver build results to developers on the order of seconds or minutes, as opposed to hours.

This concludes our four-part series of blog posts describing how we use the cloud to scale building at Google. In this series, we covered cloud-based source control, how our build system works, and how we distribute both build actions and build outputs to the cloud. We encourage readers to comment on these posts, and to comment on any additional topics they would like to see covered in this blog.

– Milos Besta, Yevgeniy Miretskiy and Jeff Cox

Source link