Basic Infrastructure Patterns – Zenefits Engineering

In his first email to Zenefits employees as our new CEO, David Sacks emphasized three core values, and value number one was “Operate with integrity.” As the platform counted on by tens of thousands of small businesses across America, Zenefits has an obligation and a commitment to reliability and integrity at every level – from customer service and compliance to the technical infrastructure supporting our product. In today’s post, infrastructure engineer David Karapetyan explores his approach to creating a reliable, sustainable, and scalable product infrastructure.

Here are some basic patterns I’ve noticed come up over and over again while working on build, CI, and deployment related things.


The pipeline is the bedrock on which pretty much everything else is built. A properly designed pipeline takes well-defined inputs and produces well-defined outputs or it halts somewhere and indicates exactly what failed. The defining characteristic of the pipeline is its compositional nature. If you’ve ever played with functional languages then this shouldn’t be surprising since a pipeline is just another incarnation of a function (in the mathematical sense).

You don’t need anything fancy to build a useful pipeline. Most of the time I get away with pretty vanilla bash and for the times when bash isn’t enough I use Ruby and rake. You can pick your favorite language and make-like tool. What follows are some other basic patterns that can be used to enhance the functionality of the basic pipeline.

Caching (with Hashing)

If you know what a hashmap is then this pattern is the hashmap carried over to the file system for managing well-defined collections of files, like node modules, and it comes in very handy in optimizing build/test pipelines. If your build pipeline is indeed a pipeline, i.e. a composition of deterministic steps, then using this pattern is like using memoization at the filesystem level.

Modern web applications have a lot of dependencies both for the front-end and the back-end. Fortunately modern software development best practices encourage being very explicit about those dependencies. Ruby has bundler and Gemfile/Gemfile.lock, Node has npm and packages.json/npm-shrinkwrap.json, Python has pip and requirements.txt, Elixir has mix and mix.exs, etc. The defining characteristic of these things is that they unambiguously describe what the application depends on. So if somewhere in your build pipeline you take files a, b, c and produce and output file or directory d then that’s basically a function application f(a, b, c) = d and we can memoize that step of the pipeline by taking the output and storing it under a file named by hash(a, b, c). Next time we reach the same step we can just hash those files and re-use the output if the hash matches. I’ll show a concrete example shortly.

You’re probably thinking that’s great so what is the point of caching something that is a one time operation anyway. Well, when you’re building things in Jenkins or some other CI environment you might not have the luxury of incremental development or the price of doing things incrementally might be more trouble than it’s worth because there is all sorts of state that can stick around and muck up your pipeline and also keeping things around for doing things incrementally might not be possible (e.g. when using spot instances in AWS for worker pools). I have also yet to see cache management that works properly and doesn’t unnecessarily fill-up the disk by clogging up /tmp or ${HOME}. In an ideal world this would not be a problem but in the real world it often times is and the easiest thing to do is side-step the cache management provided by the package manager.

My goto trick in these cases is to localize the installation directory for the dependencies (if the package manager doesn’t do it already), hash the file that describes the dependencies (Gemfile.lock, npm-shrinkwrap.json, etc.), and then generate tar.xz file that contains all the installed dependencies named by the hash of the file that describes the dependencies (Gemfile.lock, npm-shrinkwrap.json, etc). I described this abstractly a paragraph or so back. There is just one file we’re concerned with here and for concreteness sake it is npm-shrinkwrap.json and the function that we are applying is npm install which gives us node_modules. So we take node_modules, make a tar, and store it. This doesn’t necessarily get rid of the transfer overhead but it does save the computational overhead of compiling modules and checking each module one by one to see if it is in the cache, installing it if it is, or downloading and installing if not. In my experience unpacking a single tar file to restore node_modules always ends up being faster than re-installing everything even when things are cached locally.

Once the tar.xz file is generated I can stow it away locally or somewhere in S3 and keep re-using it as long as the hash matches. This means even when I blow away the local environment I don’t have to go over the network to re-download my dependencies (well I do if I put things in S3 but I only have to go to one place instead of 300) and can simply unpack a tar.xz file to get things back in working order. I might also need to twiddle things by calling build scripts because some packages will put various files somewhere other than the project directory when they’re installed but this is usually easy enough to fix.

Here is some pseudo-bash to demonstrate the point:

We currently use almost this exact pattern for node modules and build failures because of npm are now a long forgotten memory. There are other things that fail now but at least it’s not because of npm. The same principles can be applied to other package management systems but unlike in the npm case you might need to dig through some documentation to figure out where exactly modules are stored before being installed.

Retries and Fallbacks

So you have a pipeline with well-defined inputs and outputs at each step and you also cache various artifacts so that there is no unnecessary work but this still doesn’t mean you have a foolproof pipeline. Again, doing the obvious thing will get you 90% of the way there and the other 10% should be rare enough that human intervention is not a high cost to pay.

Suppose somewhere in your pipeline you are fetching something from S3, e.g. some tar.xz that contains a bunch of node modules, but since this is something that requires the network and the network can and very likely will fail you want to make it a bit more robust. The simplest thing to do is just re-try a few times.

The above piece of code will work but there are a few more things we can do to make things even more reliable.

If this is a persistent environment that you control or at least have enough control over to make sure some folders and files stick around for some period of time then instead of going to S3 every time you should first check locally and re-use whatever already exists (assuming of course that when you download something from S3 you stow it away locally)

This is better because we don’t have to worry about the network if we already have things locally but there is one more thing we can do in case we can’t find things locally or get them from S3

This is pretty good now. We have retries and fallback mechanisms in case things fail and we avoid unnecessary work as much as possible. Win all around.

Pipeline in a Loop

In control theory there are the notions of stable and unstable feedback loops. When designing control loops for servers the loops must always be stable and fail as safely as possible (which basically means leaving the system as is and alerting someone of what happened). Auto-scaling groups in AWS are a basic example of such control mechanisms although I don’t know what they offer in terms of making the loops stable.

I’m currently working on a such a loop for controlling a pool of Jenkins workers. The workload can vary throughout the day and there is no reason to keep 100 workers around when only 10 are required to clear the work queue. Since we are not trying to be very fancy we can get away with a very basic setup. Some pseudo-Ruby to demonstrate:

Hopefully you get the gist of the idea. Generalizing things a little bit here are the steps for the basic state synchronizing pipeline in a loop:

– David Karapetyan
Thanks to Leaf Pell for reviewing drafts of this.

Source link