In an earlier post, we described how Zenefits uses Duplo as a deployment engine and PaaS for hosting microservices. In this blog post we present the CI/CD challenges that arose and how Duplo’s Katkit component was used to address them.
Since the adoption of micro-services, we have a few hundred docker containers linked to several AWS services, deployed and managed in a self-service way by application teams with almost zero administrative overhead. An important missing piece was an automated CI/CD platform to validate, test and deploy code across multiple environments: from developer sandbox, to staging, and finally to production. In the absence of it, we were using two AWS accounts (dev and prod) each with a Duplo deployment where developers could deploy their service and manually run tests from their laptops. Docker images were built manually and pushed to a registry. There was no tracking of what tests were run and no formal integration environment between services. Things were coordinated manually between teams.
Existing CI/CD For a Monolith
One advantage of a monolithic architecture is that it is easy to test. There is no concept of integration because everything is deployed as a single unit from a single repository. We use industry SaaS solutions for CI like CircleCI and Buildkite. Builds happen on a fleet of containers where users can execute arbitrary code. We bring up the monolith and run a set of concurrent tests that includes everything from unit tests to full end-to-end tests. All pull requests go through this process before merging into the master branch.
The test infrastructure does not necessarily capture the application’s interactions with the real production infrastructure i.e. AWS, databases, etc. In a monolith these interactions are limited so the setup is optimized for fast execution of the test suite rather than for parity between CI and CD. To cover the gaps, after the tests pass and before deploying to production we deploy to a staging environment in our AWS account and run another suite of tests. Finally, the monolithic application is rolled out to production as a debian package with a set of fabric scripts. Overall the scheme works.
CI/CD Requirements for Microservices
The above approach doesn’t quite work for microservices because there are more moving pieces and hence there are more requirements from the CI/CD platform. Many of these requirements are also shared in a monolithic architecture:
1. Parity Between CI and Production Environment
In the monolith case, it is tolerable to have a CI (continuous integration) environment that is different from the CD (continuous deployment) environment because the hosting infrastructure is simpler. There are not multiple entities talking over the network, nor are they deployed independently. The platform interactions are also limited. Configuration tends to be more static.. But with microservices the deployment topology is more complex. There are security groups, routing policies, ELBs, IAM policies and several other factors across multiple services that need to come together beyond code correctness.
Thus we need a CI/CD environment that is built on top of our real hosting environment i.e. Duplo and AWS.
2. Continuous and Independent Deployment
Microservices should be developed and deployed independently. This requires an environment where services in development can test against stable copies of their peers. In addition, since much of Zenefits’ code is a monolith, the services need to interact with the monolith’s stable version and vice-versa.
3. User Defined Workflows
The above two requirements mean that CI is no longer a simple pipeline. It consists of multiple steps. For example, first run unit tests, then build a docker image, and finally push artifacts to an artifact store. Next take the resulting artifact(s), deploy to an environment and run a further set of commands to produce more artifacts or validate existing ones. Now repeat all of the previous steps for different environments with slight variations for each environment. For example, in a dev environment we may not run any integration tests, in a production environment, we may not run any tests at all. Additionally these workflows will be different for different services. Thus we need a system where we can customize workflows per service and environment.
4. Long Running Environments
Zenefits is a SaaS business, every change we make to our software applies to a brown field environment, i.e. it is deployed into a pre-existing environment with pre-existing customer data. Thus to validate our changes we need to deploy and test them in an environment with long-lived customer data. We would typically host internal or alpha feature customers where we provide lower SLA. For compliance reasons, it is imperative that such an environment be hosted on our infrastructure and subject to rigorous security review.
5. Administrative Policies
We need a system where we can set administrative policies for code review to block bad merges and environment upgrades. We need to keep an audit trail of test and deployment activity, so that when mistakes happen, we can determine the root cause and take corrective actions.
Build as a Short-lived Microservice
There are several CI solutions in the industry like CircleCI, Buildkite and Jenkins. One of the core pieces of the implementation is fleet management i.e. scheduling of workers on hosts. Each CI run should be isolated from the next, except for shared caches. Docker and LXC containers are lightweight solutions where each build is implemented as a set of containers that are cleaned up after the build is over. We realized that we already had a fleet manager in Duplo.
CI/CD Stack for Microservices
At a high level a CI stack consists of the following layers:
- Fleet Management to schedule and manage build agents
- Controller which has these sub-components:
a) Queuing System that manages build requests. Invokes the fleet manager to add, monitor and terminate builds.
b) Source repository integration like GitHub. It gets notification about new changes and sends notifications about results as a PR comment.
- User interface and authorization framework.
The CD stack needs to provide a deployment pipeline where users can define how the code is to be rolled out across multiple environments like dev, staging or beta, and production. Additionally at Zenefits we have a layer to enforce code commit policies like code reviews. The term “environment” indicates development, staging or production.
Figure 1 shows our CI/CD stack. It’s a layered system, built out bottom up:
1. Duplo: is the platform for hosting services. It is code agnostic: it can deploy a number of docker images and AWS artifacts for a given tenant and is ignorant of the code running inside the containers.
Duplo doubles up as a build fleet management system. Each build is a microservice where
Name = “”,
Replicas = “”,
Container Env = “Git repo,>Sha and build parameters”
. As these builds are for microservices, the build itself generates and pushes docker images. Hence these containers require access to docker. We don’t do Docker-in-Docker but instead mount the host docker socket into the build container i.e.
/var/run/docker.sock:/var/run/docker.sock This enables us to reuse docker cache. Persistent host volumes are mounted for artifact caching purposes to be used across builds.
2. Katkit: implements the “controller”, user interface and authorization functions described above. This layer in the stack allows the user to convert code (git repository) to its microservice form deployed in Duplo.
Katkit has a generic workflow engine. A workflow is an execution of an ordered series of arbitrary code modules. Each code module is called a workflow step or phase. It is left to the user what function to implement in each module. Developers define their CI/CD workflow. A common one used by our developers is:
“Name”: “PRE_DEPLOY_BUILD”, /* Run unit tests, build and push docker image*/
“BuildParams”: “PHASE=PRE_DEPLOY_BUILD, UT=True”,
“Name”: “DEPLOY”, /* Deploy the code as a microservice in duplo */
“BuildParams”: “PHASE=DEPLOY, IMAGE_ONLY=true”
“Name”: “POST_DEPLOY_VERIFICATION”, /*Run Post Deployment tests*/
“BuildParams”: “PHASE=POST_DEPLOY_VERIFICATION, ENV=STAGE, INT_TEST=True, UT=False”
If a workflow fails, developers can retry specific steps and skip others.
Each step is a short-lived microservice. Different steps for the same sha will have different parameters which are used to decide which code paths to execute. Katkit launches the microservice by calling Duplo. It sets a environment variable called “LOG_PATH” that points to an S3 path. It expects the build container to update the results in that path for indicating the build’s current and final status. Once the build completion is indicated in the file or times out, Katkit deletes the microservice. The artifacts are stored in the same path as well.
3. Sauron: Sauron sits above Katkit and links multiple environments of a service into a single CI/CD pipeline. When a developer raises a Github PR, Sauron shepherds the code through each environment by triggering the Katkit tenant for each environment. For example, a PR is first deployed and tested in a sandbox environment, then merged into the master branch. Next, the code change is deployed and tested in a Beta environment. Finally, the change is rolled out to production. Developers specify the environments and approval workflows for their service. In case of failed tests, the developer would log in to Katkit and Duplo for further investigation. We’ll cover Sauron in more detail in a future blog post.
Duplo focuses extensively on multi-tenancy where each microservice is its own tenant managed by the respective development team at Zenefits. Katkit retains the same model. For every tenant the user has access to Duplo as well as Katkit.
– Venkat Thiruvengadam