CI Automation at Zenefits – Zenefits Engineering


At Zenefits, it’s extremely important for us to uphold a high bar for code quality. Bugs and errors are costly – faulty decisions made by the application directly impact people’s lives, such as their healthcare or payroll. To maintain that high bar, merges into the master branch must meet strict quality criteria.

Initially, engineers assumed this responsibility, and controlled the merges themselves. This allowed code to move pretty quickly through the pipeline. However, as the scale of the engineering organization grew, automated systems that enforce merge criteria were required to maintain a consistent and high quality bar.

Hence, the infrastructure team at Zenefits built Sauron, a service that helps manage the development workflow and software lifecycle. Sauron oversees all code that flows through the pipeline from commit to CI to deploy.

Desired Workflow

Initially, Zenefits followed an ad-hoc workflow which we sought to refine and automate with Sauron. The development lifecycle is separated into two sections: the pre-merge workflow is everything that happens to an isolated code review, before it is approved into the master branch. The post-merge workflow is everything that happens after merge, and ends with the feature in production. For Zenefits’ primary code base, the desired workflow looked like this:

System Design

Zenefits uses Github Pull Requests as the unit of review for staging code changes. Sauron would evaluate events on pull requests as they are updated and trigger automated actions when necessary. Because pull requests can be modified to point to any code revision, they are not suitable as the primary identifier to track state in the system. Instead, git hashes (SHAs) are used, since they are immutable and clearly tied to a specific code revision.

We evaluated two approaches for building Sauron: poll-based workers and an event-driven system. The polling approach is simpler, but has its drawbacks: polling the state of every PR is expensive and can be very slow, and between polling intervals, events which would have been captured in the event stream may be lost. The event-driven model is more efficient, but introduces complexity, e.g. how do we manage race conditions or dropped events?

The final design is event-driven and uses distributed queues to buffer events and mitigate outages in the system. In the case of a widespread outage, we resort to poll-based backfills to re-synchronize the system. Below is a high-level diagram of what the system looks like today:

Sauron Workflows Architecture Overview

The overall system is composed of three main parts:

GitHub Bot: Automating Code Reviews

Sauron integrates with GitHub by receiving events via webhooks, performing transformations for downstream consumption, and enqueuing them to SQS (Simple Queue Service). For example, when an assigned reviewer comments to approve a PR, Sauron forwards an ‘LGTM’ event to the workflow engine.

Ideally code reviews should be designated to engineers who have the most context on the changes presented. Each directory should have an owner that is responsible for the overarching module, with each subdirectory assigned to somebody with better understanding of the code. Having a system that maintains levels of ownership incentivizes owners to uphold high standards in the codebase they’re responsible for.

We implemented a basic rule engine in Sauron to facilitate the corresponding code review process. Whenever a pull request is created or updated, we scan the diff and check the configured rule engine. If the diff matches any of the rules, the pull request is routed to the corresponding reviewers. We can get fancy by ensuring that a specific in-house feature is used properly or flagging large Django migrations for further review by a DBA. This lets us detect issues early and reduce code quality overhead.

CI/CD Workflow Engine

When building Sauron, we wanted the flexibility to modify workflows on demand without having to deploy code changes. Additionally, because Zenefits has been steadily moving towards microservices via Duplo, Sauron needed to support a diverse set of merge and deploy pipelines.

Therefore, we modeled the workflow engine as a finite-state machine (FSM). Each repository describes its own FSM using a YAML-based configuration file to define the states, inputs, and transition functions. For a state S, Sauron waits for status checks defined in the state’s dictionary object to propagate through the system to determine whether it can transition to the next state. The transition function identifies which state we transition to given the current state S and the set of status check inputs (𝛿success/failure = S × status_checks → Ssuccess/failure). Once we receive successful status events from the set of status checks, we transition to the next state. Note that the state machine can also transition backwards or perform a no-op. When a CI build or deploy fails, Sauron comments on the pull request and resets to a previous state so the developer can diagnose the problem before moving forward again.

Here is a rough sketch of what a state machine YAML file looks like:

The above example is a simple case; Sauron verifies that the pull request has been reviewed, CI tested, and labeled before merging. If necessary, we can easily add intermediate states to the configuration file. For example, if a repository wants to deploy and test in a staging environment prior to merge:

The workflow engine is responsible for driving the configured workflow, as specified in YAML. When Sauron polls an event from SQS, it first queries for the repository configuration and status of the given SHA from DynamoDB. If the transition is allowed, the SHA transitions to the next state by updating the state field in the DB. Moreover, if the next state specifies an activity to be performed as a side effect, like triggering a build or deploying a SHA, the activity is stored in the activities DB table to be executed.

By automating the side effects of a state, we can streamline pull requests from development to deploy with little to no manual intervention. Manual intervention is only necessary when a part of the pipeline fails.

Activity Supervisor

Sauron integrates with a variety of services with differing levels of availability, communication models (push vs. pull), and failure modes (dropped requests, time outs, etc.). Activity Supervisor isolates failures from the core workflow state management, while tolerating a wide range of failures in downstream systems.

The system persists the state of every activity and ensures that activities are never dropped, even when there is an outage. Activities in the database are queried and executed by the activity supervisor. In the above YAML, after a SHA transitions from the Code Review state to the Deploy state, as a side effect, Sauron automatically deploys the SHA to the associated staging environment in Duplo.

Sauron currently maintains support for the following internal services:

Duplo/Katkit – Deployment and CI platform for microservices

Gondor – CI service generating test plans and reports in the monolith

BuildkiteCI Workflow Engine for orchestrating builds

Frodo – Deployment service for staging and production environments in the monolith

Because the activity manager is independent of the services it manages, we can onboard new services as they are built. Integration is as simple as implementing a new activity.

All Together

The system comes together and achieves most of the goals we set out in the initial proposal. We also built a lightweight python API and web app to expose Sauron’s state to engineers, and make configuring workflows a bit friendlier. Here are few screen captures of Sauron in action:

What’s Next – Sauron Hub

We’ve had great success with managing workflows through Sauron. All code merges in our monolith are driven by Sauron, which has played an important role in minimizing production incidents. Additionally, microservices now have access to a self-service CI/CD platform.

As our team continues to grow, there are other areas in engineering services where we think Sauron-like automation could be a good fit. As such, we are leveraging the platform that we implemented for Sauron workflows to support new scenarios:

How do we track the ever-expanding technical universe at Zenefits? The Ownership Service scrapes our various systems and allows manual input to track who owns API endpoints, EmberJS routes, Django applications, products, tests, etc.

Is my code deployed? Do I need to do any code reviews? Rather than forcing developers to periodically poll various services to see the state of their change, we built Zenotify, a service that configures push notifications for all the events in the development lifecycle, and curates an event stream for engineers.

What changes are contained in this deployment? Which PMs will care? Sauron Graph will track the many related items in the engineering process: “JIRA-456 was addressed by PR 4402, which was authored by Jane and reviewed by Mike. The code was deployed to production by Bidhan in deployment X214.” By tracking relationships in one place, we can quickly find all the relevant information without clicking into 20 different services, and enrich other tools by providing this information through an API.

How can we improve the CI process? What is the bottleneck? With all this data flowing through Sauron, we are in a unique position to surface insights about developer productivity. Our Analytics Reports will help us answer those questions.

I need to access X because of Y – who can approve this exception request? Zenefits follows strict processes to protect customer data and ensure system stability. As time goes on, we are accumulating a number of approval processes for various scenarios: approving a hotfix, granting temporary access to customer data, or compute resources for an investigation task. Approval Workflows will consolidate the creation and management of these processes.

We are calling this new constellation of features ‘Sauron Hub’. Much like Zenefits itself is the operating system for small business, we hope Sauron will become the hub of engineering work at Zenefits. This is the architecture we are moving towards:

sauron-hub

Conclusion

As you can see, we are taking serious steps to improve code quality through well-defined workflows. This is a direct reflection of our three leadership principles: operating with integrity, putting the customer first, and making Zenefits a great place to work. We’ve learned that by automating engineering processes we reduce human error and encourage developers to do the right thing.

We look forward to applying these lessons to new scenarios we think that Sauron Hub will enable us to grow our engineering team, manage the chaos of software development, and reduce pain points for developers.

– Jonathan Lafleche & Brian Bao



Source link