AWS Orchestration, Container Management, and PAAS for Microservices – Zenefits Engineering


In this blog post, we describe the Zenefits case study in adopting Duplo software as the platform for hosting their microservices in AWS. Zenefits is one of Duplo’s biggest customers.

Microservices is a software architecture where a complex application is implemented as a set of smaller independent processes or “services”. The services talk to each other through an HTTP API (typically REST) or through data stores like S3 or DynamoDB combined with a notification mechanism. Recent innovations in provisioning and delivery (e.g., Docker) have made microservices a convenient way to manage complexity.

Zenefits started out as a monolithic MVC application that ran into the usual challenges of development and deployment inefficiencies. Work is underway to refactor this application into multiple smaller services. Additionally, new services are being added as microservices. Zenefits is hosted on AWS. Our services use many AWS services, primarily EC2, RDS, S3, DynamoDB, SQS, ELB, Lambda and SWF. Companies similar to Zenefits’ scale often implement the microservices pattern, and they have hundreds of services that come together to form a single customer facing application.

PROBLEM STATEMENT

1. Self-service and programmable infrastructure

AWS does a fantastic job in providing individual features but does not provide higher level abstractions built on top of those individual features. This forces application developers to worry about unnecessary implementation details and re-invent the same basic patterns for deploying and managing their applications or worse yet offloading those details onto an operations or infrastructure team. A better approach is to provide an intent based model.

Let’s take two examples to describe this problem:

Lack of Self Service: An application team wants to host a service that needs an RDS MySQL backend. Without a self-service platform, the developer has no choice but to cut an infrastructure request ticket. The infrastructure engineer then grabs this ticket and proceeds to create the EC2 and RDS instances by providing the values for security groups, VPC, parameter group, username, password, etc. The infrastructure engineer most likely uses an automation tool like terraform or cloudformation. Once the artifacts are created he updates the ticket with the RDS username, password and endpoint. The application developer then transcribes the information in a configuration file somewhere, builds his application image, and opens another ticket for deploying the application. If the service has to be torn down another ticket is created that in all likelihood does not reference the old ticket with all the required information leading to a bit of back and forth between the application developer and infrastructure engineer. In this process, it is entirely possible that some resources are not properly de-provisioned.

Lack of Declarative Service Model: The same application above exposes a website on port 80, needs a DNS name, and is expected to serve the information over HTTPS. This requires the infrastructure engineer to create an ELB, set appropriate security groups, point to the instances where the service is deployed, go to Route 53 for a DNS name, go to GoDaddy to get a certificate for HTTPS termination, add the certificate to IAM and get its ARN, and finally add the ARN to the ELB. It’s almost ironical that each one these services (minus GoDaddy) is an AWS service but still the admin has to perform each one of these individual steps. An automation tool like terraform / chef / puppet can automate the process somewhat “given the configuration”. But configuration has to be generated per deployment. Elastic BeanStalk goes a further but still falls short. There are a lot of small pieces and too many places to make mistakes in this entire process. If there are more services that share AWS resources like subnets, security groups etc, then the burden of composing a unified configuration is on the human.

Let us note the difference between an automated and programmable infrastructure. In an automated infrastructure, tools like terraform, chef and puppet are used to apply arbitrary set of configurations to AWS resources. The configuration itself is manually generated and is usually unique for each deployment because general purpose configurations are even harder to properly write, test, and validate. Additionally, the user invoking them needs to have cross-cutting access privileges to each of the AWS services involved. This breaks self-service.
We need a programmable infrastructure where the user would deploy their services by describing the intent in terms of the application without having to worry about the underlying infrastructure details. The platform itself would “generate or compose” the entire configuration by combining the application needs with static lower level infrastructure policies set by the infrastructure administrator. Such an infrastructure can be invoked by application teams w/o having any knowledge of or access to the underlying AWS resources.

2. Docker Container Management

For container management there is ECS, but it is rather primitive for the following reasons:

  • It does not allow running two containers with the same service port on the same host.
  • Rolling upgrade cannot be configured with custom application health probe URLs that are not through the ELB. A Service could be internal only.
  • There is no runtime for containers and discovery service has to be built out-of-band. Configuration updates to containers cannot be done without restarting the container.
  • No control over scheduling like potentially using spot instances.
  • We configure and orchestrate monitoring and logging tools like signalfx and sumo logic by controlling the docker container names and mapping them to self service accounts. We tie lifecycle of the containers of these services with the host’s life cycle. ECS does not have the proper API hooks.
  • No flat container networking across hosts.

Given all of the above, instead of working around ECS with custom schedulers, networking, runtime injection, ECS host lifecycle management etc, it became easier to implement a native docker container management solution in duplo with all of this built-in. It allows us to easily add features as we evolve and gives us more control over this most fundamental part of our service architecture.

3. Platform as a Service: Container Runtime, Rolling Upgrades and Secrets storage

We need applications to interact with the platform to discover secret keys, neighbors and other such information instead of hard-coding the information in their respective configuration files. During rolling upgrades the platform should interact with custom application URLs to perform health checks. We need update domains so that a batch of containers can be updated in one go to increase deployment speed.

4. Orchestrate Log Aggregation, Monitoring and Billing tools

Zenefits uses Sumologic, SignalFx, New Relic and Cloudability for this purpose. In absence of a common platform, each application is responsible to deploy and configure these tools themselves. Services carry access keys for these services and call APIs to inject application diagnostics data. The monitoring of infrastructure resources of the tenant like EC2 instance, ELB etc is done by the infrastructure administrator separately. These two set of metrics don’t come together.
Application teams typically have no idea what their respective usage cost is, even the administrator has only the full picture and breaking the cost is a long manual tedious process because resources are not tagged appropriately when the configuration was human generated.

DUPLO SOLUTION

Duplo addresses the above mentioned problems. It provides the self service platform that implements:

  • Programmable infrastructure with a declarative interface by orchestrating AWS resources.
  • Native docker container management
  • Platform-as-a-service
  • Orchestrate and implicitly provide Log aggregation, Monitoring and billing tools on a per tenant basis

Each application team has an account in Duplo and is called a tenant.They deploy, monitor and debug their services without any administrator intervention. PAAS is obviously an optional feature i.e. applications do not have to call any of PAAS APIs if they don’t require the same. The demo gives a short overview of Duplo:

 

IMPLEMENTATION DETAIL

Duplo is influenced by Microsoft Azure’s Service Model and PAAS based approach. Windows has been one of the largest software platform in the world with millions of applications built on it. The learnings from Windows are reflected in Azure’s approach to cloud services. I was an early engineer in Azure’s compute and networking team. So, Duplo is essentially an attempt to build an Azure like service model and PAAS on top of AWS IAAS and get the best of both worlds.

Self-service and Programmable Infrastructure

Base AWS infrastructure is instantiated with Terraform. Terraform manages the creation of VPC, subnets, Nat Gateways, Parameter groups and some administrative security groups that are shared by all Duplo teams, a.k.a. tenants. The IDs of the base AWS infrastructure resources created via Terraform are then configured in Duplo as a static infrastructure policy that does not change.
Following is a description of how the examples in the problem statement earlier would be in Duplo in a self service way with a declarative application interface.
When a tenant account is created, the duplo platform “implicitly” creates a security group, secret key and an IAM profile tied to the tenant’s name. Tenants first create hosts in duplo by specifying the OS, CPU, memory and region by the user, duplo injects the values for subnets, vpc, administrative security groups and other lower level infrastructure details. The instances are implicitly placed in the tenant’s security group and IAM profile created above. They are tagged with tenant’s name. Duplo implicitly deploys Sumo Logic and SignalFx containers on each host added, to capture logs and metrics. Tenants can also create RDS, S3, ElastiCache and other resources through duplo and they are also implicitly tied to this tenant’s Security group and IAM role. The tenant then deploys the service via a declarative service model that looks like: {“Name” : “Website”, “DockerImage” : “nginx:latest”, “Replicas”:”3”, “ExternalLBPort”: “80” “DNS”:”Foobar.zenefits.com”}. Duplo creates these containers on the hosts, allocates a unique port on the EC2 host, and maps it to the container port 80. Maps the ELB to the hosts on the allocated ports and sets the security group on the ELB. Subnet and VPC values would be picked up implicitly from the static config set during infrastructure provisioning. A wild character certificate, picked from the static config, is used for SSL termination at ELB. Duplo calls Route53 API to program the DNS name.
The platform manages the lifecycle of all resources and transparently performs garbage collection on unused resources.

Orchestrate Log aggregation, Monitoring and Billing Tools

Every resource created through Duplo platform are tagged with a prefix which is the tenant name. We point signal fx to our AWS account and have a template dashboard configuration with a set of commonly useful charts like CPU, network, host count etc and filtered by <TenantName>-*. For every tenant Duplo implicitly creates a unique dashboard for him by calling SignalFX API and replacing the value of <TenantName> in the template file. Thus when a tenant logs in to his account he receives a SignalFX dash board by default.

For billing we use similar approach with cloudability.

For log aggregation Duplo automatically deploys a Sumo Logic collector container in each host with the collector name set as the <tenant name> + <Host Name> . The collector is mapped to the docker logs folder on the local host. Thus all application logs are automatically uploaded in sumo logic partitioned by tenant name.

There are more interesting things we do with the platform like grouping of tenant and apply administrative policies, reduce cost by using spot instances keeping it oblivious to services and more. I will leave that for future blog posts.

– Thiruvengadam Venketesan



Source link

Write a comment