Open-sourcing StateService: Automating recovery of third-party services after a major outage

At Facebook, our services are designed to recover automatically from a major outage, such as the loss of a data center due to a natural disaster. Most of our production services are built in-house and these all run in containers. The third-party services we use in our corporate infrastructure, however, run on virtual machines (VMs), which can be challenging to recover because their deployment procedures may include steps that require coordination across several VMs. Previously, we would have had to manually intervene in such third-party service deployments, increasing the time it took to recover from an outage.

To automate these deployments and decrease our recovery time, we developed StateService, a state machine as a service that directs the state of a VM through complex deployment processes. Here at Facebook, our corporate infrastructure teams use StateService to significantly reduce manual effort when deploying services. Today, we are open-sourcing StateService for use by engineering and ops teams.

Our system offers improvements over existing approaches. StateService is self-documenting — the individual states become part of configuration management (CM) software, such as a Chef cookbook. By replaying the states that were previously applied to a VM (or a group of VMs), StateService returns services to their last-known state.

StateService works with CM software, and Chef in particular, to deploy services. We use a state machine, expressed in YAML, to describe the states that one or more VMs can enter and how and when each transitions to another state (as seen in the image below). Each state can represent one step or a sequence of steps — e.g., waiting for an event to occur during deployment or performing the same action on a subset of VMs.

StateService describes the states (1-4) that one or more VMs can enter. Actions that previously had to be performed manually (A, B, and C) can now be programmed to occur automatically in a sequence of state transitions and on specific VMs. Unlike configuration management software, which only ensures that these actions are performed, StateService ensures that A occurs before B and B occurs before C.

StateService exposes its state machine as a web service over HTTP, so Chef resources can ask, “Is this machine in State A now?” StateService responds to HTTP requests with 200 (“Yes”) or 406 (“No”) status codes; the response is interpreted as success (“Yes”) or failure (“No”) by Chef’s execute resource and determines whether the next step is allowed to proceed on the VM that made the request. The Chef resource then sends another request to StateService to cause a state transition or to increment a value associated with the current state. As an example, the code below uses Chef’s only_if guard clause to query StateService, where Chef will execute the command and update StateService with the machine’s new state only if StateService returns a “Yes.”

change_command = './'
execute 'change_machine' do
  cwd home_dir
  command "#{change_command} && curl -K didChangeMachine.curl"
  only_if 'curl -K canChangeMachine.curl', :cwd => home_dir

These resources and the state machine become part of a Chef cookbook, producing a collection of states that can be replayed to recover from a disastrous event.

StateService is a time-saver for engineering and ops teams that want to automate complex deployment procedures. At Facebook, StateService reduces manual effort and allows us to recover third-party services rapidly after a major outage. In the future, we will explore how to integrate StateService with other CM software, such as Ansible and Puppet.

Source link