Since joining Lyft as a software engineer in August, I was pleased to learn that Lyft has automated security updates to Docker images. The crux of the solution is to generate pull requests to child docker images when parent images change. The pull requests cascade down to children of those child images until all images are up to date. This process begins at base image creation and completes when all images are updated. In this post, we will dive into the problem deeper and talk about the challenges your organization will likely face when rolling out such automation. You will learn about how to overcome several challenges and discover additional opportunities for improvement.
Using Secure Base Images
Base images in the Docker ecosystem are images that don’t have any antecedents. In other words, these are images that are built “FROM scratch” — as in the required declaration that you make in your Dockerfile. Bits are normally injected into such images via some sort of bootstrapped chroot. You have to start somewhere after all! Taking a peek at how the maintainer bootstraps is always a good idea. Below is an example of what a professionally bootstrapped Dockerfile looks like.
Bootstrapping your own base images can be a challenging task. You need to generate a chroot, pull a minimal set of packages with their dependencies from a Linux distribution of your choice, and then create a tar archive that can be loaded into your scratch image. Usually, distributions provide their users with tooling to make this process easier such as debootstrap or alpine-chroot-install.
If you have strict air-gap security requirements whereby you cannot pull anything from public docker registries, you can use your distribution’s bootstrapping tools to automate making your own base images. The advantage that you get is that you can pull distribution packages from your own trusted sources. Taking the scratch approach adds additional cost because you might to need to hire a team that pulls together the contents of these secure sources. In essence, you need to invest time and efforts to redo what distributions already do for us.
To avoid reinventing the wheel, you can use public images. However, you should take precautions to make sure that you are using public images that are safe to use. How can you find out if a public image is safe to use? You can employ this checklist:
- Does the vendor announce security advisories actively?
- Are the images updated regularly?
- Do the vendor’s images depend on any images from other vendors?
Generally, the answer to the first two questions should be “yes” and the last one “no”. If the last question on the checklist is “yes”, then you will need to evaluate if that dependent — or parent — vendor fits the same checklist. You will also need to know if new images from the parent vendor cause the images from child vendor to be updated.
Once you’ve picked a trustworthy vendor, things get a little tricky. The vendor controls the cadence of updates and versioning scheme of the images that they publish. They also more than likely will not be sending you pull requests. Even if they were, your images might be in private repositories, and thus off limits to any of their automated tooling that can trigger your builds.
In order to wrest back control over versioning Lyft creates intermediate repositories that have a Dockerfile that depends on a public image. This also allows us to build continuous integration (CI) phases that scan the incoming image for vulnerabilities, upload it to our own registry, and add any additional things we’d like all child images to have. We recommend making it an administrative operation to enable a CI pipeline that pulls from public registry. This way you guarantee that your service owners will depend on these images and receive the pull requests that contain the security updates from the public vendor.
The next problem that needs to be handled, whether you are using an intermediate repository or directly depending on a public image is determining if the parent image changed. If the parent image changed, that likely means that there is a security update and we should kick off the docker image update cascade as described in the abovelinked blog post. At Lyft, we kick off our intermediate image pipeline at a regular time cadence. This keeps things simple and reduces that time that our software is left vulnerable.
There are two ways we can do better. The first way is to have software that polls a descriptive tag such as “latest” and checks if the digest of that tag changes. If a change is detected, we would kick off the intermediate image pipeline. This allows us to account for when the vendor has to release security patches out of cycle; perhaps when there is a high risk vulnerability. With the time-cadence scheme, can always manually kick off an intermediate repository build if we have engineers or managers subscribed to important security announcements.
We can of course also depend on the vendor’s security announcements that usually come in via emails. Ideally, by the time those emails are sent, there is a new image that we can pull into our intermediate image. However that is not always the case for every distribution. Distributions often prioritize their package repositories over their Docker images. If the images do not get updated, you should consider placing an update command in your intermediate base image — e.g. “apt-get upgrade”. Adding the extra upgrade in your Dockerfile never hurts, albeit it adds an additional layer. It is more important to learn towards the side of caution to ensure you have the packages with security updates.
Another challenge is that the security announcements are usually very granular to a package name level, so you’ll need to map package names to whether they exist in the image or not. Perhaps we can work with Docker image vendors to have security announcements targeted at docker images specifically and have them update them consistently when any of the packages contained within change.
In summary, there are three possible starting points for your organization:
- Build your base images from scratch
- Create intermediary images that are based on the public ones
- Rely directly on publicly available base images
Here is a chart that maps the pros and cons of these starting points:
Enforcing the Cascade
The other challenge that you’ll likely run into in a big Engineering organization is when service owners don’t merge pull requests generated by tools such as dockerfile-image-update in a timely manner. The longer it takes people to merge those changes, the longer things remain vulnerable in production.
One way to resolve this problem is to automatically merge pull requests that have been left open. The timer starts when the pull request is made and service owners can see when the merge will happen in the pull request comments. You might face a challenge where the pull requests cannot be merged because they are haven’t passed tests. Consider force merging those to signal to developers that they need to prioritize security updates. If they are in a crunch they can always revert the auto-merged pull request or simply close it before it gets merged. In a way, you are forcefully subscribing people to security updates, but allow simple levers to temporarily unsubscribe.
We have covered two challenges that my employers have run into. The first is how to safely make use of public images. The second is how to keep security updates moving forward. Driving the cascade further to deployment systems is the next big body of work. Imagine how much toil would be eliminated if your Kubernetes pod.yaml could be safely updated and deployed for each freshly updated image. Exploring this aspect is a topic for a future blog post.
I have recently talked at Jenkins World 2018 about this topic and will be talking at All Things Open and Devoxx about the same later this year. As always, if you know of tooling or ways we can solve automated security update challenges better, please let us know!
Big shout out and appreciation goes out to Anthony Sottile who provided feedback and edits for this post. Also great thanks to Aneesh Agarwal and Brian Witt for designing, maintaining, and scaling the update system at Lyft.