Developer Experience Lessons Operating a Serverless-like Platform At Netflix — Part II

By Ludovic Galibert, Vasanth Asokan and Sangeeta Narayanan

In Part 1 of this series, we outlined key learnings the Edge Developer Experience team gained from operating the API dynamic scripting platform which provides a serverless or FaaS like experience for client application developers. We addressed the concerns around getting code ready for production deployment. Here, we look at what it takes to deploy it safely and operate it on an ongoing basis.


Simplicity and abstraction of operations come with reduced control; have you evaluated the tradeoffs?

When a script is deployed to our platform, the rollout is completed within a fixed time interval. This predictability is useful, especially for automated workflows that involve a dependency on the script being fully rolled out; an example being mobile app rollouts gated by the deployment of a script with new functionality. Additionally, we implemented a realtime notification system that allows developers to progressively monitor the state of their deployment. This enables further optimization of such workflows.

Given the dynamic nature of serverless scheduling, application instances are more vulnerable to cold start delays caused by JIT-ing or connection priming. This unpredictable latency jitter is typically not an issue if there is a steady state of requests coming in from clients, but is more pronounced in the case of applications that do not receive enough traffic to keep the application instances “warm”. New application deployments are also susceptible to this issue. In order to mitigate this, we designed a feature whereby scripts could be shipped a set of “warmup” handlers as part of the image. Right after instance provisioning, the platform executes these warmup handlers in order to ensure JIT-ing and connection priming can be performed for covered code paths before an instance starts processing requests. Typical monolithic application units are in control of the provisioning lifecycle, and thus contain infrastructure to perform such startup tuning. It is important to ensure that off the shelf serverless platforms provide hooks to retain this facility.

Another provisioning consideration is multi-region (datacenter) deployments. Initial versions of our platform only supported instant global deployments. While this simplified the experience for our users by abstracting the notion of regions from them, it deprived them of the ability to test their deployment in stages across regions. More importantly it was a potential availability risk if a bad deployment was rolled out globally. In the end, we evolved our platform to support both global and regional deployments. Users have the option to choose their deployment schedules by region.

Staged regional deployments to avoid global issues

As a final gate before production, canary deployments and multi-variate testing are key techniques to gain confidence at scale and reduce the risk associated with a new deployment. These capabilities are built into our deployment and routing layers. Users can deploy a reduced size baseline (current production code) and canary (new pre-release code) version of the application, and metrics (such as cpu load and latency) flowing from each are tagged correspondingly according to function. This allows a comparison of application behavior between versions prior to full rollout.

To sum up, all the techniques and best practices that help reduce risk of deployments of services are equally applicable to and necessary for script deployments and serverless functions. Also, keep in mind that the simplicity of a serverless experience comes at the cost of control over the application container and its scheduling. For use cases requiring precise control over when the application is ready to take traffic, this could be an important consideration.

Operational Insights

Smaller, lightweight application units are more vulnerable to system noise; how do you smoothen out the jitter?

Our website team redesigned the Netflix website experience in 2015. As part of a modular design, they chose to break down the data access scripts they deployed to our scripting platform into fine grained units which can be thought of as “nano-services”. Over time, this led to an order of magnitude increase in the number of scripts they run.

Insights for fraction of website scripts in pre-production

Through the lifecycle of this exercise, we observed a few interesting things:

  1. The increased telemetry was great when needed, but a burden to monitor and optimize around continuously. More application units resulted in more dashboards and alerts. What was meant to be “information” quickly became overwhelming “data” that in the best case caused fatigue or in the worst case faded into noise (for a great discussion on this topic, see Owning Attention).
  2. These fine grained units also meant that the composition of the final application was much more distributed. With many more moving parts, it was no longer useful to deal with an application unit’s health in isolation. Instead, the website team favored starting from a more composite view of the system health. As long as business and client metrics remained unaffected, per unit health was ignored.

Based on the above experience, we believe that in order to reliably operate applications composed of smaller units, the core concept of increased abstraction should be extended to operational insight and workflows well beyond today’s levels. Here are some ideas that are influencing our next generation platform that we think would be beneficial more broadly.

  1. Low-level telemetry must always be tied together with higher order business and system metrics to provide a composite picture of application health. A clear signal on application health influences the action that needs to be taken as well as the urgency.
  2. Upstream and downstream health context becomes much more important. If an application is misbehaving as a result of issues in a dependency, the operational response changes from debugging or diagnostics to information dissemination.
  3. Issues most often correlate to changes, and with smaller units, the velocity of changes is higher. Thus as part of alerting, it is key to include the context of what changed in the connected parts in addition to the application itself.
  4. Most applications are deployed across multiple datacenter regions, which adds another operational dimension. Providing a combined view and control plane, and yet allowing per-datacenter drill-ins is useful.
  5. In the same vein, enabling composite operations across application units (a.k.a. an operator view and control plane) could be critical. If an issue affects multiple sibling units, possibly all of them will need remedial action, e.g. if a platform or library bug is discovered and needs to be patched — the ability to efficiently track or update multiple related application units all at once becomes highly desirable.
  6. Automatic diagnostics and remediations for common issues may allow the service to continue functioning well enough that the issue can be addressed at a more convenient time.

Taken together, these innovations help provide a more holistic view and control plane, powered by automation. It is important to note that these considerations are not unique to serverless — they just get amplified. Overall, the key is to allow developers to outsource more of the operations to tools with confidence.

Lifecycle Management

Large applications now become numerous smaller ones. What are the implications of this increase in dimensionality?

As described earlier, breaking up applications into nano-services implies an increase in the number of application units. A compounding factor is the increase in deployed versions that have to be maintained indefinitely as is often the case with consumer facing applications. The combination of these two factors could mean a drastic increase in independent deployments. A prominent example of this is the Android fragmentation, which results in the necessity to maintain multiple application versions in order to run on old devices that cannot upgrade to newer versions of Android. Google is now trying to address the fragmentation problem at the core of Android.

So what about maintenance? In Part 1 we mentioned frequent developer commits coupled with CI/CD runs in pre-production environments resulting in a long trail of short-lived deployments. These unused deployments which consume resources come with a maintenance overhead and result in unnecessary cost. Our first attempt to address this problem was to provide accurate and actionable usage reports, as well as self-protecting limits and switches in the platform against accidental over-subscription and abuse. We soon realized that manual clean up was not only tedious, but also error prone — versions that were still taking meaningful traffic were sometimes accidentally removed. This presented an automation opportunity for hands off application lifecycle management. Developers are asked to specify upfront when a particular version can be safely sunset, based on traffic falling below a threshold for a minimum number of days. Versions that fall below the threshold are automatically cleaned up using an off-band system that evaluates eligibility. Additional safety checks and the ability to quickly resurrect deleted versions help reduce the risk associated with the maintenance operations.

A variation of the maintenance problem is the requirement to update applications to address security vulnerabilities, performance issues or end-of-life for libraries. While automated reporting is helpful in surfacing applications that need attention, updates are typically a tedious, manual process and are not always performed in a timely manner. An idea we are pursuing here is to facilitate automatic upgrades. Our goal is to apply the updates to an application unit, run it through the canary process and based on the canary score, provide a push button way for the update to be rolled out. We believe this feature will provide a significant productivity win for developers, especially as the number of deployments increases.

Serverless makes it easy to do fire and forget deployments but it brings with it increased maintenance considerations. Features designed to eliminate toil become increasingly important even at reasonable scale.


Our experience tells us that serverless architectures go a long way in simplifying the process of developing scalable applications in a rapid and cost effective manner. From an operational perspective, they introduce different considerations such as the loss of control over the execution environment and the complexity of managing many smaller deployment units, resulting in the need for much more sophisticated insights and observability solutions. We see the industry headed in this direction and are eagerly looking forward to the innovations in this space.

If you have opinions on serverless or want to engage in conversations related to developer experience, we’d love to hear from you! And if you want to help Netflix engineers in their quest to delight millions of customers worldwide, come join our team!

Source link