Notebooks have rapidly grown in popularity among data scientists to become the de facto standard for quick prototyping and exploratory analysis. At Netflix, we’re pushing the boundaries even further, reimagining what a notebook can be, who can use it, and what they can do with it. And we’re making big investments to help make this vision a reality.
In this post, we’ll share our motivations and why we find Jupyter notebooks so compelling. We’ll also introduce components of our notebook infrastructure and explore some of the novel ways we’re using notebooks at Netflix.
If you’re short on time, we suggest jumping down to the Use Cases section.
Data powers Netflix. It permeates our thoughts, informs our decisions, and challenges our assumptions. It fuels experimentation and innovation at unprecedented scale. Data helps us discover fantastic content and deliver personalized experiences for our 130 million members around the world.
Making this possible is no small feat; it requires extensive engineering and infrastructure support. Every day more than 1 trillion events are written into a streaming ingestion pipeline, which is processed and written to a 100PB cloud-native data warehouse. And every day, our users run more than 150,000 jobs against this data, spanning everything from reporting and analysis to machine learning and recommendation algorithms. To support these use cases at such scale, we’ve built an industry-leading data platform which is flexible, powerful, and complex (by necessity). We’ve also built a rich ecosystem of complementary tools and services, such as Genie, a federated job execution service, and Metacat, a federated metastore. These tools simplify the complexity, making it possible to support a broader set of users across the company.
User diversity is exciting, but it comes at a cost: the data platform — and its ecosystem of tools and services — must scale to support additional use cases, languages, access patterns, and more. To better understand this problem, consider 3 common roles: analytics engineer, data engineer, and data scientist.
Generally, each role relies on a different set of tools and languages. For example, a data engineer might create a new aggregate of a dataset containing trillions of streaming events — using Scala in IntelliJ. An analytics engineer might use that aggregate in a new report on global streaming quality — using SQL and Tableau. And that report might lead to a data scientist building a new streaming compression model — using R and RStudio. On the surface, these seem like disparate, albeit complementary, workflows. But if we delve deeper, we see that each of these workflows has multiple overlapping tasks:
data exploration — occurs early in a project; may include viewing sample data, running queries for statistical profiling and exploratory analysis, and visualizing data
data preparation — iterative task; may include cleaning, standardizing, transforming, denormalizing, and aggregating data; typically the most time-intensive task of a project
data validation — recurring task; may include viewing sample data, running queries for statistical profiling and aggregate analysis, and visualizing data; typically occurs as part of data exploration, data preparation, development, pre-deployment, and post-deployment phases
productionalization — occurs late in a project; may include deploying code to production, backfilling datasets, training models, validating data, and scheduling workflows
To help our users scale, we want to make these tasks as effortless as possible. To help our platform scale, we want to minimize the number of tools we need to support. But how? No single tool could span all of these tasks; what’s more, a single task often requires multiple tools. When we add another layer of abstraction, however, a common pattern emerges across tools and languages: run code, explore data, present results.
As it happens, an open source project was designed to do precisely that: Project Jupyter.
Project Jupyter began in 2014 with a goal of creating a consistent set of open-source tools for scientific research, reproducible workflows, computational narratives, and data analytics. Those tools translated well to industry, and today Jupyter notebooks have become an essential part of the data scientist toolkit. To give you a sense of its impact, Jupyter was awarded the 2017 ACM Software Systems Award — a prestigious honor it shares with Java, Unix, and the Web.
To understand why the Jupyter notebook is so compelling for us, consider the core functionality it provides:
- a messaging protocol for introspecting and executing code which is language agnostic
- an editable file format for describing and capturing code, code output, and markdown notes
- a web-based UI for interactively writing and running code as well as visualizing outputs
The Jupyter protocol provides a standard messaging API to communicate with kernels that act as computational engines. The protocol enables a composable architecture that separates where content is written (the UI) and where code is executed (the kernel). By isolating the runtime from the interface, notebooks can span multiple languages while maintaining flexibility in how the execution environment is configured. If a kernel exists for a language that knows how to communicate using the Jupyter protocol, notebooks can run code by sending messages back and forth with that kernel.
Backing all this is a file format that stores both code and results together. This means results can be accessed later without needing to rerun the code. In addition, the notebook stores rich prose to give context to what’s happening within the notebook. This makes it an ideal format for communicating business context, documenting assumptions, annotating code, describing conclusions, and more.
Of our many use cases, the most common ways we’re using notebooks today are: data access, notebook templates, and scheduling notebooks.
Notebooks were first introduced at Netflix to support data science workflows. As their adoption grew among data scientists, we saw an opportunity to scale our tooling efforts. We realized we could leverage the versatility and architecture of Jupyter notebooks and extend it for general data access. In Q3 2017 we began this work in earnest, elevating notebooks from a niche tool to a first-class citizen of the data platform.
From our users’ perspective, notebooks offer a convenient interface for iteratively running code, exploring output, and visualizing data — all from a single cloud-based development environment. We also maintain a Python library that consolidates access to platform APIs. This means users have programmatic access to virtually the entire platform from within a notebook. Because of this combination of versatility, power, and ease of use, we’ve seen rapid organic adoption for all user types across the entire Data Platform.
Today, notebooks are the most popular tool for working with data at Netflix.
As we expanded platform support for notebooks, we began to introduce new capabilities to meet new use cases. From this work emerged parameterized notebooks. A parameterized notebook is exactly what it sounds like: a notebook which allows you to specify parameters in your code and accept input values at runtime. This provides an excellent mechanism for users to define notebooks as reusable templates.
Our users have found a surprising number of uses for these templates. Some of the most common ones are:
- Data Scientist: run an experiment with different coefficients and summarize the results
- Data Engineer: execute a collection of data quality audits as part of the deployment process
- Data Analyst: share prepared queries and visualizations to enable a stakeholder to explore more deeply than Tableau allows
- Software Engineer: email the results of a troubleshooting script each time there’s a failure
One of the more novel ways we’re leveraging notebooks is as a unifying layer for scheduling workflows.
Since each notebook can run against an arbitrary kernel, we can support any execution environment a user has defined. And because notebooks describe a linear flow of execution, broken up by cells, we can map failure to particular cells. This allows users to describe a short narrative of execution and visualizations that we can accurately report against when running at a later point in time.
This paradigm means we can use notebooks for interactive work and smoothly move to scheduling that work to run recurrently. For users, this is very convenient. Many users construct an entire workflow in a notebook, only to have to copy/paste it into separate files for scheduling when they’re ready to deploy it. By treating notebooks as a logical workflow, we can easily schedule it the same as any other workflow.
We can schedule other types of work through notebooks, too. When a Spark or Presto job executes from the scheduler, the source code is injected into a newly-created notebook and executed. That notebook then becomes an immutable historical record, containing all related artifacts — including source code, parameters, runtime config, execution logs, error messages, and so on. When troubleshooting failures, this offers a quick entry point for investigation, as all relevant information is colocated and the notebook can be launched for interactive debugging.
Supporting these use cases at Netflix scale requires extensive supporting infrastructure. Let’s briefly introduce some of the projects we’ll be talking about.
nteract is a next-gen React-based UI for Jupyter notebooks. It provides a simple, intuitive interface and offers several improvements over the classic Jupyter UI, such as inline cell toolbars, drag and droppable cells, and a built-in data explorer.
Papermill is a library for parameterizing, executing, and analyzing Jupyter notebooks. With it, you can spawn multiple notebooks with different parameter sets and execute them concurrently. Papermill can also help collect and summarize metrics from a collection of notebooks.
Commuter is a lightweight, vertically-scalable service for viewing and sharing notebooks. It provides a Jupyter-compatible version of the contents API and makes it trivial to read notebooks stored locally or on Amazon S3. It also offers a directory explorer for finding and sharing notebooks.
Titus is a container management platform that provides scalable and reliable container execution and cloud-native integration with Amazon AWS. Titus was built internally at Netflix and is used in production to power Netflix streaming, recommendation, and content systems.
We’ll explore this infrastructure more deeply in future blog posts. For now, we’ll focus on three fundamental components: storage, compute, and interface.
The Netflix Data Platform relies on Amazon S3 and EFS for cloud storage, which notebooks treat as virtual filesystems. This means each user has a home directory on EFS, which contains a personal workspace for notebooks. This workspace is where we store any notebook created or uploaded by a user. This is also where all reading and writing activity occurs when a user launches a notebook interactively. We rely on a combination of
[workspace + filename] to form the notebook’s namespace, e.g.
/efs/users/kylek/notebooks/MySparkJob.ipynb. We use this namespace for viewing, sharing, and scheduling notebooks. This convention prevents collisions and makes it easy to identify both the user and the location of the notebook in the EFS volume.
We can rely on the workspace path to abstract away the complexity of cloud-based storage from users. For example, only the filename of a notebook is displayed in directory listings, e.g.
MySparkJob.ipynb. This same file is accessible at
~/notebooks/MySparkJob.ipynb from a terminal.
When the user schedules a notebook, the scheduler copies the user’s notebook from EFS to a common directory on S3. The notebook on S3 becomes the source of truth for the scheduler, or source notebook. Each time the scheduler runs a notebook, it instantiates a new notebook from the source notebook. This new notebook is what actually executes and becomes an immutable record of that execution, containing the code, output, and logs from each cell. We refer to this as the output notebook.
Collaboration is fundamental to how we work at Netflix. It came as no surprise then when users started sharing notebook URLs. As this practice grew, we ran into frequent problems with accidental overwrites caused by multiple people concurrently accessing the same notebook . Our users wanted a way to share their active notebook in a read-only state. This led to the creation of commuter. Behind the scenes, commuter surfaces the Jupyter APIs for
/api/contents to list directories, view file contents, and access file metadata. This means users can safely view notebooks without affecting production jobs or live-running notebooks.
Managing compute resources is one of the most challenging parts of working with data. This is especially true at Netflix, where we employ a highly-scalable containerized architecture on AWS. All jobs on the Data Platform run on containers — including queries, pipelines, and notebooks. Naturally, we wanted to abstract away as much of this complexity as possible.
A container is provisioned when a user launches a notebook server. We provide reasonable defaults for container resources, which works for ~87.3% of execution patterns. When that’s not enough, users can request more resources using a simple interface.
We also provide a unified execution environment with a prepared container image. The image has common libraries and an array of default kernels preinstalled. Not everything in the image is static — our kernels pull the most recent versions of Spark and the latest cluster configurations for our platform. This reduces the friction and setup time for new notebooks and generally keeps us to a single execution environment.
Under the hood we’re managing the orchestration and environments with Titus, our Docker container management service. We further wrap that service by managing the user’s particular server configuration and image. The image also includes user security groups and roles, as well as common environment variables for identity within included libraries. This means our users can spend less time on infrastructure and more time on data.
Earlier we described our vision for notebooks to become the tool of choice for working with data. But this presents an interesting challenge: how can a single interface support all users? We don’t fully know the answer yet, but we have some ideas.
We know we want to lean into simplicity. This means an intuitive UI with a minimalistic aesthetic, and it also requires a thoughtful UX that makes it easy to do the hard things. This philosophy aligns well with the goals of nteract, a React-based frontend for Jupyter notebooks. It emphasizes simplicity and composability as core design principles, which makes it an ideal building block for the work we want to do.
One of the most frequent complaints we heard from users is the lack of native data visualization across language boundaries, especially for non-Python languages. nteract’s Data Explorer is a good example of how we can make the hard things simpler by providing a language-agnostic way to explore data quickly.
You can see Data Explorer in action in this sample notebook on MyBinder. (please note: it may take a minute to load)
We’re also introducing native support for parametrization, which makes it easier to schedule notebooks and create reusable templates.
Although notebooks are already offering a lot of value at Netflix, we’ve just begun. We know we need to make investments in both the frontend and backend to improve the overall notebook experience. Our work over the next 12 months is focused on improving reliability, visibility, and collaboration. Context is paramount for users, which is why we’re increasing visibility into cluster status, kernel state, job history, and more. We’re also working on automatic version control, native in-app scheduling, better support for visualizing Spark DataFrames, and greater stability for our Scala kernel. We’ll go into more detail on this work in a future blog post.
Open Source Projects
Netflix has long been a proponent of open source. We value the energy, open standards, and exchange of ideas that emerge from open source collaborations. Many of the applications we developed for the Netflix Data Platform have already been open sourced through Netflix OSS. We are also intentional about not creating one-off solutions or succumbing to “Not Invented Here” mentality. Whenever possible, we leverage and contribute to existing open source projects, such as Spark, Jupyter, and pandas.
The infrastructure we’ve described relies heavily on the Project Jupyter ecosystem, but there are some places where we diverge. Most notably, we have chosen nteract as the notebook UI for Netflix. We made this decision for many reasons, including alignment with our technology stack and design philosophies. As we push the limits of what a notebook can do, we will likely create new tools, libraries, and services. These projects will also be open sourced as part of the nteract ecosystem.
We recognize that what makes sense for Netflix does not necessarily make sense for everyone. We have designed these projects with modularity in mind. This makes it possible to pick and choose only the components that make sense for your environment, e.g. Papermill, without requiring a commitment to the entire ecosystem.
As a platform team, our responsibility is to enable Netflixers to do amazing things with data. Notebooks are already having a dramatic impact at Netflix. With the significant investments we’re making in this space, we’re excited to see this impact grow. If you’d like to be a part of it, check out our job openings.
Phew! Thanks for sticking with us through this long post. But we’ve just scratched the surface of what we’re doing with notebooks. In our next post, we’ll explore more deeply the architecture behind notebook scheduling. Until then, here are some ways to learn more about what Netflix is doing with data and how we’re doing it:
We’re also thrilled to sponsor this year’s JupyterCon. If you’re attending, check out one of the 5 talks by our engineers, or swing by our booth to talk about Jupyter, nteract, or data with us.