How to enable data scientists to stop managing ETL pipelines and get back to doing data science: Part I


Part I: How to build tools that you can hand off to data scientists [feature of Hoover tool]

 

Summary

Business moves fast and data science runways are long. What can we do to remove friction and iterate on data science models as quickly as possible? In this two-part blog series, we describe how Wayfair builds Jupyter-based tooling that empowers data scientists to focus on core data science tasks, rather than spending their time managing boilerplate ETL pipelines. In this first installment, we describe how we build tools that data science consumers can maintain and extend with minimal support. 

 

Introduction

A data science project is, at its core, a machine learning and modeling task that requires the expertise of a data scientist. Ideally, data scientists would spend all of their time on those core data science tasks.  But before you can even start doing EDA, you first need to get your data into your environment. Unfortunately, data scientists spend a lot of their time managing ETL pipelines. And while the “transform” portion of a pipeline often does require data science expertise to implement, the “extracting” and “loading” steps do not. These steps are fairly boilerplate, but building them, and especially learning how to construct them, can be labor intensive. The alternative is to arrange for costly engineering support and multiple rounds of back and forth defining and verifying the data science requirements. And this is all BEFORE you can start doing true data science work and begin your project in earnest.

What can we do to free up our data scientists to focus on data science? What can we do to minimize the amount of effort you need to spend BEFORE you get to the actual data science?

This two-part blog series will discuss how Wayfair develops light-weight, Jupyter-friendly tools that automate the construction of these pipelines. These tools not only allow new data scientists to onboard onto existing projects on day one, they also allow data scientists to self-serve new projects by implementing a small handful of functions with little effort and only occasional engineering support. 

In this blog post, we demonstrate how we achieve this using the model-view-controller (MVC) architecture design pattern (metaphorically). In the sequel, we will describe how we leverage Jupyter’s notebook level state in our tools, in this case to provide both a low-level API for internal development as well as a high-level API for external consumption.

In section 1, we begin a case study of an actual computer vision project. In section 2, we introduce the MVC framework. In section 3, we conclude our case study by demonstrating how we leverage this framework to create an automated, configurable pipeline-builder that helps data scientists onboard new projects in days rather than weeks.

 

Section 1

Case Study : RoSE—Room and Style Estimator 

At Wayfair we believe that everyone deserves to live in a home they love. Part of that is finding décor that matches your personal “style.” Are you mid-century modern with a hint of glam? How about traditionally rustic? We at Wayfair use the RoSE Room and Style Estimator, developed by our team of computer vision data scientists, to quantify the “style” of a room based on an image so we can better understand what our customer is looking for and, more importantly, help her find it. 

If you are a Wayfair customer, you might have noticed that we have various primary style tags on our website such as modern, rustic, and cottage/country. These stylistic terms are loosely defined and highly subjective to the customer’s interpretation. Moreover, it is often hard for customers to verbalize their style preferences, but once they see an image they can easily point out their likes/dislikes. Based on this motivation, the goal of RoSE is to learn image-based features and estimate style from a given image, where multiple items are seen in a room context. We represent an image with the stylistic features extracted from RoSE. The distances between these features help us recommend room and product images to our customers that are tailored to their stylistic preferences. When we first rolled out this new feature, we observed a significant increase in customer engagement especially in the landing experience quiz.

RoSE is a VGG network [1] trained on over 800k room images, each of which is tagged with a style label by 10 different experts. When you begin development on a project like this, you must first: 

  1. collate all of those images and labels into one location (single host or distributed), and
  2. create a queue of training batches that fills more quickly than your GPU(s) consumes from it.

These tasks vary little between computer vision projects, and do not require data science expertise as they are rote and can be automated. However, as you train your model, there begin to be decisions to be made. How do you want to sample your images? How do you want to process them before training (scaling, padding, perturbing, etc)? Do you need to construct training instances, e.g. pairs for a siamese network? These are core tasks for data scientists. 

 

 

Fig. 1: The Computer Vision ETL pipeline divided into 4 steps.

 

We break down the pipeline into four steps:

  1. Get image resource locations and other metadata: 
    • In this case, this takes the form of query.
  2. Pre-process images and copy locally:
    • We first copy images to the host where we will train. We also pre-process the images here, e.g. scaling, padding, perturbing, etc.
  3.  Build training instances and labels:
    • In the case of RoSE we explored two approaches
      1. Trying to predict which style won the majority of votes from our experts.
      2. Trying to predict which of two images won more votes for a given style.
    • These each require us to construct separate datasets. For the latter, care must be taken to not allow images to leak into both the train and test splits.
  4.  Perform inference / training:
    • We provide Keras generators that create image batches more quickly than they are consumed in a background process.

Steps two and three are the core data science tasks that require domain expertise. We want to empower our data scientists to own this central phase of the process via configuration files and by implementing a small handful of additional functions in their modelling code. Steps one and four are the same from project to project. A data scientist needs to write the query, but otherwise this can and should be handled “automagically.” 

 

Section 2

The model-view-controller (MVC) Framework

In this section, we pause our case study to describe the MVC framework and how we will use it.

Jupyter is the bread-and-butter productivity platform of most data scientists. Therefore, if our goal is to empower data scientists, our first requirement is that the tools we develop be useful in a notebook. That means that whatever we build needs to be something you can import and then use interactively. 

Fig. 2: Hoover is designed to power experiments in Jupyter notebooks.

 

Our solution is inspired by the model-view-controller (MVC) framework. To understand the MVC pattern, let’s look at an example. Since MVC is commonly used for web applications, we consider making a purchase on an e-commerce site. Our mission is to delight our customer with a seamless checkout experience. I want my customer to click on a button and feel confident that her order has been placed and that it will be delivered soon. 

 

Fig. 3: A graphic breakdown of the MVC framework.

 

Internally, the MVC framework separates this into three components:

  • Controller: this is how the user tells the application what to do. In our case this is the checkout button.
  • Model: the part of the application that understands how to do it, i.e. backend integrations with order processing platform
  • View: this is how the application tells the user what’s been done, such as loading a confirmation page / modal and triggering a confirmation email send.

 

Note that not all controller “actions” require views and/or models to complete. For example, consider an action that simply issues a select query and persists the results to a csv. First, we create a Data Access Object (DAO) database utility, which provides the user a single, streamlined process for running their query against arbitrary databases (row and columnar databases, key-value stores, etc.) without having to manage connections and cursors. Since issuing a select query and writing a csv neither manipulates data nor presents it to the user, we do not need to create a model or view. Instead, the controller uses the DAO directly.

Compare the MVC framework to an ETL pipeline. In each case, there are three components: one dealing with input, one dealing with output, and a central component that acts on the data itself [2].

Fig. 4: We use the MVC framework to isolate the data science part of data science.

 

Just like data scientists and their models, the MVC “model” understands how to manipulate data. Further, as is the case with  data scientists, where that data comes from and what happens to it afterwards is not a concern the MVC “model.”

 

Section 3

Case Study (Part 2): Hoover—Image Ingestion Pipeline Client

Now that we have an understanding of the MVC framework, let’s resume our case study. In this section we will describe how we use the MVC framework as a metaphor to structure our code.

Recall that our goal is to build an image ingestion pipeline building-tool for RoSE that data scientists will be able to maintain and extend to new projects with minimal support. The workflow for creating a tool such as this is as follows:

  1. Separate the process into configurable components,
  2. Identify the components that require data science expertise, and
  3. Map these components onto the MVC framework.
Fig. 5: Mapping the ETL pipeline steps onto (metaphorical) controller actions.

 

The first and second step were discussed in the first part of our case study. Here we focus on the third step. 

We begin by creating a “controller action” for each step in the pipeline process. To create a data set, a data scientist will import the controller into their notebook and perform those actions in order. 

 

Fig. 6: Hoover can perform all pipeline building tasks in a few simple commands.

 

Note that:

  1. All actions are controlled by configuration classes, and
  2. Models [3] and configuration files are stored in project-specific repos, not in the Hoover package or repository.

These processes ensure that a data scientist can perform an experiment using existing models without doing anything more than importing the package and creating and / or modifying existing configs and queries.

So how does Hoover enable data scientists to self-serve the onboarding of new projects?

The first and last components of Hoover are strictly engineering exercises while the middle steps require data science expertise to manipulate data directly. But there are actually only certain aspects of these steps that require data science decision making. While every project requires a data scientist to, for example, decide what transformations to apply to images, actually moving the images around the network and performing the transformations is as boilerplate as the generator. This means that rather than asking data scientists to create whole new models, we can actually continue abstracting away more technical details. The end result is that a data scientist only has to extend the base models and implement a handful of functions. In our Hoover example, all a data scientist needs to provide are these functions:

Image Processing

  • Input and output formatting 
  • Converting image resource descriptors into network and file paths
  • The transformations to be applied

Dataset Building 

  • Data construction (e.g. constructing image pairs / triples for Siamese networks) and sampling
  • Label construction (e.g. one hot encoding)

While the Hoover package ships with a few simple models, e.g. a one hot encoder, these functions are specific to each project and live alongside modeling code in project repos. This preserves the proper encapsulation of concerns between proper data science modeling code and boilerplate pipeline maintenance.

 

Conclusion

Data scientists spend too much of their time re-creating boilerplate ETL pipelines by rote. In this blog post we showed how we at Wayfair use the model-view-controller (MVC) architecture design pattern to develop light-weight, Jupyter-friendly tools for data pipeline abstraction. Hoover empowers data scientists to onboard onto existing projects on day 1, and to self-serve the creation of new ETL pipelines with minimal engineering support, allowing data scientists to spend more of their time doing data science instead of software engineering. Most importantly, this has shave weeks (sometimes months!) off of the lifetime of onboarding new data scientists and beginning new data science projects.

 

Endnotes & References

[1] K.  Simonyan  and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[2] In fact, consider that one could implement an ETL pipeline in the MVC framework by providing the “controller” a list of things to “extract,” implementing the “transform” using the “models”, and implementing the “load” as a “view.”

[3] Here we mean machine learning models (inference and training code, learned weights, etc.), not MVC models.



Source link