An engineer’s perspective on engineering and data science collaboration for data products


Use data and SQL as the universal language

Data scientists at Coursera operate in R and Python, while engineers write Scala. There are a few viable approaches to bridging this difference when developing data products.

One way is to train data scientists in Scala and engineers in R and Python. This approach is common in smaller organizations where individuals wear many hats.

The pros of this approach are that coordination costs are minimal and flexibility is maximized. Engineers and data scientists jointly define and redefine the collaboration model on an as-needed basis for each data product. But engineers and data scientists with cross-trained skills are hard to find. This strategy also punishes high-performing engineers and data scientists who prefer to focus on their domain of expertise.

A second way is to have data scientists own the model prototyping phase and engineers own the model productionizing phase. This approach is common in larger organizations that can afford to hire for specialized roles such as machine learning engineers.

The pro of this approach is that this specialization can bring efficiency. Domain expertise and industry best practices have emerged around the ML engineering field. However, ownership questions arise as machine learning engineers need to interface with both front-end product engineers and data scientists to productionize a data product. Striking the right headcount balance among data scientists, machine learning engineers, and product engineers is another challenge.

A third way is to use data and SQL as the intermediary. In this approach, data is the lingua franca among data scientists and engineers. We’ve had good success with this approach in the past few years.

A benefit of this approach is that SQL + data is a constrained interface that requires minimal training to operate. Data is dumb. It is easy to inspect, visualize, and debug data using SQL, and it is easy to collaborate without hidden states, assumptions, and nuances. Furthermore, this approach tightens the iteration loop, as data scientists can iterate on a model from end to end. We think this approach works for the majority of cases. But we recognize there are scenarios where data is not an ideal interface. The two main scenarios are when we need to encode stateful operations in data, and when precomputation of results is onerous. In practice, we’ve found these scenarios to be infrequent and not first-order concerns.

To use data and SQL as the universal language, we’ve had to build out and democratize our data warehouse, solve the problem of who writes ETLs (answer: everyone), and provide interfaces, libraries, and tools to make the data and SQL ubiquitous across the data science and engineering organizations.

“Based on your recent activity” recommendations module

An example of engineering and data science collaborating at the data boundary is our recommendations module infrastructure. It is a system that produces recommendations at various degrees of personalization. Recommendation modules range from fully personalized to the user (e.g., “Based on your recent activity”) to generic cold start recommendations to everything in between (e.g., “Because you viewed Machine Learning”).

Algorithms generating the recommendations range from matrix factorization to regression to rule-based queries. But data is an effective encapsulation — a combination of results, scores, and metadata is an effective internal API. It meets the characteristics of a good API: It’s easy for engineers to consume, easy for data scientists to produce, and sufficiently powerful for our use cases.

Using data and SQL as the universal language results in:

  • Clear boundaries of focus between engineers as data consumers and data scientists as data producers
  • An understandable and debuggable interface
  • A common language between data scientists and engineers when collaborating on shared concerns



Source link