Empowering Data Science with Data Engineering Education

Data engineering education enables data scientists to better interface with engineering and ensures higher data quality.

Authors: Michelle Du, Yu Guo, Jeff Feng

Since launching Data University (Data U) at Airbnb two years ago, our data education program has made significant progress towards our overarching vision of empowering every employee to make data-informed decisions. To date, over 400 courses have been taught to thousands of course participants by the 55 volunteer faculty members from across the Data Science & Engineering organizations. The content taught in these courses have helped to provide the foundation for data-informed decision making to countless employees across the organization.

While Data University has had a significant impact on the data skills for our workforce and is establishing a culture rooted in data, we have identified the need for team-specific trainings in addition to the Data U offerings that address the unique needs of specific functions and business units. Last month, we shared our ‘Data U Intensive’ trainings with our Experiences Business Unit and the Public Policy Team. In this post, we share our learnings in up-leveling our Data Science Team with data engineering skills to increase their overall effectiveness. It has received great feedback internally, and we would like to share our learnings with the broader community.

Why Equip Data Scientists with Engineering Education?

Data scientists working within the consumer tech industry are operating in an environment where mobile / web products and data infrastructure are becoming increasingly sophisticated. As a result, it has become almost impossible to perform the data science function effectively without being versed in certain aspects of engineering.

The Data Science Team at Airbnb believes that equipping the team with the engineering perspective is tremendously beneficial to producing high quality data, as well as improving how we log and run experiments. The chart below on the Data Science Hierarchy of Needs vividly demonstrates how important a solid data engineering foundation is for building out effective metrics, analytics, experimentation, and machine learning systems.

Figure 1: The pyramid of data needs illustrated by Monica Rogati

The solution we have introduced is the Engineering Empowered Data Science (EEDS) program. The goal of the program is to equip data scientists with engineering knowledge that is specific to data system design, data quality improvement, and productivity. We designed the EEDS program hoping to address the challenges below:

1. Knowledge on the Airbnb data platform is compartmentalized. We had limited end-to-end documentation or training on how data is generated, stored, and computed across the platform:

  • The lack of context for data scientists makes it challenging to communicate feedback when incidents occur or when sharing feature requests about the data platform
  • Gaps in engineering knowledge often result in inefficient processing of data, improper use of computation resources, as well as problematic analysis or model results
  • The incomplete understanding of data means a significant amount of time is spent on ad hoc investigation by data scientists

2. New data scientists generally do not possess sufficient knowledge of Airbnb’s engineering system to foresee and prevent upstream issues in logging, even though they may have lots of experience with downstream data issues.

3. Documentation and tutorials for internal packages and tools are sometimes limited/siloed. We have created a lot of leverage by automating common tasks in internal packages such as Airpy (A Python toolkit for accessing, extracting, manipulating, and plotting data from Airbnb data sources) and Rbnb (a collection of R functions and R packages that are essential for practicing data science at Airbnb). But tips for creating user-defined functions (UDFs), coding best practices, and creating a quick demo site for a data product could benefit a wider audience.

4. There are no third-party educational resources that fit Airbnb’s unique ecosystem that we could directly leverage for data engineering and data science education.

We believe that these challenges are broadly applicable in tech companies, and we feel strongly that the solutions and learnings we have through the EEDS program should be shared with the broader data science community.

Learning Objectives for the EEDS Program

With those challenges in mind, we built the Engineering Empowered Data Science program to solve the challenges we were facing and up-level our data scientist’s skills with essential engineering knowledge. The class spans two full days with several learning goals in mind:

  • Empower data scientists with a deeper understanding of the entire data system, to better collaborate with engineers when diagnosing experiment results and implementing data products.
  • Equip data scientists to leverage modern logging infrastructure, and how to access this data using simple SQL queries.
  • Disseminate best practices in automation, data products and ML —Ensure data scientists understand how to write simple, readable, performant, and maintainable code to increase productivity and algorithm code quality.

EEDS Format & Content

The two-day program consists of a series of 30–60 minute sessions in the format of lectures and workshops on Airbnb’s engineering platforms, design principles, and best practices for data science. Data science attendees gain a deep understanding of Airbnb’s data system, how data is generated/stored/computed/monitored, as well as internal productivity tooling tips.

The faculty for the program consists of:

  • Engineers and Technical PMs who are experts in data infrastructure, end-to-end data systems, logging, compute frameworks, monitoring, anomaly detection and alerting systems. They mostly come from our Infrastructure Engineering Team.
  • Data scientists who have established solid understanding and use cases towards data infrastructure, data product, machine learning, or automation tools.

The current curriculum covers 6 core areas:

  • Data Systems at Airbnb is a crash course on Airbnb’s data infrastructure, covering considerations for data scientists. This helps data scientists obtain a better understanding of how data flows and where the potential break points are. This also prevents data pipelines from breaking and improves communication of issues across the data platform.
  • Compute Frameworks is a session on compute observability and Hive best practices. This helps the data scientists understand computation resource allocation, as well as the dos and don’ts when writing queries in general at Airbnb.
  • Logging Best Practices is a session on logging best practices & schema design, followed by in-depth demonstrations and a hands-on workshop. This helps to ensure the quality of data from its generation and reduces the ETL effort needed downstream.
  • Experimentation is an overview of the Experiment Reporting Framework (our experimentation platform), and how to use it to monitor experiments. This helps data scientists understand all the features supported by the framework as well as how to deliver a well-designed experiment that produces reliable experiment results to better support product development.
  • Anomaly Detection is an overview of the anomaly detection framework. This helps data scientists understand how to use our in-house anomaly detection framework in building a consistent and scalable solutions for monitoring anomaly events — therefore allowing users to take early actions when such events happen.
  • Productivity Workshop is a hands-on workshop on helpful tool tips. One example is on how to create your own Hive user-defined functions (UDFs) in either Java and Python. This helps improve the efficiency of MapReduce jobs derived from SQL.
  • ML Tooling Workshop is a series of hands-on workshops on building ML models using Airbnb’s ML platform called Bighead. Knowledge of how to leverage the platform and tools effectively and how to avoid common pitfalls helps to empower data scientists with greater productivity and success when building ML models.

The Data Systems at Airbnb and Logging Best Practices courses are the foundation for the entire curriculum. They are tremendously helpful as they establish the foundational understanding of Airbnb’s data ecosystem (Figure 2 below) and how data scientists can help improve data quality with the best practices in mind.

Figure 2: A high-level illustration of the Data Infrastructure at Airbnb


As of today, over 50 Airbnb data scientists have participated in the training and we have received great feedback on the EEDS program. Over 90% of the students felt it was a highly impactful use of their time and that they have learnt something new and helpful for their day-to-day work.

The EEDS program has also created significant leverage for Airbnb. Teaching best practices in data and engineering has helped us ensure high data quality with upfront design, thus helping us avoid fixing problems down the line. Additionally, the program provides critical knowledge that enables data science to effectively collaborate with engineering. Last but not least, the investment we have made in employee growth and continuing education provides our data scientists with valuable learnings that will stay with them throughout their career.

Airbnb has a tradition in investing in employees and providing learning for everyone. The EEDS program is targeted at data scientists, and it connects well with our broader data education initiative, Data University. It also enriches our data science onboarding program by delving deeper into the technical aspects of the role after new hires have had a chance to sample Airbnb’s data and tools. Attendees of the course have shared that the material has helped them build more empathy with other functions such as data engineering, data infrastructure, and product engineering, as well as foster better cross-functional collaboration.

We hope our learnings can help other organizations scale their internal education efforts to empower data scientists to do even greater work within a tech environment. The curriculum that we share here not only provides learning recommendations for data scientists that are already in the industry, but also for those who are hoping to join. Understanding the additional skills needed for working as a data scientist in the tech industry versus traditional industries such as statistics, research, etc. will help bridge the gap and encourage greater diversity in our field. Going forward, EEDS is now part of the regular offerings of Data University at Airbnb, and we will continue to iterate on the content as our platform and tools evolve.


We would like to offer special thanks to — Elena Grewal, Ricardo Bion for their tremendous support for the program; Gurer Kiratli, Lauren Chircus, Gabe Lyons, Reid Andersen, Alice Beard, Varant Zanoyan, Alfredo Luque, Andrew Hoh, Tingting Ma, Jian Chen, Hao Wang, Mihajlo Grbovic, Pai Liu, Cindy Chen, David Dolphin, Jingjing Duan for being part of our awesome faculty team and for coming up with & continuously refining our course contents; Ellen Huynh for help organizing the program with useful feedback; and the entire Data Science team for the support and feedback. We’d also like to thank Navin Sivanandam, Xiaohan Zeng and Jamie Stober for their kind help in proofreading!

Source link