Data Quality at Airbnb. Part 2 — A New Gold Standard | by Vaughn Quoss | Airbnb Engineering & Data Science | Nov, 2020

This certification process is followed on a project-by-project basis for individual data models, which comprise a set of data tables and metrics that correspond to a specific business concept or project feature. Example data models at Airbnb cover subjects such as Active Listings, Customer Service Tickets, and Guest Growth Accounting. While there is no perfect set of criteria to define the boundaries of a given data model, aggregating our data tables, pipelines, and metrics at this level of abstraction allows us to more effectively organize, architect, and maintain our offline data warehouse.

While this post won’t describe each step of the certification process in detail, the following sections provide an overview of the most important components of the process.

Broad Stakeholder Input

Furthermore, the process is set up to encourage participation from stakeholders across all teams that consume Midas models. A major goal of certification is ensuring the data models we build meet the data needs of users across the company, rather than just the needs of the team building the model. The certification process gives data consumers from every team the option to sign on as reviewers of new data model designs, and we have found that small requests or feedback early in the design process save substantial time by reducing the need for future revisions.

Prior to Midas, these cross-functional, cross-team partnerships were often difficult to form organically. The formal structure provided by a certification process helps streamline collaboration on data design across the company.

Design Specs

The contents of a design spec are best illustrated with examples. The following figures depict condensed and simplified examples from the design spec for Airbnb’s Active Listings data model.

The spec opens with a description of individual and team data model owners, as well as the relevant design reviewers.

Fig 5: Owners and reviewers are formalized in the heading for each Midas design spec.

The first section of the spec describes the headline metrics included in the data model, along with plain-text business definitions and specific details relevant to interpreting the metrics.

Fig 6: An example Metric Definitions section from a Midas design spec.

The following section provides a summary of the pipeline used to build the data tables included in the model. This summary includes a simple diagram of input and output tables, an overview pipeline SLA criteria, context on how to backfill historical data, and a short disaster recovery playbook.

Fig 7: Example pipeline overview section from a Midas design spec.

The overview of the data pipeline is followed by documentation for the table schemas that will be built.

Fig 8: Example table schema details from a Midas design spec.

Finally, the spec provides an overview of the data quality checks that will be built into the data model’s pipeline for validation (as discussed further below).

Fig 9: Example section on data quality checks details from a Midas design spec.

The examples above cover the main design spec sections, but are shown in substantially condensed and simplified form. In reality, descriptions of metric and pipeline details are much longer, and some of the more complex design specs exceed 20 pages in length. While this level of documentation requires a large upfront time investment, it ensures data is architected correctly, provides a vehicle for design input from multiple stakeholders, and reduces dependency on the specialized knowledge of a handful of data experts.

Data Validation

  1. Automated checks are built into the data pipeline by a Data Engineer, and described in the design spec. These checks are required for certified data, and cover basic sanity checks, definitional testing, and anomaly detection on new data generated by the pipeline.
  2. One-off validation checks against historical data are run by a Data Scientist and documented in a separate validation report. That report summarizes the checks performed, and links to shared data workbooks (e.g. Jupyter Notebook) with code and queries that can be used to re-run the validation whenever a data model is updated. This work covers checks that can not be easily automated in the data pipeline, including more detailed anomaly detection on historical time series, and comparisons against existing data sources or metrics expected to be consistent with the new data model.

As with the design specs, this level of validation and documentation requires a larger upfront investment, but substantially reduces data inaccuracies and future bug reports, and makes refreshing the validation easy when the data model evolves in the future.

Certification Reviews

There are four distinct reviews in the Midas process:

  1. Spec Review: Review the proposed design spec for the data model, before implementation begins.
  2. Data Review: Review the pipeline’s data quality checks and validation report.
  3. Code Review: Review the code used to generate the data pipeline.
  4. Minerva Review: Review the source of truth metric definitions implemented in Minerva, Airbnb’s metrics service.

Collectively, these reviews cover engineering practices and data accuracy across all data assets, and ensure certified data models meet the Midas promise: a gold standard for end-to-end data quality.

Bugs and Change Requests

Midas certification does not come without challenges. In particular, quality takes time. Requirements for documentation, reviews, and input from a broad set of stakeholders mean building a data model to Midas standards is much slower than building uncertified data. Re-architecting data models at scale also requires substantial staffing from data and analytics engineering experts (we’re hiring!), and entails costs for teams to migrate to the new data sources.

Offline data is a key technology asset for Airbnb, and this investment is warranted. Certified data models serve as the shared foundation for all data applications, spanning business reporting, product analytics, experimentation, and machine learning and AI. Investing in data quality improves the value of each of these applications, and will improve data-informed decisions at Airbnb for years to come.

Source link