This certification process is followed on a project-by-project basis for individual data models, which comprise a set of data tables and metrics that correspond to a specific business concept or project feature. Example data models at Airbnb cover subjects such as Active Listings, Customer Service Tickets, and Guest Growth Accounting. While there is no perfect set of criteria to define the boundaries of a given data model, aggregating our data tables, pipelines, and metrics at this level of abstraction allows us to more effectively organize, architect, and maintain our offline data warehouse.
While this post won’t describe each step of the certification process in detail, the following sections provide an overview of the most important components of the process.
Broad Stakeholder Input
An important feature of the process is the cross-functional partnerships it formalizes. Every Midas model requires a Data Engineering and Data Science owner who share ownership of the data model design and provide expert input from their respective functions. Cross-functional input is pivotal to ensuring certification can address the full scope of data quality dimensions, which span technical implementation concerns as well as requirements for effective business usage and downstream data applications.
Furthermore, the process is set up to encourage participation from stakeholders across all teams that consume Midas models. A major goal of certification is ensuring the data models we build meet the data needs of users across the company, rather than just the needs of the team building the model. The certification process gives data consumers from every team the option to sign on as reviewers of new data model designs, and we have found that small requests or feedback early in the design process save substantial time by reducing the need for future revisions.
Prior to Midas, these cross-functional, cross-team partnerships were often difficult to form organically. The formal structure provided by a certification process helps streamline collaboration on data design across the company.
The first step in the Midas process is writing a design spec, which serves as both a technical contract describing the pipeline, tables, and metrics that will be built, as well as the primary ongoing documentation for the data model. Design specs follow a shared template with standardized sub-sections. Collectively, these specs form a library of documentation for Airbnb’s offline data assets. This documentation represents a high-value deliverable, as it reduces dependency on data producers’ specialized knowledge, eases future iteration on existing data models, and simplifies transition of data assets between owners.
The contents of a design spec are best illustrated with examples. The following figures depict condensed and simplified examples from the design spec for Airbnb’s Active Listings data model.
The spec opens with a description of individual and team data model owners, as well as the relevant design reviewers.
The first section of the spec describes the headline metrics included in the data model, along with plain-text business definitions and specific details relevant to interpreting the metrics.
The following section provides a summary of the pipeline used to build the data tables included in the model. This summary includes a simple diagram of input and output tables, an overview pipeline SLA criteria, context on how to backfill historical data, and a short disaster recovery playbook.
The overview of the data pipeline is followed by documentation for the table schemas that will be built.
Finally, the spec provides an overview of the data quality checks that will be built into the data model’s pipeline for validation (as discussed further below).
The examples above cover the main design spec sections, but are shown in substantially condensed and simplified form. In reality, descriptions of metric and pipeline details are much longer, and some of the more complex design specs exceed 20 pages in length. While this level of documentation requires a large upfront time investment, it ensures data is architected correctly, provides a vehicle for design input from multiple stakeholders, and reduces dependency on the specialized knowledge of a handful of data experts.
After a design spec has been written and the data pipeline built, the resulting data needs to be validated. There are two groups of data quality checks relied on for validation:
- Automated checks are built into the data pipeline by a Data Engineer, and described in the design spec. These checks are required for certified data, and cover basic sanity checks, definitional testing, and anomaly detection on new data generated by the pipeline.
- One-off validation checks against historical data are run by a Data Scientist and documented in a separate validation report. That report summarizes the checks performed, and links to shared data workbooks (e.g. Jupyter Notebook) with code and queries that can be used to re-run the validation whenever a data model is updated. This work covers checks that can not be easily automated in the data pipeline, including more detailed anomaly detection on historical time series, and comparisons against existing data sources or metrics expected to be consistent with the new data model.
As with the design specs, this level of validation and documentation requires a larger upfront investment, but substantially reduces data inaccuracies and future bug reports, and makes refreshing the validation easy when the data model evolves in the future.
Certification reviews are a major component of the Midas process. These third-party reviews are performed by recognized data experts at the company, who are designated as either Data Architects or Metrics Architects. By performing Midas reviews, architects serve as gatekeepers of the company’s data quality.
There are four distinct reviews in the Midas process:
- Spec Review: Review the proposed design spec for the data model, before implementation begins.
- Data Review: Review the pipeline’s data quality checks and validation report.
- Code Review: Review the code used to generate the data pipeline.
- Minerva Review: Review the source of truth metric definitions implemented in Minerva, Airbnb’s metrics service.
Collectively, these reviews cover engineering practices and data accuracy across all data assets, and ensure certified data models meet the Midas promise: a gold standard for end-to-end data quality.
Bugs and Change Requests
Lastly, though not part of the initial pipeline development process, the Midas initiative improved our ability to manage offline data bugs and change requests. Organizing offline data into discrete data models and clarifying ownership allowed us to formalize company-wide processes to address requests from data consumers. Employees can now use a simple form to file tickets for bugs and change requests, a system that was not previously feasible.
The Midas initiative has allowed us to define a comprehensive standard for data quality shared across the company. Midas-certified data assets are guaranteed to be accurate, reliable, and cost-efficient, with consistent operational support, and backed by detailed user documentation. As the size of the company and our data warehouse continue to grow at rapid pace, the certification process ensures we are able to provide data consumers with a consistent guarantee for data quality at scale.
Midas certification does not come without challenges. In particular, quality takes time. Requirements for documentation, reviews, and input from a broad set of stakeholders mean building a data model to Midas standards is much slower than building uncertified data. Re-architecting data models at scale also requires substantial staffing from data and analytics engineering experts (we’re hiring!), and entails costs for teams to migrate to the new data sources.
Offline data is a key technology asset for Airbnb, and this investment is warranted. Certified data models serve as the shared foundation for all data applications, spanning business reporting, product analytics, experimentation, and machine learning and AI. Investing in data quality improves the value of each of these applications, and will improve data-informed decisions at Airbnb for years to come.