Note that this configuration communicates the following to Data Sentinel:
- The intent to validate the values of the dataset fields employee_id, email_address, and age.
- A command to perform a corresponding set of 1 or more data checks for each field.
Given the configuration and dataset, Data Sentinel executes the corresponding data validation job. First, Data Sentinel loads a subset of the dataset into a special Apache Spark data structure, whose contents are distributed across the underlying Spark cluster. Note that this subset contains only the specified fields to be validated. Next, Data Sentinel parses this configuration and generates an execution plan containing a sequence of optimized SQL queries. With these queries, Data Sentinel scans the dataset to mine two bodies of knowledge:
- Dataset profile: Statistical summaries of the dataset fields specified in the configuration
- Validation report: Results of data checks specified in the configuration
For efficiency, Data Sentinel will first compute the dataset profile. Then, it uses the profile’s statistical summaries to compute the validation report.
After Data Sentinel computes the dataset profile and validation report, users have a couple of options for next steps. They can simply examine the contents of the profile and report. If one or more of the specified data checks failed, users can instruct Data Sentinel to block the dataset from flowing further downstream the workflow of jobs. Alternatively, other programs and software systems can further process the profile and report.
Data Sentinel adoption at LinkedIn
With data mining and software engineering techniques, Data Sentinel has identified bugs in development and production workflows by flagging poor quality data and has prevented software systems at LinkedIn from consuming bad data. It has also caught more insidious issues, such as data skew and duplicate examples in datasets. These can result in poor data analytics and statistical machine learning models.
Some success stories from our teams include:
- Leveraging Data Sentinel to discover duplicated work anniversary data and primary keys for organization data.
- Data Sentinel helped a team that works with member and jobs data 1) discover duplicated data, 2) intervene and prevent corrupted data from being pushed to a database.
- Data Sentinel helped a team that works with recruiter data discover duplicate records, thus preventing the learning of biased machine learning models from this corrupted data.
Following its widespread success and added value at LinkedIn, Data Sentinel continues to undergo intense development to improve its capabilities. These exciting efforts include the following:
- Implementing more data mining methods based on AI, statistics, and machine learning to perform data checks
- Discovering and recommending data checks for users
- Validating data in an online streaming fashion (as opposed to the current offline batch-processing approach)
- Leveraging self-driving database techniques, as referenced in this paper, to improve the performance of data validation jobs
At LinkedIn, we are excited to continue pushing the frontiers of data mining, data management, and software engineering to address data quality problems. We hope that Data Sentinel will not only raise awareness around the importance of data quality, but also inspire concepts of “testing coverage” and health metrics for datasets to be incorporated into software engineering and big data analytics.