When I started my journey at LinkedIn ten years ago, the company was just beginning to experience extreme growth in the volume, variety, and velocity of our data. Over the next few years, my colleagues and I in LinkedIn’s data infrastructure team built out foundational technology like Espresso, Databus, and Kafka, among others, to ensure that LinkedIn would survive and thrive through the next wave of growth. A few years later, I became the tech lead for what was then a pretty small “data analytics infrastructure” team that ran and supported LinkedIn’s Hadoop usage, and also maintained a hybrid data warehouse spanning Hadoop and Teradata.
One of the first things I noticed was how often people were asking around for the “right dataset” to use for their analysis. It made me realize that, while we had built highly-scalable specialized data storage, streaming capabilities, and cost-efficient batch computation capabilities, we were still wasting time in just finding the right dataset to perform analysis.
Data discovery: One problem, many solutions
Fast forward to today and we’re living in the golden age of data. When a data scientist joins a data-driven company, they expect to find a data discovery tool (i.e., data catalog) that they can use to figure out which datasets exist at the company, and how they can use these datasets to test new hypotheses and generate new insights. Most data scientists don’t really care about how this tool actually works under the hood, as long as it enables them to be productive.
In fact, there are numerous data discovery solutions available: a combination of proprietary software available for purchase, open source software contributed by a particular company, and software built in-house. In the past few years, LinkedIn, Airbnb, Lyft, Spotify, Shopify, Uber, and Facebook have all shared details of their own data discovery solutions. This begs the question: how are each of these platforms different, and which option is best for companies thinking of adopting one of these tools?
The architecture of your data catalog will influence how much value your organization can truly extract from your data. Additionally, catalogs are sticky, taking a long time to integrate and implement at a company. As a result, it’s important to choose your data discovery solution carefully.
In this post, I will describe three generations of architectures that the industry has produced so far for data discovery tools, as well as explain where along this spectrum many of the most well-known options fall. This progression between generations is also mirrored by the evolution of the architecture of DataHub at LinkedIn, as we’ve driven the latest best practices (first open sourced and shared with the world as WhereHows in 2016, and then rewritten completely and re-shared with the open source community in 2019 as DataHub).
Hopefully, this post will help you make the best decision possible as you choose your own data discovery solution.
What is a data catalog?
Before we dive into the different architectures, let’s get our definitions in order. One of the simplest definitions for a data catalog I’ve found is from the Oracle website: “Simply put, a data catalog is an organized inventory of data assets in the organization. It uses metadata to help organizations manage their data. It also helps data professionals collect, organize, access, and enrich metadata to support data discovery and governance.”
Thirty years ago, a data asset was likely a table in an Oracle database. In a modern enterprise, though, we have a dazzling array of different kinds of assets that comprise the landscape: tables in relational databases or in NoSQL stores, streams in your favorite stream store, features in your AI system, metrics in your metrics platform, dashboards in your favorite visualization tool, etc. The modern data catalog is expected to contain an inventory of all these kinds of data assets and enable data workers to be more productive at getting things done with those assets.
Why do you need a catalog?
Before you decide to buy or adopt a specific data catalog solution or build your own, you should first ask what things you want to enable for your enterprise with a data catalog. A related and important question concerns what kinds of metadata you want to store in your data catalog, because that directly influences the kinds of use cases you can enable.
Here are a few common use cases and a sampling of the kinds of metadata they need:
- Search and Discovery: Data schemas, fields, tags, usage information
- Access Control: Access control groups, users, policies
- Data Lineage: Pipeline executions, queries, API logs, API schemas
- Compliance: Taxonomy of data privacy/compliance annotation types
- Data Management: Data source configuration, ingestion configuration, retention configuration, data purge policies (e.g., for GDPR “Right To Be Forgotten”), data export policies (e.g., for GDPR “Right To Access”)
- AI Explainability, Reproducibility: Feature definition, model definition, training run executions, problem statement
- Data Ops: Pipeline executions, data partitions processed, data statistics
- Data Quality: Data quality rule definitions, rule execution results, data statistics
One interesting observation is that each individual use case often brings in its own special metadata needs, and yet also requires connectivity to existing metadata brought in by other use cases. We’ll refer back to this insight as we dive into the different architectures of these data catalogs and their implications for your success.
First-generation architecture: Monolith everything
The figure below describes the first generation of metadata architectures. It is typically a classic monolith frontend (maybe a Flask app) with connectivity to a primary store for lookups (typically MySQL/Postgres), a search index for serving search queries (typically Elasticsearch), and, for generation 1.5 of this architecture, maybe a graph index for handling graph queries for lineage (typically Neo4j) once you hit the limits of relational databases for “recursive queries.”