As the operator of the world’s largest professional network and the Economic Graph, LinkedIn’s Data team is constantly working on scaling its infrastructure to meet the demands of our ever-growing big data ecosystem. As the data grows in volume and richness, it becomes increasingly challenging for data scientists and engineers to discover the data assets available, understand their provenances, and take appropriate actions based on the insights. To help us continue scaling productivity and innovation in data alongside this growth, we created a generalized metadata search and discovery tool, Data Hub.
To increase the productivity of LinkedIn’s data team, we had previously developed and open sourced WhereHows, a central metadata repository and portal for datasets. The type of metadata stored includes both technical metadata (e.g., location, schema, partitions, ownership) and process metadata (e.g., lineage, job execution, lifecycle information). WhereHows also featured a search engine to help locate the datasets of interest.
Since our initial release of WhereHows in 2016, there has been a growing interest in the industry to improve the productivity of data scientists by using metadata. For example, tools developed in this space include AirBnb’s Dataportal, Uber’s Databook, Netflix’s Metacat, Lyft’s Amundsen, and most recently Google’s Data Catalog. At LinkedIn, we have also been busy expanding our scope of metadata collection to power new use cases while preserving fairness, privacy, and transparency. However, we came to realize WhereHows had fundamental limitations that prevented it from meeting our evolving metadata needs. Here is a summary of the lessons we learned from scaling WhereHows:
- Push is better than pull: While pulling metadata directly from the source seems like the most straightforward way to gather metadata, developing and maintaining a centralized fleet of domain-specific crawlers quickly becomes a nightmare. It is more scalable to have individual metadata providers push the information to the central repository via APIs or messages. This push-based approach also ensures a more timely reflection of new and updated metadata.
- General is better than specific: WhereHows is strongly opinionated about how the metadata for a dataset or a job should look like. This results in an opinionated API, data model, and storage format. A small change to the metadata model will lead to a cascade of changes required up and down the stack. It would have been more scalable had we designed a general architecture that is agnostic to the metadata model it stores and serves. This in turn would have allowed us to focus on onboarding and evolving strongly opinionated metadata models without worrying about the lower layers of the stack.
- Online is as important as offline: Once the metadata has been collected, it’s natural to want to analyze that metadata to derive value. One simple solution is to dump all the metadata to an offline system, like Hadoop, where arbitrary analyses can be performed. However, we soon discovered that supporting offline analyses alone wasn’t enough. There are many use cases, such as access control and data privacy handling, that must query against the latest metadata online.
- Relationships really matter: Metadata often conveys important relationships (e.g., lineage, ownership, and dependencies) that enable powerful capabilities like impact analysis, data rollup, better search relevance, etc. It is critical to model all these relationships as first-class citizens and support efficient analytical queries over them.
- Multi-center universe: We realized that it is not enough to simply model metadata centered around a single entity (a dataset). There is an entire ecosystem of data, code, and human entities (datasets, data scientists, teams, code, microservice APIs, metrics, AI features, AI models, dashboards, notebooks, etc.) that need to be integrated and connected through a single metadata graph.
Meet Data Hub
About a year ago, we went back to the drawing board and re-architected WhereHows from the ground up based on these learnings. At the same time, we realized the growing need within LinkedIn for a consistent search and discovery experience across various data entities, along with a metadata graph that connects them together. As a result, we decided to expand the scope of the project to build a fully generalized metadata search and discovery tool, Data Hub, with an ambitious vision: connecting LinkedIn employees with data that matters to them.
We broke the monolithic WhereHows stack into two distinct stacks: a Modular UI frontend and a Generalized Metadata Architecture backend. The new architecture enabled us to rapidly expand our scope of metadata collection beyond just datasets and jobs. At the time of writing, Data Hub already stores and indexes tens of millions of metadata records that encompass 19 different entities, including datasets, metrics, jobs, charts, AI features, people, and groups. We also plan to onboard metadata for machine learning models and labels, experiments, dashboards, microservice APIs, and code in the near future.
The Data Hub web app is how most users interact with the metadata. The app is written using Ember Framework and runs atop a Play middle tier. To make the development scalable, we leverage various modern web technologies, including ES9, ES.Next, TypeScript, Yarn with Yarn Workspaces, and code quality tools like Prettier and ESLint. The presentation, control, and data layers are modularized into packages so that specific views in the app are built from a composition of relevant packages.
Component service framework
In applying a modular UI infrastructure, we’ve built the Data Hub web app as a series of cohesive feature aligned components that are grouped into installable packages. This package architecture employs Yarn Workspaces and Ember add-ons at the foundation, and is componentized using Ember’s components and services. You can think of this as a UI that’s built using small building blocks (i.e., components and services) to create larger building blocks (i.e., Ember add-ons and npm / Yarn packages) that when put together, eventually constitute the Data Hub web app.
With components and services at the core of the app, this framework allows us to pull apart different aspects and put together other features in the application. Additionally, segmentation at each layer provides a very customizable architecture that allows consumers to scale or streamline their applications to take advantage of only the features or onboard new metadata models relevant to their domain.
Interacting with Data Hub
At the highest level, the frontend provides three types of interactions: (1) search, (2) browse, and (3) view/edit metadata. Here are some example screenshots from the actual app: