Open Sourcing DataHub: LinkedIn’s Metadata Search and Discovery Platform


WhereHows is now DataHub!

LinkedIn’s metadata team has previously introduced DataHub (successor of WhereHows), LinkedIn’s metadata search and discovery platform, and shared plans to open source it. Shortly after that announcement, we released an alpha version of DataHub and shared it with the community. Since then, we have continuously contributed to the repo and worked with interested users to add most requested features and resolve issues. Now, we are proud to announce the official release of DataHub on GitHub.

Open source approaches

WhereHows, LinkedIn’s original data discovery and lineage portal, started as an internal project; the metadata team open sourced it in 2016. From that time onwards, the team has always maintained two different codebases—one for open source, and the other for LinkedIn’s internal use—because not all product features developed for LinkedIn’s use cases were generally applicable to a broader audience. Also, WhereHows had some internal dependencies (infrastructure, libraries, etc.) which are not open sourced. WhereHows went through a lot of iterations and development cycles in the following years, which made keeping the two codebases in sync a big challenge. The metadata team attempted different approaches over the years to try to make internal and open source development in sync with each other.

First attempt: “Open source first”
Initially, we followed an “open source first” development model, where the main development takes place in the open source repo and changes are pulled in for internal deployment. The problem with this approach is that the code is always pushed to GitHub first before it is fully validated internally. Until the changes from the open source repo were pulled in and a new internal deployment took place, we would not discover any production issues. In the case of a bad deployment, it was also very hard to figure out the culprit, because changes were pulled in batches.

Also, this model decreased the productivity of the team when developing new features that needed fast iterations because it forced all changes to be pushed to the open source repo first and then brought them to the internal repository. To reduce turnaround time, the necessary fix or change could be done first in the internal repository, but this became a huge pain point when it came to merging those changes back to the open source repo, because the two repositories had gotten out of sync.

This model is much easier to implement for generic frameworks, libraries, or infrastructure projects than it is for full-stack custom web applications. Also, this model is perfect for projects that start out as open source from day one, but WhereHows had started out as a completely internal web app. It was really difficult to cleanly abstract all internal dependencies, which is why we needed to keep an internal fork, but keeping an internal fork and developing primarily in open source did not quite work for us.

Second attempt: “Internal first”
As a second attempt, we switched to an “internal first” development model, where the main development takes place internally and changes are pushed to open source on a regular basis. Although this model is best suited for our use case, it has inherent challenges. Directly pushing all the diff to the open source repo and then trying to resolve merge conflicts later is an option, but it’s time consuming. Developers will mostly avoid doing it with every code check-in. As a result, it will be done much less frequently, in batches, and thus increases the pain of resolving merge conflicts later.

Third time’s the charm!
The two failed attempts mentioned above had the consequence of leaving the WhereHows GitHub repository stale for a long time. The team continued to iterate on the product features and architecture, so LinkedIn’s internal version of WhereHows quickly became a better and much improved version than the open source one. It even had a new name, DataHub. Learning from previous failed attempts, the team has decided to devise a scalable long-term solution.

For any new open source project, LinkedIn’s open source team advises and supports a development model where building blocks/modules of the project are fully developed in open source. Versioned artifacts are deployed to a public repository and then brought back to LinkedIn’s internal artifactory using External Library Request (ELR). Following this development model is not only good for the open source community, but also results in a more modular, extensible, and pluggable architecture.

However, to achieve that state for a mature internal application like DataHub will take a significant amount of time. It also precludes the possibility of open sourcing a fully working implementation before all internal dependencies are completely abstracted out. Therefore, we’ve developed tooling that helps us make open source contributions faster and much less painful in the interim. This is a decision benefiting both the metadata team (the developer of DataHub) and the open source community. The following sections will discuss this new approach.

Automating open source contributions

The metadata team’s latest approach for open sourcing DataHub is to develop a tool that automatically syncs the internal codebase and the open source repository. High level features of this tooling include:

  1. Syncing of LinkedIn code to/from open source, similar to rsync

  2. License header generation, similar to Apache Rat

  3. Auto-generation of open source commit logs from internal commit logs

  4. Preventing internal changes that break open source build via dependency testing

In the following subsections, the above features, which have interesting challenges, will be discussed in detail.

Source code syncing
As opposed to the open source version of DataHub, which is a single GitHub repo, LinkedIn’s version of DataHub is a combination of multiple repos (known internally as multiproducts). DataHub’s frontend, metadata models library, metadata store backend service, and streaming jobs sit in different repositories within LinkedIn. However, for an easier experience for open source users, we have a single repository for the open source version of DataHub.



Source link