Solving the data integration variety problem at scale, with Gobblin


In the above diagram, the new multi-staged flow uses shared components that can be leveraged to integrate with many other Rest API and OData data sources, whereas the complex job requires a vendor-specific connector. The streamlined multistage flow dramatically reduces the work needed to apply it to hundreds of objects, getting rid of many one-off implementations previously deemed “essential” to handle some special objects, such as those illustrated above for retrying.   

Last but not least, all the genericity and decomposition will not work without a fine-grained state store, with which each generic step and work unit is managed independently and collectively, allowing them to be restarted or reset separately. 

Benefits for open source community

At LinkedIn, we have used DIL across dozens of use cases to greatly simplify operational complexity, while allowing us to scale as a company. The direct benefit of the design is quicker time-to-market, and the dramatic reduction of lead time to onboard a new business initiative. The design also creates high ROI, as fewer resources are required to build and maintain the connector libraries.

While these benefits across LinkedIn alone have been very significant, consider the tens of companies that are using Gobblin as their data integration platform: there could be hundreds or thousands of customized connectors being developed and used in private, which means DIL could create a far greater impact within the community. For those looking to adopt this framework for their data integration needs, the standardized design will greatly reduce the upfront investments, which results in accelerated adoption. 

Conclusion

DIL’s generic components, along with its multistage architecture and granular state store, greatly standardize and simplify data integration by using the same principle of microservices that are used to replace monolithic applications. The new architecture decouples data source protocols from format, and it is proven to be scalable in complex integration environments with its protocol-format-agnostic design. At LinkedIn, we have replaced 89 independently-maintained connectors with generic connectors. We believe this architecture will greatly simplify Gobblin with regard to its connector repository, which will surely benefit the open source community. 

Future work

So far at LinkedIn, we have applied the above design to a few dozens of data ingestion cases, positively impacting hundreds of valuable datasets. In the next few quarters, we plan to apply the same principles to outbound data.

Compared to ingestion, (outbound) egress involves additional requirements in terms of data security and compliance. Furthermore, we are targeting more usage patterns where users might initiate data pulling from their end, which will demand that the framework be able to serve data in more diversified ways.

At the time of this writing, Gobblin is being promoted to a top level Apache project (TLP), and this work is being considered as a sub-repository under the upcoming Apache Gobblin project.

Lastly, while the current implementation of this design is based on the Gobblin framework, the same design principles could be applied to other frameworks as well. 

Acknowledgments

This work is heavily inspired by our partners in Marketing (LMS/DSO), Sales (Sales Productivity and LSS), and Artificial Intelligence (AI). A number of engineering teams contributed to the development of the connector framework. For all their hard work and support, we give thanks to Yogesh Keshetty, Eric Song, Pravin Boddu, Zhisheng Zhou, Haoji Liu, Alex Li, Varun Bharill, Yan Nan, Harry Tang, Steven Chuang, Varuni Gupta, Cliff Leung, Jimmy Hong, Sudarshan Vasudevan, Zhixiong Chen, Azeem Ahmed, Jyothi Krishnamurthi, Elma Liu, Carlos Flores, Susan Sumida, and others who provided feedback to this blog.



Source link