Bridging Offline and Nearline Computations with Apache Calcite

Figure 1. The traditional lambda architecture

This is a great idea, but a major pain point of the Lambda architecture is that it requires users to develop two different code bases for carrying out the same logic: batch scripts, often in declarative languages like SQL, and nearline code, often in procedural languages like Java. Many problems arise due to this duplication—developers have to do twice the work, and it is painful to make sure the two code bases are consistent during code update and maintenance.

Several years ago, Apache Flink was introduced to unify both batch processing and stream processing in a single engine and provide a better streaming model with a guarantee of accuracy. However, migrating an existing computation platform into a new technology like Flink is not trivial work.

In this blog post, we show how we address the limitations of the Lambda architecture by maintaining a single (existing) batch codebase and building a technology to auto-generate streaming API code from batch logic.

Lambda architecture with a single batch code base at LinkedIn

The majority of our use cases for big data computation are for existing batch jobs, written in Pig, Hive, or Spark scripts and deployed in production. But at some point, our users realized they needed fresh results. We addressed this problem by auto-generating streaming code from batch logic and employing the Lambda architecture to transparently deliver merged results to our users.

Technically, our Lambda architecture is very similar to the above traditional Lambda architecture; however, our users only need maintain a single code base, either in Pig, Hive, or Spark. This code base serves as the single source of truth to specify what users want to compute. If they need fresh results, they just need to turn on the nearline flag.

In our Lambda architecture, we use Apache Kafka to deliver new events, which are then either consumed by Samza jobs in the streaming layer or ingested in HDFS via Apache Gobblin and then processed by Azkaban flows in the batch layer. Batch jobs in Azkaban can be scheduled at different time granularities, including monthly, weekly, daily, and hourly. We use Pinot, a realtime distributed OLAP datastore developed at LinkedIn, for the serving database. Pinot separately stores both nearline results and offline results, and makes them appear as a single view, called a hybrid table, to users. When a client makes a query to a Pinot table, Pinot will automatically figure out the latest timestamp where offline results are available, then fetch and merge appropriate results from both sources together, and return the merged results to the client.

Source link