Introducing Apache Pinot 0.3.0 | LinkedIn Engineering

Apache Pinot

Areas of improvement

Working closely with some of the power users of Pinot helped us realize what it takes to run Pinot—especially in cases where the ecosystem was different from ours. One of the most important goals of Pinot 0.3.0 release was to improve Pinot’s ease of use and extendability. To achieve this objective, we identified the following four areas of focus:

Restrictive pluggability
Our big data analytics ecosystem is built off of technologies that rely on Kafka, Hadoop, Avro, and Apache ORC. To be compatible with other stacks, version 0.1.x of Pinot included ways to plug in streams other than Kafka, but the implementation was still closely tied to Hadoop and Avro. While both are widely adopted, we recognized that there were certain use cases that would have equally good or even better alternatives to using Hadoop or Avro. Having compile-time dependency on these technologies made it hard to integrate with other systems (e.g., S3, GCS, ADLS), and ingest data in different formats (e.g., Parquet, ORC, Thrift). 

Lack of cloud-native support
The foundational development work on Pinot predates the majority of LinkedIn’s public cloud integrations. Naturally, some of the toolings around Pinot weren’t built to embrace cloud-native technologies. These include Blob Stores, Containers, Docker, Kubernetes, etc. This made it harder for the community to deploy and operate Pinot on cloud.

Limited SQL support
One of the reasons users like to use Pinot is its query execution speed. At LinkedIn, Pinot handles over 120K queries per second while ensuring millisecond latency, pushing the limits on OLAP scale. In order to ensure the SLA, we limited support to a subset of SQL syntax and deviated from standard SQL semantics. For instance, we changed the GROUP BY behavior to order results on multiple metrics in a single query. Other features such as Joins and expressions in filters were also not supported. This was done to ensure that the latency SLA is always maintained. With SQL being the popular choice for analytics, these deviations of Pinot Query Language (PQL) from SQL syntax and semantics made it difficult for users to interact with Pinot. 

Decentralized documentation
Better documentation was one of the most common pieces of feedback that we received from users. While Pinot did have ample documentation, it was developer-centric, and not as friendly to users who wanted to try Pinot out. Pinot was built to power internal data analytics products—such as Who Viewed My Profile, Talent Analytics, Company Analytics, and many more—while being easy to operate. At LinkedIn, we continue to operate Pinot as a service for all verticals and have invested heavily in making Pinot highly available and operable. Complex operations such as adding nodes, provisioning new tables, making config and schema changes, or rebalancing a workload can be performed without any downtime. The issue was that users did not know right away that these features necessary to operate Pinot at scale were already built, and would often run into the lack of documentation for these operations as a common pain point.

New in Apache Pinot 0.3.0

Once we identified these priority areas of improvement, it was time to tackle them one by one to move towards a better Pinot for the community. 

Introducing a plug-in architecture
Creating a plug-in architecture was not a quick fix—we had to completely overhaul Pinot’s code layout (modules and their dependencies). Over time, Pinot’s core module (pinot-core) had become a behemoth that had swallowed tons of dependencies, ranging from external systems such as S3, ADLS, Hadoop, and Spark, to data formats, such as Avro and Parquet. It was important that the layout was simplified to make it easy for future contributors to add support for further system integrations. The first order of business was to abstract these interfaces out from the core module, and provide them as plugable implementations. The graph below shows just how complex the inter-module dependencies we originally had were.

Source link