Building a better and faster Beam Samza runner

Co-authors: Yixing ZhangBingfeng Xia, Ke Wu, and Xinyu Liu

Since Beam Samza runner was developed in 2018 at LinkedIn, we now have 100+ Samza Beam jobs running in production. As our usage grew, we wanted to better understand how the Samza runner performs compared to other runners and identify areas of improvement. In general, for stream processing platforms, better performance usually means supporting more use cases and reducing costs.

To do this, we utilized Beam Nexmark suite to benchmark the performance of Beam Samza runner. Using the async-profiler, we were able to identify performance bottlenecks in the Samza runner. This led us to four key improvements that collectively improved the benchmark score of Beam Samza runner by 10x. 

In this blog post, we will discuss the results of the benchmark suite and the optimizations implemented along the way. Similar steps can also be applied to benchmark and improve performance of other Beam runners.


Apache Beam is an open source, advanced unified programming model for both batch and streaming processing. It provides a rich and portable API layer for building sophisticated data-parallel processing pipelines that can be executed across a diversity of execution engines or runners, such as Flink, Samza, and Spark.

Apache Samza is a distributed stream processing framework with large-scale state support. Samza applications can be built locally and deployed to either YARN clusters or standalone clusters using Zookeeper for coordination. Samza uses RocksDB to support large-scale state, backed up by changelogs for durability. Samza does incremental checkpointing using Kafka. As of today, Samza powers thousands of streaming applications inside LinkedIn processing over 3 trillion messages per day.

In 2018, we developed the Beam Samza runner to leverage the unified data processing API, advanced model, and multi-language support of Apache Beam. It has been widely adopted at LinkedIn for processing complex streaming use cases such as sliding window aggregations and multi-stream session window joins.

Benchmarking with Nexmark

We know Beam Samza Runner performed well for our use cases in production, but we wanted to find out how it performs more generally and in comparison to other runners. We leveraged the Beam Nexmark suite for benchmarking. If you’re not familiar, Nexmark is a suite of Beam pipelines inspired by the “continuous data stream” queries in the Nexmark research paper to benchmark the performance of the Beam SDK and its underlying runners since the Beam 2.2.0 release. Using the async-profiler, a low overhead sampling profiler for Java applications, we were able to identify performance bottlenecks in the Samza runner and improve its Nexmark benchmark score.

For the performance tests, we ran the Beam Nexmark suite on a dedicated Linux machine with the following hardware:

  • Intel® Xeon® processor (1.8 GHz, 16 cores)
  • 64GB GB DDR4-2666 ECC SDRAM (4x16GB)
  • 512GB SATA SSD

Nexmark settings
We ran the Nexmark benchmark suite in streaming mode with Direct, Flink, and Samza runners. We configured the Flink and Samza runners’ state store with RocksDB, a high-performance persistent key-value store. Most of LinkedIn services’ state is much larger than the available memory, hence we use RocksDB as the state backend inside LinkedIn. We used the default state backend for Direct runner since it doesn’t support RocksDB.

We have set bundle size as one for the runners, which means all elements are processed in parallel. When we ran the benchmark, Samza runner did not yet support bundling.

Benchmark results
Following are the Nexmark benchmark results of Direct, Flink, and Samza runners of the 15 Nexmark queries, labeled Q0 through Q14. In this chart, the X-axis represents each of the queries and the Y-axis represents the throughput of the queries in QPS, the higher the better. For each query, the four bars represent, from left to right: the Direct runner, the Flink runner, the Samza runner before optimization, and the Samza runner after optimization respectively.

Source link