Bridging batch and stream processing for the Recruiter usage statistics dashboard


Discrepancies across metrics computed from streaming engine vs. computed from batch engine

To validate the consistency of metrics, we compute metrics over the same day from both batch and streaming engines and measure the resulting discrepancies. The above chart shows the observed differences between the two engines with and without implementing event deduplication. There are 11 metrics used for the dashboard—among them, searchesperformed and activedays are non-additive, while the remaining are additive.

Without the deduplication logic, the discrepancies are already small, with an average of 0.05% for all metrics. The deduplication logic helps reduce the discrepancies to be negligible, with an average of 0.003%, meaning we only produce 3 incorrect counts for every 100,000 events. Among those 11 metrics, there were no discrepancies for the following three metrics: newtags, newsavedsearches, and newsavedsearchalerts.

However, we are aware that there are still discrepancies—albeit very small—across the two computation engines. We believe that it will take much more effort to eliminate those remaining discrepancies as both our streaming engine (Apache Samza) and our messaging system (Apache Kafka) have not yet guaranteed the exact-once semantic while data is copied across multiple clusters.

Conclusion

In this blog post, we showed a prototype of applying our single codebase Lambda architecture to build a near real-time dashboard that helps recruiters utilize our product better. We also presented our work in making the dashboard consistent over time by significantly reducing the discrepancies of metrics computed from the batch and streaming engines. We’re always committed to improving our product, and delivering more value to our members and customers. However, we have experienced through our production that the operating challenges of the Lambda architecture still remained as we needed two different teams, one for operating the batch engine and the other for operating the streaming engine, to serve one service. This can incur unnecessary friction when operating the service in production. Alternative approaches to combine streaming computation and batch computation into a single engine, like Apache Flink, can be a promising approach to solve this problem.

Acknowledgements

We would like to thank Plaban Dash, Kexin Fei, and Ping Zhu for their important contributions to build the offline and nearline flows. Thanks to support from LinkedIn leadership, Harsha Badami Nagaraj, Ameya Kanikar, Divyakumar Menghani, Vasanth Rajamani, Eric Baldeschwieler, and Kapil Surlaker; and valuable feedback from Shirshanka Das, Xinyu Liu, Maneesh Varshney, and Mayank Shrivastava to make this blog post better.



Source link