Analyzing anomalies with ThirdEye | LinkedIn Engineering

Co-authors: Yen-Jung Chang, Yang Yang, Xiaohui Sun, and Tie Wang

At LinkedIn, ThirdEye is the backbone of our monitoring toolkit. We use it to keep track of a variety of metrics, whether it be related to production infrastructure and AI model performance, or business impact, such as page view or click count. It’s a key quality assurance system because it provides rules-based or model-based anomaly detection to reduce false alarms, and multiple interactive root cause analysis tools to help our engineers narrow down the cause of an anomaly. In fact, it has successfully detected several anomalies that could have otherwise slipped through the cracks and significantly impacted the member experience.

In previous blog posts, we have focused on the early steps of anomaly detection with ThirdEye: real-time alerting and collaborative analysis and creating smart alerts. However, alerts are only the first step when a user receives the notification. In this blog post, we will specifically focus on the behind-the-scenes functionalities of ThirdEye that analyze the multi-dimensional time series data and help our engineers understand why these anomalies happened through a dimension heatmap.

Data cube

In modern systems, data is usually aggregated or summarized by multi-dimensional information so that users understand the specific impact on different subpopulations. For example, when looking at total page views, business analysts typically will want to know how pageviews change across different countries, platforms, etc. The table below shows a hypothetical example of such breakdowns. This is known as a data cube, which enables users to slice data and gain a better understanding of its variations. At LinkedIn, such data cubes are pre-aggregated and stored in a real-time OLAP engine called Pinot.

Source link