At LinkedIn, we have many different monitoring systems—each with its own role and granularity— ranging from quarterly reports about the business as a whole to the lowest levels of system-specific latency and availability. However, these systems don’t operate in vacuums—sometimes, issues or changes that are flagged by one system will go on to cause problems in another area. While system-specific monitoring is valuable and necessary, we also need to have an overall platform that allows us to see how the whole LinkedIn ecosystem is working in concert. Additionally, when issues arise, we need an integrated solution for real-time alerting and collaborative analysis. To solve this problem, we created ThirdEye.
ThirdEye is a comprehensive platform for real-time monitoring of metrics that covers a wide variety of use-cases. LinkedIn relies on ThirdEye to monitor site performance, track member growth, understand adoption of new features, flag sustained attempts to circumvent system security, and many other areas. ThirdEye provides a shared infrastructure for outlier detection and user-interactive data analysis of various system and business metrics. ThirdEye connects to a large number of data sources to gather information and learns over time to generate more relevant detection and analysis results through user interaction.
ThirdEye builds upon Apache Pinot, which recently entered incubation with Apache. For ThirdEye, we leverage Pinot’s awesome slice-and-dice capabilities to analyze high-dimensional data on demand and provide real time insights into the vast data sets of business metrics generated at LinkedIn.
A typical use-case for ThirdEye at LinkedIn is answering questions about deviations in growth metrics, such as member signups and page views, from various executives across the company. Small changes in growth can be attributed to everything from regional holidays, to minor configuration issues, to outages of entire data centers, so it’s important to have a single platform that provides visibility into all possible causes. Thanks to ThirdEye, we are able to supply potential root causes on-demand without needing to farm these questions out to a large number of specialized analysts and coordinate between multiple responses. Additionally, we’re able to bring attention to relevant outliers hidden underneath the surface, such as ongoing changes in the attention of LinkedIn’s members towards different sub-products that might cancel each other out in aggregate.
Over time, ThirdEye improves its automated detection and analysis capabilities from incremental user feedback and the addition of domain-knowledge and data sources. It provides common components out-of-the-box, and becomes more and more effective as different teams integrate their data and expand ThirdEye’s knowledge graph of system and metric dependencies.
Why another monitoring system?
Anomaly detection and root cause analysis are common problems for data science, site reliability, and engineering teams. One way or another, teams create monitoring solutions for their area of responsibility. These solutions are usually application- or domain-specific libraries, and don’t generalize to other use-cases; they’re built around a very specific set of business rules for detection and analysis. Different teams sometimes redundantly spend large amounts of time to develop a particular solution. Each of these monitoring systems typically comes with its own data ingestion pipeline, limitations on processing capabilities and, of course, ongoing maintenance requirements. There might be a desire to consolidate multiple systems, but original systems don’t scale to new use cases because they were never designed to process large amounts of data efficiently or support different detection methods.
An example of this was one of our teams taking several weeks to notice and diagnose a drop in page impressions in a specific part of LinkedIn’s feed. This issue was ultimately linked to a new security feature that was interfering with the timely serving of recommendations. The existing monitoring infrastructure was integrated with site performance tracking and, separately, the site performance team had integrated their monitoring with teams responsible for feature rollouts. However, these integrations used aggregated, top-level metrics and did not have an in-depth, end-to-end view of the system. Ironically, independent of the team’s ongoing investigation, one of their engineers was evaluating the costs and benefits of connecting their data feeds to ThirdEye. In the course of the evaluation, ThirdEye revealed the problem.
Architecture and design
We built ThirdEye from the ground up as a monitoring and analysis platform with a robust foundation in federated data processing across numerous batch and streaming data sources (Figure 1). ThirdEye leverages high-dimensional time series data from systems such as Apache Pinot and RocksDB for quantitative analysis, and integrates with numerous event data sources for correlation analysis and root-cause inference. All of this is done at user-interactive speeds and is suitable for real-time monitoring. Teams can easily connect their own data sources and then immediately leverage the entire pool of operational metrics at LinkedIn for detection, analysis, and implementation of team-specific business logic. This way, ThirdEye provides value to many different areas of business and responsibility, while centralizing the underlying knowledge base, infrastructure, and operations.
Another critical design aspect of ThirdEye is the tight integration of online monitoring and offline analysis capabilities. ThirdEye has real-time analysis features similar to MacroBase, and allows our users to investigate anomalies via dashboarding utilities comparable to Adobe Analysis Workspace, Google Stackdriver, and Amazon CloudWatch. Rather than treating analysis as an isolated feature, ThirdEye integrates analysis and detection as an iterative process. Our users explore data interactively while ThirdEye dynamically adapts to the user’s current focus to generate context-sensitive recommendations and detect outliers in potentially-related metrics and events on the fly.