So, e.g., a regression of a “high” traffic page is classified as P0 if PLT this week is above 30% of the baseline (high water mark). This regression will be resolved only when the PLT is less than 5% above the baseline (low water mark). Note that the thresholds are revised periodically, so the thresholds shown here are examples to illustrate the overall concept.
We open internal tickets for each regression and assign them to page owners. Page ownership is a widely-used concept at LinkedIn. Each page has exactly one engineering owner. The engineering owner (usually a manager of a team) can use their team’s on-call process and pass ownership of fixing regression to the on-call engineer.
Did it work?
Although we have had the ticket process in place for a while, initially it was a struggle to get traction. We would have hundreds of regression tickets open for weeks without any resolution, and had the project marked RED or YELLOW for a long time. In short, the process was not working.
Specific problems we observed were:
It was not made clear who owned solving the issue
Page owners, who ideally should own solving the regression, did not always have the right tools or knowledge needed to understand the issue
Intermittent measurement issues would create false positives and erode trust in the process
The Performance Team was acting like an enforcer and not like an owner, going against LinkedIn’s values
Midway through last year, we stepped back and analyzed the situation holistically. Here’s what we realized:
We had taken an extremely engineering-focused approach by automating as much as possible, and;
We had preemptively solved a scaling issue by distributing all the work to page owners.
Any such large scale distributed system needs a bake-in period. Instead of rolling out such a brand new system and process to all engineering, we needed to perfect the process by targeting a smaller subset of pages. We also had put too much trust in automation without establishing proper feedback mechanism from humans.
Making it work by “Acting Like an Owner”
We now have a Performance engineer assigned to all P0 and P1 tickets to triage them first. While the system has improved, we still occasionally find measurement issues (e.g., performance timing markers moved, data processing failed, etc.). Performance engineers triage the ticket to make sure it is not a measurement issue (and thus a false positive). They also triage to see if this is due to either a global degradation or a local, page-specific, but known problem. Finally, they attempt a basic root cause analysis. Then, they comment on the ticket detailing their findings.
We expect the page owner to drive the ticket from this point on. Performance engineers will help them with further investigations of the problem, or potential unrelated optimization opportunities for the page. But it is the page owner’s job to take ownership of the ticket from this point onwards.
Once the Performance engineer transfers ownership of the ticket to the page owner, the page owner is also responsible for satisfying the SLAs. For P0 regressions, e.g., they need to investigate the high PLT and come up with a plan to get the PLT back under baseline within two business days.
Fixing the root cause versus improving the page
It is important to note that since we are talking about slow leaks, root cause analysis in many cases doesn’t catch one single culprit. More often than not, it is a combination of multiple issues that cause the page to degrade. Hence, page owners are always encouraged to think of optimization opportunities outside of the reasons why the page has regressed.
Where are we now?
We see about four P0 and P1 regressions every week right now for all of LinkedIn. This is in contrast to over 20 P0 and P1 regressions per week last year. As we get better at resolving P1 and P2 regressions, we hope to see P0 regressions less often.
We also noticed that the regression mechanism has become a catch-all for site speed problems. While we have world-class monitoring systems like EKG, XLNT (A/B testing), and Luminol, they all tend to have some false negatives in order to reduce false positives. Anything these systems miss tends to be caught by our regression process.
We want to retain the human touch, but simplify the work that Performance engineers and page owners do to get to the root cause (if any). So we are developing more automation around better root cause detection using multiple data sources to point the investigator in the right direction:
What other known events happened when the page went into regression? E.g., deployments, A/B test ramps, global issues, etc.
Tie together different pieces of data. E.g., make use of server side call fanout data, A/B test analysis, etc.
Correlate different metrics within Real User Monitoring (RUM) to understand root cause better. E.g., page load time increase due to increase in TCP connection time could be due to an issue with a PoP.
Find out which combination of dimensions in the data set might have caused the regression. E.g., a bad browser release may have caused a regression.
Better integration of regression management with how bugs are handled by each team. E.g., assign regressions directly to the on-call engineer on the page owner’s team.
Finally, some teams are instituting their own processes to reduce load further on the Performance engineers.
Slow leaks are a menace. LinkedIn has come up with tools and processes to catch them automatically. Fixing these regressions still needs human involvement, and so it cannot be done by a central team. Our process scales because it lets the centralized Performance team use their domain knowledge and expertise with site speed and RUM data to triage regressions, but resolving regressions is distributed across engineering to page owners and their teams.
I want to acknowledge Sanjay Dubey, Badri Sridharan, David He, Anant Rao, Haricharan Ramachandra, and Vasanthi Renganathan, who initiated this work with me almost three years back. Also thanks to Brandon Duncan, who helped the Performance Team realize the importance of acting like an owner. And thanks to all the product engineering VPs and Directors who championed site speed along the way. A good culture starts at the top.
This work would not have been possible without significant design and dev work by Steven Pham, Dylan Harris, Ruixuan Hou, Sreedhar Veeravelli, and many others on the Performance Team. Also, thanks to all the performance engineers for putting hours into these regression tickets to help triage them. And finally, thanks to all the page owners for their feedback along the way and for keeping the bar high for site speed at LinkedIn!