Improving Experimentation Efficiency at Netflix with Meta Analysis and Optimal Stopping

By Gang Su & Ian Yohai

From living rooms in Bogota, to morning commutes in Tokyo, to beaches in Los Angeles and dorms in Berlin, Netflix strives to bring joy to over 139 million members around the globe and connect people with stories they’ll love. Every bit of the customer experience is imbued with innovation, right from the very first encounter with Netflix during the signup process — whether it be on mobile, tablet, laptop or TV. We strive to bring the best experience to our customers through experimentation by continuously learning from data and refining our product. In the customer acquisition area, we aim to make the signup process as accessible, smooth, and intuitive as possible.

There are numerous challenges in experimentation at scale. But believe it or not, even with millions of global daily visitors and state-of-the-art A/B testing infrastructure, we still wish we had larger samples to test more innovative ideas. There are many benefits to ending experiments early when possible. To name just a few:

  • We could run more tests in the same amount of time, improving our chances of finding better experiences for our customers.
  • We could rapidly test the waters to identify the best areas to invest in for future innovation.
  • If we could, in a principled way, end an experiment earlier when a sizable effect is detected, we could bring more delight to our customers sooner.

On the other hand, there are some risks associated with running short experiments:

  • Very often tests are allocated much longer than the minimal required time determined by power analysis, to mitigate potential seasonal fluctuations (e.g., time of day, day of week, week over week, etc.), identify any diminishing novelty effect, or account for any treatment effects which may take longer to manifest.
  • Holidays and special events, such as new title launches, may attract non-representative audiences. This may render the test results less generalizable.
  • Improperly calling experiments early (such as by HARKing or p-hacking) may substantially inflate false positive rates and consequently lead to wasted business effort.

So in order to develop a scientific framework for faster product innovation through experimentation, there are two key questions we would like to answer: 1) How much, if at all, does seasonality impact our experiments and 2) If seasonality is not a great concern, how can we end experiments early in a scientifically principled way?

Detecting Seasonal Effects with Meta Analysis

While seasonality is perceived to render short tests less generalizable, not all tests should be equally vulnerable. For example, if we experiment on the look and feel of a ‘continue’ button, Monday visitors should not have drastic differences in aesthetic preference compared with Friday visitors. On the other hand, a background image featuring a new original TV series may be more compelling during the time of launch when visitors may have higher awareness and intent to join. The key, then, is to identify the tests with time-invariant treatment effects and run them more efficiently. This requires a mix of technical work and experience.

The secret sauce we used here is meta-analysis, a simple yet powerful method of analyzing related analyses. We adopted this methodology to identify time-varying treatment effects. One frequent application of this method in healthcare is to combine results from independent studies to boost power and improve estimates of treatment effects, such as the efficacy of a new drug. At a high level:

  • If outcomes from independent studies are consistent as shown in the following chart (left side), the data can be fitted with a fixed-effect model to generate a more confident estimate. The treatment effect of five individual tests were all statistically insignificant but directionally negative. When pooled together, the model produces a more accurate estimate, as shown in the fixed-effect row.
  • By contrast, if outcomes from independent studies are inconsistent, as shown in the right side of the chart, with both positive and negative treatment effects, meta analysis will appropriately acknowledge the higher degree of heterogeneity. It will adjust to a random-effect model to accommodate the wider confidence intervals, as shown in the future prediction interval row.

More details can be found in this reference. The model-fitting process (i.e. fixed-effect model versus random-effect model) can be leveraged to test whether heterogeneous treatment effects are present across time dimensions (e.g., time of day, day of week, week over week, pre-/post-event). We conducted a comprehensive retrospective study in A/B tests on the signup flow. As expected, we found most tests do not demonstrate strong heterogeneous treatment effects over time. Therefore, we could have ended some tests early, innovated more, and brought an even better experience to our prospective customers sooner.

End Experiments Early with Optimal Stopping

Assuming that a treatment effect is both time-invariant (evaluated by meta-analysis) and sufficiently large, we can apply various optimal-stopping strategies to end tests early. Naively, we could constantly peek at experiment dashboards, but this will inflate false positives when we mistakenly believe a treatment effect is present. There are scientific methodologies to control for false positives (Type I errors) with peeking (or, more formally, interim analyses). Several methods have been assessed in our retrospective study, such as Wald’s Sequential Probability Ratio Tests (SPRT), Sequential Triangular Testing, and Group Sequential Testing (GST). GST demonstrated the best performance and practical value in our study; it is widely used in clinical trials in which samples are accumulated over time in batches, which is a perfect fit our use case. This is roughly how it works:

  • Before a test starts, we decide the minimum required running time and the number of interim analyses.
  • GST then allocates the total tolerable Type I error (for example, 0.05) into all interim analyses, such that the total Type I error sums up to the overall Type I error. As a result, each interim test becomes more conservative than regular peeking.
  • A test can be stopped immediately whenever it becomes statistically significant. This often happens when the observed treatment effect size is substantially larger than expected.

The following chart illustrates the critical values and the individual and cumulative alpha spends from a GST design with five interim analyses. By adopting this strategy, we could have saved substantial time in running some experiments and obtained very accurate point estimates of the treatment effects much sooner, albeit with slightly wider confidence intervals and a small inflation of treatment effects. It works best when we would like to do a quick test of ideas and the accuracy of the magnitude of the treatment effect is less critical, or when we need to end a test prematurely due to a severe negative impact.

The following chart illustrates a successful GST early stop and a Fixed Sample Size (FSS full stop) determined by power analysis. Since the observed effect size is sufficiently large, we could stop test earlier with similar point estimates.

Building Decision Support into Our Experimentation Platform

Now that our initial research is completed, we are actively developing meta analysis, optimal stopping, heterogeneity treatment effects detection, and much more into the larger Netflix Experimentation and Causal Inference platform. We hope such features will accelerate our current experimentation workflow, expedite product innovation, and ultimately bring the best experience and delight to our customers. This is an ongoing journey, and if you are passionate about our mission and our exciting work, join our all-star team!

Special thanks to the support from Randall Lewis, Colin McFarland and the Science and Analytics team at Netflix. Team work makes the dream work!

Source link