Selection Bias in Online Experimentation – Airbnb Engineering & Data Science – Medium

What Is It?

As a statistician who moved to work in data science in industry less than two years ago, it is inspiring to see the almost universal adoption of experimentation to guide product and business decisions. Statistical inference and hypothesis testing are ingrained in our daily work, and an appreciation of randomness informs much of our decision making. However, the era of big data and A/B testing at scale present new challenges to our old methodologies.

If you have experience running A/B testing on a large scale experimentation platform, this is something you might have seen before: numbers don’t always add up. Here is an example. Over the period of a couple of months, our team at Airbnb ran several experiments sequentially. Six of the experiments were launched to all users after showing a statistically significant lift on our target metric, with the exception of a small holdout group (split-holdout in the graph). After the spur of successful experiments, we tallied their effect on our target metric and found this:

In the bottoms-up calculation, each number is the point estimate of the effect on the target metric, as measured by each successful A/B experiment. Naively, both the bottoms-up sum and the result of the “meta-experiment” of holdout are valid measurements of the aggregated total effect. How do we explain the gap between the two approaches?

Your first impression is probably, “Wait, are these effects truly additive?” YES, because we designed the process as a series of experiments conducted one after one. Let’s also assume that the percentage lift of each change is small enough that the cumulative effect can be found by either addition or multiplication (log(1+x)~x when x is small).On the other hand, the answer is also NO, as it does not take a lot of domain knowledge to understand that there are many potential issues:

  1. Variance in the experiment results – Each estimation comes with its own confidence interval. These two numbers 7.2% and 4% may not be significantly different from each other because of each of their variances.
  2. Seasonal effect in experiments – Travel is highly seasonal and Airbnb is no exception. Although the experiments were run sequentially, so they could be summed up, their effects might have changed over time.
  3. Short term versus long term effects – Sometimes the individual experiments are only good at measuring the short term effects. Stacking them together does not necessarily lead to an accurate estimation of the long term effect estimated by the holdout experiment.
  4. Cannibalization between control and treatment – Even if the estimate in each experiment is unbiased, the effect itself can be different when the sample sizes are at different ratios because of the potential interaction of people within each experiment group. We call this “cannibalization” at Airbnb, and it is not uncommon in other experimentation setups as well. We have studied ways to address this effect however it is out of scope for this post.

Each of the above issues could be a separate article at length. However, beside all of these plausible factors to cause the gap, there is a more fundamental one that exists in almost all online experimentation platforms: selection bias, or what we call the Winner’s Curse.

Are You Suffering From Winner’s Curse?

The Winner’s Curse is a phenomenon in common value auctions where the winner tends to overpay for the value of the item. Analogously, the observed effects of selected experiments tend to overestimate the true effects of experiments. Here is a simplified illustration of the idea. Suppose we run 10 experiments, with the same fixed standard deviation of 1%. We collect enough samples in each experiment and observe their effects individually, as shown in the graph below in the first row. We also pretend not to know the underlying true effects, which are represented in the second row. Everything is in percentage scale.

We conduct student t-tests for each experiment individually using the observed results. If we set the Type-I error level at α = 0.05, it is easy to see that the t-test statistic is just the observation itself, and it needs to be greater than 1.96 to have significant result. Thus, three experiments become significant as circled out in red:

Thus, if we add up the observed effects from these three experiments, the total effect would be 2.7% + 2.6% + 3.3% = 8.6%. However, since we know the underlying true effect, we see the aggregation of observed results is indeed larger than the summation of the true effects: 1% + 1% + 4% = 6%. The upward bias is 2.6% in the case.

In fact, the above illustration is not just a bad example, and the reason behind it can be explained formally through statistical formulation with several simplified assumptions. Suppose that X1, …, Xn are random variables defined on a same probability space, and each Xᵢ follows a distribution with finite mean aᵢ and finite variance σᵢ²(the distributions are not necessarily identical.) We regard aᵢ as the unknown true effect and usually estimate it by the unbiased estimate Xᵢ.

Consider a vanilla setting of A/B testing, where we conduct simple two-sample t-test for each experiment. We then select the “significant” experiments that had the testing statistic larger than a threshold, which is equivalent to when the observed positive estimated effects are greater than a threshold. Suppose we use significance level αᵢ for each experiment i. For now, let us assume that σᵢ² is known. We choose experiments such that Xᵢ /σᵢ > bᵢ, where bᵢ is the cut-off from the reference distribution for significance level αᵢ, usually set at 0.05.

Let us define the set of significant experiments A= {i|Xᵢ /σᵢ > bᵢ}. Then, the total true effect of A is T_A = ∑{iA} aᵢ. If we add up the effects of positive significant experiments, the total estimated effect of A is S_A= ∑{iA} Xᵢ . Note that since A is a random set, therefore ES_A ≠ ET_A in general. Also, the total true effect T_A is random. We can define the expected total true effect as E[T_A] = E[∑{iA} aᵢ].

We can show that ES_A ≥ ET_A. In fact,

All the summands are all nonnegative because the mean of lower truncated mean-zero distribution is always positive, and therefore the selection bias is always positive. As an illustration, we can plot the individual terms in the above summation as a function of true effect aᵢ under Normal distribution, if σᵢ = 1and bᵢ is the two standard deviation cutoff of a t-statistic, i.e. 1.96:

We can also plot the bias in terms of effect and p-values. Note that the bias increases linearly as the true effect increases.

Now, it is easy to see that we have a way to quantify the total bias E[S_A -T_A] if there is an unbiased estimation for each term within the summation.

Winner’s curse occurs in the process of “selection”, where we pick the winning experiments and attribute their total effect by aggregating over the individual observed effect of these selected experiments.

A reasonable simulation is better than a thousand words. We can show that the bottoms-up estimates (y-axis) are indeed substantially greater than the true effects (x-axis). Suppose that we have n=30 experiments and measure the incremental percentages of the effects. For each i, we sample aᵢ from a truncated Gaussian distribution aᵢ ~ Zᵢ|(-1.5 < Zᵢ < 2) where Zᵢ ~ N(0.2, 0.7²) and σᵢ² from the inverse gamma distribution with shape parameter 3 and scale parameter 1. A left-skewed distribution is chosen to be the prior of true effects, because we would like to have more positive true effects in the simulation, a reasonable assumption in the context of a product team seeking to improve a metric. Among the 1000 simulation instances almost all cases indicate the naive bottoms-up estimate is over the 45-degree diagonal line of true effect.

Is Selection Bias A Real Problem?

This might sound familiar to you if you have heard about “multiple testing”. Admittedly, we are testing multiple hypotheses and instead of using the standard p-value threshold 0.05 (somewhat arbitrary as well), we can adjust it intelligently to control for the inflated false discovery rate or family-wise error rate. There are well-studied methods such as the Bonferroni correction or the Benjamini-Hochberg procedures to address this problem. By using a universal p-value threshold, we’re highly likely to have a few winning experiments even when all of them are just pure noise.

However, what we are trying to explain is more than the false positives. There are actually two aspects of selection bias we hope to address. First, even when there is no false positives, i.e. all the experiments we run indeed have positive true effects, the measurement of the aggregated effect will be an overestimation. In addition, the experiments not being selected, i.e. for i A, also contribute to the overall bias.

For one single selected experiment, the observed result in the A/B test is expected to overstate its true effect. Usually we know that the sample mean is an unbiased estimate for the population mean. However, because one becomes interested in this particular experiment after observing that it is significantly positive, we are actually looking at an estimation conditional on that this experiment has been selected, or its observation exceeds a given threshold, hence upward bias has been introduced.

Secondly, in the above formulation we can see that even the non-significant experiments will contribute to the bias, which is a little bit counterintuitive. Why are we interested in such an estimation ET_A particularly? If one focuses only on the selected ones, we can correct for T_A|A; instead of doing this, however, we are integrating out the selection set and measuring the bias of the process as a whole. This is fundamentally a different correction, and we think it represents the actual experimentation process better. Let us consider an intuitive example with the following two cases:

  1. One runs 1 experiment, which turns out to be significant with effect X₁= 1 and p-value ~ 0.001.
  2. One runs 1000 experiment, only the first experiment is significant with effect X₁= 1 and p-value ~ 0.001.

Intuitively, if we estimate the total true effect as 1, case 2) has a higher risk in overestimating the total true effect due to the selection bias. Correcting according to ET_A will take into account this risk. On the other hand, conditioning by the fact that only the first experiment is selected, the bias corrections are the same for case 1) and 2).

If the set of experiments being aggregated A is fixed from the beginning, then no bias would be introduced. In the end, the bias exists because of the process of selecting the successful ones among many experiments we run, and we do this every day in a large scale experimentation platform.

Okay, What Shall We Do Then?

Is this something you do as a team for attribution, i.e. declaring the total impact based on experimentation results? If so, we should consider accounting for the bias or try best to mitigate it by better experiment design.

With the above formulation, we have come up with a straightforward unbiased estimation for the Winner’s Curse bias. If we assume the true parameters aᵢ and σᵢ² are known, one can derive the bias E[S_A -T_A] to be

Since aᵢ and σᵢ are usually unknown, we use the estimates Xᵢ and Wᵢ, where Wᵢ is the estimated standard deviation of Xᵢ, to define the bias estimate

Subtracting this bias estimate from the bottoms-up estimate will provide the adjusted unbiased estimate for the aggregated effect. We then built a feature in ERF, our Experimentations Reporting Framework, using this formula to calculate the bias automatically. We are able to choose a set of experiments that are being selected, specify the metric to be analyzed as well as the selection rules. One can also build confidence interval for the bias-adjusted estimation using the bootstrap method. In the earlier example, we adjusted the bias to get a total effect of 5.3% rather than 7.2%. As you can see, the confidence interval is also quite large.

This de-biasing method relies on very few assumptions, especially compared to Bayesian methods which requires specific knowledge for the priors. Over the course of developing the method and its implementation into ERF we have several learnings.

  • Be aware of the selection process and communicate it well with the team. As we showed before, the bias is positive so debiasing means a smaller estimated aggregated effect than bottoms-up method. Moreover, notice that the more experiments we have started, the higher risk to introduce upward bias into the bottoms-up estimate, if all the experiments are equally promising. It is not always easy to be objective and accept the overestimation when every experiment is part of the team effort.
  • Set up experiments with clear hypothesis and launch conditions, and then apply the bias adjustment method. Trial and error comes with a cost of inaccurate measurement, if no extra data is collected. More than just experimenting ideas quickly, we also want to do it smartly. To apply the bias adjustment method, we would like to have a simple selection rule that can be written as a threshold bᵢ for each experiment. Keep in mind that these thresholds need to be specified without looking at experimentation data. Therefore, it is critical to have clarified the hypothesis and launch condition before starting any experiment.
  • Determine the set of experiments for aggregation with caution. The attribution set, {1,…, n} determines where the bias will come from, hence it is essential to decide the set beforehand properly. It is non-trivial and depends on our own judgement to tell where the selection process is happening. As a rule of thumb, we should consider the series of experiments one team works on towards the same goal along the same idea over a fixed time. For example, the Search Team has been running experiments with different messages on the search page to encourage guests to make a booking decision.
  • Consider other ways to obtain more accurate measurements. For example, setting up a global holdout has been an effective way for us to achieve better estimation, not only to mitigate selection bias, but also to address other factors like seasonality, long-term versus short-term effects etc. Similarly, with the benefit of more data, one could also set up a new experiment just testing the selected features as a bundle to get an accurate measurement free of selection bias. In reality, we don’t conduct this often as it requires extra engineering work and slows down product development.

Final Thoughts

Measurement plays a crucial role in data informed decision making. When online experiments are costly and have to be performed efficiently, we inevitably carry out measurements on the same data used for both inference and model selection. There has been a long ongoing discussion in both academia and industry around “p-hacking” and similar ideas. An extensive literature exists trying to tackle this problem in various applications in econometrics or genome-wide association studies. Our approach, although with simplified assumptions about the selection rule, is a quick and effective way to account for the selection bias without many additional assumptions or prior knowledge, especially in large scale online experimentation platforms.

In practice, various characteristics of online experiments make application of the theoretical work very challenging. Not every decision-making process follows the same rules. It is only when we keep both statistical rigor and practical concerns in mind that we will be able to move the frontier of product development forward. On the Data Science team at Airbnb, we’re excited about how many interesting problems and undefined opportunities await us in the future.

Source link