Budget-split testing: A trustworthy and powerful approach to marketplace A/B testing

Co-authors: Min Liu, Vangelis Dimopoulos, Elise Georis, Jialiang Mao, Di Luo, and Kang Kang

The LinkedIn ecosystem drives member and customer value through a series of marketplaces (e.g., the ads marketplace, the talent marketplace, etc.). We maximize that value by making data-informed product decisions via A/B testing. Traditional A/B tests on our marketplaces, however, are often statistically biased and under-powered. To mitigate this, we developed “budget-split” testing, which provides more trustworthy and powerful marketplace A/B testing. Read on to learn about the problem, solution, and successful results, using the ads marketplace as a running example. For more technical details, please refer to the paper “Trustworthy Online Marketplace Experimentation with Budget-split Design.”

Problems with marketplace A/B testing

To add some important context, modern online ad marketplaces use auction-based models for ad assignment. Advertisers set an objective, an audience, a campaign budget, and a bidding strategy to each ad campaign. Each “result” (member click, view, etc., depending on the objective) utilizes a portion of the overall campaign budget, for a set duration, until the campaign ends or there is no more budget available. The maximum revenue generated by a campaign cannot exceed its set budget.

When running A/B tests on the ads marketplace, we noticed two types of problems:

  1. When testing a new ad feature, we’d often see a strong metric impact in our experiment, but wouldn’t observe the same level of impact when launched to the entire marketplace.

  2. Many tests required an unacceptably long time to achieve statistically significant results. 

The first problem exemplified cannibalization bias, while the second stemmed from insufficient statistical power.

Cannibalization bias
We can illustrate cannibalization bias with a hypothetical example (note: real world manifestations of this bias are less extreme forms of this hypothetical). Suppose that we want to test how a new ad feature (e.g., improving the match between ads and members) impacts ad impressions and revenue. Prior to our experiment, let’s say all ad campaigns were spending 100% of their budgets (i.e., no new feature can increase ads revenue further). If we test our new feature in a traditional A/B test and observe increases in the number of ad impressions, the test would also show a corresponding increase in revenue for the treatment group. Once we launch the feature to the entire marketplace, however, we won’t see that same increase in revenue because (remember) all campaigns were already spending 100% of their budgets. So why did our A/B test lead us to the wrong conclusion?

Source link