Previously, we’ve posted about the importance we put in Etsy’s experimentation systems for our decision-making process. In a continuation of that theme, this post will dive deep into an interesting edge case we discovered.
We ran an A/B test which required a 5% control variant and 95% treatment variant rather than the typical split of 50% for control and treatment variants. Based on the nature of this particular A/B test, we expected a positive change for conversion rate, which is the percent of users that make a purchase.
At the conclusion of the A/B test, we had some unexpected results. Our A/B testing tool, Catapult, showed the treatment variant “losing” to the control variant. Catapult was showing a negative change in conversion rate when we’d expect a positive rate of change.
Due to these unexpected negative results, the Data Analyst team investigated why this was happening. This quote summarizes their findings
The control variant “benefited” from double-bucketing because given its small size (5% of traffic), receiving an infusion of highly engaged browsers from the treatment provided an outsized lift on its aggregate performance.
With the double-bucketed browsers excluded, the true conversion rate of change is positive which is the results that we expected from the A/B test. Just 0.02% of the total browsers in the A/B test were double-bucketed. This small percentage of the total browsers had a large impact on the A/B test results. This post will cover the details of why that occurred.
Definition of Double-bucketing
So what exactly is double-bucketing?
In an A/B test, a user is shown either the control or treatment experience. The process to determine which variant the user falls into is called ‘bucketing’. Normally, a user experiences only the control or only the treatment; however in this A/B test, there was a tiny percentage of users who experienced both variants. We call this error in bucketing ‘double-bucketing’.
Typical user 50/50 bucketing for an A/B test puts ½ of the users into the control variant and ½ into the treatment variant. Those users stay in their bucketed variant. We calculate metrics and run statistical tests by summing all the data for the users in each variant.
However, the double-bucketing error we discovered would place the last 2 users in both control and treatment variants, as shown below. Now those users’ data is counted in both variants for statistics on all metrics in the experiment.
How browsers are bucketed
Before discussing the cases of double-bucketing that we found, it helps to have a high-level understanding of how A/B test bucketing works at Etsy.
For etsy.com web requests, we use an unique identifier from the user’s browser cookie which we refer to as “browser id”. Using the string value from the cookie, our clickstream data logic, named EventPipe, sets the browser id property on each event.
Bucketing is determined by a hash. First we concatenate the name of the A/B test and the browser id. The name of the A/B test is referred to as the “configuration flag”. That string is hashed using SHA-256 and then converted to an integer between 0 and 99. For a 50% A/B test, if the value is < 50, the browser is bucketed into the treatment variant. Otherwise, the browser is in the control variant. Because the hashing function is deterministic, the user should be bucketed into the same variant of an experiment as long as the browser cookie remains the same.
EventPipe adds the configuration flag and bucketed variant information to the “ab” property on events.
For an A/B test’s statistics in Catapult, we filter by the configuration flag and then group by the variant.
This bucketing logic is consistent and has worked well for our A/B testing for years. Although occasionally some experiments wound up with small numbers of double-bucketed users, we didn’t detect a significant impact until this particular A/B test with a 5% control.
Some Example Numbers (fuzzy math)
We’ll use some example numbers with some fuzzy math to understand how the conversion rate was effected so much by only 0.02% double-bucketed browsers.
For most A/B tests, we do 50/50 bucketing between the control variant and treatment variants. For this A/B test, we did a 5% control which puts 95% in the treatment.
If we start with 1M browsers, our 50% A/B test has 500K browsers in both control and treatment variants. Our 5% control A/B test has 50K browsers in the control variant and 950K in the treatment variant.
Let’s assume a 10% conversion rate for easy math. For the 50% A/B test, we have 50K converted browsers in both the control and treatment variant. Our 5% control A/B test has 5K converted browsers in the control variant and 95K in the treatment variant.
For the next step, let’s assume 1% of the converting browsers are double-bucketed. When we add the double-bucketed browsers from the opposite variant to both the numerator and denominator, we get a new conversion rate. For our 50% A/B test, that is 50,500 converted browsers in both the control and treatment variants. The new conversion rate is slightly off from the expected conversion rate but only by 0.1%.
For our 5% control A/B test, the treatment variant’s number of converted browsers only increased by 50 browsers from 95,000 to 95,050. The treatment variant’s new conversion rate still rounds to the expected 10%.
But for our 5% control A/B test, the control variant’s number of converted browsers jumps from 5000 to 5950 browsers. This causes a huge change in the control variant’s conversion rate – from 10% to 12% – while the treatment variant’s conversion rate was unchanged.
Cases of Double-bucketing
Once we understood that double-bucketing was causing these unexpected results, we started digging into what cases led to double-bucketing of individual browsers. We found two main cases. Since conversion rates were being affected, unsurprisingly both cases involved checkout.
- Checkout from new device
- Checkout from Pattern (individual seller sites hosted by Etsy on a different domain)
Checkout from new device
When browsing etsy.com while signed out, you can add listings to your cart.
Once you click the “Proceed to checkout” button, you are prompted to sign in. You get a sign in screen similar to this.
After you sign in, if we have never seen your browser before, then we email you a security alert that you’ve been signed in from a new device. This is a wise security practice and pretty standard across the internet.
Many years ago, we were doing A/B testing on emails which were all sent from offline jobs. Gearman is our framework for running offline jobs based on http://gearman.org. In Gearman, we have no access to cookies and thus cannot get the browser id, but we do have the email address. So override logic was added deep in email template logic to bucket by email address rather than by browser id.
This worked perfectly. But the security email isn’t sent from Gearman; it is coming from the sign in request. So now our bucketing for the same browser id has this different bucketing based on email address rather than browser id.
This worked perfectly for A/B testing in emails sent by Gearman, but the logic applied to all emails, not just those sent by Gearman. Even though the security email is sent by the sign in request (not Gearman), the logic updated the bucketing ID to be the user’s email address rather than the browser id so that the browser might be bucketed into two different variants (once using the browser id and once using the email address).
Since we are no longer using that email system for A/B testing, we were able to simply remove the override call.
Pattern is Etsy’s tool that seller’s use to create personalized, separate website for their businesses. Pattern shops allow listings to be added to your cart while on the shop’s patternbyetsy.com domain.
The checkout occurs on etsy.com domain instead of the patternbyetsy.com domain. Since the value from the user’s browser cookie is what we bucket on and we cannot share cookies across domains, we have two different hashes used for bucketing.
In order to attribute conversions to Pattern, we have logic to override the browser id with the value from the patternbyetsy.com cookie during the checkout process on etsy.com. This override logic works for attributing conversions; however during sign in some bucketing happens prior to the execution of the override logic by the controllers.
For this case, we chose to remove bucketing data for Pattern visits as this override caused the bucketing logic to put the same user into both the control and treatment variants.
Here is a dashboard of double-bucketed browsers per day that helped us track our fixes of double-bucketing.
- Dashboards are good
Use dashboards to show the problem, track progress and monitor that problems don’t reoccur.
- Overrides bite
Both of the cases involved use of override functions. Often, overrides are a quick and convenient way to solve a local problem, but in a complex system, they can have unintended consequences that manifest in unexpected ways that may not be immediately apparent.
- Cross-domain is a difficult problem set
- Data quality issues are tricky to debug
Data quality issues are not uncommon in complex data systems, and it can be a challenge to fully understand the impact. An issue that may seem benign in one case can become significant for a slightly different case. In addition, finding the cause of a data quality issue can be tricky and require significant investigation. As mentioned above, we used data analysis and monitoring to identify patterns and narrow in on the diagnostic.