Project Lighthouse — Part 2: Measurement with anonymized data | by Sid Basu | Airbnb Engineering & Data Science | Sep, 2020

In June, the Airbnb Anti-Discrimination product team announced Project Lighthouse, an initiative with the goal to measure and combat discrimination when booking or hosting on Airbnb. We launched this project in partnership with Color Of Change, the nation’s largest online racial justice organization with millions of members, as well as with guidance from other leading civil rights and privacy rights organizations.

At the core of Project Lighthouse is a novel system to measure discrepancies in people’s experiences on the Airbnb platform that could be a result of discrimination and bias. This system is built to measure these discrepancies with perceived race data that is not linked to individual Airbnb accounts. By conducting this analysis, we can understand the state of our platform with respect to inclusion, and begin to develop and evaluate interventions that lead to more equitable outcomes on Airbnb’s platform.

In the first post of this series, we provided some broader context on Project Lighthouse and introduced the privacy model of p-sensitive k-anonymity, which is one the tools we use to protect our community’s data. In this post, we will focus on how to evaluate the effectiveness of the anonymized data derived in measuring the impact of our product team’s interventions.

These blog posts are intended to serve as an introduction to the methodology underlying Project Lighthouse, which is described in greater detail in our technical paper. By publicly sharing our methodology, we hope to help other technology companies systematically measure and reduce discrimination on their platforms.

Like most other product teams in the technology industry, the Airbnb Anti-Discrimination team runs A/B tests, where users are randomly assigned to a control or treatment group, to measure the impact of its interventions. For example, we could use an A/B test to measure the impact of promoting Instant Book on booking conversion rates. Similarly, we could also use an A/B test to understand the impact of obscuring guest profile photos until a booking is confirmed on metrics like booking and cancellation rates.

Some of the topics we are most interested in understanding are whether a gap in acceptance rates exists between demographic groups, and whether a particular intervention affects that gap. We can answer both questions by using A/B testing.

Consider a hypothetical A/B test that we analyze for impact on guests perceived as “race 1” and “race 2”.¹ In such a test all users are randomly assigned to either the control or treatment group; one’s perceived race does not affect this assignment. Perceived race only becomes relevant when we analyze the test’s impact, after the A/B test has concluded. Figure 1 shows some possible results from this hypothetical test. In this example, the acceptance rate in the control group is 67% for guests perceived as “race 1” and is 75% for guests perceived as “race 2”. The baseline difference in acceptance rates can be found by comparing acceptance rates in the control group, and is 75% — 67% = 8 percentage points.

Suppose the intervention increases the acceptance rate for guests perceived as “race 1” to 70% and leaves the acceptance rate for guests perceived as “race 2” unchanged at 75%. The gap in acceptance rates within the treatment group is calculated as 75% — 70% = 5 percentage points. Therefore, we conclude that the intervention has reduced the gap in acceptance rates between guests perceived as “race 1” and “race 2” by 8–5 = 3 percentage points.

As discussed in our previous post, we utilize the privacy model of p-sensitive k-anonymity to protect user data while computing potential gaps in the Airbnb experience (for this example, acceptance rates) between different demographic groups. Enforcing this privacy model sometimes requires us to modify data by changing or removing values from analysis. This can diminish our ability to accurately estimate metrics, such as acceptance rates, and the impact of our A/B tests on them.

To ensure that we can accurately measure the impact of our interventions, we conducted a study of the impact of anonymization on data utility, the usefulness of data for analysis. More precisely, we were concerned with statistical power, the probability that we observe a statistically significant change in a metric when an intervention is effective. Both larger sample sizes and effect sizes generally lead to more statistical power. Having adequate statistical power for our tests allows us to measure the impact of our interventions with confidence.

We focused our efforts on understanding how enforcing p-sensitive k-anonymity might affect the statistical power of our A/B tests. Our goal of this analysis is then to understand how changing certain parameters, such as the value of k chosen in enforcing k-anonymity, affects statistical power when measuring the impact of our interventions on reservation acceptance rates by demographic group.

The main tool that we use to understand the impact of anonymization on measurement is a simulation-based power analysis. To better understand how such an analysis works, let’s first go over how we would statistically analyze an A/B test’s impact on differences between acceptance rates between demographic groups.

Suppose, for sake of discussion only, we had a dataset where each row represented a reservation request on Airbnb and the columns were:

  • accept: 1 if the reservation request was accepted, 0 otherwise
  • treatment: whether the guest was in the control or treatment group of the A/B test, we can encode this to take the value of 0 in the control group, and 1 in the treatment group²
  • perceived race: the guest’s perceived race, we can encode this to take the value of 0 for guests perceived to be “race 1” and 1 for guests perceived to be “race 2”³

Then, we could run a linear regression of the form:

Here, the coefficient a would be the acceptance rate for guests perceived to be “race 1” in the control group, a + b_obs would be the acceptance rate for guests perceived to be “race 2” in the control group, and a + c_obs would be the acceptance rate for guests perceived to be “race 1” in the treatment group. For the purposes of our analysis, we are primarily concerned with the coefficient d_obs,⁴ which gives us the A/B test’s impact on the difference in acceptance rates between guests perceived as “race 1” and “race 2”.

Suppose we also could conduct many A/B tests where we knew the true impact of the intervention on acceptance rates. We could then estimate statistical power by running the above regression after each test and recording whether d_obs was statistically significantly different from zero. That is, we would estimate the “probability of finding a statistically significant effect” by the “fraction of tests where we found a statistically significant effect”. While we are not able to do this analysis with actual⁵ A/B tests, we can simulate data and follow a similar process. This is the idea at the heart of a simulation-based power analysis.

The core step of a simulation-based power analysis is the simulation of a single A/B test. To do this, we generate a synthetic dataset where each row represents a hypothetical reservation request. We randomly generate perceived race labels and control/treatment group assignments for each row. We model acceptance as a Bernoulli random variable, with the probability of acceptance, p, given as:

Continuing with the example in Figure 1, we set a_1 = 0.67, a_2 = .75 and d_true = 0.03.⁶ Here, T = 0 if the user is in the control group of the A/B test and T = 1 if they are in the treatment group. This gives us the following acceptance rates:

We can then anonymize this dataset and analyze the anonymized data using the regression detailed above, recording our results. Repeating this process many times (at least 1,000, in our case) allows us to estimate the statistical power of our tests.

One of the benefits of a simulation-based power analysis is that we can vary different aspects of our hypothetical experiment setups to understand their impact on data utility. In our case, we are interested in understanding the impact of the following factors:

  • The value of k, used in enforcing p-sensitive k-anonymity.
  • The value of N, the number of reservation requests in the A/B test.
  • The intervention’s efficacy in reducing the difference in acceptance rates between guests perceived as “race 1” and “race 2”. We will call this the true effect size (d_true in the previous section).

To this end, we can fix k, N and the true effect size and run the simulation 1,000 times to get a distribution of the observed effect size (d_obs in the previous section). We can then repeat this exercise for different values of k, N and the true effect size to study how they affect the distribution of d_obs. For example, we can compute the fraction of simulation runs where we detect a statistically significant effect, and use that as our estimate of statistical power.

Figure 2 summarizes the main results of this analysis. The horizontal axis represents the true effect size (d_true), while the vertical axis represents our simulation-based estimate of statistical power. Each line represents the relationship between true effect size and statistical power for a specific value of k. We also shade the area in the graph where statistical power is below 80%, as it is a best practice to run A/B tests which have at least 80% power.

The first thing that we notice in Figure 2 is that statistical power increases with the true effect size. This is our empirical evidence that it is “easier to detect larger effects”, and a useful sanity check to do in any simulation-based power analysis. Secondly, we see that enforcing anonymity leads to a mild decrease in statistical power, depending on the value of k. For k = 5 or 10, this decrease is within 5–10 percent relative to identifiable data (k = 1). On the other hand, for k = 100, the relative decrease is 10–20 percent, depending on the true effect size.

Another way to look at these results is to analyze the minimum detectable effect, the smallest true effect size for which we have 80% power for various values of k and N. Figure 3 plots the sample size on the horizontal axis and the minimum detectable effect on the vertical axis. The different lines demarcate different values of k.

Similar to Figure 2, Figure 3 shows that there is an increase in minimum detectable effect that gets larger as k increases. A higher minimum detectable effect is undesirable since it means that we can only detect a larger change with the same amount of statistical power. However, the figure shows how increasing the sample size can compensate for this. For example, we need to analyze 200,000 reservation requests in an A/B test to detect an effect size of 1.75 percentage points with identifiable data (k = 1). When we use p-sensitive k-anonymous data, with k = 5, this increases to 250,000 reservation requests. Practically speaking, this means that we can run our A/B tests for longer so that they include more reservation requests, leading them to have adequate statistical power.

In summary, our simulation-based power analysis demonstrates that we can use p-sensitive k-anonymous data to measure the impact of our interventions to reduce discrepancies in the Airbnb experience by guest perceived race. While enforcing anonymity leads to up to a 20% decrease in statistical power, depending on the value of k, running tests for longer to obtain larger sample sizes can compensate for this.

It is important to note that our A/B test analysis workflow is notably different from that employed more generally in the technology industry. Each analysis we conduct now requires a significant amount of pre-work to ensure that we have p-sensitive k-anonymous data. We also run A/B tests for longer than we would have if we used identifiable data.

Nevertheless, our findings show that it is possible to audit online platforms for large-scale gaps in the user experience while at the same time protecting our community’s privacy. We hope that our work can serve as a resource for other technology companies who would also like to systematically measure and reduce discrimination on their platforms. Our publicly-available technical paper describes the topics covered in these posts, as well as our methods for enforcing p-sensitive k-anonymity in more detail. Our landing page has a more general overview of Project Lighthouse.

Project Lighthouse represents the collaborative work of many people both within and external to Airbnb. The Airbnb Anti-Discrimination team is: Sid Basu, Ruthie Berman, Adam Bloomston, John Campbell, Anne Diaz, Nanako Era, Benjamin Evans, Sukhada Palkar, and Skyler Wharton. Within Airbnb, Project Lighthouse also represents the work of Crystal Brown, Zach Dunn, Janaye Ingram, Brendon Lynch, Margaret Richardson, Ann Staggs, Laura Rillos, and Julie Wenah. We would also like to extend a special thanks to Laura Murphy and Conrad Miller for their continuing support and guidance throughout the project.

We know that bias, discrimination, and systemic inequities are complex and longstanding problems. Addressing them requires continued attention, adaptation, and collaboration. We encourage our peers in the technology industry to join us in this fight, and to help push us all collectively towards a world where everyone can belong.

This analysis is currently being conducted in our United States community. Perceived race data used in Project Lighthouse is not linked to individual Airbnb accounts. Additionally, the data collected for Project Lighthouse will be handled in a way that protects people’s privacy and will be used exclusively for anti-discrimination work. You can read more about this in the first blog post or in the Airbnb resource center.

[1] I’m using the fictional labels “race 1” and “race 2” for the sake of exposition. The framework presented here can be extended to analyze gaps in acceptance rates between multiple (>2) perceived racial identities.

[2] This encoding can be reversed to be 1 for the control group and 0 for the treatment group without loss of generality. The only effect on the analysis would be that c_obs would become (control — treatment acceptance rate, instead of treatment — control acceptance rate)

[3] Similar to the treatment variable, this encoding can also be reversed

[4] Here obs is shorthand for observed

[5] There are several reasons why we cannot do this with actual A/B tests. Firstly, to do so would require us having product interventions where we knew what the exact true impact on the difference in acceptance rates was, which we don’t. Secondly, the procedure we outline would require individual-level perceived race labels, which would violate our privacy commitments.

[6] To relate these values to the regression equation above, a + b_obs estimates a_1 and a estimates a_2

Source link