How Airbnb Battles Chargebacks while Minimizing Impact to Good Guests
On any given night, nearly two million people are staying in Airbnb listings in 191 countries around the world. The rapid growth of our global community is predicated on one thing: trust.
We take a comprehensive approach to trust, which consists of both robust proactive measures and reactive support, but today I want to focus on some of the work we do ahead of time — and often in the background — to prevent fraudsters from using stolen credit cards on our site.
In this post, I’ll walk through how we leverage machine learning, experimentation, and analytics to identify and block fraudsters while minimizing impact on the overwhelming majority of good users. First, I’ll introduce our use of machine-learning (ML) models to trigger frictions targeted at blocking fraudsters. Then, I’ll outline how we choose the model’s threshold by minimizing a loss function, and dive into each term in the loss function: the costs of false positives, false negatives, and true positives. Finally, I’ll walk through a numerical example comparing the optimization of blocking transactions versus applying a friction.
What We’re Fighting: Chargebacks
Like all online businesses, Airbnb faces fraudsters who attempt to use stolen credit cards. When the true cardholder realizes their card has been stolen and notices unauthorized charges on their bill, the credit card company issues what’s called a “chargeback,” and the merchant (in our case, Airbnb) returns the money.
Unlike some of our competitors, we absorb the cost of these chargebacks and do not pass any financial liability to our hosts. In order to better protect our community and reduce our own exposure to chargeback costs, we actively work to block stolen credit cards from being used in the first place.
We detect financial fraud in a number of ways, but our workhorse method uses machine-learning (ML) models trained on past examples of confirmed good and confirmed fraudulent behavior. Because no model can be perfect, we will always have false positives: “good” events that a model or rule classifies as “bad” (above the threshold). In some situations, we block actions outright, but in most situations we allow the user the opportunity to satisfy an additional verification called a friction. A friction is ideally something that blocks a fraudster, yet is easy for a good user to satisfy.
To stop the use of stolen credit cards, our chargeback model triggers a number of frictions to ensure that the guest is in fact authorized to use that card, including micro-authorization (placing two small authorizations on the credit card, which the cardholder must identify by logging into their online banking statement), 3-D Secure (which allows credit card companies to directly authenticate cardholders via a password or SMS challenge), and billing-statement verification (requiring the cardholder to upload a copy of the billing statement associated with the card).
Optimizing the Model Threshold
We train the chargeback model on positive (fraud) and negative (non-fraud) examples from past bookings, with the goal of predicting the probability that a booking is fraudulent. Because fraud is extremely rare, this is an imbalanced classification problem with scarce positive labels. We characterize our model’s performance at identifying fraudulent versus good bookings at various thresholds in terms of the true-positive rate and false-positive rate, then evaluate the total cost associated with each threshold using a loss function that depends on those rates.
Specifically, our goal is to minimize the overall loss function L, which we can write as:
In this equation FP is the number of false positives, G is the dropout rate of good users when exposed to the friction, V is the good user lifetime value, FN is the number of false negatives, C is the cost of a fraud event, TP is the number of true positives, and F is the friction’s efficacy against fraudsters. (The cost of a “true negative” — i.e., the model correctly identifying a good booking as good — is zero, so does not appear in the loss function.) In the following sections, we’ll examine how we estimate each of these terms.
Cost of a False Positive
If we incorrectly apply friction to a good booking (a false positive), we incur a cost because there is a chance the good user will not complete the friction and then will not use Airbnb. This probability of the good user dropping out is given by G.
For each good guest that is unable to complete the friction and drops from the funnel, we approximate that the guest has churned for life — that is, we lose their entire lifetime value V. We won’t go into the details of how we calculate lifetime values in this post, but it is a concept common to many businesses.
The loss from good users dropping out is given by multiplying the number of false positives by the expected loss from each false positive: FP * G * V.
In order to measure the impact of each friction on good users G, we run an A/B test using our Experiment Reporting Framework. We assign users with low model scores (who are very unlikely to be fraudsters) to the experiment at the same stage in the funnel where we will apply the friction against fraudsters.
What do we measure in these good user dropout experiments? For simplicity’s sake, we tend to choose a single, end-of-funnel metric. For anti-chargeback frictions, we measure the number of guests who successfully complete a booking. We are looking to find the good-user friction dropout rate G, which we’ll define as: 1 — (success rate in friction group)/(success rate in control group).
Good user dropout experiments are costly to run. We need to cause enough good users to drop out that we can measure how big that dropout is with a reasonable degree of confidence — which means some good guests won’t end up booking! To minimize the total number of good users exposed to the friction, while still measuring G to within a given confidence interval, we use highly imbalanced assignments such as 95% control (no friction) / 5% treatment (friction). To see why, consider that we can use the Delta method to calculate the the variance on a ratio metric as:
Is this equation 𝜇_c and 𝜇_t are the means of the control and treatment statistics respectively, 𝜎_c² and 𝜎_t² are their respective variances, and G = 1-𝜇_t/𝜇_c. Since 𝜎_t² is set by the costly treatment (friction) group size, we want 𝜎_c² to be as small as possible — which is achievable by increasing the control group size with an imbalanced experiment. In many cases, if G≪1 then the imbalanced experiment allows us to expose roughly half as many users to friction compared to a 50/50 experiment to achieve the same confidence interval on G.
The disadvantage of running an imbalanced experiment is that the statistic takes longer to converge (i.e., requires a larger total sample size) than a 50/50 experiment. The exact treatment fraction needs to be tailored to the number of users eligible for the experiment each day, the expected magnitude of G, and the amount of time we are willing to run the experiment.
Cost of a False Negative
Next, we need to know the cost of false negatives — that is, the cost of a fraudulent event that scored below the model’s threshold. The total loss from false negatives is given simply by multiplying the number of false negatives by the cost of each fraud event: FN * C.
Airbnb absorbs all costs associated with chargebacks, and we never pass them through to our hosts. Thus, the total cost is the full amount of the payment made by the fraudster, plus overhead associated with processor fees and decline rates.
Cost of a True Positive
A true positive is when the model correctly identifies a fraudster with a score above the threshold. Here we apply friction to achieve our ultimate goal: preventing this fraudster from using Airbnb. If the friction successfully blocks the fraudster, we have achieved this goal and have no loss.
However, no friction is perfect, and if the fraudster somehow manages to pass the friction, then we do incur a loss. The total loss from true positives if given by TP*(1-F)*C, where F is the fraudster dropout rate when challenged by the friction.
We measure F using another A/B test, where we assign risky users who score above the model threshold. This time the imbalance is flipped around because not applying friction to a fraudster is very expensive. We subject all high-risk events that skipped the friction due to the experiment to a manual review in order to prevent harm to our community and minimize loss. We track a metric measuring fraud events and compare the number of successful fraud events in the friction group (hopefully small!) to the number of fraud events caught by manual review in the control group to get the fraudster dropout rate F. A value of F=1 would signify that the friction is 100% effective at blocking fraudsters, so we want F to be as close to 1 as possible.
A few comments:
- One possible friction is to outright deny or block a transaction or event, as many financial companies do. This could be considered a friction with F=1 and G=1.
- An ideal friction would have F=1 and G=0. If we had a friction like this, we would apply it to 100% of events since there is no cost to applying it to good users. Unfortunately frictions like this are hard to come by!
Example: Comparing Blocking a Transaction versus Applying a Friction
Let’s walk through a fictional example. First, let’s imagine that we trained a machine-learning model on past examples of fraud — for this example we’ll use a dummy ROC curve defined by TPR = ⁵√[1-(1- FPR)⁵]
Let’s say that 1% of events are attempted by fraudsters. Each instance of fraud has a fixed cost C=10, and each good user has a value V=1. If we use our model to block transactions directly (Figure 2(a)), the loss function is minimized at 4.5 by blocking roughly 1% of transactions. If we instead use the model to trigger a friction that is F=95% effective against fraudsters with a good-user dropout rate G=10%, then total loss is minimized to 2 by applying friction to 11% of transactions. We can cut the fraud rate by more than half by using the friction rather than hard-blocking transactions!
Note that we could go a step further by hard-blocking the riskiest events and applying friction to the medium-risk events. This approach maps onto a two-dimensional optimization of the same loss function above.
Fighting fraud is an adversarial business by nature. The more aggressively we apply frictions, the less likely fraudsters are to try attacking again — and the framework described above doesn’t explicitly account for this feedback loop. For this reason, we tend to action slightly more aggressively than the optimal point on our loss function curve. We also avoid updating our operating point too quickly when we see fraud rates drop, though we do act and update quickly if rates rise.
The fact is that fraudsters will never stop looking for new ways around our defenses. Machine learning and targeted frictions are just one of the many ways we work to keep our community safe, and our team is constantly working to improve our systems in order to earn your trust and stay ahead of those who might be looking to take advantage of our community.