It is common in the internet industry to develop algorithms that power online products using historical data. An algorithm that improves evaluation metrics from historical data will be tested against one that has been in production to assess the lift in key performance indicators (KPIs) of the business in online A/B tests. We refer to metrics calculated using new predictions from an algorithm and historical ground truth as offline evaluation metrics. In many cases, offline evaluation metrics are different from business KPIs. For example, a ranking algorithm, which powers search pages on Etsy.com, typically optimizes for relevance by predicting purchase or click probabilities of items. It could be tested offline (offline A/B tests) for rank-aware evaluation metrics, for example, normalized discounted cumulative gain (NDCG), mean reciprocal rank (MRR) or mean average precision (MAP), which are calculated using predicted ranks of ranking algorithms on the test set of historical purchase or click-through feedback of users. Most e-commerce platforms, however, deem sitewide gross merchandise sale (GMS) as their business KPI and test for it online. There could be various reasons not to directly optimize for business KPIs offline or use business KPIs as offline evaluation metrics, such as technical difficulty, business reputation, or user loyalty. Nonetheless, the discrepancy between offline evaluation metrics and online business KPIs poses a challenge to product owners because it is not clear which offline evaluation metric, among all available ones, is the north star to guide the development of algorithms in order to optimize business KPIs.
The challenge essentially asks for the causal effects of increasing offline evaluation metrics on business KPIs, for example how business KPIs would change for a 10% increase in an offline evaluation metric with all other conditions remaining the same (ceteris paribus). The north star should be the offline evaluation metric that has the greatest causal effects on business KPIs. Note that business KPIs are impacted by numerous factors, internal and external, especially macroeconomic situations and sociopolitical environments. This means, just from changes in offline evaluation metrics, we have no way to predict future business KPIs. Because we are only able to optimize for offline evaluation metrics to affect business KPIs, we try to infer the change in business KPIs, given all other factors, internal and external, unchanged, from changes in our offline evaluation metrics, based on historical data. Our task here is causal inference rather than prediction.
Our approach is to introduce online evaluation metrics, the online counterparts of offline evaluation metrics, which measure the performance of online products (see Figure 1). This allows us to decompose the problem into two parts: the first part is the consistency between changes of offline and online evaluation metrics, the second part is the causality between online products (assessed by online evaluation metrics) and the business (assessed by online business KPIs). The first part is solved by the offline A/B test literature through counterfactual estimators of offline evaluation metrics. Our work focuses on the second part. The north star should be the offline evaluation metric whose online counterpart has the greatest causal effects on business KPIs. Hence, the question becomes how business KPIs would change for a 10% increase in an online evaluation metric ceteris paribus.
Figure 1: The Causal Path from Algorithm Trained Offline to Online Business
Note: Offline algorithms powers online products, and online products contribute to the business.
Why do we focus on causality? Before answering this question, let’s think about another interesting question: thirsty crow vs. talking parrot, which one is more intelligent (see Figure 2)?
In Aesop’s Fables, a thirsty crow found a pitcher with water at the bottom. The water is beyond the reach of its beak. It intentionally dropped pebbles into the pitcher, which caused the water to rise to the top. A talking parrot cannot really talk. After being fed a simple phrase tons of times (big data and machine learning), it can only mimic the speech without understanding its meaning.
The crow is obviously more intelligent than the parrot. The crow understood the causality between dropped pebbles and rising water and thus leveraged the causality to get the water. Beyond big data and machine learning (talking parrot), we want our artificial intelligence (AI) system to be as intelligent as the crow. After understanding the causality between evaluation metric lift and GMS lift, our system can leverage the causality, by lifting the evaluation metric offline, to achieve GMS lift online (see Figure 3). Understanding and leveraging causality are key topics in current AI research (see, e.g., Bergstein, 2020).
Figure 3: Understanding and Leveraging Causality in Artificial Intelligence
Causal Meta-Mediation Analysis
Online A/B tests are popular to measure the causal effects of online product change on business KPIs. Unfortunately, they cannot directly tell us the causal effects of increasing offline evaluation metrics on business KPIs. In online A/B tests, in order to compare the business KPIs caused by different values of an online evaluation metric, we need to fix the metric at its different values for treatment and control groups. Take the ranking algorithm as an example. If we could fix online NDCG of the search page at 0.22 and 0.2 for treatment and control groups respectively, then we would know how sitewide GMS would change for a 10% increase in online NDCG at 0.2 ceteris paribus. However, this experimental design is impossible, because most online evaluation metrics depend on users’ feedback and thus cannot be directly manipulated.
We address the question by developing a novel approach: causal meta-mediation analysis (CMMA). We model the causality between online evaluation metrics and business KPIs by dose-response function (DRF) in potential outcome framework. DRF originates from medicine and describes the magnitude of the response of an organism given different doses of a stimulus. Here we use it to depict the value of a business KPI given different values of an online evaluation metric. Different from doses of stimuli, values of online evaluation metrics cannot be directly manipulated. However, they could differ between treatment and control groups in experiments of treatments other than algorithms: user interface/user experience (UI/UX) design, marketing, and etc. This could be due to the “fat hand” nature of online A/B tests that a single intervention can change many causal variables at once. A change of the tested feature, which is not an algorithm, could induce users to change their engagement with algorithm-powered online products, so that values of online evaluation metrics would change. For instance, in an experiment of UI design, users might change their search behaviors because of the new UI design, so that values of online NDCG, which depend on search interaction, would change even though the ranking algorithm does not change (see Figure 5). The evidence suggests that online evaluation metrics could be mediators that partially transmit causal effects of treatments on business KPIs in experiments where treatments are not necessarily algorithm-related. Hence, we formalize the problem as the identification, estimation, and testing of mediator DRF.
Figure 5: Directed Acyclic Graph of Conceptual Framework
Our novel approach CMMA combines mediation analysis and meta-analysis to solve for mediator DRF. It relaxes common assumptions in causal mediation literature: sequential ignorability (in linear structural equation model) or complete mediation (in instrumental variable approach) and extends meta-analysis to solve causal mediation while the meta-analysis literature only learns the distribution of average treatment effects. We did extensive simulations, which show CMMA’s performance is superior to other methods in the literature in terms of unbiasedness and the coverage of confidence intervals. CMMA uses only experiment-level summary statistics (i.e., meta-data) of many existing experiments, which makes it easy to implement and to scale up. It can be applied to all experimental-unit-level evaluation metrics or any combination of them. Because it solves the causality problem of a product by leveraging experiments of all products, CMMA could be particularly useful in real applications for a new product that has been shipped online but has few A/B tests.
We apply CMMA on the three most popular rank-aware evaluation metrics: NDCG, MRR, and MAP, to show, for ranking algorithms that power search products, which one has the greatest causal effect on sitewide GMS.
User-Level Rank-Aware Metrics
We redefine the three rank-aware metrics (NDCG, MAP, MRR) at the user level. The three metrics are originally defined at the query level in the test collection evaluation of information retrieval (IR) literature. Because the search engine on Etsy.com is an online product for users, the computation could be adapted to the user level. We include search sessions of no interaction or no feedback into the metric calculation in accordance with online business KPI calculation in online A/B tests that always includes visits/users of no interaction or no feedback. Specifically, the three metrics are constructed as follows:
- Query-level metrics are computed using rank positions on the search page and user conversion status (binary relevance). Queries of non-conversion have zero values.
- User-level metric is the average of query-level metrics across all queries the user issues (including non-conversion associated queries). Users who do not search or convert have zero values.
All three metrics are defined at rank position 48, the lowest position of the first page of search results in Etsy.com.
To demonstrate CMMA, we randomly selected 190 experiments from 2018 and implemented CMMA on summary results of each experiment (e.g., the average user-level NDCG per user in treatment and control groups). Figure 6 shows results from CMMA. The vertical axis indicates elasticity, the percentage change of average GMS per user for a 10% increase in an online rank-aware metric with all other conditions that can affect GMS remaining the same. Lifts in all the three rank-aware metrics have positive causal effects on the average GMS per user. These don’t have the same performance; different values of different rank-aware metrics have different elasticities. For example, suppose current values of NDCG, MAP, and MRR of search page are 0.00210, 0.00156, and 0.00153 respectively, then a 10% increase in MRR will cause higher lifts in GMS than 10% increases in the other two metrics ceteris paribus, which is indicated by red dashed lines, and thus MRR should be the north star to guide our development of algorithms. Because all the three metrics have the same input data, they are highly correlated and thus the differences are small. As new IR evaluation metrics are continuously developed in the literature, we will implement CMMA for more comparison in the future.
Figure 6: Elasticity from Average Mediator DRF Estimated by CMMA
Note: Elasticity means the percentage change of average GMS per user for a 10% increase in an online rank-aware metric with all other conditions that can affect GMS remaining the same.
We implement CMMA to identify the north star of rank-aware metrics for search-ranking algorithms. It has helped product and engineering teams to achieve efficiency and success in terms of business KPIs in online A/B tests. We published CMMA on KDD 2020. Interested readers can refer to our paper for more details.