At LinkedIn, our imperative is to create economic opportunity for every member of the global workforce, something that would be impossible to accomplish without leveraging AI at scale. We help members and customers make decisions by providing them with the most relevant insights based on the available data (e.g., the job listings that might be a good fit for their skills, or content that might be most relevant to their career). Along with the rest of the industry, our AI models use both implicit and explicit feedback in order to make these predictions.
News headlines and academic research have emphasized that widespread societal injustice based on human biases can be reflected both in the data that is used to train AI models and the models themselves. Research has also shown that models affected by these societal biases can ultimately serve to reinforce those biases and perpetuate discrimination against certain groups. Sadly, these examples persist even in models used to inform high-stakes decisions in high-risk fields, such as criminal justice and health care, owing to a range of complex historical and social factors.
At LinkedIn, we are working toward creating a more equitable platform by avoiding harmful biases in our models and ensuring that people with equal talent have equal access to job opportunities. In this post, we share the methodology we’ve developed to detect and monitor bias in our AI-driven products as part of our product design lifecycle.
Today’s announcement is the latest in a series of broader R&D efforts to avoid harmful bias on our platform, including Project Every Member. It is also a logical extension of our earlier efforts in fairness, privacy, and transparency in our AI systems, as well as “diversity by design” in LinkedIn Recruiter. Furthermore, there are additional company-wide efforts that extend beyond the scope of product design to help address these issues and close the network gap.
Towards fairness in AI-driven product design
There are numerous definitions of fairness for AI models, including disparate impact, disparate treatment, and demographic parity, each of which captures a different aspect of fairness to the users. Continuously monitoring deployed models and determining whether the performance is fair along these definitions is an essential first step towards providing a fair member experience.
Although several open source libraries tackle such fairness-related problems (FairLearn, IBM Fairness 360 Toolkit, ML-Fairness-Gym, FAT-Forensics), these either do not specifically address large-scale problems (and the inherent challenges that come with such scale) or they are tied to a specific cloud environment. To this end, we developed and are now open sourcing the LinkedIn Fairness Toolkit (LiFT), a Scala/Spark library that enables the measurement of fairness, according to a multitude of fairness definitions, in large-scale machine learning workflows.
Introducing the LinkedIn Fairness Toolkit (LiFT)
The LinkedIn Fairness Toolkit (LiFT) library has broad utility for organizations who wish to conduct regular analyses of the fairness of their own models and data.
- It can be deployed in training and scoring workflows to measure biases in training data, evaluate different fairness notions for ML models, and detect statistically significant differences in their performance across different subgroups. It can also be used for ad hoc fairness analysis or as part of a large-scale A/B testing system.
- Current metrics supported measure: different kinds of distances between observed and expected probability distributions, traditional fairness metrics (e.g., demographic parity, equalized odds), and fairness measures that capture a notion of skew like Generalized Entropy Index, Theil’s Indices, and Atkinson’s Index.
- LiFT also introduces a novel metric-agnostic permutation testing framework that detects statistically significant differences in model performance (as measured according to any given assessment metric) across different subgroups. This testing methodology will appear at KDD 2020.
In the remainder of this post, we will provide a high-level overview of various aspects of LiFT’s design, then delve into the details of our permutation testing methodology and discuss how it overcomes the limitations of conventional permutation tests (and other fairness metrics). Finally, we’ll share some of our thoughts around future work.
The LinkedIn Fairness Toolkit (LiFT)
To enable deployments in web-scale ML systems, we built LiFT to be:
- Flexible: It is usable for exploratory analyses (e.g., with Jupyter notebooks) and can be deployed in production ML workflows as well. The library comprises bias measurement components that can be integrated into different stages of an ML training and serving system.
- Scalable: Computation can be distributed over several nodes to scale bias measurement to large datasets. It leverages Apache Spark to ensure that it can operate on datasets stored on distributed file systems while achieving data parallelism and fault tolerance. Utilizing Spark also provides compatibility with a variety of offline compute systems, ML frameworks, and cloud providers, for maximum flexibility.
To enable its use in ad hoc exploratory settings as well as in production workflows and ML pipelines, LiFT is designed as a reusable library at its core, with wrappers and a configuration language meant for deployment. This provides users with multiple interfaces to interact with the library, depending on their use case.