The Statistical Modeling System Powering LinkedIn Salary


Introduction

For most job seekers, salary (or, more broadly, compensation) is a crucial consideration in choosing a new job opportunity. Indeed, more candidates (74%) want to see salary information compared to any other feature in a job posting, according to a survey of over 5,000 job seekers in the United States and Canada. At the same time, job seekers face challenges in learning the salaries associated with different jobs, given the dearth of reliable sources containing compensation data. The LinkedIn Salary product was designed with the goal of providing compensation insights to the world’s professionals, thereby helping them make more informed career decisions.

With its structured information, including the work experience, educational history, and skills associated with over 500 million members, LinkedIn is in a unique position to collect compensation data from its members at scale and provide rich, robust insights covering different aspects of compensation while still preserving member privacy. For instance, we can provide insight into the distribution of base salary, bonus, equity, and other types of compensation for a given profession, how these factors vary based on things like location, experience, company size, and industry, and which locations, industries, or companies pay the most.

In addition to helping job seekers understand their economic value in the marketplace, the compensation data has the potential to help us better understand the monetary dimensions of the Economic Graph, which includes companies, industries, regions, jobs, skills, and educational institutions, among other things.

The availability of compensation insights along different demographic dimensions can lead to greater transparency, shedding light on the extent of compensation disparity and thereby helping stakeholders, including employers, employees, and policy makers, take steps to address pay inequality.

Further, products such as LinkedIn Salary can improve efficiency in the labor marketplace by reducing the asymmetry of compensation knowledge and by serving as market-perfecting tools for workers and employers. Such tools have the potential to help students make good career choices by taking expected compensation into account, and to encourage workers to learn skills needed for obtaining well-paying jobs, thereby helping reduce the skills gap.

In this post, we will describe the overall design and architecture of the statistical modeling system underlying the LinkedIn Salary product. We will also focus on unique challenges we have faced, such as the simultaneous need for user privacy, product coverage, and robust, reliable compensation insights, and will describe how we addressed these challenges using mechanisms such as outlier detection and Bayesian hierarchical smoothing.

Problem setting

In the publicly-launched LinkedIn Salary product, members can explore compensation insights by searching for different titles and locations. For a given title and location, we present the quantiles (10th and 90th percentiles, and median) and histograms for base salary, bonus, and other types of compensation. We also present more granular insights on how the pay varies based on factors such as location, experience, education, company size, and industry, and on which locations, industries, or companies pay the most.

The compensation insights shown in the product are based on compensation data that we have been collecting from LinkedIn members. We designed a give-to-get model based on the following data collection process. First, cohorts (such as User Experience Designers in the San Francisco Bay Area) with a sufficient number of LinkedIn members are selected. Within each cohort, emails are sent to a random subset of members, requesting them to submit their compensation data (in return for aggregated compensation insights later). Once we collect sufficient data, we get back to the responding members with the compensation insights and reach out to the remaining members in those cohorts, promising corresponding insights immediately upon submission of their compensation data.

Considering the sensitive nature of compensation data and the desire to preserve the privacy of LinkedIn’s members, we designed our system such that there is protection against data breach, and against inference of any particular individual’s compensation data by observing the outputs of the system. Our methodology for achieving this goal, through a combination of techniques such as encryption, access control, de-identification, aggregation, and thresholding, is described in our IEEE PAC 2017 paper. Next, we will highlight the key data mining and machine learning challenges for the salary modeling system (see our ACM CIKM 2017 paper for more details).

Modeling challenges
Due to privacy requirements, the salary modeling system has access only to cohort-level data containing de-identified compensation submissions (e.g., salaries for UX Designers in the San Francisco Bay Area) and is limited to those cohorts having at least a minimum number of entries. Each cohort is defined by a combination of attributes, such as title, country, region, company, and years of experience, and contains de-identified compensation entries obtained from individuals who all share these same attributes. Within a cohort, each individual entry consists of values for different compensation types, such as base salary, annual bonus, sign-on bonus, commission, annual monetary value of vested stocks, and tips, and is available without an associated user name, ID, or any attributes other than those that define the cohort. Consequently, our modeling choices are limited, since we have access only to the de-identified data and therefore cannot, for instance, build prediction models that make use of more discriminating features not available due to de-identification.

Evaluation: In contrast to other member-facing products, such as job recommendations, we face unique evaluation and data quality challenges with our salary product. Since members themselves may not have a good perception of the true compensation range, they may not be in a position to evaluate whether the compensation insights displayed are accurate. Consequently, it is not feasible to perform online A/B testing to compare the compensation insights generated by different models. Further, there are very few reliable and easily available ground truth datasets in the compensation domain, and even when available (e.g., the BLS OES dataset), mapping such datasets onto LinkedIn’s taxonomy is inevitably noisy.

Outlier detection: As the quality of the insights depends on the quality of the submitted data, detecting and pruning potential outlier entries is crucial. Such entries could arise due to either mistakes or misunderstandings during submission, or due to intentional falsification (such as someone attempting to game the system). We needed a solution to this problem that would work even during the early stages of data collection, when outlier detection is more challenging, and there may not be sufficient data across related cohorts to compare.

Robustness and stability: While some cohorts may have a large sample size, many cohorts typically contain very few (< 20) data points each. Given the desire to have data for as many cohorts as possible, we needed to ensure that the compensation insights were robust and stable even when data was sparse. That is, for such cohorts, the insights should be reliable, and not too sensitive to the addition of a new entry. We faced a similar challenge when it came to reliably inferring the insights for cohorts with no data at all.

Our problem can thus be stated as follows. How do we design the salary modeling system to meet the immediate and future needs of LinkedIn Salary and other LinkedIn products? How do we compute robust, reliable compensation insights based on de-identified compensation data (for preserving privacy of members), while addressing the product requirements, such as coverage?

LinkedIn Salary modeling system design and architecture

Our system consists of both an online component that uses a service-oriented architecture for retrieving compensation insights corresponding to the query from the user-facing product, and an offline component for processing de-identified compensation data and generating compensation insights.



Source link