Based on LinkedIn member profiles, we generated an initial list of candidate work experience descriptions. This step consists of applying various hard filters, including the application of privacy preferences, and results in a cleaner set of viable candidate work experience descriptions. It’s important to note that LinkedIn only considers profiles where privacy settings are set to public. Additionally, members can opt out and their data will be excluded in this filtering step.
An early error analysis revealed that about 8% of the errors we were seeing occurred when work experience descriptions did not describe someone’s work experience but instead described a company or product. It wouldn’t be useful to surface these descriptions in the Resume Assistant product, so we trained a binary text classifier on LinkedIn company descriptions (which were given the “company” label) and LinkedIn member work experience descriptions (which were given the “work experience description” label). Even though the member work experience description data was noisy insofar as it contains a significant amount of “company” descriptions, the model was able to generalize successfully. All work experience examples predicted as “company” were then filtered out of the data.
Once we removed the examples that we definitely did not want to surface in the product, we needed to rank the remaining descriptions. We used a gradient-boosted decision tree classifier that predicted “good” or “bad” labels given a work experience description. To obtain a ranking score, we simply used the distribution over the classes returned by the classifier: the score returned by the ranker is the probability of the “good” label being assigned, given the input text.
Data for training the model was created by an in-house linguist team (details of the annotation task are described below) and consisted of a label (good/bad) derived from human judgements about the quality of the work experience descriptions. We had a very small training set and to avoid overfitting the data, we trained a very simple model with a few features based mostly on the structural characteristics of the text.
Much of the effort in this project was dedicated to coming up with effective ways to evaluate the model output. Once launched in production, we would have data from users that we could use to evaluate and improve our model. Pre-launch, however, we needed a way to evaluate how our models were doing, so we devised a task for human annotators to judge work experience descriptions. Even post-launch, we continue to use this human evaluation task because it provides a complementary validation of model output to pair with what we get from tracking data and other user feedback.
We needed to establish what constituted high-quality text in the Resume Assistant context. As well as being asked to give an overall judgement (on a four-point scale), annotators were asked a number of additional questions relating to particular elements of a work experience description (for example: “Does the description contain examples of achievements? Does the description contain any quantification of results?”). We aggregated the answers to produce a quality score for each example. We found that the additional questions helped to make the final annotations more consistent and resulted in a more useful, fine-grained ranking. It also allowed us to tune the final quality score based on what was deemed most important in terms of quality from a product perspective.
Training data generation for ranking model
Early versions of our data pipeline used heuristic methods for ranking work experience descriptions. We were able to use the annotated data created during the evaluation of these pipelines to create a small training set, which we used to train the ranking model. The training set was augmented with additional randomly-selected data, annotated by human annotators in the same fashion as described above.
The manual evaluations by our linguist team resulted in a quality score for each work experience description in a sample. Quality scores range from 0 to 1. A zero score indicates a very poor-quality description; the best possible score is 1. For each evaluation, we selected a number of job titles and ran the model on all the data in the Knowledge base (i.e. the LinkedIn Economic Graph) to retrieve the top k work experience examples for each title. These top k results were then evaluated. Each histogram below displays the results for one such evaluation run. Improvements in data quality over time can be observed in the gradual improvement of the distribution of quality scores in the histograms for the three models displayed here.