Pensieve: An embedding feature platform

Figure 2: Architecture of our platform, divided into pillars

In this post, we highlight the novel subcomponents of this platform, taking a deep dive into the Pensieve Modeling and the Nearline Embedding Serving Framework.

Pensieve model

Model input
The LinkedIn Knowledge Graph is an important source of features across all AI models at LinkedIn, defining relationships between entities such as members, uploaded resumes, job postings, titles, skills, companies, and geolocations. Building the relationship edges using explicit information provided by members and through inference models is known internally as “Feature Standardization.” Titles, skills, companies, and geolocations related to members, uploaded resumes, and job postings are anonymized to id values and used as input sparse categorical features.

However, company and geolocation features are of extremely high cardinality, often in the millions. Training on those large dimensional inputs results in larger models and slower convergence. We combat this by subsetting the features before use, leveraging the observation that many of the job/member feature pairs tend to co-occur. For example, people of a particular geolocation often prefer to apply for jobs in a nearby location with lucrative opportunities. We can model these co-occurrences as a weighted bipartite graph G: (U, V, E) where

U := Set of values of a member feature (e.g., member geolocation)

V := Set of values of the corresponding job feature (e.g., job geolocation)

E := {(u, v, w) | u ⍷ U, v ⍷ V, w = fraction of co occurrence of (u, v) feature values}

From this bipartite graph G, we choose the subgraph G’(U’, V’, E’)  by optimizing

Source link