The Content Ingestion team at LinkedIn primarily focuses on discovering content across the web and ingesting it into the LinkedIn content ecosystem. Not only do we ingest content whenever a member shares a URL, but we also proactively search for interesting content that our members could enjoy.
Given the team’s focus, we’ve created a tool—called Post Inspector—for external content providers and internal teams at LinkedIn that provides insight into how we extract metadata so that content providers can easily optimize the sharing experience of their content on the LinkedIn platform. For instance, when members share a link to post on the platform, the Content Ingestion team’s services are tasked with finding the metadata to populate the shared post’s title, image, and content provider. Metadata is essentially a bird’s-eye view of the content that gives you an idea of what the content is about, but is not the content itself.
The internet and unstructured data
The great thing about the internet is that there is no standardized format that people must follow when publishing content online. This has allowed for a lot of innovation, and new forms of content have popped up over the years. However, since web pages have unpredictable formats, large players in the internet industry have come up with ways to be able to understand pages across the web in order to do things such as build helpful search engines or show content previews across platforms.
There have been several efforts in the past decade to introduce some structure to the web. For instance, schema.org and the Open Graph protocol are two of the major initiatives to allow web content providers to add helpful markup to their pages so that search engines and other web companies can better interpret the content.
Why Post Inspector?
When possible, our services use the content’s structured metadata, which can come in several forms, including Open Graph tags or OEmbed tags. However, we can’t expect that every piece of content on the web adheres to these protocols, nor can we expect that the provided metadata meets the protocols’ specifications for each metadata property. For instance, the provided image could be too large or too small, a title could be too long, and so on. Consequently, we need to have backup ways to interpret pages based on what’s available when extracting the metadata from the content, and we need to validate the data we are given to pick the best candidate for each metadata property (title, image, description, and so on).
This additional complexity of handling unstructured data and validating existing structured metadata makes it harder to reason about why we are extracting certain values. Our team had to answer a lot of questions regarding content metadata, such as why we chose a certain image over the one that was specified in the structured metadata. Often, the reason was because our service found a better candidate for the image because the image property’s criteria weren’t being met. Up until now, the reason why certain values were chosen instead of other ones was largely a mystery to most people outside our team. That had to change, because we want to help create the best content sharing experience for our members, in a highly scalable way.
At LinkedIn, we care a lot about both our members’ and publishers’ success, so we have teams across the company that work closely with publishers to help us provide our members with a variety of content, displayed in a way that makes the content fit seamlessly into the feed experience.
Without Post Inspector, each time a client had an issue with how our content was being displayed, because people didn’t have clear visibility into content metadata requirements, our team would have to answer with implementation details about why exactly certain values were picked.
We believed that the required knowledge to solve a problem should match the problem’s complexity. This effectively means that if a publisher, or any member for that matter, wants to improve how their content looks on LinkedIn, it should be extremely straightforward. Through Post Inspector, we want to empower our members and publishers alike, so that anyone can easily know what they can do to make improvements for their content to gain more traction on LinkedIn. No one should have to know the fine details of the LinkedIn architecture to figure out why their image is smaller than expected, or is not showing up.