Presenting an Open Source Toolkit for Lightweight…


By Aasish Pappu, Roi Blanco, and Amanda Stent

What’s the first thing you want to know about any kind of text document (like a Yahoo News or Yahoo Sports article)? What it’s about, of course! That means you want to know something about the people, organizations, and locations that are mentioned in the document. Systems that automatically surface this information are called named entity recognition and linking systems. These are one of the most useful components in text analytics as they are required for a wide variety of applications including search, recommender systems, question answering, and sentiment analysis.

Named entity recognition and linking systems use statistical models trained over large amounts of labeled text data. A major challenge is to be able to accurately detect entities, in new languages, at scale, with limited labeled data available, and while consuming a limited amount of resources (memory and processing power).

After researching and implementing solutions to enhance our own personalization technology, we are pleased to offer the open source community Fast Entity Linker, our unsupervised, accurate, and extensible multilingual named entity recognition and linking system, along with datapacks for English, Spanish, and Chinese.

For broad usability, our system links text entity mentions to Wikipedia. For example, in the sentence Yahoo is a company headquartered in Sunnyvale, CA with Marissa Mayer as CEO, our system would identify the following entities:

On the algorithmic side, we use entity embeddings, click-log data, and efficient clustering methods to achieve high precision. The system achieves a low memory footprint and fast execution times by using compressed data structures and aggressive hashing functions.

Entity embeddings are vector-based representations that capture how entities are referred to in context. We train entity embeddings using Wikipedia articles, and use hyperlinked terms in the articles to create canonical entities. The context of an entity and the context of a token are modeled using the neural network architecture in the figure below, where entity vectors are trained to predict not only their surrounding entities but also the global context of word sequences contained within them. In this way, one layer models entity context, and the other layer models token context. We connect these two layers using the same technique that (Quoc and Mikolov ‘14) used to train paragraph vectors.

Architecture for training word embeddings and entity embeddings simultaneously. Ent represents entities and W represents their context words.

Search click-log data gives very useful signals to disambiguate partial or ambiguous entity mentions. For example, if searchers for “Fox” tend to click on “Fox News” rather than “20th Century Fox,” we can use this data in order to identify “Fox” in a document. To disambiguate entity mentions and ensure a document has a consistent set of entities, our system supports three entity disambiguation algorithms:

*Currently, only the Forward Backward Algorithm is available in our open source release–the other two will be made available soon!

These algorithms are particularly helpful in accurately linking entities when a popular candidate is NOT the correct candidate for an entity mention. In the example below, these algorithms leverage the surrounding context to accurately link Manchester City, Swansea City, Liverpool, Chelsea, and Arsenal to their respective football clubs.

Ambiguous mentions that could refer to multiple entities are highlighted in red. For example, Chelsea could refer to Chelsea Football team or Chelsea neighborhood in New York or London. Unambiguous named entities are highlighted in green.
Examples of candidate retrieval process in Entity Linking for both ambiguous and unambiguous examples referred in the example above. The correct candidate is highlighted in green.

At this time, Fast Entity Linker is one of only three freely-available multilingual named entity recognition and linking systems (others are DBpedia Spotlight and Babelfy). In addition to a stand-alone entity linker, the software includes tools for creating and compressing word/entity embeddings and datapacks for different languages from Wikipedia data. As an example, the datapack containing information from all of English Wikipedia is only ~2GB.

The technical contributions of this system are described in two scientific papers:

There are numerous possible applications of the open-source toolkit. One of them is attributing sentiment to entities detected in the text, as opposed to the entire text itself. For example, consider the following actual review of the movie “Inferno” from a user on MetaCritic (revised for clarity): “While the great performance of Tom Hanks (wiki_Tom_Hanks) and company make for a mysterious and vivid movie, the plot is difficult to comprehend. Although the movie was a clever and fun ride, I expected more from Columbia (wiki_Columbia_Pictures).”  Though the review on balance is neutral, it conveys a positive sentiment about wiki_Tom_Hanks and a negative sentiment about wiki_Columbia_Pictures.

Many existing sentiment analysis tools collate the sentiment value associated with the text as a whole, which makes it difficult to track sentiment around any individual entity. With our toolkit, one could automatically extract “positive” and “negative” aspects within a given text, giving a clearer understanding of the sentiment surrounding its individual components.

Feel free to use the code, contribute to it, and come up with addtional applications; our system and models are available at https://github.com/yahoo/FEL.



Source link