The volume of literature produced on the topic of COVID-19 is daunting. So much so that scientists can’t keep up and need help finding relevant papers and building correlations.
Enter COVIDScholar.com. The search engine uses natural language processing techniques to scan, search, synthesize, draw insights and make connections.
A group of materials scientists at Lawrence Berkeley National Laboratory (Berkeley Lab), who usually spend their time researching high-performance materials for thermoelectrics or battery cathodes, built the text mining tool. Their quest to develop text and data mining techniques that can help answer high-priority questions related to COVID-19 stems from the White House’s March 16 call to action.
At the time, the COVID-19 Open Research Dataset (CORD-19) of scholarly literature about COVID-19, SARS-CoV-2 and the Coronavirus group had the most extensive machine-readable coronavirus literature collection available for data and text mining, with more than 29,000 articles.
Once the Berkeley Lab team set to work, its prototype was up and running within a week; after a month the tool had collected more than 61,000 research papers. About 8,000 were specifically about COVID-19 and the balance were about related topics, such as other viruses and pandemics in general. They estimate 200 new articles are published every day on the coronavirus. “Within 15 minutes of the paper appearing online, it will be on our website,” said Amalie Trewartha, a postdoctoral fellow who is one of the lead developers.
Ready for Public Use
The tool went live this week when the Berkeley Lab team released an upgraded version that allows the user to search for “related papers” and sort articles using machine-learning-based relevance tuning. COVIDScholar will also recommend similar abstracts and automatically sort papers in subcategories, such as testing or transmission dynamics, allowing users to do specialized searches.
The developers built automated scripts to grab new papers (including preprint papers), clean them up and make them searchable. At the most basic level, COVIDScholar acts as a simple search engine—albeit a highly specialized one touted as the largest single-topic literature collection on COVID-19—according to the developers.
The team of artificial intelligence experts will now train its algorithms to look for unnoticed connections between concepts. “You can use the generated representations for concepts from the machine learning models to find similarities between things that don’t actually occur together in the literature, so you can find things that should be connected but haven’t been yet,” said John Dagdelen, a UC Berkeley graduate student and Berkeley Lab researcher who is one of the lead developers.
Further on, the team plans to work with researchers in Berkeley Lab’s Environmental Genomics and Systems Biology Division and UC Berkeley’s Innovative Genomics Institute to improve COVIDScholar’s algorithms. The idea is to synthesize systems in a way that will allow researchers to discover new connections within their data, said Dagdelen.
Not From Left Field
The entire tool runs on the supercomputers of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science user facility located at Berkeley Lab. The online search engine and portal are powered by the Spin cloud platform at NERSC.
Chalk up the speed with which the team was able to iterate ideas to experience. The group spent three years doing natural language processing for materials science and built a similar tool, called MatScholar, a project supported by the Toyota Research Institute and Shell.
Last year the team published a paper in Nature that showed how an algorithm with no training in materials science could recommend materials for functional applications several years before their discovery.