How NLP Cuts Through the Noise of 200K COVID-19 Research Papers

A Deluge of Information

When the entire world’s attention shifted to COVID-19 in 2020, suddenly the topic of scientific research became important to just about everyone.


On January 31st, 2020, a preprint* titled “’Uncanny similarity of unique inserts in the 2019-nCoV spike protein to HIV-1 gp120 and Gag” was published on the platform bioXiv. The paper claimed to find similarities between the protein structures of the COVID-19 and HIV proteins. The authors quickly retracted it due to swift criticism from the scientific community. While there was almost no mention of this research in the news, it received an incredible amount of attention on Twitter, going viral, and gaining momentum despite its lack of rigor.

Within four hours, the paper had been shared over 200 times on Twitter and before the day was over, it had been shared over 30K times.

The retracted preprint went viral within hours of publication.

Despite the retraction, the paper continues to be propagated through Twitter — in September 2021, more than 18 months after it was published, it received 200 shares.

Despite its retraction, the paper continues to be shared on Twitter.

As the COVID case rate and death rate rose exponentially, so did the amount of work being published. By the end of February 2020, 612 papers had been published, and by the end of March, the number was more than 2,800. Tens of thousands more would follow. 

Cumulative research graph through time.
COVID-19 research grew exponentially for the first months of the pandemic and has continued to grow linearly.

At Primer, we build NLP technology that helps humans make sense of massive unstructured data sources (like scientific research, for example) so we knew we could be a resource to the frontline workers who were fighting to contain the pandemic and save lives. So on April 6, 2020, we launched COVID-19 Primer, a comprehensive information-tracking site to help researchers, frontline workers, patients, and others make sense of the flood of information pouring in from experts around the world.

For example, Madeline Grade, an emergency-medicine physician and researcher at the University of California, San Francisco, used the site early on in the pandemic when “every aspect of care was changing on a daily basis.” Grade was inundated with information and needed to create daily protocol updates for the university’s hospital. “Amid that chaos,” she says, “the Primer app was actually a really amazing way to cut through the noise” [Nature.com].

Another frontline worker, Dr Zev Waldman, a clinician developing guidelines for care of Covid-19 patients at his hospital, used the site to search COVID literature – both preprint and published work – for relevant pediatric subtopics. Accessing the most discussed papers allowed him to get a quick sense of what is new, trending and potentially relevant, as was the daily briefing, a summarization of the most important new papers.

Understanding 18 Months of Research

With NLP, we are able to structure hundreds of thousands of documents and aggregate information you care most about to a more digestible level. One NLP technique is called topic modeling. The COVID-19 Primer site extracts topics and maps the papers to each topic. Through this lens, we can see the story of the pandemic through the rise and fall of topical research over time.

Early Focus

In the very early days of the pandemic, the majority of the focus was centered around understanding the biology of the virus and its epidemiology. By April 2020, many large cities in the US and around the world had imposed lockdown measures, an indication that research with the topic “Model of Epidemic and Control the Spread” was reaching policy makers. Soon after, the share of research dedicated to the disease’s spread and genetic make-up dropped off. It’s important to note that the large percentage of research in the early days dedicated to these topics may have been influenced by previous research into epidemic spread and spike proteins, making it more accessible, as well as the relatively low volume of all research compared to just a few months later.

Early research was disproportionally focused on understanding the disease’s makeup and epidemiology.

Emerging Research

As the world began to realize the pandemic would not end quickly and social distancing and school closures continued into the summer and new school year, more research was dedicated to mental health issues. Similarly, vaccine research ramped up steadily through 2020 and into 2021. As the vaccines became widely accessible, research into vaccine hesitancy also increased.

Research into vaccines and vaccine hesitancy has grown as the pandemic has continued.

Steady Players

Two topics of research that held steady as a proportion of all COVID-19 research were around public health and testing.

Research dealing with public health and testing has stayed steady throughout the pandemic.

Here are the six in relation to each other as well as their absolute volumes. Unshown are the 50 plus other topics of COVID19 research from the past two years.

Taken together, we can see the shift in focus of COVID19 research topics over time.

Our Work Ahead

While cases in the United States and Europe are decreasing, the COVID-19 pandemic is far from over. Primer will continue to serve our researchers and front line workers by maintaining the COVID-19 Primer site as long as it is needed. We welcome inquiries from researchers, journalists, and educators for access to this data and using Primer’s platform to quickly build custom models to track emerging topics. Disseminating scientific research is only one application of NLP. If you have large amounts of text data that needs structuring and want to learn more about NLP, please contact [email protected].

*Preprints are preliminary reports of work that have not been certified by peer review and published in a journal. They provide a mechanism for rapidly communicating research with the scientific community.