Research investigates the origin of the COVID-19 virus using artificial intelligence (AI)



[ad_1]

The coronavirus responsible for the COVID-19 pandemic has spread worldwide with unprecedented speed and lethality, killing hundreds of thousands of people and forcing the entire populations of the countries to self-quarantine.

The virus technically referred to as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is believed to be zoonotic, but its origin is still in doubt. A new article reports on the use of artificial intelligence (AI) to solve the puzzle of the origin of the virus. The article was published on the prepress server bioRxiv * in May 2020.

AI has been widely used during the pandemic, and its uses range from rapid diagnostics to contact tracing and drug simulation. The ability to quickly compare, classify, and relate data has made it an invaluable tool. In fact, researchers believe that this may provide the key to developing a vaccine against the virus.

Using AI-assisted cluster analysis to trace the origin of SARS-CoV-2

To find the origin of the virus, the team decided to compare its genome with those of pre-existing organisms. They downloaded 334 complete genomic sequences of the virus from the GenBank database, using samples taken worldwide: 258 from the United States, 49 from China and the remaining 27 from other countries. For each set, they used the first published complete mapping of each country’s virus sequence.

They also selected reference genomic sequences, such as those for alpha and beta coronaviruses, from GenBank and Virus-Host DB. The sequenced genomes of the Guangxi and Guangdong pangolins were downloaded from the GISAID database.

Study: origin of the new coronavirus (COVID-19): a computational biology study using artificial intelligence. Image Credit: 2630ben / Shutterstock

Study: origin of the new coronavirus (COVID-19): a computational biology study using artificial intelligence. Image Credit: 2630ben / Shutterstock

In total, there were three sets of reference genomes selected at various taxonomic levels for use in a supervised decision tree method that has been recommended for the classification of new pathogens. The method used is to scroll through the ranking levels from highest to lowest, looking for the correct slot for the SARS-CoV-2 genome in the genus and lower levels, and its closest relatives.

The reference genomes at each taxonomic level were fed to the AI, along with the sequences of the viral genome.

AI-based analysis was performed using unsupervised clustering methods, using a hierarchical clustering algorithm in conjunction with noise application density based spatial clustering (DBSCAN). There are two steps involved: using the algorithms to achieve groups of reference sequences alone, and then using the same parametric values ​​to group a mixture of reference genome and SARS-CoV-2 sequences.

In other words, the method first shows the reference sequences with which the SARS-CoV-2 sequences are grouped. Second, the settings are changed to observe the corresponding changes in the formed groups. This will help collect the closest sequences to compare the similarities between genomes.

What did the study find?

By progressively reducing search parameters, the team progressed from high to low levels of taxonomic classification. Starting with the first reference set, comprising viruses of 12 major classes at the highest level, the team discovered that the virus belonged to the Riboviria group, represented by the MERS virus (responsible for the 2012 MERS outbreak). Based on these data, they concluded that the coronavirus probably belonged to the Riboviria family.

At the next level, they analyzed the grouping of SARS-CoV-2 against 12 virus families within Riboviria. The results show that the viral genome is grouped with the Coronaviridiae family. This class has four genera: the Alpha, Beta, Gamma, and Delta coronavirus families. SARS-CoV2 belongs to the genus Beta-coronavirus.

Within this genus, among 37 reference sequences, SARS-CoV-2 is grouped with the subgenus Sarbecovirus. It mainly contains SARS coronavirus and bat coronavirus, but also 5 pangolin sequences from Guangxi and one from Guangdong.

Interestingly, the study found that the amount of variation in the genetic code of the 334 samples, compared to the reference samples, was virtually constant for all samples, which were collected in sixteen countries over a three-month time period.

With narrower cutoff parameters, SARS-CoV-2 continued to be grouped with Sarbecovirus, even as this group split into two. With a very low limit, SARS-CoV-2 is grouped with only 2 viruses based on full genome analysis: Bat CoV RaTG13 and Guangdong pangolin CoV.

By further narrowing the search, the AI ​​found only one virus, which it was able to bundle with SARS-Cov2: the CoV-RaTG13 bat. This could mean that bats are the most likely reservoir host for SARS-CoV2.

Greater horseshoe bat (Rhinolophus ferrumequinum). Image credit: ATTILA Barsan / Shutterstock

Greater horseshoe bat (Rhinolophus ferrumequinum). Image credit: ATTILA Barsan / Shutterstock

However, with an even lower cutoff point, AI did not group the virus with any other organism. Does this mean that the virus cannot originate from bats or pangolins?

The study says this is a “moot question” because bat SARS-CoV-2 and RaTG13 CoV (or Guangdong pangolin CoV) are so similar that, for example, between bat coronaviruses originating in El same host.

SARS-CoV-2 Probably from Bat or Pangolin CoV

They conclude: “Therefore, it is considered highly likely that SARS-CoV-2 originated from the same host with Bat CoV RaTG13 or Guangdong pangolin CoV, which is either bat or pangolin, respectively.” The study shows AI’s ability to make sense of large volumes of data to choose meaningful and useful patterns. Hopes that the same power can be harnessed to develop an effective vaccine against SARS-CoV-2.

*Important news

bioRxiv publishes preliminary scientific reports that are not peer-reviewed and therefore should not be considered conclusive, guiding clinical practice / health-related behavior, or treated as established information.



[ad_2]