Data analysis identifies the “mother” of all SARS-Co-2 genomes


DNA Technool .G Concept

Temple researchers have identified the first genome to transmit coronavirus.

In the field of nuclear epidemiology, the world has been struggling to solve the riddle of the early history of the scientific community. SARS-CoV-2.

Since the first SARS-CoV-2 virus was infected in December 2019, thousands of its genomes have been sequenced worldwide, indicating that the coronavirus is slowly changing, albeit at a rate of 25 mutations per genome each year.

But despite great efforts, to date no one has identified the first case of human transmission or “patient zero”. COVID-19 Nationwide epidemic. To investigate such cases, it is important to understand how the virus can infect humans by jumping from its animal host, as well as the history of how the SARS-Covy-2 viral genome has evolved over time and spread globally.

“The SARS-CoV-2 virus is one RNA The genome has already infected more than 35 million people around the world, ”said Sudhir Kumar, director of the Institute of Genomics and Evolutionary Medicine at Mandir University. “We need to find this common ancestor, which we call the ancestor genome.”

This ancestor genome is the mother of all SARS-Co-2 coronaviruses that have infected people today.

In the absence of patient zero, Kumar and his Temple University research team will now have found the next best thing to help the worldwide molecular epidemiology detective work. “We decided to reconstruct the ancestral genome using a large dataset of coronavirus genomes obtained from infected individuals,” said Saka Miura.

They found the “mother” of all the SARS-Co-2 genomes and her early offspring strains later changed and spread the dominance of the worldwide epidemic. “We have now reconstructed the former editor’s genome and mapped out where and when the first change took place,” said Kumar, author of the corresponding preprint study.

In doing so, his work has given a new understanding of the early transformation history of SARS-Cavi-2. For example, their study showed that mutations in the SARS-CoV-2 spike protein (D416G) are often associated with increased infection and spread, followed by many other mutations, weeks after the start of Covid-19. “It is always found with many other protein mutations, so it remains difficult to establish its role in growing infections,” said Sergei Pond, senior co-author of the study.
In addition to the findings about the early history of SARS-Cavi-2, Kumar’s group has developed interactive fingerprints to quickly identify infected or sub-strains or colonization in the global arena.

Order for an epidemic

To identify the ancestor genome, they used a mutant order analysis technique, relying on clonal analysis of mutant strains and the frequency at which the pair of mutations appear simultaneously in the SARS-CoV-2 gene.

First, Kumar’s team tested data on about 30,000 complete genomes of SARS-CV-2, the virus that causes COVD-19. Together, they analyzed 29,681 SARS-Co-2 genomes, each with at least 28,000 sequence data bases. The genome was sampled between 24 December 2019 and 07 July 2020, representing 97 countries and territories around the world.

Kumar says, “Because of the focus on creating the evolutionary tree of SARS-Covi-2, many previous attempts at analyzing such large datasets have not been successful. “This coronavirus develops very slowly, the number of genomes is too large to analyze, and the data quality of the genome is very variable. I immediately noticed a similarity between the properties of these genetic data obtained from the coronavirus with the genetic data of another negative disease, the clonal spread of cancer. “

Kumar’s group has developed and researched several techniques for analyzing genetic data from tumors in cancer patients. They adapted and innovated the techniques and created a trail of changes that automatically traced back to the ancestors. “Basically, the genome was ancestral before the first change,” Kumar said. “The transformation tracking approach is beautiful and predicts the phylogeny of the” main strain “of SARS-Cavi-2. It is an excellent example of how critical data from biologically informed data mining reveals important patterns.”

Ancestor genome

Kumar’s team traced the predicted sequence of the ancestor (mother) genome of all the SARS-Kovi-2 genome (ProcoV2). In the ProcoV2 genome, they identified 170 non-synonyms (mutations that cause amino acids) Acid Changes in proteins) and 958 synonym substitutes compared to the genome of a closely related coronavirus, RATG13, found in a Rhinolofus aphids bat. While the intermediate animal between the bat and the human is still unknown, this is a 96.12% sequence similarity between the ProcoV2 and the RATG13 sequence.

Next, they identified 49 single nucleotide variants (SNVs) with a frequency greater than 1% of their dataset. This was further investigated to see the changing patterns and global spread.

“The tree of change predicts a tree of stress,” Kumar said. “You can also do the first stress tree, and predict the order of change. However, this method is very much influenced by the quality of the sequence. When the conversion rate is low, it becomes difficult to distinguish between error due to low quality and actual conversion. The approach we have taken against sequencing errors is much stronger, as the analysis of pairs of locations around the genome is more informative. “

The earlier timeline emerges

When the estimated Procova 2 coin in his collection was compared with the genomes to reveal that there was no perfect match at the nucleotide level, Kumar’s research team knew that the actual time the onset of the epidemic was off.

“The genome of this genus had a different sequence than the reference sequence, first seen in China and stored in the GISID SARS-CO-2 database,” Kumar said.

The genome was sampled 12 days after the closest match virus, available on December 24, 2019. Many specimens were found in all sample continents and were found in Europe in late April 2020. Overall, the group of 120 Genome Kumar performed all the analyzes with only synonymous differences from ProcoV2. That is, all of their proteins were similar to the relative ProcoV2 proteins in the amino acid sequence. Most of these protein-level matches (80 genomes) were from coronaviruses sampled in China and other Asian countries.

These speciotemporal patterns suggest that the Proco CVV2 already has a complete collection of protein sequences to infect, spread, and sustain the global human population.

They found the ProcoV2 virus and its early descendants became pro in China, based on the early mutations of ProcoV2 and their locations. Furthermore, they also showed that a stress population with as many as six different differences from Prokowi 2 existed in China at the time of the first investigation of the Kovid-19 case. With SARS-CoV-2 estimated to change 25 times per year, this means that the virus has been infecting people for several weeks before the December 2019 cases.

Mutual signatures

Since there was evidence of many mutations before the reference was found in the genome, Kumar’s group had to come up with a new name for mutation signatures to classify SARS-Cove-2 and calculate this by introducing a series of symbols of Greek characters. Each presented.

For example, they discovered that the μ and AR SARS-CoV-2 genome variables preceded the first reports of COVID-19. Some sequence in the SARS-Cavi-2 population of these ancestors strongly suggests the existence of diversity. All 17 genomes sampled by China in December 2019, including the designated SARS-Kovi-2 reference genome, have three μ and three α variables. Interestingly, in January 2020, six genomes were sampled in China and the United States with પરંતુ variants but not variants. Therefore, the genomes of the early sample (including the designated reference) were not ancestral strains.

It also predicts that the ancestor has offspring in the genome that spread around the world during the early stages of COVID-19. He was ready for infection from the beginning.

Sergei Pond said, “The lineage had all the potential needed to spread.” There is very little evidence of selection on the lineage between bats and humans, although bats have a strong preference for coronaviruses. “

Hitchhiking change

Furthermore, they found embarrassing evidence that there was always another mutation that accompanied the D416G spike protein mutation.

“A lot of people are interested in mutations due to its functional properties in spike proteins,” Kumar said. “But what we are observing is that in addition to the spike protein, there were some additional changes within the genome that are always accompanied by changes in the spike protein (D416G). We call this the beta group of mutations, and spike mutations are one of them. We think the spike is changing, it’s best not to forget that other changes may be involved as well. Alternatively, this change can only simultaneously strike, we cannot say yet. “

“Interestingly, there were many other mutations in the genome that contained spike protein mutations. And what we call epsilon mutations (they are) occur on the background of spike mutations, and they replace the arginine residue in a very important protein, the nucleocapsid (N) protein. Epsilon mutations are widespread in Europe, and are always seen with spike protein mutations. Therefore, the Epsilon mutations began to prevail in both Europe and Asia. “

Global spread

Together, they have identified seven major evolutionary lineages that occurred seven times after the onset of the epidemic, some of which originated in Europe and North America after the origin of the lineage in China.

“Asian tensions set the whole epidemic,” Kumar said. “But over time, it’s a sub-strain with an epsilon transformation, which could occur outside of China (first seen in Central and Eastern Europe), affecting Asia more.”

Their mutation-based analyzes have also established that North American coronaviruses have very different genome signatures than those prevalent in Europe and Asia.

“This is a dynamic process,” Kumar said. “Clearly, there are very different pictures of the spread that have been drawn by the emergence of the new mutation, the three Epsilon, Gamma and Delta, which we got after the Spike protein mutation. We need to find out if any of the functional properties of this mutation have accelerated the epidemic. ”

Next steps

Going forward, they will continue to improve their results as new information becomes available.

“There are more than 100,000 SARS-CoV-2 genomes that have now been sequenced,” Pond said. Kumar says, “The strength of this approach is that the more data you have, the more easily you can tell the exact frequency of individual change and the pair of changes. The types that are produced, single nucleotide variants or SNVs, their frequency and history can be said very well with more information. Therefore, our analysis finds reliable sources for the SARS-Cavi-2 phylogeny. “

Their results are being updated automatically online when new genomes are reported (which now exceeds 1,000,000 samples and can be found at http://igem.temple.edu/COVID-19).

“These findings and our intuitively interactive fingerprints of the SARS-COVI-2 strains address the challenges of how, when and why COVID-19 came out and spread, which is a prerequisite for creating solutions to this epidemic.” . Science, technology, public policy and medicine efforts, ”Kumar said.

Reference: “By Suveer Kumar, Quicking Tao, Steven Weaver, Maxwell Sanderford, Marcos A. Carablo-Tertis, Sudip Sharma, Sergei L. K. Pond” Provider SARS-Co.V. 2 and an evolutionary portrait in the COVID-19 epidemic in its impressive off-shoots. And Sayaka Miura, 29 September 2020, Bioarchive.
DOI: 10.1101 / 2020.09.24.311845