Researchers of HSE Tikhonov Moscow Institute of Electronics and Mathematics (MIEM), in cooperation with their colleagues from the University of California, Santa Cruz (UCSC), and The European Bioinformatics Institute (EMBL-EBI), have developed software to model the spread of the COVID-19 global pandemic. This is the world’s fastest Viral Genealogy Simulator (VGsim). For more details about this scalable simulator, read the reprint on medRxiv. The code is freely available at GitHub.
Around 180 million coronavirus cases have been reported globally to date, and the number of SARS-CoV-2 genomes in the GISAID open database exceeds 2 million. Methods are being actively developed to analyze this data. Each method should be carefully verified in terms of its sensitivity to assumptions on population homogeneity, accidental effects that frequently occur in reality, etc. The Viral Genealogy Simulator is software designed to verify data analysis methods related to the genetic nature of the coronavirus.
‘Since we are unable to simulate the conditions of the pandemic in a lab, computer modelling appears to be the only way of such verification. Various scenarios are run multiple times to prove the reliability of a method and its sensitivity to approximations of mathematical models compared to the real world. It is all quite like in physics,’ says Vladimir Shchur, Head of the International Laboratory of Statistical and Computational Genomics, HSE University.
Genetic sequences of viruses can be used to track the course of their spread, building their genealogy and the tree of infections. These trees are a rich source of information, including about the evolution of pathogens. The simulator able to generate such gigantic trees helps researchers prove that their method works correctly.
The simulator is able to generate trees of dozens or even hundreds of millions of genome samples for the world’s population. ‘Neutral’ (evolutionally irrelevant) mutations can then be added to these trees by using other software developed by one of the authors. Thus, the two computer programs generate synthetic viral genomes, linking their genealogy and the pandemic dynamics.
The simulator relies on the SIR model, in which the population is divided into three groups: Susceptible, Infectious, and Recovered. This model was proposed over 100 years ago. The researchers have modified it to account for various types of immunity and the population structure—considering a number of regions or countries between which migration takes place.
The user can insert population-related data such as names of countries or regions. The frequency of contacts is set for each country, region or population. This frequency can be reflected by the population density or other cultural aspects, including, for instance, mask wearing culture as an additional contact barrier.
The researchers point out that the traditional SIR model does not take migration into account. Relevant modifications have been made in relation to the more contemporary models.
Having generated the dynamics of the infection spread, the researchers built a genealogical tree of samples. ‘This is a common approach in genomics—we watch the evolution not from the past to the present, but in the reverse flow of time. Based on the dynamics generated, we create a tree, building relationships between virtual samples which appeared in this virtual pandemic,’ Vladimir Shchur explains.
Genealogy of the hypothetical samples in the virtual laboratory helps understand where the strain originated and spread. Using the simulator, we can verify the accuracy of the currently available methods designed to generate such genealogical trees.