Machine Learning-Driven Multiscale Modelling of Genomic Evolution of SARS-CoV-2

Danilo Perez Jr, New York University

Photo of Danilo Perez Jr

Public Health Genomic Surveillance efforts have produced hundreds of thousands of SARS-CoV 2 genomes to better understand this virus at the root of the ongoing COVID-19 pandemic. One of the key questions that emerges from these large-scale sequence collections is whether some mutations are naturally more pathogenic by nature (i.e., increase transmissibility) and whether we can predict emerging pathogenic mutants that one may have to be aware of. In addition, questions about emerging resistance mutations (to drug treatments) or epitope selections (for vaccine design) would be extremely useful to address. Furthermore, though there truly is a wealth of genomic data, it is heterogeneously distributed in coverage, with plenty of countries unable to access resources for viral sequencing, begging the question if the next variant of concern may emerge where we simply are not looking. The emergence of highly successful machine learning techniques, particularly transformer architectures, posit an interesting opportunity for attempting to answer these questions from such large collections. We focus on the well-known S-gene that encodes for the infamous surface glycoprotein or “spike", a hotspot for many of the mutations that have produced ever more infectious variants of concern, defining a large language model (LLM) of this gene’s evolution. Through the proposition of novel genomic/proteomic sequences, we expect the LLM to facilitate the identification of phenotypically advantageous variations that may yet to have emerged or established dominance, offering crucial forewarning that can inform public policy stake holders as soon as possible.

Abstract Author(s): Danilo Trinidad Perez-Rivera, Alex Brace, Max Zvyagin, Arvind Ramanathan PhD