Real-time Genotype Compression in Forward-time Population Genetic Simulations

Troy Ruths, Rice University

Forward-time population genetic simulators are critical research tools in evolutionary biology, demonstrated by both the growing number of available simulators and the collection of high-impact studies that employ them. These simulators allow for in-silico hypothesis testing of evolutionary scenarios – hypotheses which would otherwise be intractable in the laboratory setting. However, these simulations are potentially data-heavy: Over simulated time, populations evolve novel genotypes through mutations and other biological processes. The representation of a genotype may be small – a handful of SNPs – to very large – 3 billion base pairs or an interaction network. For simulations with either large populations or large genotype data structures, the data footprint can quickly exceed that of a single compute node.

In this work, we develop a novel and general method for addressing the memory issue inherent in forward-time simulations by compressing – in real time – active and ancestral genotypes. We propose an efficient algorithm called Greedy-Load which can both threshold the memory footprint and be implemented in any current simulator. By improving the performance of the memory hierarchy, compressed simulations ran faster in practice than non-compressed ones. We simulated both large – 100 MB sequence – and complex – 1,000 gene pathways – without causing a memory crash. We believe our algorithm provides a major enhancement to the scalability of population genetic simulators, making possible new levels of genotype complexity, parallelization, and usability.

Abstract Author(s): Troy Ruths and Luay Nakhleh