Understanding how variation across genomes shapes the properties of biomolecules, cells and organisms is a foundational question in biomedicine and biotechnology. Though evolution provides many examples of a given gene, organizing genetic variation and applying that knowledge to heterogeneous problems in biology remains challenging.
Generative models are useful tools for making sense of complex, high-dimensional data. One flavor of these, pairwise undirected graphical models, has been very successful in solving the three-dimensional structures of proteins and RNA as well as predicting the effect of mutations in an unsupervised manner. However, we would like to increase the power of these generative models to capture higher-order interactions while still ensuring tractable computational complexity.
We fit a deep, directed latent variable model – a variational autoencoder – to natural sequence variation as a global probability model for gene families. We find we can approximately double the improvements in mutational effect prediction over shallow pairwise models in a purely unsupervised manner. We propose extending these methods for diverse biomedical and engineering applications.
