Predicting Phage-host Interactions Using Alternating Minimization

Joy Yang, Massachusetts Institute of Technology

Work on phage-host interactions from the last century has led to the discovery of important insights into the central dogma as well as the development of impactful technologies such as restriction enzymes, Hfr conjugation, transposon mutagenesis and CRISPR-Cas. Still, there remains much to be discovered about the genetic features that enable a phage to infect a particular host. Analysis of phage genomes has revealed that the number of phage orthologous groups has been growing with each sequenced phage without signs of saturation. On average, approximately 70 percent of open reading frames in a phage are unannotated hypothetical proteins.

Sampling of closely related bacterial strains with differing phage infection profiles can elucidate infection mechanisms. The Polz lab maintains the Nahant Collection, a rich dataset of 243 Vibrio strains that have been challenged by 241 unique phages, all with sequenced genomes. This is the largest phylogenetically resolved host range cross test available to date. Gleaning mechanistic insights from this data is a complex statistical problem, as infection specificities involve interacting proteins between organisms.

With approximately 1,000 phage protein clusters and 10,000 bacterial protein clusters, there are 10,000,000 possible interactions for the 58,000 observations. With centered and scaled predictors, this amounts to approximately 4.7 terabytes. We propose using alternative minimization to utilize these interaction terms without directly working with this matrix. This reduces the memory requirements to around 14 GB. While we tackle a specific biological dataset here, the class of problems involving inference on bipartite graphs is very general.

Abstract Author(s): Joy Yang, Libusha Kelly, Philippe Rigollet, Martin Polz