Assessing Models That Control for Phylogeny for Inferring Genetic Effect in Bacterial Datasets

Joy Yang, Massachusetts Institute of Technology

Photo of Joy Yang

The movement toward making biological data more open-source has wrought tens of thousands of publicly available bacterial genomes, an amazing (but terrifying) resource. Biologists have long been regressing phenotypes against genotypes as with genome-wide association studies, but with pooled bacterial genomes the structure is a bit more complex.

The sampling may be nonrandom, as different sequencing centers may be specifically interested in different bacteria. In addition, when there is a pathogenic outbreak, bacteria from the same strain may be aggressively isolated for sequencing from different patients as a technique for using small changes in the bacterial genome to track the spread of disease. And on a broader scale, genomic data is correlated by the nature of evolution.

Statistical inference not accounting for phylogenetic relationships often results in spurious correlations and overly confident confidence intervals. Through the years, a few techniques for tackling this problem have emerged, such as phylogenetic versions of generalized least squares (GLS) and generalized linear mixed models (GLMM).

We will examine the effect of these techniques on one particular data set, where the “phenotype” in question is known to be associated with certain “genotypes.” In particular, <em>Clostridium difficile</em> toxigenicity is known to be associated with various Tcd genes on a pathogenicity locus as well as the binary toxin gene. Various other genes (mostly associated with phage) also have a “significant effect” on the toxigenicity of the bacteria, but may not, in reality, affect toxigenicity.

Here we compare the confidence intervals of the effect size using ordinary least squares (OLS) with that of techniques mentioned above, specifically asking: Does the model in question allow genes known to be associated with toxigenicity to have a positive effect? And by how much does the confidence interval for the effect of the genes that may be confounded by phylogeny shift?

Abstract Author(s): Joy Yang, Martin Polz, Eric Alm