Solving 1,000 Puzzles with 8 Billion Small Pieces: Assembling Draft Genomes from the Cow Rumen Metagenome

Aleah Caulin, University of Pennsylvania

Next-generation sequencing technologies, applied to metagenomics and single-cell genomics, have provided two culture-independent ways of investigating the genetic and functional potentials of complex microbial communities. At DOE’s Joint Genome Institute over 1 tera-base of sequence data has been generated from a microbial community adherent to switchgrass in the cow rumen. Assembling individual genomes from this data, with 8 billion short reads, from a mixture of over 1,000 species has proven difficult. One of the biggest challenges is that there are no suitable reference genomes that exist for assembly quality assessment. Here we propose to integrate single-cell genomics and metagenomics to de novo assemble a set of high-quality reference genomes. We developed a three-step strategy: assemble each single amplified genome (SAG) into contigs, bin cells representing the same species, and use these bins to recruit sequence reads from the metagenome. These reads extend the initial contigs to create a more complete draft genome for each species represented by the isolated cells. The metagenome reads that do not get recruited by the SAGs are filtered by abundance and assembled into contigs. A hierarchical clustering algorithm is then applied to bin these contigs based on sequence features, such as tetranucleotide frequency. Generating these draft genomes is the first step toward the assembly of many uncultured novel organisms from complex communities, which will serve as a foundation to the comprehensive study of community composition, structure and dynamics.

Abstract Author(s): Aleah Caulin, Rob Egan, Jeff Froula, Zhong Wang