De Novo Short Read Genome Sequencing

Ariella Sasson, Rutgers University

Photo of Ariella Sasson

One of the most significant advances in biology has been the ability to sequence the DNA of organisms. While the technology to sequence genomes has been around for decades, large-scale sequencing has only recently been made possible by the development of modern computing techniques, such as shotgun sequence assemblers like ARACHNE. Considered the gold standard of DNA sequencing, the Sanger method, which has long been the dominant approach, is still costly even though it has been around for over 20 years. Even if cost were not an issue and even considering the various improvements in techniques and automation that have been made, it is still time-consuming and expensive to sequence a large genome. Even today after the human genome project has been labeled as completed, problems still lurk in the current shotgun method. Intractable regions, regions of repetitive sequences in the chromosomes that result in gaps in the genome assembly, remain unsequenced. New whole-genome sequencing technologies are needed to reach the goal of the $1000 genome. The next generation of sequencing technologies is now emerging capable of generating far cheaper, but at the same time far shorter reads (25 to 100 bp instead of 800 to 1000 bp), presenting new computational problems and opportunities. Although greater coverage depths are thus affordable (50-200x instead of 2-10x), de novo sequence assembly with these shorter sequences is significantly more complex. The question arises, can an accurate assembly of a genome be computed at acceptable computational costs de novo? Some considerations are that firstly, memory costs are an issue when dealing with so many elements; and secondly, the short read length implies that the assembler must be able to deal with numerous ambiguous overlaps. In addition, the assembler must be able to deal with the correction of sequence errors and the assembly of reads containing mismatches. There are a few assemblers that have been developed or modified to assemble short read sequences; however, each has its limitations, and while some have shown success on smaller bacterial artificial chromosomes (BACs), none have attempted larger genomes due to its computational expense. The goal of the project is to create a next-generation sequence assembler algorithm that optimally combines multiple types of sequence data: micro-read sequences (15-35 bp), mate pair sequences, short read sequences (100 bp), Sanger sequences (1000 bp) and previously sequenced genomes from other closely related organisms in order to sequence a completely unknown genome. The advantage of combining all these different methodologies is to overcome some of the disadvantages of any one methodology alone and potentially help increase the speed and reduce error in the assembly of the genome.

Abstract Author(s): Ariella Sasson, Todd P. Michael, & Anirvan Sengupta