From Short Read Sequences to an Assembled Genome: Is De novo Assembly Possible?
One of the most significant advances in biology has been the ability to sequence the DNA of organisms. Even in the shadow of the completion of the human genome, intractable regions of the genome remain incomplete. Next-generation high-throughput short read sequencing technologies are now available and have the ability to generate millions of short reads per run. Although greater coverage depths are possible, de novo sequence assembly with these shorter sequences is significantly more complex than resequencing; handling them presents new computational problems and opportunities. Identifying repetitive regions, coping with sequencing errors, and manipulating the millions of short reads simultaneously, are some of the difficulties that must be overcome. As a result of these complexities and working with the short read sequences from the Waksman SOLiD sequencing platform, I explore the problem of de novo assembly two ways, theoretically and concretely. For the theoretical exploration of assembly, simulations explore the interactions and influences of key elements. Using parameters and trends learned from these simulations, they are then compared to the results of true attempts at assembly using some of the currently available de novo assemblers (Velvet and SOPRA). For the realistic exploration, a pipeline has been developed to facilitate de novo assembly of complex genomes. This pipeline will attempt to manipulate the short read and mate-pair outputs of the SOLiD sequencers in order to facilitate and optimize assembly of the genome of interest.