New on ASCR Discovery: Assembling a Flood

Date

A cascade of genetic information is drowning science. Sequencing machines decode plant and animal DNA so quickly computers can’t analyze the resulting data fast enough. It’s especially true for genome assembly, the final step in compiling the strings of letters that represent every creature’s genetic instructions.

Now researchers from Lawrence Berkeley National Laboratory and the University of California, Berkeley, have developed an assembly program so efficient they believe it could handle output from all the world’s sequencers on just part of one supercomputer.

The team will report on its new program in a paper presentation Nov. 17 at the SC15 supercomputing conference in Austin, Texas.

The efficient code will help geneticists advance science in a range of areas, from finding bacterial genes that produce drugs, biofuels and more, to identifying mutations behind cancer and other diseases.

Gene sequencing analyzes an organism’s DNA, comprised of four chemical bases represented by the letters A, C, G and T. The letters are instructions for cells to produce proteins, life’s workhorse molecules.

To rapidly sequence the billions of bases in a single organism’s genes, scientists break copies of a DNA strand into short pieces, or reads. Machines distribute the reads to multiple sequencers that decode them into the bases. Today’s sequencers can decode a genome in hours. To put the reads back in the correct, accurate order, high-performance computing (HPC) systems match overlaps in their letter patterns. It can take from a day to a week, depending on the computer’s speed and the genome’s size.

Researchers at the DOE Joint Genome Institute (JGI) at Berkeley Lab developed their own assembly pipeline program, called Meraculous (taking its name from k-mer, a DNA sequence containing a set number, k, of bases). Meraculous produces high-quality results but can take up to two days to assemble a 3-billion-base human genome, says Evangelos Georganas, a Berkeley Lab researcher and UC Berkeley graduate student in electrical engineering and computer science. Assembling bigger genomes, like those of the wheat plant or pine tree, can take as much as a week.

But by rewriting Meraculous, Evangelos and a team from JGI, UC Berkeley and Berkeley Lab can now assemble a human genome in as little as 8.4 minutes. And that’s using just part of Edison, a Cray XC30 supercomputer at the lab’s National Energy Research Scientific Computing Center (NERSC).

But assembling the human genome in less than 10 minutes isn’t what’s important, Georganas says. “Being able to provide new ways to do science” is the real achievement. Assembling multiple genomes in an hour and experimenting with varying parameters lets geneticists easily test different assumptions. “That, at the end of the day, would make them better understand the results they get.”

Read more about HipMer (high-performance Meraculous) at ASCR Discovery, a website highlighting research supported by the Department of Energy’s Advanced Scientific Computing Research program.

Image caption: DNA. Illustration courtesy of the National Human Genome Research Institute.