HaploPool: Accurately and Quickly Phasing Pooled DNA Samples

Bonnie Kirkpatrick, University of California, Berkeley

Photo of Bonnie Kirkpatrick

The genome, in the form DNA, encodes most of the information for the development and function of living organisms. Over the last decade, large quantities of human genomic data became available. These data specifically identify genetic variation between people. In order to find the portions of the genome responsible for certain traits, we would like to identify associations which relate genetic variation between people to differences in traits. In the long term, knowledge of these associations will lead to personalized medicine, including assessments of disease risk and drug response.

An association study often involves these steps: 1) design, 2) sample collection, 3) data collection, 4) phasing, 5) association, and 6) power analysis. The number of samples collected and the methods used for data collection largely determine the cost of the study. When performing a study with good statistical power, collecting data from the requisite number of samples may cost millions of dollars. We focus on improving the accuracy and data-efficiency of algorithms that phase data from pooled and unpooled DNA samples. We introduce a new algorithm, HaploPool, which can quickly and cost-effectively perform phasing on both types of data. After developing our algorithms, we examine the trade-offs, in terms of accuracy, of different data collection methods.

We compared our method to four state-of-the-art programs for phasing. HaploPool is consistently more efficient, with a speed improvement of 86% over the next fastest method, and is at least as accurate as previous methods for pooled data. We also demonstrated that one can obtain equal accuracy for less cost by using pooled genotype data.

Abstract Author(s): Bonnie Kirkpatrick, Carlos Santos Armendariz, Richard M. Karp, and Eran Halperin