FASTERp: A Feature Array Search Tool for Estimating Resemblance of Protein Sequences

Presenter:

Derek

Macklin

Profile Link:

Derek Macklin

University:

Stanford University

Program:

CSGF

Year:

2014

Metagenome sequencing efforts have provided a large pool of billions of genes for identifying enzymes with desirable biochemical traits. However, homology search with billions of genes in a rapidly growing database has become increasingly computationally impractical. Here we present our pilot efforts to develop a novel alignment-free algorithm for homology search. Specifically, we represent individual proteins as feature vectors that denote the presence or absence of short kmers in the protein sequence. Similarity between feature vectors is then computed using the Tanimoto score, a distance metric that can be rapidly computed on bit string representations of feature vectors. Preliminary results indicate good correlation with optimal alignment algorithms (Spearman r of 0.87, about 1 million proteins from Pfam), as well as with heuristic algorithms such as BLAST (Spearman r of 0.86, about 1 million proteins). Furthermore, a prototype of FASTERp implemented in Python runs approximately four times faster than BLAST on a small-scale data set (about 1,000 proteins). We are optimizing and scaling to improve FASTERp to enable rapid homology searches against billion-protein databases, thereby enabling more comprehensive gene annotation efforts.

Program Review:

2014 Annual Program Review

Secure Login

Secure Login

FASTERp: A Feature Array Search Tool for Estimating Resemblance of Protein Sequences