Machine Learning in Computational Genomics: Building Structural Protein Phylogenies

Sarah Moussa, University of California, Berkeley

Photo of Sarah Moussa

Methods such as kernel feature selection and dimensionality reduction from the field of machine learning can be applied to the problem of constructing phylogenetic trees for protein evolution. Sequence-based methods have proven successful at detecting evolutionary relationships between closely related proteins; however, such methods may break down when proteins are distant relatives. In such cases, the sequence similarity between proteins may be too low to detect with sequence alignment techniques and the evolutionary relationship may be missed. However, because protein structure is more preserved than is sequence (remote homologs are more likely to share structural characteristics than to have appreciable sequence similarity), our approach builds learning methods upon structural features of proteins in addition to utilizing sequence information. Recent advances in the field of semi-definite optimization for kernel learning allow for the learning of an optimal kernel from the data as a convex combination of many simpler kernels derived from various similarity metrics. Details of our kernel construction and subsequent classification results will be presented.

Abstract Author(s): Sarah Moussa