Grouping Proteins According to Structural Features

Bonnie Kirkpatrick, University of California, Berkeley

The Protein Data Bank (PDB) contains more than 32,000 experimentally solved protein structures. The Structural Classification of Proteins (SCOP) database, a manual classification of these structures, cannot keep pace with the rapid growth of the PDB. We will provide an automatic classification of proteins which reflects the manual classification. We use structurally derived features to cluster groups of related proteins. Each cluster has a maximal set of shared features, or fingerprint.

When given a group of proteins and a target protein, the LGA algorithm[1] creates one structural alignment of each protein to the target protein. We use the Gaussian Mixture Model to cluster the proteins according to the structural regions they share with the target. Taking each of the proteins in turn as the target yields an ensemble of clusters, multiple partitions on the same set of proteins. Discrepancies are resolved by grouping together proteins that clustered together across many of the partitions.

The test data is comprised of PDB structures having resolutions ranging from 0.54Å (X-ray structures) to greater than 15Å (electron microscopy). Despite this noise, the robust nature of our clustering method detects relationships on the level of the SCOP superfamily with 88% accuracy and a low false positive rate.

Future work involves predicting the family and superfamily to which a new structure belongs. Comparison with the structural fingerprint determines whether a new structure belongs to the cluster. Initial results indicate that the fingerprint derived directly from a SCOP family can predict membership with almost complete accuracy.

Abstract Author(s): Kirkpatrick, B., Zhou, C., and Zemla, A.