NoFold: A Novel Method for Finding RNA Structure Motifs Without Folding or Alignment

Sarah Middleton, University of Pennsylvania

Photo of Sarah Middleton

RNA molecules often fold into complex secondary structures through self-basepairing. Interestingly, certain structures appear again and again across different RNAs and often confer a common function or regulatory signal to those RNAs. These modular, recurring structures are called motifs, and they are of great interest to biologists because of the insight they can give into the function and regulation of an otherwise uncharacterized RNA. However, identifying motifs computationally has been challenging due to the inaccuracy of in silico structure prediction and the slowness of pairwise alignment. Here I present a novel method for identifying structure motifs in large RNA data sets that does not require individual structure prediction or pairwise alignment of the RNA sequences. This approach is based on the idea of constructing a distance function between two objects by their respective distances to a set of empirical examples or models, which in our case consists of 1,973 covariance models from the Rfam structure database. In this way, we can measure the structural similarity between two RNAs without actually predicting their structure or aligning them to each other. Using this as a basis, we developed an unsupervised clustering pipeline called NoFold to automatically identify and annotate structure motifs in large sequence databases. We demonstrate that NoFold can simultaneously identify multiple structure motifs with an average sensitivity of 0.80 and a precision of 0.98 and generally exceeds the performance of existing methods. We apply NoFold to identify motifs enriched in dendritically localized mRNAs in neurons, which would be good candidates for the elusive “zipcode” signals that mark these RNAs for transport. In total we identified 213 enriched motifs, including both known and novel structures.

Abstract Author(s): Sarah A. Middleton, Junhyong Kim