The Shape of RNA: can secondary structure alone predict RNA family?

Yan Karklin, Carnegie Mellon University

The development of automated tools for the detection and classification of functional RNAs is an important problem in computational biology. Machine learning methods, such as stochastic context free grammars and neural networks, have been applied to classification of primary sequences of RNA nucleotides, but none of the methods take full advantage of the information present in the secondary structure. Here, we present a discriminative learning approach that exploits the topology and structural characteristics of folded RNAs and produces good classification accuracy compared to previously reported results. We extended the recently introduced dual graph representation of RNA secondary structure; we then used kernel methods developed specifically for learning from labelled graphs to construct a Support Vector Machine classifier that learns the topological characteristics specific to secondary structures of different RNA families. The classifier was able to discriminate RNAs from the RFAM database from random nucleotide sequences with high accuracy: tRNAs, 98% TP and 98% TN; RNase, 91%/95%, Intron_gpII 85%/91%, and a collection of riboswitch families with 80%/85%. We also trained a multi-class SVM on tRNA, miRNA, RNASE, and 5S rRNA using the “one vs. others” framework (Q = 0.82). A similar multi-class SVM trained on five varieties of riboswitch RNAs showed consistent discrimination between the classes, (Q = 0.60).

Although the performance of the algorithm depended on the accuracy of predicted folding, the results are encouraging and suggest that this approach could significantly improve automated discovery and characterization of functions RNAs.

Abstract Author(s): Yan Karklin, Richard F. Meraz, Stephen R. Holbrook