Thomson Reuters
 

 ScienceWatch
Chuong B. Do talks with ScienceWatch.com and answers a few questions about this month's New Hot Paper in the field of Computer Science.
Do Article Title: CONTRAfold: RNA secondary structure prediction without physics-based models
Authors: Do, CB;Woods, DA;Batzoglou, S
Journal: BIOINFORMATICS
Volume: 22
Issue: 14
Page: E90-E98
Year: JUL 2006
* Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA.
* Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA.

 Why do you think your paper is highly cited?

Functional noncoding RNA genes are an important class of genomic elements, which perform numerous catalytic and regulatory roles in living cells. The function of RNA genes is dictated by their secondary structure, i.e., the patterns of base-pairings that form between nucleotides of an RNA molecule.

This paper describes CONTRAfold, (CONditional TRAining for RNA secondary Structure Prediction), a novel approach to predicting the secondary structures of RNAs for single sequences using machine learning, which makes significantly more accurate predictions than all previous methods.

"In this paper, we adapt an existing probabilistic modeling technique, known as conditional log-linear models (CLLMs)"

Over the last several decades, the most accurate methods have relied on physics-based models, whose energy terms were measured through laborious experiments. Our approach is the first competitive method that allows automated estimation of parameters without the need for direct experimental measurements.

 Does it describe a new discovery, methodology, or synthesis of knowledge?

In most computational approaches to RNA secondary structure prediction, the energy of a structure is modeled as the summation of local interaction terms describing small portions of the global base-pairing configuration; the predicted RNA secondary structure is the one achieving the minimum free energy. Obtaining the free energies for each type of local interaction term that could occur in an RNA secondary structure, however, is a difficult endeavor, often involving carefully calibrated optical melting experiments.

In this paper, we adapt an existing probabilistic modeling technique, known as conditional log-linear models (CLLMs), to the problem of modeling RNA secondary structure. Unlike previous applications of machine learning to the problem of RNA secondary structure prediction, our model uses parameters which closely mirror the local interaction terms of thermodynamics-based models. Using discriminative learning techniques, we estimate these parameters directly from databases of RNAs with known structure, without relying on optical melting data.

 Would you summarize the significance of your paper in layman's terms?

CONTRAfold is not the first method to use machine-learning techniques for estimating RNA secondary structure models from structural databases. Past applications of machine learning to the problem of RNA secondary structure determination focused on generative probabilistic grammar-based models of RNA secondary structure (in particular, stochastic context-free grammars or SCFGs).

These methods, however, failed to reach the accuracies of thermodynamics-based models. As a result, machine-learning methods have often been considered second-rate approaches to parameter estimation, useful only in specialized circumstances where more general RNA folding models do not apply.

By demonstrating that effective parameter estimation from databases of RNAs with known structure is in fact possible, CONTRAfold provides a promising alternative to the thermodynamics-based modeling techniques that have dominated RNA secondary structure modeling for several decades.

 How did you become involved in this research, and were there any problems along the way?

Before this work, our research dealt with applications of CLLMs to pairwise protein sequence alignment. Based on our successes in protein alignment, we expected that similar techniques could also succeed in the problem of RNA secondary structure prediction.

Initially, we worked with grammar-based models of RNA, which were, at the time, the machine-learning technique of choice for RNA secondary structure modeling. Grammars turned out to be overly restrictive and cumbersome for incorporating all the various types of local interactions needed in modeling RNA secondary structures, and our first attempts at applying our discriminative learning algorithms gave disappointing results.

While working on these models, however, we realized that we could build a CLLM using the parameterization of local interaction terms in existing RNA thermodynamic models as a starting point. By constructing our model in this way, our algorithm would closely mirror the scoring scheme of the existing state-of-the-art methods while retaining the flexibility to learn new parameters via discriminative machine learning. This key insight led to the CONTRAfold program.

 Where do you see your research leading in the future?

We believe that discriminative machine-learning techniques hold much promise for computational analysis of RNAs, beyond structure prediction. In particular, we are interested in developing models for identifying novel candidates for functional noncoding RNAs in whole genomes. For this task, we are looking at extensions of the learning algorithms used in CONTRAfold for distinguishing functional structured RNAs from nonfunctional transcripts. This supervised learning approach differs from more standard computational screens in which the free energy of RNAs is compared to the free energies for randomly shuffled RNA sequences.

Chuong B. Do
Department of Computer Science
Stanford University
Stanford, CA, USA

Web

Keywords: functional noncoding RNA genes, CONTRAfold, CONditional TRAining for RNA secondary Structure Prediction, RNA secondary structure prediction.

Download this article



2008 : September 2008 - New Hot Papers : Chuong B. Do
Science Home  |  About Thomson Reuters  |  Site Search
Copyright  |  Terms of Use  |  Privacy Policy
Previous
left arrow key
Next
right arrow key
Close Move