1 / 17

Presented by: Jeff Bonis CISC841 - Bioinformatics

Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo. Presented by: Jeff Bonis CISC841 - Bioinformatics. What Are Non-Coding RNAs (ncRNA)?. “functional molecules that do not code for proteins”

myral
Télécharger la présentation

Presented by: Jeff Bonis CISC841 - Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of AccuracyZasha Weinberg, and Walter L. Ruzzo Presented by: Jeff Bonis CISC841 - Bioinformatics

  2. What Are Non-Coding RNAs (ncRNA)? • “functional molecules that do not code for proteins” • Examples: transfer RNA (tRNA), spliceosomal RNA, microRNA, regulatory RNA elements • Over 100 known ncRNA families

  3. Secondary Structure of ncRNAs • Conserved, therefore useful for identifying homologs • Secondary structure is functionally important to RNAs • Base pairing important in pattern searching • e.g. 16s RNA - part of small subunit of prokaryote ribosome

  4. What Techniques Exist? • Two models that predict homologs in ncRNA families • Covariance Models (CMs) • Easy RNA Profile IdentificatioN (ERPIN) - http://tagc.univ-mrs.fr/erpin/ • Both use multiple alignment of family members with secondary structure annotation • Statistical model is built from this multiple alignment • Display high sensitivity and low specificity

  5. What about ERPIN? • DP algorithm matches the statistical profile onto a target database and returns the solutions and their scores • Cannot take into account non-consensus bulges in helices (caused by indels) • Need user specified score thresholds which compromises accuracy

  6. CMs • “specify a tree-like SCFG arcitecture suited for modelling consensus RNA secondary structures.” • Can’t accommodate pseudoknots • Very slow algorithm

  7. Which model should be improved? • Covariance Model (CM) is chosen because it’s limitation, pseudoknots, contain little information anyway • Address slow speed without sacrificing accuracy • CMs used in Rfam - http://rfam.wustl.edu • 8 gigabase genome DB called RFAMSEQ • Takes over a year to search for tRNA on P4 • Over 100 ncRNA families

  8. Previous improvements on speed • BLAST based heuristic • Known members are BLASTed against RFAMSEQ • CM is run on resulting set • BLAST misses family members, especially where there is low sequence conservation • tRNAscan-SE - http://www.genetics.wustl.edu/eddy/tRNAscan-SE/ • Uses 2 heuristic based programs for tRNA searches • CM is used on resulting set • May miss tRNAs that CMs would find

  9. How to improve sensitivity? • Authors previously developed rigorous filters with 100% sensitivity of CM found set • Filters based on profile HMMs • Profile HMM is built from CM then run on DB • Much of DB is filtered out, CM runs on remaining set • HMM filter based on sequence conservation • Scanned for 126 of 139 ncRNA families in Rfam • Other 13 display low sequence conservation, but have strong conservation of secondary structure which HMM can’t take into account • Heuristic methods also miss these ncRNAs

  10. How can these special biological situations be accounted for? • Authors propose 3 innovations to overcome these setbacks • 2 techniques to include secondary structure information in filtering at expense of CPU time • Sub-CMs • Hybrid filtering composed of CMs and profile HMMs • Store-Pair • Uses additional HMM states for modeling key base pairs • Third techique will help reduce scan time • Runs filters in series with quickest first ending with most selective • Shortest path problem

  11. Results • Techniques worked for 11 of the 13 previously missed Rfams • Also found new hits missed by BLAST • In tRNAscan-SE, provided rigorous scan for 3 of 4 CMs finding missed hits • 100 times faster than raw CM on average • Uncovers members missed by heuristics

  12. What are CMs anyway? • “statistical models that can detect when a positional sequence and secondary structure resemble a given multiple RNA alignment” • Described in terms of stochastic context-free grammars (SCFGs) • Transformational Grammars • Rules: describe grammar of the form Si -> xL Si+1 xR, xL and xR are left and right nucleotide • Terminals: symbols in the actual string (nucleotides) • Non-Terminals: abstract symbols (states) • Parse: series of steps to obtain final output • Example: • RNA molecules CAG or GAC • S1 -> c S2 g | g S2 C; S2 -> a • Parse: S1 -> c S2 g -> cag

  13. How are CM’s used? • Each rule is assigned a probability • Rules more consistent w/ family have higher probability • The probability of a parse is the product of all the probability of the rules it used • CMs use a log-odds ratios and sum the scores instead of multiplying • CM Viterbi requires window length input which upper bounds the family member’s length and affects scan time

  14. How are profile HMMs and CMs combined? • Given a CM, a profile HMM is created whose Viterbi score upper bounds the CM’s Viterbi score • Guarantees 100% sensitivity on CM • Filtering: • At each nucleotide position in the subsequences of the database, a HMM is used to compute the CM score upper bound • A CM scan is applied to all subsequences that produce an upper bound exceeding some threshold • Subsequences that are below the threshold are filtered out. • Profile HMMs are represented by regular grammars which cannot emit paired nucleotides, e.g. • CM: S1 -> a S2 u | c S S2 G; S2 -> e • HMM: S1L -> a S2L | C S2L; S2L -> S1R; S1R-> g | u • A CM is expanded into a left and right HMM

  15. How can these be supplemented? • Selecting an optimal series of filters • Filtering fraction (fraction of DB left over) and run time are given by running an filter on a training sequence • Minimize expected total CPU time • Assumptions: • estimated fractions and CPU times are constant for all training sequences • A filter’s fraction is not affected by the previously run filters • Optimal sequence of filters is solved as a shortest graph problem • nodes are filters and the CM • Weight of edges are CPU time

  16. Sub-CM technique • Exploit info in hairpins (bulges and internal loops) • Much info is stored in short hairpins that need only part of the CMs states • Grammar contains both HMM and CMs • Window length of sub-CM is crucial • HMMs are created manually after sub-CMs are found • Automation of this is a future project

  17. Store-pair technique • A HMM with extra states can reflect base pairs • S1L[C] -> gS1L[C] has score neg. inf. • 5 states are added per HMM state, but can be reduced

More Related