Download
slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
C omputational ncRNA gene finding (& nc RNA structure prediction) PowerPoint Presentation
Download Presentation
C omputational ncRNA gene finding (& nc RNA structure prediction)

C omputational ncRNA gene finding (& nc RNA structure prediction)

188 Views Download Presentation
Download Presentation

C omputational ncRNA gene finding (& nc RNA structure prediction)

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Computational ncRNA gene finding (& ncRNA structure prediction) ncRNA structure prediction (& computational ncRNA gene finding) Liming Cai (BINF8210@UGA, Fall 2010)

  2. Non-coding RNAs • Functions other than coding proteins, e.g., structural, catalytic, and regulatory factors functional RNAs = ncRNAs + UTR motifs • (-) No strong statistical features, such as ORFs, or polyadenylated, demonstrated in coding genes • (+) Transcribed ncRNA molecules can fold into secondary and tertiary structures (more conserved than sequences)

  3. Sources of ncRNAs • Non-coding RNA genes encode RNAs, e.g., miRNAs, rox1 and rox2 RNAs in male Drosophila melanogaster. • In introns and intergenic regions, e.g., snoRNAs • In 5’ and 3’ UTRs, e.g., regulatory motifs (functional RNAs)

  4. Functions of ncRNAs • rRNAs and tRNAs • RNA maturation: snRNA in recognizing splicing sites • RNA modification: snoRNA converting uridine to pseudo-uridine • Regulation of gene expression and translation: e.g., miRNAs • DNA replication: e.g., telomerase RNAs - template for addition of telomeric repeats • Etc.

  5. Classes of ncRNAs(Bompfunewerer, et al, 2005)

  6. Some ncRNAs databases • Rfam (280,000 regions of 379 families) • NONCODE (109 transitional classes and 9 groups) • RNAdb (800 mammalian ncRNAs, excluding tRNAs, rRNAs and snRNAs) • Arabidposis small RNA Project (ASRP) • Etc.

  7. ncRNA gene finding strategies • Computational predictive methods • cDNA cloning to enrich ncRNAs • Detecting new transcripts with oligonucleotide microarrays

  8. ncRNA gene finding: a computational challenge • ncRNA genes do not have significant statistical signals • large in number • diverse, 20 nts to 22,000 nts • Not sure what to look for • Computationally intensive • - Simply no good method • - Methods compromising accuracy

  9. Difficulty to discover ncRNAs from genomes Unlike protein-coding genes: No strong statistical sequence signals (no ORF, no polyadenine) Folded into 3D structure Transcribed to tRNA sequence tRNA gene

  10. Computational ncRNA gene finding methods • Specific (custom-designed) ncRNA search and annotation (e.g., tRNAscan, methylattion-guide snoRNA, miRNA, tmRNA) • Reconfigurable search systems (e.g., Infernal, ERPIN, RNATOPS,FastR) • mechanism to profile the target ncRNA (structure) - need training data • De novo ncRNA gene detection with • base composition (e.g., G+C %) • structure fold (e.g., RNAz) • Comparative analysis (e.g., QRNA, EvolFold) - consensus structure • ncRNA “holy grail” ?

  11. Review literature in computational ncRNA gene finding and annotation • A. Laederach (2007) Informatics challenges in Structural RNA, Brief Bioinformatics 8(5) 294-303. • S. Eddy (2001) Non-coding RNA genes and modern RNA world, Nature Reviews Genetics, 2(12), 919-929. • S. Griffiths-Jones (2007) Annotating noncoding RNA genes, Annual Rev. Genomics & Human Genetics, 8:279-298. • Machado-Lima et al (2008) Computational methods in noncoding RNA research, Mathematical Biology, 56: 15-49.

  12. 506 miRNAs Comparison between NUPACK and Triple 499 tRNAs Comparison between NUPACK and Triple Data were from Bonnet et al, 2004

  13. 499 tRNA Comparisons between HG, Triple, NUPACK 499 tRNA Comparisons between HG and NUPACK Data were from Bonnet et al, 2004

  14. [tRNA unfolding pathway] [Doudna,et al, 1999] What are in this lecture? • RNA secondary structure prediction 1. ab initio structure prediction 2. consensus structure prediction 3. structural model-based prediction but why just secondary structure?

  15. Tertiary structure: Less understood non-canonical interactions Only a small number of resolved structures • Secondary structure: • (Well understood) canonical base pairs • Scaffolding tertiary structure • Well studied, many known structures Measuring ncRNA secondary structure may be a feasible solution for ncRNA gene finding

  16. What else are in this lecture? • ncRNA gene finding and annotation 4. Structural profile-based ncRNA gene annotation 5. comparative analysis based ncRNA gene finding 6. ab initio ncRNA gene detection

  17. Base pairings of RNAs • Base pairings allow RNA to fold • Watson-Crick base pairs: A-U, C-G • Wobble pair G-U • called canonical pairs for secondary structure Note: all 16 (including non-canonical) base pairs are possible for RNA tertiary structure

  18. P a H g P N H O c P N N N H N P u N a O H N N P H H N H O N N N H N N N O 5’-u-u-c-c-g-a-a-g-c-u-c-a-a-c-g-g-g-a-a-a-u-g-a-g-c-u-3’ 3’ 5’ CYTOSINE GUANINE URACIL ADENINE

  19. Secondary structure is important to tertiary structure

  20. acc acc Stems in nested or parallel pattern c guu aga aac c ucu cccc gc gca ggg ugc ggu cc stem (double helix): stacked base pairs loop: strand of unpaired bases

  21. Stems in crossing patterns c guu aga aac c ucu cccc acc gc gca ggg ugc acc ggu cc Pseudoknots: crossing patterns of stems

  22. RNA secondary structure elements Pseudoknot Stem Interior Loop Single-Stranded Bulge Loop Junction (Multiloop) Hairpin loop Image– Wuchty

  23. RNA stem-loop (pseudoknot-free) structure example

  24. RNA secondary structure prediction • ab inito structure prediction to predict the structure of a single sequence 2. Consensus structure prediction to predict the structure shared by more than one sequences 3. Statistical model-based prediction and alignment to search for desirable structures on genomes or data bases

  25. 1. ab initio structure prediction • Hydrogen bonds consume energy contained in the molecule. • The smaller the free energy is, the more stable the structure folded.

  26. ab initio structure prediction (cont’) • Consider only canonical base pairs A-U, C-G, and G-U. Base pairings reduce the amount of free energy contained in the molecule. • Maximizing the number of base pairs would minimize the free energy in the molecule. (Only an approximate model)

  27. ab initio structure prediction (cont’) • But how to count? An RNA could be very long; there may be many possible ways that base pairs can be formed: e.g., ……ACGGUACGUC….. conflicting pairs A-U, A-U G-C, G-C etc. Even the number of non-conflicting combinations of base pairs is exponentially large.

  28. j i (1) head paired with tail (2) tail is unpaired (3) head is unpaired (4) two subfolds j i k ab initio structure prediction (cont’)

  29. looking at shorter (e.g., very short) subsequences in a long sequence ACGGU…ACGUC • For subsequences of length 1, A, C, G, G, U, …, A, C, G, U, C #of base pairs 0, 0, 0, 0, 0, …, 0, 0, 0, 0, 0 • For subsequences of length 2, AC, CG, GG, GU, …, AC, CG, GU, UC # 0, 1. 0, 1, …, 0, 1, 1, 0 • For subsequence of length 3, ACG, CGG, GGU, …, UAC, ACG, CGU, GUC, UUC ?: e.g., GUC (1) G-C + U --> 1+0 =1 head-tail (2) G + UC --> 0+0 =0 head unpaired (3) GU + C --> 1+0 =1 tail unpaired (4) GU + C --> 1+0 =1 split (5) G + UC --> 0+0 =0 split

  30. examine a little longer sequence …..ACGGUACGU….. i j ==> max of {cases 1, 2, 3, 4} • Head-tail paired, count = 1 + max count in subsequence CGGUACG i+1 j-1 2. Head unpaired, count = max count in subsequence CGGUACGU i+1 j • Tail unpaired, count = max count in subsequence ACGGUACG i j-1 • Split (why needed and where to split ?) ACGGUACGU when k=i+2 i j ==> ACG + GUACGU <---- k ---> count = max count in ACG + max count in GUACGU

  31. simple model: (i, j) = 1 Ab initio structure prediction (cont’) • Maximizing the number of base pairs (Nussinov et al, 1978)

  32. G G G A A A U C C 0 0 0 0 0 0 1 2 3 0 0 0 0 0 1 2 3 0 0 0 0 1 2 2 GAAAUC 0 0 0 1 1 1 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 G G G A A A U C C GGGAAAUCC Ci,j = 0 when i=j AAUC AU

  33. Example 2: ACGGUU subsequence of length 0: empty sequence, 0 pairs subsequences of length 1: A, C, G, G, U, U 0 0 0 0 0 0 pairs subsequences of length 2: AC, CG, GG, GU, UU 0 1 1 0 0 pairs subsequences of length 3: ACG, CGG, GGU, GUU 1 1 1 1 pairs Subsequences of length 4: ACGG, CGGU, GGUU 1 2 2 pairs Subsequences of length 5: ACGGU, CGGUU 2 2 pairs subsequence of length 6: ACGGUU 3 pairs

  34. Prediction Algorithm Web Server • http://frontend.bioinfo.rpi.edu/applications/mfold/cgi-bin/rna-form1.cgi • Sample sequence: (1) tRNA GGGGUCAUAGCUCAGUUGGUAGAGCGCUACAAUGGCAUUGUAGAGGUCAGCGGUUCGAUCCCGCUUGGCUCCACCA (2) a part of tmRNA CCUCUCUCCCUAGCCUCCGCUCUUAGGACGGGGAUCAAGAGAGGUCAAACCCAAAAGAGA • Simple matrix, • simple matrix with G-U pair • Complex matrix Rfam database: http://www.sanger.ac.uk/Software/Rfam/

  35. Thermodynamic energy based structure prediction • Energy minimization algorithm predicts the correct secondary structure by minimizing the free energy (G) • G calculated as sum of individual contributions of: • loops • base pairs • secondary structure elements

  36. Free-energy values (kcal/mole at 37oC ) • Energies of stems calculated as stacking contributions between neighboring base pairs

  37. Free-energy values (kcal/mole at 37oC )

  38. Zuker’s algorithm MFOLD: computing loop dependent energies

  39. Assumptions in such algorithms • Most likely structure corresponds to energetically most stable structure • Energy associated with any position is only influenced by local sequence and structure • Structure formed does not produce pseudoknots

  40. RNA structure prediction web servers • MFOLD http://www.bioinfo.rpi.edu/applications/mfold/rna/form1.cgi • RNAfold ( a part of Vienna Package) http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi Examples: GCTTACGACCATATCACGTTGAATGCACGC CATCCCGTCCGATCTGGCAAGTTAAGCAAC GTTGAGTCCAGTTAGTACTTGGATCGGAGA CGGCCTGGGAATCCTGGATGTTGTAAGCT

  41. RNA pseudoknot (tmRNAs) Bacterial tmRNA consensus structure (Felden et al. 2001. NAR 29) terminates translation errors

  42. Functions of pseudoknots (TMV 3’ UTR) Promotes efficient translation Binds EF1A, cooperates with 5’UTR (Leathers et al. 1993 MCB 13 Zeenko et al. 2002 JVI 76)

  43. Pseudoknots drastically increase computational complexity

  44. RNA pseudoknot prediction web servers • Pknots-RG: http://bibiserv.techfak.uni-bielefeld.de/pknotsrg/ • Pknots-RE (the first pseudoknot prediction algorithm) • Kinefold: http://kinefold.curie.fr/cgi-bin/form.pl • ILM http://cic.cs.wustl.edu/RNA/

  45. Computational complexity issues • Pseudoknot-free structures: O(n3) CUP time • Pseudoknots: NP-hard, restricted cases O(n5) • Heuristics added: O(n4) • Difficult for search RNA structures in genomes

  46. 2. Consensus structure prediction • Covariance fact for RNAs: • Variations in RNA sequence maintain base-pairing patterns for secondary structures • When a nucleotide in one base changes, the base it pairs to must also change to maintain the same structure

  47. query: GGGGGCAACCCC query: GGGGGCAACCCC query: GGGGGCAACCCC     | | |        |  | |        |  | |    A: AUCCGAAAGGAU B: CCUAGAAAGGAU B: CCUAGAAAGGAU query: GGGGGCAACCCC |||||  | | | | | | A: AUCCGAAAGGAU Structure alignments (example) C A G A G•C G•C G•C G•C A A G A C•G C•G U•A A•U G A G A AG UG CA CU Query RNA structure A: structural homolog B: nonhomologous primary sequence alignment scoring: -6 -6 structure + sequence alignment scoring: +11 -6