1 / 34

Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab

Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science – FORTH. Rfam/miRBase 7.1 (October 2005). ID #miRNAs name ------------------------------------------- aga 42 A. gambiae (MOZ2)

Télécharger la présentation

Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science – FORTH

  2. Rfam/miRBase 7.1 (October 2005) ID #miRNAs name ------------------------------------------- aga 42 A. gambiae (MOZ2) ame 26 A. mellifera (AMEL2.0) ath 117 A. thaliana (RefSeq entries) cbr 82 C. briggsae (cb25.agp8) cel 115 C. elegans (WormBase WS140) cfa 6 C. familiaris (BROADD1) dme 78 D. melanogaster (BDGP4) dps 73 D. pseudoobscura (DPSE2.0) dre 293 D. rerio (WTSI Zv5) fru 130 F. rubripes (FUGU2.0) gga 122 G. gallus (WASHUC1) hsa 325 H. sapiens (NBCI35) mmu 255 M. musculus (NCBIM34) osa 123 O. sativa (TIGR 3.0) ptr 67 P. troglodytes (CHIMP1) rno 189 R. norvegicus (RGSC3.4) tni 131 T. nigroviridis (TETRAODON7) zma 95 Z. mays (TIGR AZM4) ebv 5 Epstein Barr virus (EMBL:V01555.1) hcmv 8 Human cytomegalovirus (Refseq:NC_001347.2) kshv 11 Kaposi sarcoma associated herpesvirus (EMBL:U75698.1) mghv 9 Mouse gammaherpesvirus 68 (EMBL:U97553.1) microrna.sanger.ac.uk used 227 from miRBase 6.0

  3. Negative examples: 3’UTR s ~ 9 MBases http://www.ensembl.org/BioMart/

  4. Conservation: MultiZ alignments 11111111111111111111111111111111111111110111111111111111111101111111111111111110111111111111111111111111 0 11111011111111111111111111111111111111010111111111111111111111111111110111110110111111111111111111111111 1 11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 2 11111101111111111111111111111111111111111101101011111111111111111111111111111110111111101111111111111111 3 11101001101101111111111111111111111110011111011011111111011011111111001111011100111111101111111111111111 4 11100001101101111111111111111111111110010100001011111111011001111111000111010000111111101111111111111111 5 Conservation rules: # 1’s above >= 120 , at least one stretch of 12 1’s

  5. Genome wide prediction pipeline • Process windows of 104 nt along genome: • Fast filtering using composition and palindromes • 2. Comparative analysis with other genomes • (BLASTZ) • 3. Approximate secondary structure prediction • (stem-loop) using a novel dynamic programming • algorithm. • 4. Feature extraction and classification (SVMs) • 5. Filter conserved secondary structures

  6. ’Fast’ rules: • No window containing unknown base • No windows with complete repeat-regions gain 40% reduction in analyzed size, • 100% - > 98.4 % sensitivity • (lost: hsa-mir-151 hsa-mir-370 hsa-mir-422a hsa-mir-513-1 hsa-mir-513-2) • - Single nt composition, both strands: • max A 43% min 9% • max C 38% min 10.6% • max G 45% min 11% • max T 40% min 9.3% • - Single nt composition, single strands: • max A 37.5% min 9% • max C 38% min 10.6% • max G 43.8% min 12.5% • max T 40% min 12.7%

  7. More ’fast’ rules: • Double nt composition, single strands: • max AA 15.4% min 0% • max AC 10.7% min 0% • max AG 14.2% min 1% • max AT 16.1% min 0% • max CA 14.7% min 0% • max CC 18.3% min 0% • max CG 15.8% min 0% • max CT 16.4% min 1.3% • max GA 11.9% min 0% • max GC 17.6% min 0% • max GG 19.3% min 1% • max GT 13.4% min 1.4% • max TA 15.7% min 0% • max TC 15.6% min 1.1% • max TG 18.8% min 2.9% • max TT 25.8% min 0%

  8. >= 4nt palindrome rule: Hash-table with 4^4=256 entries: Hash-key occured at position rev.comp --------------------------------------- 000 AAAA 3 255 001 AAAC 0 254 002 AAAG 0 253 003 AAAU 4 252 004 AACA 0 251 005 ... ... 254 UUUG 0 001 255 UUUU 60 000

  9. microRNA computational prediction pipeline Energy + structural features 2 851 352 871 bases Cross-species conservation Inverted repeats, composition SVM SS-conservation RNA secondary structure prediction Novel microRNAs: Microarray verification

  10. Prediction features predicted seconddary structure comparative analysis • Stem_Length 2. GC_Content 3. Stem_BPs 4. maxLinHelix 5. MatureCons • 6. MatureOppositeCons 7. ArmCons 8. SS_Energy 9. MatureBPs 10. MatureEnergyProfile => 10 features for SVM classification

  11. Histogram for feature: stem length

  12. Histogram for feature: GC content

  13. Histogram for feature: #base pairs in stem

  14. Feature: longest ‘linear’ helix maxlinhelix = 18 nt maxlinhelix = 26 nt

  15. Histogram for feature: longest ‘linear’ helix

  16. Features related to mature region Sliding 0 to 15 nt from loop window of 23 nt Calculate ‘mature’ feature at all positions and keep prediction with highest score

  17. Histogram for feature: #conserved bases in mature region

  18. Histogram for feature: #conserved bases in mature region(on opposite strand)

  19. Histogram for feature: #conserved bases in both arms of the stem

  20. Histogram for feature: secondary structure minimal free energy

  21. Histogram for feature: #paired bases in mature region

  22. Mature region: average stacking energy

  23. Histogram for feature: correlation with averagemature energy profile in mature region

  24. Learning with Support Vector Machines Training data Test data ‘Soft-margin’ hyperplanes, cost parameter C

  25. Training with libsvm-2.6 package by C.-C. Chang & C.-J. Lin Modification: optimize Mathews correlation, not % correct http://www.csie.ntu.edu.tw/~cjlin/libsvm/

  26. Importance of features with ‘knockout’ retraining: All features: Cross Validation Accuracy = 87.2728% Feature ‘knockout’: Cross Validation Accuracy = 75.4618% ss-energy *** Cross Validation Accuracy = 84.6784% stem-start Cross Validation Accuracy = 84.409% stem-end Cross Validation Accuracy = 85.2758% loop-length Cross Validation Accuracy = 82.3163% loop-start Cross Validation Accuracy = 82.3909% # base-pairs Cross Validation Accuracy = 76.4124% GC-content ** Cross Validation Accuracy = 86.3902% higher arm conservation Cross Validation Accuracy = 84.97% lower arm conservation Cross Validation Accuracy = 85.0393% loop conservation Cross Validation Accuracy = 84.0942% # GU pairs Cross Validation Accuracy = 85.4047% length of longest bulge

  27. Test-set results for various SVM thresholds Q SENS SPEC CORR cp cn fp fn threshold --------------------------------------------------------------------- 99.60 96.74 28.16 +0.5208 89 56497 227 3 0.010000 99.76 95.65 39.82 +0.6163 88 56591 133 4 0.020000 99.83 95.65 48.09 +0.6776 88 56629 95 4 0.030000 99.86 95.65 54.32 +0.7203 88 56650 74 4 0.040000 99.87 95.65 55.00 +0.7248 88 56652 72 4 0.050000 99.92 95.65 67.18 +0.8012 88 56681 43 4 0.100000 99.94 95.65 75.21 +0.8479 88 56695 29 4 0.150000 99.95 95.65 78.57 +0.8667 88 56700 24 4 0.200000 99.96 95.65 82.24 +0.8868 88 56705 19 4 0.250000 99.96 95.65 83.02 +0.8909 88 56706 18 4 0.300000 *** 99.96 94.57 85.29 +0.8979 87 56709 15 5 0.350000 99.97 94.57 86.14 +0.9024 87 56710 14 5 0.400000 99.97 92.39 87.63 +0.8996 85 56712 12 7 0.450000 99.97 91.30 90.32 +0.9080 84 56715 9 8 0.500000 99.97 88.04 91.01 +0.8950 81 56716 8 11 0.550000 99.96 85.87 90.80 +0.8828 79 56716 8 13 0.600000 99.96 85.87 91.86 +0.8880 79 56717 7 13 0.650000 99.97 85.87 94.05 +0.8985 79 56719 5 13 0.700000 99.96 82.61 93.83 +0.8802 76 56719 5 16 0.750000 99.96 80.43 96.10 +0.8790 74 56721 3 18 0.800000 99.96 80.43 96.10 +0.8790 74 56721 3 18 0.849999 99.96 77.17 97.26 +0.8662 71 56722 2 21 0.899999

  28. < 3 weeks on ~40 AMD-242-Opterons (ICS-FORTH)

  29. Hg17-scan results for various SVM thresholds precursor #candidates sensitivity (incl. known miRNAs) hit-rate ---------------------------------------------- 95.1% 96699 16 ppm 90.3% 45231 7.6 ppm 85.9% 23025 3.9 ppm 80.6% 14429 2.4 ppm 75.7% 9732 1.6 ppm 70.9% 6912 1.2 ppm --------------------------------------- Total nt processed: 5976557831

  30. Secondary structure conservation: From RNAfold-library: structure – stucture comparison: Null, H, B, I, M, S, E ------------------------------------- { 0, 2, 2, 2, 2, 1, 1} Null { 2, 0, 2, 2, 2, INF, INF} H { 2, 2, 0, 1, 2, INF, INF} B { 2, 2, 1, 0, 2, INF, INF} I { 2, 2, 2, 2, 0, INF, INF} M { 1, INF, INF, INF, INF, 0, INF} S { 1, INF, INF, INF, INF, INF, 0} E 'H' hairpin loop 'I' interior loop 'B' bulge 'M' multi-loop 'S' stack 'E' external elements

  31. Secondary structure conservation vs. SVM scores

  32. Probe-design for experimental verification (RNA-RNA chip): • - 2 probes with 60 nt for each candidate • end of 5' probes reach 75% into the hairpin-loop - 3' probes start after 50% of the hairpin-loop • sensitivity detecting mature miRNA: 86 % • Chip in preparation at UoToronto Estimate for the number of true miRNAs: Q:099.96 SENS:085.87 SPEC:091.86 CORR:+0.8880 cp 79 cn 56717 fp 7 fn 13 th 0.67 spec=cp/(cp+fp)=cp/nhits => (expected cp)=spec*nhits=0.9168*7664=7026 All predictions are avaliable !

  33. Just the tip of an iceberg • tiling window expression analysis of mouse: • 30 % of the genome is transcribed ! • - mRNA genes are 3% of the truth….

  34. Acknowledgments: Artemis Hatzigeorgiou, Praveen Sethupathy, Molly Megraw, Karol Szafranski Center for Bioinformatics, School of Medicine, University of Pennsylvania Yannis Tollis Panayiota Poïrazi Anastasis Oulas Alkiviadis Simeonidis Angelos Bilas, Michalis Flouris Advanced Computing Systems, Computer Architecture and VLSI Systems Lab, ICS-FORTH

More Related