1 / 15

Exploiting Genome Comparison for Gene Structure Prediction in Plants

Exploiting Genome Comparison for Gene Structure Prediction in Plants. Michael Brent Ping Hu. Performance Improvement on single gene set: Explicit Intron Length Model. Performance Improvement on genome set: Explicit Intron Length Model. Splice Donor Model for both GT/AG and GC/AG introns.

trella
Télécharger la présentation

Exploiting Genome Comparison for Gene Structure Prediction in Plants

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Genome Comparison for Gene Structure Prediction in Plants Michael Brent Ping Hu

  2. Performance Improvement on single gene set: Explicit Intron Length Model

  3. Performance Improvement on genome set: Explicit Intron Length Model

  4. Splice Donor Model for both GT/AG and GC/AG introns • GC/AG introns represent: • 252/33350=0.75% in Arabidopsis • 0.7% of total human pre-mRNA introns; • ~0.6% in C. elegans (Nuc Acid Research 30(15) 3360-3368). • 27/2034 = 1.3% in crypto • Old model can not predict the GC/AG intron

  5. Decision Tree Model for GT/GC Donors Donor sites NNNG1T2NNNN NNNG1C2NNNN NNNGTNNG5N NNNGTNNĞ5N NNG-1GTNNGN NNĞ-1GTNNGN NA-2GGTNNGN NĂ-2GGTNNGN NAGGTNNGT6 NAGGTNNGŤ6

  6. Arabidopsis Performance Improvement for GT/GC donor Model

  7. GC/GT Donor Sites Prediction

  8. Breakdown of Arabidopsis Predictions Total Prediction: 30634/Total annotation: 28581 Identical to ann: 15063 Not Identical to ann: 15588 Overlap with confirmed ann: 3246 Not overlap with confirmed ann: 12342 Not overlap with any ann: 4394 Overlap with other ann: 7948 Same start/Same stop 2358 Diff start/Same stop 2770 Same start/Diff stop 1879 Diff start/Diff stop 941

  9. First Experiment Result 2000 1650 1000 850 650 500 400 300 200 100 M 1 2 3 4 5 6 7 8 9 10 11 12

  10. Result : Genomic contamination

  11. Rice Annotation Data Set • TIGR data set: • Most annotations were based on FgeneSH • Get manually curate contigs: 3171 genes • May still been influenced by FgeneSH • Gene bank cDNA confirmed data set: • Download Genes with full length cDNA from Genebank • Total: 1084 mRNA and 443 DNA • Filter out the bad genes with stop codon in frame, 341 DNA left • Limitation of this data set: • UTR and Intergenic region very limited • All positive strand, small • Other data sets are all from automatic pipelines

  12. Performance Improvement on Rice TIGR Manually Curated Data

  13. Performance Improvement on Rice GeneBank cDNA-Confirmed Data

More Related