1 / 45

Gene Structure and Identification

Gene Structure and Identification. Eukaryotic Genes and Genomes Gene Finding. Complex Genome DNA. ~10% highly repetitive ( Mbp) NOT GENES ~25% moderate repetitive ( Mbp) Some genes ~25% exons and introns ( Mbp) 40%=? Regulatory regions Intergenic regions.

cstarnes
Télécharger la présentation

Gene Structure and Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Structure and Identification Eukaryotic Genes and Genomes Gene Finding Chuck Staben

  2. Complex Genome DNA • ~10% highly repetitive ( Mbp) • NOT GENES • ~25% moderate repetitive ( Mbp) • Somegenes • ~25% exons and introns ( Mbp) • 40%=? • Regulatory regions • Intergenic regions How to tell?? Chuck Staben

  3. Eukaryotic Gene ExpressionEngraved on your Brain!!!! Chuck Staben

  4. Yeast ORFS=genes! What don’t you find this way? Chuck Staben

  5. “large” Eukaryotes Intron average= Exon average= Promoter/enhancer Where/how arranged genome sparse Fungi introns promoter/enhancer Where? genome dense or sparse? Eukaryotes, cont’d Chuck Staben

  6. Intron Prevalence Chuck Staben

  7. Intron Size Chuck Staben

  8. Exon Size Chuck Staben

  9. Fungi Sew together exons • ORF regions • consensus sequences • domain/polypeptide matches Chuck Staben

  10. Exon/Intron Structure CCACATTgtn(30-10,000)an(5-20)agCAGAA …_______________... ...ProHisSerGlu... Chuck Staben

  11. Alternative Splice CCACATTgtn(30-10,000)an(5-20)agcagAA ...CCACATTAA... ...ProHis_____ Rules for alternative splicing? Chuck Staben

  12. Codon Bias/Nucleotide Frequency-useful? • Bias=0.97 means______ • Bias=0.03 means______ Chuck Staben

  13. Consensus Sequences • Promoter sites • Intron/Exon • Transcription Termination/PolyA • Translation initation Position Weight Matrices Chuck Staben

  14. Finding Functional Sequences Known Consensus Sequences Consensus Sequence Generation Functional Tests Chuck Staben

  15. Consensus Inference • Position Weight Matrices • Sequence Logos • Hidden Markov Models ProfileScan Chuck Staben

  16. Translation Initiation Sites Chuck Staben

  17. Functional Assay CCATGG 100 CCCTGG 0 CCTTGG 5 CCATAG 0 CTATGG 90 CCATGA 85 • Conservation • Correlated • Positions Chuck Staben

  18. Splicing Consensus A64G73GTA62A68G84T63… Y80NY80Y87R75AY95…C65AGNNVert GTRNGT(N){30-1000} CTRAC(N){5-15}YAG Fungi Alternate Splicing!?? Chuck Staben

  19. Linguistic Approach • Non-repetitive DNA!! • Long ORF • similar to known protein • ORF extended by “reasonable” splices • ORF begins with “good” ATG • Promoter/terminator flanks Looks like a duck... Chuck Staben

  20. DATABASE SEARCH • BLASTN • What? • Limitations? • BLASTX/TBLASTX • BLASTX does? • TBALSTX? www.ncbi.nlm.nih.gov Chuck Staben

  21. Protein Database Matches Great for the “known” What about the unknown??? Chuck Staben

  22. Transcript Initiation • Basal Promoters • Enhancers/Silencers/Regulatory Sites • Boundary elements? • Transcription Initation Prokaryotes vs Eukaryotes Organism-to-Organism Chuck Staben

  23. GC CAAT TATA Basal Promoter Analysis Myers and Maniatis, Genes VI, 831 • ATATAA -30 TBP • GGCCAATC -75 CTF/NF1 • GCCACACCC -90 SP1 +1 Chuck Staben

  24. mRNA processing • Exon/Intron • Alternate splicing • Polyadenylation/Cleavage • Stability Chuck Staben

  25. Poly A sites • Metazoans • AATAAA • Yeast-different Chuck Staben

  26. Translation • Initation site • (Frameshifting) • Translational regulatory elements • upstream ORFs • translational enhancers Chuck Staben

  27. Translation Sites • Initiate at 5’-ATG • upstream ORF…regulatory • (Frameshifting) • Translation enhancers…. Chuck Staben

  28. Integrated Genefinding • Linguistic approach (our discussion) • Probabilistic approaches • Discriminant analyses • MARKOV MODELS Chuck Staben

  29. Tools-WWW • GRAIL II: integrated gene parsing • GenLang • GENIE • HMMGene (lock ESTs, etc.) • GENSCAN • GENEMARK HMM Probabilities Chuck Staben

  30. Hidden Markov Models • Probabilistic Models • Applicable to linear sequences • P(all states)=1, infer probabilities of all states from observed (hidden states unobserved) • Work best when local correlations unimportant • Genefinding, phylogeny, secondary structure, genetic mapping • Work best with “Training Set” • Quantitative probabilities Chuck Staben

  31. Accuracy Assessment PP=predicted coding PN=predicted non-coding AP=“real” positive AN=“’real” negatives TP=number correct positive TN=number correct negative FP=number false positive FN=number false negative Sn=TP/AP Sp=TP/PP AC = ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) / 2 - 1 Chuck Staben

  32. Accuracy Levels DNA Sequence Error Rate!?? Chuck Staben

  33. NEXT • Regulatory Sequences • Known Consensus Sequences • Consensus Sequence Generation • Functional (Lab) Data • Real examples Chuck Staben

  34. Gene Regulatory Sequences • Functional sites • Consensus • Experimental tests • Inferred sites • Transcriptome analysis Chuck Staben

  35. Regulatory Sites • Transcript initiation • mRNA processing • Translation sites Chuck Staben

  36. Regulatory Factors • lacI, trpR, CAP, araC…. • GAL4, NDT80… Known from experiment Infer from genome? Infer from expression data? Chuck Staben

  37. EUKARYOTES • More complex signals • More genes • More dispersed signals • Combinatoric regulation common Chuck Staben

  38. Enhancer Elements • Octamer OCT1, OCT2 • Name some… False +, False - Chuck Staben

  39. Consensus Sequence Databases • WWW-based • TFD (transcription factor database) • BCM Search launcher Chuck Staben

  40. Transcriptome Analyses • Microarray transcription analysis • MEME analysis of clusters More later.... Chuck Staben

  41. Practical Gene Finding • Use ALL tools • Comparative • BLASTN, BLASTX • Predictive: Stitch together a consensus • HMM, GRAIL… • ORF finders • Findpatterns (and WWW pattern searches) • cDNA OR protein OR genetic evidence Chuck Staben

  42. FRAMES-aldolase gene Chuck Staben

  43. If aldolase is so tough, how do you really do it? Combine DNA sequence with other data! Chuck Staben

  44. Infer Promoter, Enhancer Test in cis Genome-cDNA P DNA sequencing Align (GAP) cDNA Chuck Staben

  45. Comparative Genomics • Conservation of coding regions • Identification of transcription signals • “words” in common Chuck Staben

More Related