250 likes | 359 Vues
Explore the complexity of gene identification in genomic data, covering gene structures, human genes, microbial work, and methodologies like Markov models. Learn about tools like GLIMMER, GENESCAN, and the comparison between Celera and public genome approaches.
E N D
IDENTIFICATION OF GENES IN GENOMIC DATA TOO MUCH OF A GOOD THING
RELEVANT NUMBERS • E. coli genome • Four million base pairs • 4000 genes • Human genome • Three billion base pairs • 25,000 genes
GENE STRUCTURES • E. coli - WYSIWYG • Human - Life is complicated • Exons and Introns • Alternative splicing • Alternative starts and stops
MICROBIAL WORK • Not trivial but more direct than human • Many approaches • GRAIL a first attempt • Web based searches • Markov most common
General Markov Model • Probabilistic model • Uses adjacent base(s) to predict current base • Order of model depends on number of bases examined • Sum the probabilities for each base • High score wins (recognized as gene)
Second Order Probabilities(i I i-2, i-1) • 64 possibilities • kth order needs 4k+1 probabilities • DNA actually needs six models (six reading frames) – so six times all probabilities • Need known genes to determine real probabilities (train the model)
Example • 5th order model needs 24,576 probabilities 4096 hexamers x 6 frames Do these occur frequently enough in identified coding regions to give good probabilities?
Scan sequence and determine scores for each region based on probabilities • Scores above threshold declared genes because they are like previously identified genes • Implies a hidden relationship
GLIMMER and GLIMMERM • Use interpolated Markov models • If oligomers are available, will score up to 8th order • M version works for small eukaryotes • Plasmodium falciparum • Arabidopsis thaliana • Adds information about splice sites
GENESCAN • Combines multiple approaches • 5th order Markov model • Staden weight matrix to model cis sites • Poly A site • TATA and INR • CAP site • Translation termination sites • Maximal Dependence Decomposition • Donor and acceptor splice site modeling
A Tale of Two Genomes • Private genome – Celera • Public genome – Rest of the world • DRAFT ONLY (except 21 and 22)
Celera • Otto-Refseq • Gold standard – full length, curated cDNAs • Otto-Homology • ESTs • cDNAs • Mouse-human genomic similarities • Known proteins
Celera • De novo • GENESCAN • GRAIL • FGENESH • Manual curation
Celera results • De novo prediction of 76,400 genes (58,000 appeared to be new) • 21,350 supported by some other evidence • Otto homologies identify 17,764 other genes • Total is 39,114
Public • De novo using ENSEBL • Merge predictions with predictions from GENIE • Merge results with known genes in databases • Eliminate bacterial sequences