1 / 42

Gene Recognition

Gene Recognition. Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov. Reading. GENSCAN EasyGene SLAM Twinscan Optional: Chris Burge’s Thesis. DNA. transcription. RNA. translation. Protein. Gene expression. CCTGAGCCAACTATTGATGAA. CCU GAG CCA ACU AUU GAU GAA.

frey
Télécharger la présentation

Gene Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov

  2. Reading • GENSCAN • EasyGene • SLAM • Twinscan Optional: Chris Burge’s Thesis

  3. DNA transcription RNA translation Protein Gene expression CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA PEPTIDE

  4. Gene structure intron1 intron2 exon2 exon3 exon1 transcription splicing translation exon = coding intron = non-coding

  5. Exon 3 Exon 1 Exon 2 Intron 1 Intron 2 5’ 3’ Stop codon TAG/TGA/TAA Start codon ATG Finding genes Splice sites

  6. Approaches to gene finding • Homology • BLAST, Procrustes. • Ab initio • Genscan, Genie, GeneID. • Hybrids • GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM.

  7. HMMs for single species gene finding: Generalized HMMs

  8. T A A T A T G T C C A C G G G T A T T G A G C A T T G T A C A C G G G G T A T T G A G C A T G T A A T G A A Exon1 Exon2 Exon3 GHMM for gene finding This is a HMM similar to the GENSCAN HMM duration

  9. Elements of the HMM • Duration of states – length distributions of • Exons • Introns • Signals at state transitions • ATG • Stop Codon TAG/TGA/TAA • Exon/Intron and Intron/Exon Splice Sites • Emissions • Coding potential and frame at exons • Intron emissions

  10. 1. Durations of states

  11. Better way to do it: negative binomial • EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A • Negative binomial with n = 3

  12. 2. Splice Site Detection Donor: 7.9 bits Acceptor: 9.4 bits (Stephens & Schneider, 1996) (http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

  13. Donor site 5’ 3’ Position % 2. Splice Site Detection

  14. 2. Splice Site Detection • WMM: weight matrix model = PSSM (Staden 1984) • WAM: weight array model = 1st order Markov (Zhang & Marr 1993) • MDD: maximal dependence decomposition (Burge & Karlin 1997) • Decision-tree algorithm to take pairwise dependencies into account • For each position I, calculate Si = ji2(Ci, Xj) • Choose i* such that Si* is maximal and partition into two subsets, until • No significant dependencies left, or • Not enough sequences in subset • Train separate WMM models for each subset G5G-1 G5G-1 A2 G5G-1 A2U6 G5 All donor splice sites not G5 G5 not G-1 G5G-1 not A2 G5G-1A2 not U6

  15. atg caggtg ggtgag cagatg ggtgag cagttg ggtgag caggcc ggtgag tga

  16. 3. Coding potential Amino Acid SLC DNA codons Isoleucine I ATT, ATC, ATA Leucine L CTT, CTC, CTA, CTG, TTA, TTG Valine V GTT, GTC, GTA, GTG Phenylalanine F TTT, TTC Methionine M ATG Cysteine C TGT, TGC Alanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCG Threonine T ACT, ACC, ACA, ACG Serine S TCT, TCC, TCA, TCG, AGT, AGC Tyrosine Y TAT, TAC Tryptophan W TGG Glutamine Q CAA, CAG Asparagine N AAT, AAC Histidine H CAT, CAC Glutamic acid E GAA, GAG Aspartic acid D GAT, GAC Lysine K AAA, AAG Arginine R CGT, CGC, CGA, CGG, AGA, AGG Stop codons Stop TAA, TAG, TGA

  17. 4. GENSCAN’s hidden weapon • C+G content is correlated with: • Gene content (+) • Mean exon length (+) • Mean intron length (–) • These quantities affect parameters of model • Solution • Train parameters of model in four different C+G content ranges!

  18. Results of GENSCAN • On the initial test dataset (Burset & Guigo) • 80% exact exon detection • 10% partial exons • 10% wrong exons • In general • Most accurate single sequence-based gene predictor • In practice it overpredicts human genes by ~2x

  19. Comparison-based Methods

  20. Comparison of 1196 orthologous genes(Makalowski et al., 1996) • Sequence identity between genes in human/mouse • exons: 84.6% • protein: 85.4% • introns: 35% • 5’ UTRs: 67% • 3’ UTRs: 69% • 27 proteins were 100% identical.

  21. Human Mouse Human-mouse homology

  22. Not always: HoxA human-mouse

  23. 50 . : . : . : . : . : 247 GGTGAGGTCGAGGACCCTGCA CGGAGCTGTATGGAGGGCA AGAGC |: || ||||: |||| --:|| ||| |::| |||---|||| 368 GAGTCGGGGGAGGGGGCTGCTGTTGGCTCTGGACAGCTTGCATTGAGAGG 100 . : . : . : . : . : 292 TTC CTACAGAAAAGTCCCAGCAAGGAGCCACACTTCACTG |||----------|| | |::| |: ||||::|:||:-|| ||:| | 418 TTCTGGCTACGCTCTCCCTTAGGGACTGAGCAGAGGGCT CAGGTCGCGG 150 . : . : . : . : . : 332 ATGTCGAGGGGAAGACATCATTCGGGATGTCAGTG ---------------||||||||||||||||||||||:|||||||||||| 467 TGGGAGATGAGGCCAATGTCGAGGGGAAGACATCATTTGGGATGTCAGTG 200 . : . : . : . : . : 367 TTCAACCTCAGCAATGCCATCATGGGCAGCGGCATCCTGGGACTCGCCTA |||||:||||||||:||||||||||||||:|| ||:|||||:|||||||| 517 TTCAATCTCAGCAACGCCATCATGGGCAGTGGAATTCTGGGGCTCGCCTA Alignment

  24. Twinscan • Twinscan is an augmented version of the Gencscan HMM. I E transitions duration emissions ACUAUACAGACAUAUAUCAU

  25. Twinscan Algorithm • Align the two sequences (eg. from human and mouse) • Mark each human base as gap ( - ), mismatch ( : ), match ( | ) New “alphabet”: 4 x 3 = 12 letters  = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }

  26. Twinscan Algorithm • Run Viterbi using emissions ek(b) where b  { A-, A:, A|, …, T| } Note: Emission distributions ek(b) estimated from real genes from human/mouse eI(x|) < eE(x|): matches favored in exons eI(x-) > eE(x-): gaps (and mismatches) favored in introns

  27. Example Human: ACGGCGACUGUGCACGU Mouse: ACUGUGAC GUGCACUU Alignment: ||:|:|||-||||||:| Input to Twinscan HMM: A| C| G: G| C: G| A| C| U- G| U| G| C| A| C| G: U| Recall, eE(A|) > eI(A|) eE(A-) < eI(A-) Likely exon

  28. HMMs for simultaneous alignment and gene finding: Generalized Pair HMMs

  29. 1 - 2 M P(xi, yj) 1-  - 2 1-  - 2   I P(xi) J P(yj)     A Pair HMM for alignments BEGIN I M J END

  30. Exon3 Exon1 Exon2 Intron1 Intron2 5’ 3’ CNS CNS CNS Cross-species gene finding [human] [mouse] Exon = coding Intron = non-coding CNS = conserved non-coding

  31. Generalized Pair HMMs

  32. Ingredients in exon scores • Splice site detection (VLMM) • Length distribution (generalized) • Coding potential (codon freq. tables) • GC-stratification

  33. Exon GPHMM 1.Choose exon lengths (d,e). 2.Generate alignment of length d+e. e d

  34. The SLAM hidden Markov model

  35. TBLASTX SLAM SLAM CNS SGP-2 VISTA Twinscan RefSeq Genscan Example: HoxA2 and HoxA3

  36. length seq1 no. states length seq2 max duration Computational complexity

  37. Approximate alignment Reduces TU -factor to hT

  38. Measuring Performance • Definition: • Sensitivity (SN): (# correctly predicted)/(# true) • Specificity (SP): (# correctly predicted)/(# total predicted)

  39. Measuring Performance

More Related