1 / 53

Applications of HMMs

Applications of HMMs. Yves Moreau 2003-2004. Overview. Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes vs. eukaryotes Gene prediction by homology GENSCAN. GGWWRGdy.ggkkqLWFPSNYV IGWLNGynettgerGDFPGTYV PNWWEGql..nnrrGIFPSNYV

ania
Télécharger la présentation

Applications of HMMs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applications of HMMs Yves Moreau 2003-2004

  2. Overview • Profile HMMs • Estimation • Database search • Alignment • Gene finding • Elements of gene prediction • Prokaryotes vs. eukaryotes • Gene prediction by homology • GENSCAN

  3. GGWWRGdy.ggkkqLWFPSNYV IGWLNGynettgerGDFPGTYV PNWWEGql..nnrrGIFPSNYV DEWWQArr..deqiGIVPSK-- GEWWKAqs..tgqeGFIPFNFV GDWWLArs..sgqtGYIPSNYV GDWWDAel..kgrrGKVPSNYL -DWWEArslssghrGYVPSNYV GDWWYArslitnseGYIPSTYV GEWWKArslatrkeGYIPSNYV GDWWLArslvtgreGYVPSNFV GEWWKAkslsskreGFIPSNYV GEWCEAgt.kngq.GWVPSNYI SDWWRVvnlttrqeGLIPLNFV LPWWRArd.kngqeGYIPSNYI RDWWEFrsktvytpGYYESGYV EHWWKVkd.algnvGYIPSNYV IHWWRVqd.rngheGYVPSSYL KDWWKVev..ndrqGFVPAAYV Profile HMM • Hidden Markov model for the modeling of protein families and for multiple alignment • Example • Part of the alignment of the SH3 domain • Two conserved regions separated by a variable region

  4. Bgn End Profile HMMs • Hidden Markov Models for multiple alignments • Match, insert, and delete states Deletion Insertion Match

  5. Silent deletion states • Deletions could be modeled by shortcut jumps between states • Problem: number of transitions grows quadratically • Other solution: use parallel states that do not produce any symbol (silent state)

  6. GGWWRGdy.ggkkqLWFPSNYV IGWLNGynettgerGDFPGTYV PNWWEGql..nnrrGIFPSNYV DEWWQArr..deqiGIVPSK-- GEWWKAqs..tgqeGFIPFNFV GDWWLArs..sgqtGYIPSNYV GDWWDAel..kgrrGKVPSNYL -DWWEArslssghrGYVPSNYV GDWWYArslitnseGYIPSTYV GEWWKArslatrkeGYIPSNYV GDWWLArslvtgreGYVPSNFV GEWWKAkslsskreGFIPSNYV GEWCEAgt.kngq.GWVPSNYI SDWWRVvnlttrqeGLIPLNFV LPWWRArd.kngqeGYIPSNYI RDWWEFrsktvytpGYYESGYV EHWWKVkd.algnvGYIPSNYV IHWWRVqd.rngheGYVPSSYL KDWWKVev..ndrqGFVPAAYV Corresponding profile HMM .85 HMM from multiple alignment Multiple alignment (+ conserved columns) Parameter estimation = estimation with known paths

  7. .33 .85 New profile HMM Pseudocounts • Zero probabilities in HMM causes the rejection of sequences containing previously unseen residues • To avoid this problem, add pseudocounts (add extra counts as if prior data was available)

  8. Database search with profile HMM • The estimated model can be used to detect new members of the protein family in a sequence database (more sensitive than PSI-BLAST) • For each sequence in the database, we compute P(x, p* | M) (Viterbi) or P(x | M) (forward-backward) • In practice we work with log-odds (w.r.t. the random model P(x | R))

  9. Alignment to profile HMM • Through Viterbi (search for the best alignment path), we can align sequences w.r.t a profile HMM • Training sequences • Database matches

  10. Multiple alignment with profile HMM • If the sequences are not aligned, it is possible to train a profile HMM to align them • Initialization: choose the length of the profile HMM • Length of profile HMM is number of match states  sequence length • Training: estimate the model via Viterbi training or Baum-Welch training • Heuristics to avoid local minimas • Multiple alignment: use Viterbi decoding to align sequences

  11. Extensions • More sophisticated pseudocounts are possible • Dirichlet mixtures • Different types of local alignments can be done with HMMs • Methods are available to weigh sequences in function of evolutionary distances

  12. Protein families • PFAM • http://www.sanger.ac.uk/Software/Pfam/search.shtml • Collection of protein families and protein domains • Provides multiple alignment of the protein families for the domains • Provides the domain organization of proteins • Provides profile HMMs of the domains

  13. Software for profile HMMs • SAM: University of California Santa Cruz • http://www.cse.ucsc.edu/research/compbio/sam.html • Web service: http://www.cse.ucsc.edu/research/compbio/HMM-apps/HMM-applications.html (takes time) • Hmmer (‘hammer’): Washington University, St. Louis • http://genome.wustl.edu/eddy/hmmer.html

  14. Gene finding

  15. Overview • Elements of gene prediction • Prokaryotes vs. eukaryotes • Gene prediction by homology • GENSCAN

  16. DNA makes RNA makes proteins

  17. Evidence for gene prediction • Sources of evidence (positive and negative) • Sequence similarity to known genes (e.g., found by BLASTX) • Statistical measure of codon bias • Template matches to functional sites (e.g., splice site) • Similarity to features not likely to overlap coding sequence (e.g., Alu repeats) • The structure must respect the biological grammar (promoter, exon, intro, ...)

  18. Search by signal vs. search by content • Search by signal • Detect short signals in the genome • E.g., splice site, signal peptide, glycosylation site • Neural networks can be useful here • Search by content • Detect extended regions in the genome • e.g., coding regions, CpG islands • Hidden Markov Models are useful here • Gene finding algorithms combine both

  19. Probabilistic prediction vs. homology • Hidden Markov Models can be used to predict genes • Homology to a known gene is also a strong method for detecting genes • More and more gene prediction packages combine both approaches

  20. Search by signal vs. content

  21. Transcription start and stop -35 region TATA box Translation start and stop Open Reading Frames Shine-Delgarno motif Start ATG/GTG Stop TAA/TAG/TGA Stem-loops Operon Signals in prokaryotes

  22. Problems for prokaryotes • Short genes are hard to detect • Operons • Overlapping genes

  23. Transcription Promotor/enhancer/silencer TATA box Introns/exons Donor/acceptor/branch PolyA Repeats Alu, satellites CpG islands Cap/CCAAT&GC boxes Translation 5’ and 3’ UTR Kozak consensus Start ATG Stop TAA/TAG/TGA Signals in eukaryotes

  24. Open reading frames • Translate the sequence into the six possible reading frames • Check for start and stop codons

  25. Codon bias • In coding sequences, genomes have specific biases for the use of codons encoding the same amino acid

  26. Coding potential • Most coding potentials are based on analysis of codon usage • The HMMs keeps track of some kind of average coding potential around each position • The increase and decrease of the coding potential will “push” the HMM in and out of the exons

  27. Promoter region • Promoter region contains the elements that control the expression of the gene • Prediction of the promoter region (e.g., prediction of the TATA-box) is difficult

  28. Intron-exon splicing • Consensus • 5’ Donor • (A,C)AG/GT(A,G)AGT • 3’ Acceptor • TTTTTNCAG/GCCCCC • Branch • CT(G,A)A(C,T) • Neural networks can predict splice sites; they can detect complex correlation between positions in a functional site

  29. Gene prediction by homology

  30. Gene prediction by homology • Coding regions evolve more slowly than noncoding ones (conserved by natural selection because of their functional role) • Not only the protein sequence but also the gene structure can be conserved • Use standard homology methods • Gene syntax must be respected

  31. Gene prediction by homology

  32. Procrustes • Find potentially related with BLASTX (= model sequences) • Find all possible blocks (exons) on the basis of acceptor/donor location • Look which blocks can be aligned with model sequences • Look for best alignment of blocks with the query sequence

  33. Gene prediction by homology • Advantages • Recognition of short exons and atypical exons • Correct assembly of complex genes (> 10 exons) • Disadvantages • Genes without known homologs are missed • Good homologs necessary for the prediction of the gene structure • Very sensitive to sequencing errors

  34. GENSCAN

  35. GENSCAN • GENSCAN was used for the annotation of the human genome in the Human Genome Project • Gene prediction with Hidden Semi-Markov Models • Different models in function of GC-content (<43% G+C, 43-50%, 50-57%, >57%)

  36. Typical gene structure

  37. Signal: human splice site • 5’ splice site • 3’ splice site

  38. Hidden semi-Markov model

  39. Example • Nodes of HSMM • Position-weight matrix (signal) • Higher-order position-weight matrix • HMM (content)

  40. Architecture of GENSCAN

  41. Training of HSMM Viterbi algorithm for HSMMs Viterbi algorithm

  42. Gene structure prediction • Current performance on exon prediction is acceptable • However, grouping the correct exons into the genes is still problematic • In many cases, a significant proportion of the predicted genes will not be correct

  43. CpG islands • In mammalians, CpG islands have higher G+C and CG dinucleotide content than the rest of the DNA • CpG islands arise in active regions where no deactivation by methylation takes place (CG dinucleotides in methylated regions disappear by deamination) • CpG islands may be used as gene markers in mammalians

  44. Repeats • Repeats make up a large part of the human genome • Alu repeats • Long Interspersed Elements (LINEs) • Short Interspersed Elements (SINEs) • Important to mask repeats when searching for genes

  45. Promoter, enhancers, and silencers

  46. Promotor, enhancers en silencers

  47. Polyadenylation signal • Polyadenylation (cleavage of pre-mRNA 3' end and synthesis of poly-(A) tract) is a very important early step of pre-mRNA processing • The most well-known signal involved in this process is AATAAA, located 15-20 nucleotides upstream from the poly-(A) site (site of cleavage) • Real AATAAA signals can differ from AATAAA consensus sequence. The most frequent natural variant, ATTAAA, is nearly as active as the canonical sequence.

  48. Problem: alternative splicing

More Related