1 / 81

Developing and Using Special Purpose Hidden Markov Model Databases

Developing and Using Special Purpose Hidden Markov Model Databases. Martin Gollery Associate Director of Bioinformatics University of Nevada, Reno Mgollery@unr.edu. Today’s Tutorial. Instructor: Martin Gollery Associate Director of Bioinformatics, University of Nevada, Reno

mieko
Télécharger la présentation

Developing and Using Special Purpose Hidden Markov Model Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing and Using Special Purpose Hidden Markov Model Databases Martin Gollery Associate Director of Bioinformatics University of Nevada, Reno Mgollery@unr.edu

  2. Today’s Tutorial • Instructor: Martin Gollery • Associate Director of Bioinformatics, University of Nevada, Reno • Consultant to several organizations • Formerly with TimeLogic • Developed several HMM databases

  3. Hidden Markov Models • What HMM’s are • Which HMM programs are commonly used • What HMM databases are available • Why you would use one DB over another • Integrated Resources- InterPro and more • How you can build your own HMM DB • Problems with building your own • Live demonstration

  4. Hidden Markov Models-What are they, anyway? • Statistical description of a protein family's consensus sequence • Conserved regions receive highest scores • Can be seen as a Finite State Machine

  5. Representation of Family Members • yciH KDGII • ZyciH KDGVI • VCA0570 KDGDI • HI1225 KNGII • sll0546 KEDCV

  6. Representation of gaps in Family Members • yciH KDGII • ZyciH KDGVI • VCA0570 KDGDI • HI1225 KNGII • sll0546 KED-V

  7. For Maximum sensitivity- No residue at any position should have a zero probability, even if it was not seen in the training data.

  8. Start with an MSA… • CLUSTAL W (1.7) multiple sequence alignment • yciH KDGVIEIQGDKRDLLKSLLEAKGMKVKLAGG • ZyciH KDGVIEIQGDKRDLLKSLLEAKGMKVKLAGG • VCA0570 KDGDIEIQGDVRDQLKTLLESKGHKVKLAGG • HI1225 KNGIIEIQGEKRDLLKQLLEQKGFKVKLSGG • sll0546 KEDCVEIQGDQREKILAYLLKQGYKAKISGG • PA4840 KDGVVEIQGEHVELLIDELLKRGFKAKKSGG • AF0914 KNGVIELQGNHVNRVKELLIKKGFNPERIKT • *:. :*:**: : : * :* : :

  9. Hidden Markov Models • HMMER2.0 • NAME example2 • DESC Small example for demonstration purposes • LENG 31 • ALPH Amino • COM hmmbuild example2 example2.aln • NSEQ 7 • DATE Wed Jan 08 13:33:06 2003 • HMM A C D E F G H I K … • 1 -3217 -3413 -3082 -2664 -4291 -3257 -2104 -4231 3883… • 2 -1938 -3859 2747 1592 -4024 -1857 -1206 -3953 -1455… • 3 -2160 -3144 1834 -953 -4284 3247 -2013 -4362 -2365… • 4 -1255 2750 436 -2789 -1273 -2972 -2049 1510 -2543… • 5 -2035 -1558 -4660 -4320 -2085 -4409 -4229 3081 -4224… • 6 -3264 -3765 -1447 3822 -4535 -2948 -2636 -4814 -2810… • 7 -2423 -1951 -4843 -4395 -1156 -4544 -3680 3291 -4151… • 8 -3220 -3396 -2530 -2667 -3851 -3171 -2735 -4442 -2277… • 9 -3196 -3194 -3915 -4259 -4867 3789 -4005 -5414 -4591… • 10 -1923 -3837 2743 2134 -4005 -1854 -1196 -3929 -1434… • 11 -999 -2164 -952 -353 -2483 -1909 3321 -2139 1730… • 12 -1629 -1909 -2827 -2102 -2279 -2588 -1442 -1012 -488…

  10. Emission Probabilities • What is the likelihood that sequence X was emitted by HMM Y? • Likelihood is calculated by adding the probability of each residue at each position, and each of the transition probabilities

  11. Plan7 from Outer Space(Well, from St. Louis, anyway!)

  12. HMM’s vs BLAST • Position specific scoring vs. general matrix • Example: • dDGVIvIddDKRDLLKSLiEAKkMKVKLAGG • KDGVIEIQGDKRDLLKSLLEAKGMKVKLAGG has 80% BLAST similarity, but misses highly conserved regions • Scoring emphasizes important locations • Clearer score cutoffs • However, it is MUCH slower!

  13. HMM programs • HMMer -Sean Eddy, Wash U • SAM - Haussler, UCSC • Wise tools - Birney, EBI • SledgeHMMer - Subramaniam, SDSC • Meta-MEME - Noble & Bailey • PSI-BLAST - NCBI • SPSpfam - Southwest Parallel Software • Ldhmmer - Logical Depth • DeCypherHMM - TimeLogic

  14. What exactly do you want? • Are you searching thousands of sequences with one or a few models? • Use hmmsearch • Searching a few sequences with thousands of models? • Use hmmpfam • Thousands of sequences vs. Thousands of models? • Use an accelerator, if you do it very often

  15. HMM databases • PFAM • TIGRFAM • Superfamily • SMART • Panther • PRED-GPCR

  16. HMM databases at the CFB • COGfam • KinFam • HydroHMMer • NVfam-pro • NVfam-arc • NVfam-fun • NVfam-pln

  17. PFAM • From Sanger, WashU, KI, INRA • Version 17 has 7868 families • Most widely used HMM database • Good annotation team

  18. PFAM • PFAM-A is hand curated • From high quality multiple Alignments • PFAM-B is built automatically from ProDom • Generated using the Domainer algorithm • ProDom is built from SP/TREMBL

  19. PFAM • Pfam-ls = global alignments • Pfam-fs = local alignments, so that matches may include only part of the model • Both the –ls and –fs versions are local W.R.T. the sequence

  20. PFAM • Note ‘type’ annotation • Labeled TP • Family • Domain • Repeat • Motif

  21. TIGRFAMs • Available at (www.tigr.org/TIGRFAMs/) • Organized by functional role • Equivalogs: a set of homologous proteins that are conserved with respect to function since their last common ancestor • Equivalog domains: domains of conserved function

  22. TIGRFAMs • 2453 models in release 4.1 • Complementary to PFAM, so run both • Part of the Comprehensive Microbial Resource (CMR)

  23. TIGRFAMs TIGRfam and PFAM alignments for Pyruvate carboxylase. The thin line represents the sequence. The bars represent hit regions.

  24. SuperFamily • By Julian Gough, formerly MRC, now Riken GSC • www.supfam.org • Provides structural (and hence implied functional) assignments to protein sequences at the superfamily level • Built from SCOP (Structural Classification of Proteins) database, which is built from PDB • Available in HMMer, SAM, and PSI-BLAST formats

  25. SuperFamily • 1447 SCOP Superfamilies • Each represented by a group of HMMs • Over 8500 models total • Table provides comparison to GO, Interpro, PFAM

  26. SMART • Simple Modular Architecture Research Tool • Version 3.4 contains 654 HMMs • Emphasis on mobile eukaryotic domains • smart.embl-heidelberg.de • Annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues

  27. SMART • Use for signaling domains or extracellular domains • Normal and Genomic mode

  28. PRED-GPCR • Papasaikas et al, U of Athens • 265 HMMs in 67 GPCR families • Based on TiPs Pharmacological classification. • Filters with CAST • signatures regularly updated • Entire system redone each year

  29. PRED-GPCR webserver

  30. Panther • Protein ANalysis THrough Evolutionary Relationships • Family and subfamily: families are evolutionarily related proteins; subfamilies are related proteins with the same function • Molecular function: the function of the protein by itself or with directly interacting proteins at a biochemical level, e.g. a protein kinase • Biological process: the function of the protein in the context of a larger network of proteins that interact to accomplish a process at the level of the cell or organism, e.g. mitosis. • Pathway: similar to biological process, but a pathway also explicitly specifies the relationships between the interacting molecules.

  31. Panther • (Thomas et al., Genome Research 2003; Mi et al. NAR 2005) • 6683 protein families • 31,705 functionally distinct protein subfamilies.

  32. Panther • Due to the size, searches could be slow • First, BLAST against consensus seqs • Then, search against models represented by those hits • With an accelerator, you don’t have to do that…

  33. Panther • So- how does it perform? • I took 3451 Arabidopsis proteins with no hit to PFAM, Superfamily, SMART or TIGRfam • Ran it against Panther • Found 160 significant hits!

  34. COG-HMMs • Clusters of Orthologous Groups of proteins • www.ncbi.nlm.nih.gov/cog/ • Each COG is from at least 3 lineages • Ancient conserved domain • 4873 alignments available • Alignments from NCBI, HMMs from me at mgollery@unr.edu

  35. CDD • Conserved Domain Database (NCBI) • Psi-BLAST profiles are similar to HMMs • 10991 PSSMs - SMART + COG +KOG+ Pfam+CD • Runs with RPS-BLAST • Much faster searches

  36. KinFam • Kinfam- models represent 53 different classes of PKs • Assigns Kinase Class and Group • Based on Hanks’ classification scheme • Database is small, so searches are fast

  37. KinFam • Categorizes Kinase data • Available for download from bioinformatics.unr.edu RANK SCORE QF TARGET|ACCESSION E_VALUE DESCRIPTION 1 852.93 1 KinFam||ptkgrp15 9.3e-256 Fibroblast GF recept 2 479.14 1 KinFam||ptkgrp14 3.1e-143 Platelet derived GF 3 423.33 1 KinFam||ptkother 1.9e-126 Other membrane-span

  38. HydroHmmer • Hydrohmmer finds LEAs, other hydrophilin classes • Small target size makes for very fast searches

  39. NVFAMs • HMM’s reflect the training data • Specific training sets provide better results • So… use Archaeal data to study Archaeons, Fungal data to study Fungi, etc. • Designed for use with PFAM, not stand alone • Recent redesign, name change

  40. NVFAMs • NVFAM-pro used to study E. faecalis • Demonstrated higher scores, better aligns • However, PFAM had more total hits • P.falciparum used as negative control • PFAM showed better scores, aligns as predicted • Automated design by Garrett Taylor- scripts are available! • Contact me for input, collaboration, or help to build your own

  41. Which database to use?One Comparison Test-(Your results may vary…) • Compare 563 I. pini sequences to COGhmm, PFAM, PFAMfrag, SMART, TIGRfam, TIGRfamfrag, Superfamily • COGs- 9 • PFAM- 22 • PFAMfrag- 57 • SMART- 4 • Superfamily- 30 • TIGRfam- 6 • TIGRfamfrag- 12

  42. Integrated Resources • InterProscan • MAGPIE • PANAL • Make your own!

  43. InterPro • Database built from PFAM, Prints, Prosite, SuperFamily, ProDom, SMART, TIGRFAMs, PANTHER, PIRsf, Gene3D & SP/TrEMBL • Version 10.0 • Nearly 12,000 entries • http://www.ebi.ac.uk/interpro/ • InterProScan can be installed locally

  44. InterProScan • Splits up big jobs & reassembles them • Works with SGE, PBS, LSF • A free analysis pipeline! • Provides GO mappings • Written in PERL, so it’s easy to modify • Average 4 min. per NT sequence per CPU

  45. InterPro release 10.0 contains 11972 entries, representing 3079 domains, 8597 families, 228 repeats, 27 active sites, 21 binding sites and 20 post-translational modification sites. Overall, there are 7521179 InterPro hits from 1466570 UniProt protein sequences. A complete list is available from the ftp site. DATABASE VERSION ENTRIES SWISS-PROT 46.5 180652 PRINTS 37.0 1850 TrEMBL 29.5 1689375 Pfam 17.0 7868 PROSITE patterns 18.45 1800 PROSITE preprofiles N/A 120 ProDom 2004.1 1522 InterPro 10.0 11972 SMART 4.0 663 TIGRFAMs 4.1 2454 PIRSF 2.52 962 PANTHER 5.0 438 SUPERFAMILY 1.65 1160 Gene3D 3.0 117 GO Classification N/A 18705 InterPro

  46. Modifying InterProScan • Two ways to Add your own HMM database to InterProScan: • Modify PERL scripts • Concatenate your models onto PFAM • Similarly, if you are looking for a specific target, delete all the rest to speed up searches

More Related