1 / 68

Identification of Protein Domains

Identification of Protein Domains. Eden Dror Menachem Schechter. Computational Biology Seminar 2004. Overview. Introduction to protein domains. Classification of homologs. Representing a domain. PSSM HMM Internet resources Pfam SMART PROSITE InterPro Research example.

noel
Télécharger la présentation

Identification of Protein Domains

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

  2. Overview • Introduction to protein domains. • Classification of homologs. • Representing a domain. • PSSM • HMM • Internet resources • Pfam • SMART • PROSITE • InterPro • Research example.

  3. Protein domains • A discrete portion of a protein assumed to fold independently, and possessing its own function. • Mobile domain (“module”): a domain that can be found associated with different domain combinations in different proteins.

  4. Protein domains • The assumption: The domain is the fundamental unit of protein structure and function. • Protein family – all proteins containing a specific domain.

  5. What can we learn from them? • Common ancestors & homology information of a set of proteins. • Homology can induce properties of a protein like functionality & localization. • Therefore, domains can be used to classify a new protein to a family, inferring functionality.

  6. Classification of homologs • Homology is not a sufficiently well-defined term to describe the evolutionary relationships between genes. • Homologous genes can be derived by two major ways: • Gene duplication (in the same species). • Speciation (splitting of one species into two).

  7. Classification of homologs

  8. Classification of homologs • Orthologs– Two genes from two different species that derive from a single gene in the last common ancestor of the species. • Paralogs– Two genes that derive from a single gene that was duplicated within a genome.

  9. ortho para para ortho Classification of homologs

  10. Classification of homologs • Inparalogs - paralogs that evolved by gene duplication after the speciation event. • Outparalogs - paralogs that evolved by gene duplication before the speciation event.

  11. In-para In-para out-para When comparing human with worm Classification of homologs

  12. What can we learn from them? • Ortholog proteins are evolutionary, and typically functional counterparts in different species. • Paralog proteins are important for detecting lineage-specific adaptations. • Both of them can reveal information on a specific species or a set of species.

  13. Protein domains – summary • By identifying domains we can: • infer functionality & localization of a protein. • Learn on a specific species. • Learn on a set of species as a group.

  14. Domain representation • Different methods to represent (model) domains: • Patterns (regular expressions). • PSSM (Position specific score matrix). • HMM (Hidden Markov model).

  15. PSSM • Position specific score matrix • Score matrix representing the score for having each amino acid in a given position in a specific sequence. • Based on the independent probabilities P(a|i) of observing amino acid a in position i.

  16. PSSM: Example

  17. PSSM: Identifying a domain • Given a sequence and a PSSM: • Run over all positions. • Score each sub-sequence according to the matrix.

  18. x3 x1 x2 x4 HMM: Hidden Markov Model • Markov model: a way of describing a process that goes through a series of states. • Each state has a probability of transitioning to the other states. • xi is a random variable of state.

  19. x3 =0 x3 =0 x3 =1 x1 =0 x1 =1 x1=0 x2 =0 x2 =0 x2=1 x4 =0 x4 =1 x4 =1 HMM: Markov Model • Example: • States are Î {0,1}

  20. x x3 x1 x2 x4 HMM: Markov Model • Transition matrix:

  21. HMM: Markov Model • State transition example: • States are the nucleotides A, T, G, C.

  22. x3 x1 x2 x4 y3 y1 y2 y4 HMM: Hidden Markov Model • Hidden Markov model: • Each state x emits an output y, at a specific probability. • We only know the output (observations). • Thus, the states are hidden.

  23. x3 =1 x3 =0 x1 =1 x1 =0 x2 =0 x2 =1 x4 =1 x4 =1 y3 =1 y3 =0 y1 =1 y1 =1 y2 =1 y2 =0 y4 =0 y4 =0 HMM: Hidden Markov Model • Example: states are Î {0,1}, output Î {0,1}

  24. x y x3 x1 x2 x4 y3 y1 y2 y4 HMM: Hidden Markov Model • Emission matrix:

  25. HMM: What can we do with it? • Given (A, B): • Probability of given states and outputs • Probability of a given output sequence • Most likely sequence of states that generated a given output sequence

  26. HMM: What can we do with it? • Learning: • Given state and output sequences calculate the most probable (A, B). • Easy when the states are known. • Otherwise: use a training algorithm.

  27. HMM: Profile HMM • Use HMM to represent sequence families. • A particular type of HMM suited to modeling multiple alignments. • (Assume we have a multiple alignment).

  28. HMM: Trivial profile HMM • We begin with ungapped regions. • Each position corresponds to a state. • Transitions are of probability 1.

  29. HMM: Trivial profile HMM • Let ei(a) be the independent probability of observing amino acid a in position i. • The probability of a new sequence x, according to the model:

  30. HMM: Trivial profile HMM • We can score the sequence x: • Where q indicates the probability under a random model.

  31. HMM: Trivial profile HMM • Consider the values • They behave like elements in a score matrix. • The trivial profile HMM is equivalent to a PSSM.

  32. HMM: profile HMM • Let’s untrivialize by allowing for gaps: insertions and deletions. • Start off with the PSSM HMM.

  33. HMM: profile HMM • Handling insertions: • Introduce new states Ij– match insertions after position j. • These states have random emission probabilities.

  34. HMM: profile HMM • The score of a gap of length k:

  35. HMM: profile HMM • Handling deletions: • Introduce silent states Dj. • These states do not emit.

  36. HMM: profile HMM • The complete profile HMM:

  37. Internet resources • Databases of protein families. • Family information and identification. • Considerations: • Type of representation (pattern, PSSM, HMM). • Choice of seed multiple alignment proteins. • Quality control. • Database features (links, annotations, views). • Database Specificity (organism, functions).

  38. Pfam: Home

  39. Pfam • Protein families database of alignments and HMMs • Uses profile-HMMs to represent families. • For each family in Pfam you can: • Look at multiple alignments • View protein domain architectures • Examine species distribution • Follow links to other databases • View known protein structures

  40. Pfam: Databases 2 databases: • Pfam-A – curated multiple alignments. • Grows slowly. • Quality controlled by experts. • Pfam-B – automatic clustering (ProDom derived). • Complements Pfam-A. • New sequences instantly incorporated. • Unchecked: false positives, etc.

  41. Pfam: Features • Search by: Sequence, keyword, domain, taxonomy. • Browsing by family or genome. • Evolutionary tree

  42. Pfam: Construction • Source of seed alignments: • Pfam-B families. • Published articles. • 'domain hunting' studies. • occasionally using entries from other databases (e.g. MEROPS for peptidases).

  43. Pfam: Domain information

  44. Pfam: Domain organization

  45. Pfam: Multiple alignment

  46. Pfam: HMM logo

  47. Pfam: Species distribution

  48. Pfam: Genome comparison

  49. PROSITE • Database of protein families. • Matching according to simple patterns or PSSM profiles. • Browsing all proteins of a specific family. • Latest release knows 1696 protein families.

  50. PROSITE: Features • Comprehensive domain documentation. • All profile matches checked by experts. • Specificity/sensitivity: • Specificity: true-pos/all-pos • Sensitivity: true-pos/(true-pos + false-neg)

More Related