1 / 46

Protein Evolution: SARS coronavirus as an example

CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS August 2004. Protein Evolution: SARS coronavirus as an example.

samira
Télécharger la présentation

Protein Evolution: SARS coronavirus as an example

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CZ5225 Methods in Computational BiologyLecture 2-3: Protein Families and Family Prediction MethodsProf. Chen Yu ZongTel: 6874-6877Email: csccyz@nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUSAugust 2004

  2. Protein Evolution: SARS coronavirus as an example

  3. SARS Coronavirus A novel coronavirus Identified as the cause of severe respiratory syndrome (SARS )

  4. SARS Infection How SARS coronavirus enters a cell and reproduce

  5. Protein Evolution Generation of different species

  6. Protein Families • Sequence alignment-based families. • Based on Principle of Sequence-structure-function-relationship. • Derived by multiple sequence alignment • Database: PFAM (Nucleic Acids Res. 30:276-280) • Structure-based families. • Derived by visual inspection and comparison of structures • Database: SCOP (J. Mol. Biol. 247, 536-540) • Functional Families. • Databases: • G-protein coupled receptors: GPCRDB (Nucleic Acids Res. 29: 346-349), ORDB (Nucleic Acids Res. 30:354-360) • Nuclear receptors: NucleaRDB (Nucleic Acids Res. 29: 346-349) • Enzymes: BRENDA (Nucleic Acids Res. 30, 47-49) • Transporters: TC-DB (Microbiol Mol Biol Rev. 64:354-411) • Ligand-gated ion channels: LGICdb (Nucleic Acids Res. 29: 294-295) • Therapeutic targets: TTD (Nucleic Acids Res. 30, 412-415) • Drug side-effect targets: DART (Drug Safety 26: 685-690)

  7. Protein Families Sequence families =\= Structural families =\= Functional families Sequence similar, structure different Sequence different, structure similar Sequence similar, function different (distantly related proteins) Sequence different, function similar Homework: find examples

  8. Protein Family Prediction Methods Sequence alignment-based families: • Multiple sequence alignment (HMM): HMMER; JMB 235, 1501-153; JMB 301, 173-190 Structure-based families: • Visual inspection and comparison of structures Functional Families. • Statistical learning methods: • Neural network: ProtFun (Bioinformatics, 19:635-642) • Support vector machines:SVMProt (Nucleic Acids Res., 31: 3692-3697)

  9. Sequence Comparison as a Mathematical Problem: Example: Sequence a:  ATTCTTGC Sequence b: ATCCTATTCTAGC          Best Alignment:             ATTCTTGC                                  ATCCTATTCTAGC                                        /|\                  gap    Bad Alignment: AT     TCTT       GC                                  ATCCTATTCTAGC                                                              /|\             /|\                                      gap          gap Construction of many alignments => which is the best?

  10. How to rate an alignment? • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C - - - T T A A C TC G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score

  11. Alignment Graph Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T

  12. An optimal alignment-- the alignment of maximum score • Let A=a1a2…am and B=b1b2…bn . • Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj • With proper initializations, Si,j can be computedas follows.

  13. Computing Si,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n

  14. Initializations C G G A T C A T CTTAACT

  15. S3,5 = ? C G G A T C A T CTTAACT

  16. S3,5 = ? C G G A T C A T CTTAACT optimal score

  17. C T T A A C – TC G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14 C G G A T C A T CTTAACT

  18. Global Alignment vs. Local Alignment • global alignment: • local alignment:

  19. An optimal local alignment • Si,j: the score of an optimal local alignment ending at ai and bj • With proper initializations, Si,j can be computedas follows.

  20. local alignment Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T CTTAACT

  21. local alignment A – C - TA T C A T 8-3+8-3+8 = 18 C G G A T C A T CTTAACT The best score

  22. Multiple sequence alignment (MSA) • The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC Seq2: AC Seq3: GATC GC-TC A---C G-ATC

  23. How to score an MSA? • Sum-of-Pairs (SP-score) GC-TC A---C Score + GC-TC A---C G-ATC GC-TC G-ATC Score Score = + A---C G-ATC Score

  24. Functional Classification by SVM • A protein is classified as either belong (+) or not belong (-) to a functional family • By screening against all families, the function of this protein can be identified(example: SVMProt) • What is SVM? Support vector machines, a machine learning method, learning by examples, statistical learning, classify objects into one of the two classes. • Advantage of SVM: Diversity of class members (no racial discrimination). Use of sequence-derived physico-chemical features as basis for classification. Suitable for functional family classifications.

  25. SVM References • C. Burges, "A tutorial on support vector machines for pattern recognition", Data Mining and Knowledge Discovery, Kluwer Academic Publishers,1998 (on-line). • R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 5.11, hard-copy). • S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy). • Online lecture notes

  26. Introduction to Machine Learning • Goal: • To “improve” (gaining knowledge, enhancing computing capability) • Tasks: • Forming concepts by data generalization. • Compiling knowledge into compact form • Finding useful explanations for valid concepts. • Clustering data into classes. • Reference: • Machine Learning in Molecular Biology Sequence Analysis. • Internet links: • http://www.ai.univie.ac.at/oefai/ml/ml-resources.html

  27. Introduction to Machine Learning • Category: • Inductive learning. • Forming concepts from data without a lot of knowledge from domain (learning from examples). • Analytic learning. • Use of existing knowledge to derive new useful concepts (explanation based learning). • Connectionist learning. • Use of artificial neural networks in searching for or representing of concepts. • Genetic algorithms. • To search for the most effective concept by means of Darwin’s “survival of the fittest” approach.

  28. Machine Learning Methods Inductive learning: Concept learning and example-based learning Concept learning:

  29. Machine Learning Methods Analytic learning:

  30. Machine Learning Methods Neural network:

  31. Machine Learning Methods Genetic algorithms: Pattern Strength Classification

  32. SVM

  33. SVM

  34. SVM

  35. SVM

  36. SVM

  37. SVM

  38. SVM

  39. SVM

  40. SVM

  41. SVM

  42. SVM

  43. SVM for Classification of Proteins How to represent a protein? • Each sequence represented by specific feature vector assembled from encoded representations of tabulated residue properties: • amino acid composition • Hydrophobicity • normalized Van der Waals volume • polarity, • Polarizability • Charge • surface tension • secondary structure • solvent accessibility • Three descriptors, composition (C), transition (T), and distribution (D), are used to describe global composition of each of these properties. Nucleic Acids Res., 31: 3692-3697

  44. SVM for Classification of Proteins Descriptors for amino acid composition of protein: C=(53.33, 46.67) T=(51.72) D=(3.33, 16.67, 40.0, 66.67, 96.67, 6.67, 26.67, 60.0, 76.67, 100.0) Nucleic Acids Res., 31: 3692-3697

  45. CZ5225 Methods in Computational BiologyAssignment 1 • Project 1: Protein family classification by SVM • Construction of training and testing datasets • Generating feature vectors • SVM classification and analysis. • Write a report and include a softcopy of your datasets • Project 2: Develop a program of pair-wise sequence alignment using a simple scoring scheme. • Write a code in any programming language • Test it on a few examples (such as estrogen receptor and Progesterone receptor) • Can you extend your program to multiple alignment? • Write a report and include a softcopy of your program

More Related