CZ5225 Methods in Computational BiologyLecture 2-3: Protein Families and Family Prediction MethodsProf. Chen Yu ZongTel: 6874-6877Email: firstname.lastname@example.org://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUSAugust 2004
SARS Coronavirus A novel coronavirus Identified as the cause of severe respiratory syndrome (SARS )
SARS Infection How SARS coronavirus enters a cell and reproduce
Protein Evolution Generation of different species
Protein Families • Sequence alignment-based families. • Based on Principle of Sequence-structure-function-relationship. • Derived by multiple sequence alignment • Database: PFAM (Nucleic Acids Res. 30:276-280) • Structure-based families. • Derived by visual inspection and comparison of structures • Database: SCOP (J. Mol. Biol. 247, 536-540) • Functional Families. • Databases: • G-protein coupled receptors: GPCRDB (Nucleic Acids Res. 29: 346-349), ORDB (Nucleic Acids Res. 30:354-360) • Nuclear receptors: NucleaRDB (Nucleic Acids Res. 29: 346-349) • Enzymes: BRENDA (Nucleic Acids Res. 30, 47-49) • Transporters: TC-DB (Microbiol Mol Biol Rev. 64:354-411) • Ligand-gated ion channels: LGICdb (Nucleic Acids Res. 29: 294-295) • Therapeutic targets: TTD (Nucleic Acids Res. 30, 412-415) • Drug side-effect targets: DART (Drug Safety 26: 685-690)
Protein Families Sequence families =\= Structural families =\= Functional families Sequence similar, structure different Sequence different, structure similar Sequence similar, function different (distantly related proteins) Sequence different, function similar Homework: find examples
Protein Family Prediction Methods Sequence alignment-based families: • Multiple sequence alignment (HMM): HMMER; JMB 235, 1501-153; JMB 301, 173-190 Structure-based families: • Visual inspection and comparison of structures Functional Families. • Statistical learning methods: • Neural network: ProtFun (Bioinformatics, 19:635-642) • Support vector machines:SVMProt (Nucleic Acids Res., 31: 3692-3697)
Sequence Comparison as a Mathematical Problem: Example: Sequence a: ATTCTTGC Sequence b: ATCCTATTCTAGC Best Alignment: ATTCTTGC ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap Construction of many alignments => which is the best?
How to rate an alignment? • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C - - - T T A A C TC G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score
Alignment Graph Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T
An optimal alignment-- the alignment of maximum score • Let A=a1a2…am and B=b1b2…bn . • Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj • With proper initializations, Si,j can be computedas follows.
Computing Si,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n
Initializations C G G A T C A T CTTAACT
S3,5 = ？ C G G A T C A T CTTAACT
S3,5 = ？ C G G A T C A T CTTAACT optimal score
C T T A A C – TC G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14 C G G A T C A T CTTAACT
Global Alignment vs. Local Alignment • global alignment: • local alignment:
An optimal local alignment • Si,j: the score of an optimal local alignment ending at ai and bj • With proper initializations, Si,j can be computedas follows.
local alignment Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T CTTAACT
local alignment A – C - TA T C A T 8-3+8-3+8 = 18 C G G A T C A T CTTAACT The best score
Multiple sequence alignment (MSA) • The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC Seq2: AC Seq3: GATC GC-TC A---C G-ATC
How to score an MSA? • Sum-of-Pairs (SP-score) GC-TC A---C Score + GC-TC A---C G-ATC GC-TC G-ATC Score Score = + A---C G-ATC Score
Functional Classification by SVM • A protein is classified as either belong (+) or not belong (-) to a functional family • By screening against all families, the function of this protein can be identified(example: SVMProt) • What is SVM? Support vector machines, a machine learning method, learning by examples, statistical learning, classify objects into one of the two classes. • Advantage of SVM: Diversity of class members (no racial discrimination). Use of sequence-derived physico-chemical features as basis for classification. Suitable for functional family classifications.
SVM References • C. Burges, "A tutorial on support vector machines for pattern recognition", Data Mining and Knowledge Discovery, Kluwer Academic Publishers,1998 (on-line). • R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 5.11, hard-copy). • S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy). • Online lecture notes
Introduction to Machine Learning • Goal: • To “improve” (gaining knowledge, enhancing computing capability) • Tasks: • Forming concepts by data generalization. • Compiling knowledge into compact form • Finding useful explanations for valid concepts. • Clustering data into classes. • Reference: • Machine Learning in Molecular Biology Sequence Analysis. • Internet links: • http://www.ai.univie.ac.at/oefai/ml/ml-resources.html
Introduction to Machine Learning • Category: • Inductive learning. • Forming concepts from data without a lot of knowledge from domain (learning from examples). • Analytic learning. • Use of existing knowledge to derive new useful concepts (explanation based learning). • Connectionist learning. • Use of artificial neural networks in searching for or representing of concepts. • Genetic algorithms. • To search for the most effective concept by means of Darwin’s “survival of the fittest” approach.
Machine Learning Methods Inductive learning: Concept learning and example-based learning Concept learning:
Machine Learning Methods Analytic learning:
Machine Learning Methods Neural network:
Machine Learning Methods Genetic algorithms: Pattern Strength Classification
SVM for Classification of Proteins How to represent a protein? • Each sequence represented by specific feature vector assembled from encoded representations of tabulated residue properties: • amino acid composition • Hydrophobicity • normalized Van der Waals volume • polarity, • Polarizability • Charge • surface tension • secondary structure • solvent accessibility • Three descriptors, composition (C), transition (T), and distribution (D), are used to describe global composition of each of these properties. Nucleic Acids Res., 31: 3692-3697
SVM for Classification of Proteins Descriptors for amino acid composition of protein: C=(53.33, 46.67) T=(51.72) D=(3.33, 16.67, 40.0, 66.67, 96.67, 6.67, 26.67, 60.0, 76.67, 100.0) Nucleic Acids Res., 31: 3692-3697
CZ5225 Methods in Computational BiologyAssignment 1 • Project 1: Protein family classification by SVM • Construction of training and testing datasets • Generating feature vectors • SVM classification and analysis. • Write a report and include a softcopy of your datasets • Project 2: Develop a program of pair-wise sequence alignment using a simple scoring scheme. • Write a code in any programming language • Test it on a few examples (such as estrogen receptor and Progesterone receptor) • Can you extend your program to multiple alignment? • Write a report and include a softcopy of your program