Protein Evolution: SARS coronavirus as an example

CZ5225 Methods in Computational BiologyLecture 2-3: Protein Families and Family Prediction MethodsProf. Chen Yu ZongTel: 6874-6877Email: csccyz@nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUSAugust 2004

Protein Evolution: SARS coronavirus as an example

SARS Coronavirus A novel coronavirus Identified as the cause of severe respiratory syndrome (SARS )

SARS Infection How SARS coronavirus enters a cell and reproduce

Protein Evolution Generation of different species

Protein Families • Sequence alignment-based families. • Based on Principle of Sequence-structure-function-relationship. • Derived by multiple sequence alignment • Database: PFAM (Nucleic Acids Res. 30:276-280) • Structure-based families. • Derived by visual inspection and comparison of structures • Database: SCOP (J. Mol. Biol. 247, 536-540) • Functional Families. • Databases: • G-protein coupled receptors: GPCRDB (Nucleic Acids Res. 29: 346-349), ORDB (Nucleic Acids Res. 30:354-360) • Nuclear receptors: NucleaRDB (Nucleic Acids Res. 29: 346-349) • Enzymes: BRENDA (Nucleic Acids Res. 30, 47-49) • Transporters: TC-DB (Microbiol Mol Biol Rev. 64:354-411) • Ligand-gated ion channels: LGICdb (Nucleic Acids Res. 29: 294-295) • Therapeutic targets: TTD (Nucleic Acids Res. 30, 412-415) • Drug side-effect targets: DART (Drug Safety 26: 685-690)

Protein Families Sequence families =\= Structural families =\= Functional families Sequence similar, structure different Sequence different, structure similar Sequence similar, function different (distantly related proteins) Sequence different, function similar Homework: find examples

Protein Family Prediction Methods Sequence alignment-based families: • Multiple sequence alignment (HMM): HMMER; JMB 235, 1501-153; JMB 301, 173-190 Structure-based families: • Visual inspection and comparison of structures Functional Families. • Statistical learning methods: • Neural network: ProtFun (Bioinformatics, 19:635-642) • Support vector machines:SVMProt (Nucleic Acids Res., 31: 3692-3697)

Sequence Comparison as a Mathematical Problem: Example: Sequence a: ATTCTTGC Sequence b: ATCCTATTCTAGC Best Alignment: ATTCTTGC ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap Construction of many alignments => which is the best?

How to rate an alignment? • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C - - - T T A A C TC G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score

Alignment Graph Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T

An optimal alignment-- the alignment of maximum score • Let A=a1a2…am and B=b1b2…bn . • Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj • With proper initializations, Si,j can be computedas follows.

Computing Si,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n

Initializations C G G A T C A T CTTAACT

S3,5 = ？ C G G A T C A T CTTAACT

S3,5 = ？ C G G A T C A T CTTAACT optimal score

C T T A A C – TC G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14 C G G A T C A T CTTAACT

Global Alignment vs. Local Alignment • global alignment: • local alignment:

An optimal local alignment • Si,j: the score of an optimal local alignment ending at ai and bj • With proper initializations, Si,j can be computedas follows.

local alignment Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T CTTAACT

local alignment A – C - TA T C A T 8-3+8-3+8 = 18 C G G A T C A T CTTAACT The best score

Multiple sequence alignment (MSA) • The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC Seq2: AC Seq3: GATC GC-TC A---C G-ATC

How to score an MSA? • Sum-of-Pairs (SP-score) GC-TC A---C Score + GC-TC A---C G-ATC GC-TC G-ATC Score Score = + A---C G-ATC Score

Functional Classification by SVM • A protein is classified as either belong (+) or not belong (-) to a functional family • By screening against all families, the function of this protein can be identified(example: SVMProt) • What is SVM? Support vector machines, a machine learning method, learning by examples, statistical learning, classify objects into one of the two classes. • Advantage of SVM: Diversity of class members (no racial discrimination). Use of sequence-derived physico-chemical features as basis for classification. Suitable for functional family classifications.

SVM References • C. Burges, "A tutorial on support vector machines for pattern recognition", Data Mining and Knowledge Discovery, Kluwer Academic Publishers,1998 (on-line). • R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 5.11, hard-copy). • S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy). • Online lecture notes

Introduction to Machine Learning • Goal: • To “improve” (gaining knowledge, enhancing computing capability) • Tasks: • Forming concepts by data generalization. • Compiling knowledge into compact form • Finding useful explanations for valid concepts. • Clustering data into classes. • Reference: • Machine Learning in Molecular Biology Sequence Analysis. • Internet links: • http://www.ai.univie.ac.at/oefai/ml/ml-resources.html

Introduction to Machine Learning • Category: • Inductive learning. • Forming concepts from data without a lot of knowledge from domain (learning from examples). • Analytic learning. • Use of existing knowledge to derive new useful concepts (explanation based learning). • Connectionist learning. • Use of artificial neural networks in searching for or representing of concepts. • Genetic algorithms. • To search for the most effective concept by means of Darwin’s “survival of the fittest” approach.

Machine Learning Methods Inductive learning: Concept learning and example-based learning Concept learning:

Machine Learning Methods Analytic learning:

Machine Learning Methods Neural network:

Machine Learning Methods Genetic algorithms: Pattern Strength Classification

SVM

SVM for Classification of Proteins How to represent a protein? • Each sequence represented by specific feature vector assembled from encoded representations of tabulated residue properties: • amino acid composition • Hydrophobicity • normalized Van der Waals volume • polarity, • Polarizability • Charge • surface tension • secondary structure • solvent accessibility • Three descriptors, composition (C), transition (T), and distribution (D), are used to describe global composition of each of these properties. Nucleic Acids Res., 31: 3692-3697

SVM for Classification of Proteins Descriptors for amino acid composition of protein: C=(53.33, 46.67) T=(51.72) D=(3.33, 16.67, 40.0, 66.67, 96.67, 6.67, 26.67, 60.0, 76.67, 100.0) Nucleic Acids Res., 31: 3692-3697

CZ5225 Methods in Computational BiologyAssignment 1 • Project 1: Protein family classification by SVM • Construction of training and testing datasets • Generating feature vectors • SVM classification and analysis. • Write a report and include a softcopy of your datasets • Project 2: Develop a program of pair-wise sequence alignment using a simple scoring scheme. • Write a code in any programming language • Test it on a few examples (such as estrogen receptor and Progesterone receptor) • Can you extend your program to multiple alignment? • Write a report and include a softcopy of your program

Protein Evolution: SARS coronavirus as an example

Protein Evolution: SARS coronavirus as an example

Presentation Transcript

Protein Evolution and Analysis February 5 2003

Protein Clustering to Assemble Families of Homeomorphic Proteins

Evolution of the Protein Kinase Family

Phylogenetic Analysis of the SARS virus

Protein Evolution

Availability of Quality Seed of Improved varieties

Phylogenetic inference on the evolution of protein-coding genes

SARS

Protein homology I: Evolution and comparison of protein sequences

Protective Measures For Prevention Of SARS Infection

臨床病理科 SARS 基本認識與防護措施

Living with SARS: The Economic Consequences on Singapore