300 likes | 439 Vues
7. Bioinformatics. Why carry out sequence analysis? Pairwise alignment dotplot Multiple sequence alignment consensus sequence profile Similarity secondary structure prediction Applications. PC session Protein sequence analysis Biocomputing. Primary etc structure X-ray crystallography
 
                
                E N D
7. Bioinformatics • Why carry out sequence analysis? • Pairwise alignment • dotplot • Multiple sequence alignment • consensus sequence • profile • Similarity • secondary structure prediction • Applications
PC session • Protein sequence analysis • Biocomputing • Primary etc structure • X-ray crystallography • Structural genomics • Homology modelling • Protein Structure • Physiological • Biochemical • Chemical (prodrugs) • Targeting and delivery Rational Drug Discovery Lead compound • QSAR • History • Objectives • Limitations • Statistics • Steric • Electrostatics • Hydrophobic • PC sessions • Molecular modelling • Theory • Drug structure • Drug conformation • Docking • De Novo ligand design • PC sessions • 3D QSAR • CoMFA
7.1. Sequence Analysis - why? Identify new proteins - that could be drug potential targets - especially for GPCRs Database query - give me all adrenergic receptor sequences 10 rat sequences 7 human sequences - conclusion? Understand overall function of newly identified protein A protein shows some similarity to another well understood protein - conclusion? Identify basic structural features A protein contains 7 hydrophobic stretches of ~26 amino acids - conclusion? A protein contains 12 hydrophobic stretches of ~26 amino acids - conclusion? Identify the important residues in a protein All class A amine like G-protein coupled receptors (e.g. adrenergic, serotonin (5HT), dopamine, histamine, muscarinic) contain a conserved D (aspartate, Asp) on helix 3 that is involved in binding all known drugs At some sequence positions there are key differences between similar receptors that can be exploited to design subtype-specific drugs. Sequence alignment can be used in homology modeling Build a structural model of a protein from its sequence alignment with a protein of known structure 3 more to be found The new protein may have a similar function It is probably a G-protein coupled receptor It is probably a transporter
7.2. DNA sequences XX SQ Sequence 2032 BP; 461 A; 543 C; 501 G; 527 T; 0 other; cagagcgcaagctggaactggctgaactgacaggcactgcgagcccagagtagccccgga 60 gctgagtgcaccacgcacccctaccacacccacacccacccacggccgctgaatgagtct 120 tccaggtgctcgcttgctgcccgcagcgccccgccggaggtccgctcgctgagggcggct 180 ggtgcgccggcagcctgtgcgctcacctgccagcctgcgcgccatggggcagcccgggaa 240 ccgcagcgtctttttgctggcgcccaacgcaagccacgcgccggaccaaaacgtcacgct 300 ggaacgggacgaggcctgggttgtgggcatgggcatcctcatgtcgcttattgtcctggc 360 catcgtgtttggaaacgtgctagtcatcacagccattgccaagtttgagcgtctccagac 420 ggtcaccaactacttcatcacctccctggcctgtgctgacctggtcatgggcctggcagt 480 ggtgccctttggggcctgccacatcctcatgaaaatgtggacttttggcaacttctggtg 540 tgagttttggacttccattgacgtgttatgcgtcacggccagcattgagaccttgtgcgt 600 gatcgctgtggatcgctacttagccatcacgtcacccttcaagtatcagtgcctgctgac 660 caagaataaggcccgggtggtcattttgatggtgtggatcgtgtctggccttacctcctt 720 cttacccattcagatgcactggtaccgggccagccacaaggaagccatcaactgctatgc 780 taaggaaacctgctgtgacttcttcacgaaccaaccctatgccattgcctcctccattgt 840 gtccttctaccttcccctggtggtcatggtcttcgtctactccagggtgttccaggtggc 900 caaaaggcagctccagaagatcgacaaatctgagggccgcttccatgcccaaaacgtcag 960 tcaagtggagcaggatgggcggagcggtctaggacaacgcaggacctccaagttctactt 1020 gaaggaacacaaagccctcaagactttaggcattatcatgggcactttcaccctgtgctg 1080 gctgcccttcttcattgtcaacattgtgcacgtgatcaaggataacctcatccgtaagga 1140 aatatacatccttctaaactggttgggctacatcaactccgctttcaatccccttatcta 1200 ctgccggagcccagatttcaggattgccttccaggagcttctctgcctgcgcaggtcttc 1260 attgaaggcctatgggaatggctgctccagcaacagcaatgacaggactgactacacagg 1320 ggaacagagtggatatcacctgggggaggagaaagacagtgaacttctgtgtgaagaccc 1380 cccaggcaccgaaaactttgtgaaccagcaaggtactgtgcccagtgatagcattgattc 1440 acaagggaggaattgtagtacaaatgactcactgctgtaatgccggttttctacttttta 1500 agacaccccttctccccagtaccctgcaacaaaacactaaacagactatttaacttgagt 1560 ctaataaatttagaataaagttgtacagagatgtgcaggaggaaagatatccttctgcct 1620 ttttattttttatttttttaagttgtaacaaaatatatttgagtaactgtttcttgtaca 1680 gttcagttcctctttgcctggaacttgttaagtttatgtctgaagggcttcagtctcaaa 1740 ggacctggggctgctatgttttgatgacttttcctgcatatctacctcattgatcaagta 1800 ttaggggtaatatattgctgctggtaatttgtatctgaaggagaccttccttcctgcacc 1860 cttggactggaagatactgagtctctcggacctttcgctgtgaacatggactctcctcgc 1920 ccctcttatttgctcaaacggggtgttgtaggcagggacttgaggggcagctttggttgt 1980 tttcctgagcaaagtctaaagtttacagtaaataaattgtttgaccatgaaa 2032
7.3. Identity and similarity Align 2 sequences ADGVLIIQVG & ADGVLIQVG 2 alternatives ADGVLIIQVG ADGVLIIQVG |||||| or |||||| ||| ADGVLIQVG ADGVLI-QVG Score Score Comparing sub-sequences of A (400 residues), and B (650 residues) gap 9 = higher, so better alignment 6 If A and B are identical in the regions that match then alignment is straightforward even if it is necessary to insert gaps generally the subsequences are not identical so and so we need a measure of similarity rather than identity (I)A (I)B (ii)A (ii)B
Gbg GDP 2.4. G-protein coupled receptors Exterior Inhibitory Ligand Stimulatory ligand Receptor (Gs coupled) AC Plasma membrane Ga Cytosol Rhodopsin, X-ray structure Stimulation
2.4 GPCR CXCR4 Chemokine N-terminal (start)_sequences Same sequence different organisms, different sequences same organism – note different lengths - Note poor alignment at start (< 40), Good alignment in helix (>40), including well-conserved N at position 57( a well-known GPCR motif)
2.4 Notes on previous alignment Note examples of different sequence, same organism Note well-conserved (largely green) helical regions (~185-210, 225-247) Note less well-conserved loop region (~215) between transmembrane helix 6 (TM6) and TM7 Find conserved CWXP motif and NPXXY motif CWLP is at position NPXXY is at position Are the alternatives to C (position 199) and N (position 241) what you would expect from the amino acid structure (see below)? The identification of such Motifs is an indication that a new sequence is a GPCR Can you see groups of sequences that more similar to each other – if these are highly similar subtypes of the same receptor (e.g. Neurokinin receptor subtype 1 (NK1R), NK2R and NK3R) it could be difficult to design a drug to bind to one and not the other. Note predominance of green hydrophobic residues in transmembrane regions (roughly positions 198-210 (TM6) and 222-248 (TM7) and red/blue hydrophilic residues in the loops (~211-221) and ~249+. For the full colour code examine the alignment itself!. 199 241 Yes, the alternatives are similar
2.4. Sequence alignment and subtype-specificity This position is N in beta-adrenergic receptors and F in alpha adrenergic receptors. We know from SDM and structure that it is in the binding site Beta-selctive ligands such as propranolol have on OH group to interact with this; alpha adrenergic ligands are more hydrophobic at this point. 5HT receptors also have this N at this position and so promiscuously bind propranolol. Knowledge of sequence can therefore be used to design specificity and reduce side-effects.
3.4. What does an alignment mean? At position 6, 1oft has a Y and 1bip has an H. From Homstrad database, superposition of 1oft and 1bip - 4-oxalocrotonate tautomerase from Pseudomonas sp and Pseudomonas putida, 60 residues, %ID = 76% 1tig, 2ife, translation initiation factor if-3 from Bacillus stearothermophilus and Escherichia coli Gap – red chain longer
3.4. What does an alignment mean? The gap here is because the blue loop is longer than the red loop at this point 2mbr and 1hsk, Diphospho-N-acetylenolpyruvylglucosaminereductase and UDP-N-acetylenolpyruvoylglucosaminereductase from Escherichia coli and Staphylococcus aureus
7.6. Pairwise alignment: the Dotplot Align the sequnces using The Dotplot
Dotplot – unrelated sequences These sequences: ASRAILFYLLLIDD and HLWDSAGGQNSTSP are not related. There is no serious diagonal line. There will inevitably some dots – there are only 20 amino acids. A dot does not mean an alignment with 1 identical residue Is there a weak alignment in the following? ASRAILFYLLLIDD--------- ---------HLWDSAGGQNSTSP Probably not, even this looks like it has arisen by chance
Alignments from dotplots – simple cases The following dotplot has been determined – note the diagonal lines Consider whether the short diagonal regions can be extended The alignment is therefore HIWDSGGAQQSSSD |:|||:|:|:|:| HLWDSAGGQNSTSP The %ID = 8*100/14 This can only be worked out from the alignment It cannot be worked out from the dotplot Note that in this case, some of the non-identical amino acids, e.g. {I,L}, {G,A} are very similar hence the : symbol. The D and the P at the end are not at all similar but the they should not be missed out
Dotplots - continued Alignments do not always start in the top left hand corner The alignment is therefore YLHIWDSGGAQQSSSDD |:|||:|:|:|:| --HLWDSAGGQNSTSP- The %ID = 8*100/14 =57% (based on 2nd sequence, or 8*100/17 =47% based on first
Dotplots: alignments with gaps This alignment shows two diagonal lines, with two clear local alignments: HLWDSA AGAQQSTS |||||| and ||:|:||| HLWDSA AGGQNSTS Joining these together gives HLWDSAFFAGAQQSTSHLWDSAFFAGAQQSTS |||||| |:|:||| or ||||| ||:|:||| HLWDSA---GGQNSTS HLWDS---AGGQNSTS We have to decide as we can’t use the A twice, so I chose 1st – you might choose 2nd %ID = 11*100/16=69%
7.6. For you to align using a dot plot D4DR_HUMAN RERKAMRVLP VVVGAFLLCW TPFFVVHITQ ACM1_HUMAN KEKKAARTLS AILLAFILTW TPYNIMVLVS Hint: you need some squared paper! The correct answer is obvious - but you need to do the exercise so you can check out the alternatives The correct answer can be found at http://tinyGRAP.uit.no/famin.html - the sequences are part of helix 6 (last checked 2001). 20
C-terminus The alignment is EGPRPDSSAGGSSAG |||:|||||| EGPKPDSSAG or EGPRPDSSAGGSSAG |||:|| |||| EGPKPD-----SSAG or? gap 7.7. Pairwise alignment: Completed Dotplot %ID = 9*100/10 9 matches over a length of 10 residues %ID = 9*100/10 9 matches over a length of 15 residues Different but related Identical sequences Highly similar
7.8. Global alignment v local alignment Global alignment The essence is to score 1 for each X on the dot plot, 0 otherwise. The aim is to find the highest scoring route (from the alternatives) through the entire grid starting from the C-terminus - essentially by joining up diagonal lines in the dotplot. A gap penalty is introduced for jumping between parallel lines as this corresponds to creating a gap. The Needleman and Wunsch algorithm is the best known of this kind. Local alignment Similar to the above but only fragments are considered. Only parts of the protein may be similar.
7.9. Database searching • In database searching we effectively carry out lots of pairwise comparisons - but this has to be much faster than an ordinary pairwise alignment. • Fasta • searches for identical pairs of ~2 residues - with tricks to find the best way to join the pairs together. An alignment will be produced if enough pairs are found. • Output from the program includes • query sequence - the one entered • name of database searched (e.g. SWISS-PROT) • program name + literature reference to be cited • list of hits (often ~50), incl. unique database identifier (e.g. A1AA_RAT) & ID code (e.g. P23944) • E-value - a low value indicates that virtually no matches with a similar score could expected by chance Look for a value less than 0.01 or preferably 0.001 • alignment • BLAST • The distinction is that BLAST looks for fixed length hits and extends them if possible. • The resulting high scoring pairs (HSPs) form the basis of the alignment.
+ - S H A - S C A + A H S - H + - S H S S small + positive C cysteine A aromatic - negative or similar polar Other groupings possible 7.10. 5 Amino acid groups - arrange in groups Gly, G Val, V Tyr, Y Arg, R Asp, D Cys, C Ala, A Trp, W Lys, K Glu, E Met, M Ile, I Phe, F Ser, S Asn, N Pro, P Leu, L His, H Thr, T Gln, Q
7.11. Similarity Above left - identity matrix - as used in dotplot Above right - part of Dayhoff mutation matrix - based on observed mutations in aligned proteins. W is rarer than L and so matches score 17 rather than 6 F is like Y so a match still scores 7 W and V are very different hence - 6 30
7.12. Multiple sequence alignment • Two main perspectives • 1st - based on comparison of amino acid sequences, taking into account amino acid properties • 2nd - takes into account secondary or tertiary structure • Which is the best alignment below? • HHHHHH HHHHH HHHHHH HHHHH • EGPRPDSSAGGSSAGAPD EGPRPDSSAGGSSAGAPD • |||:|.|||| |||:|. |||| • EGPKPQSSAG-----APD EGPKPQ-----SSAGAPD • General strategy • Pair-wise alignment of all sequences • Produce a phylogenetic tree to group similar sequences (as right) • Similar sequences aligned first, more distantly related later • Gaps in related sequence guides position of gaps in others • The alignment may not be optimal and may need manual adjustment • A similarity matrix (e.g. Dayhoff PAM 250, BLOSUM 60) rather than an identity matrix used in alignment • Different methods (e.g. clustal (ordinary method), T-coffee, profile methods in clustal) may give different alignments so think carefully about an alignment (H denotes helix) The first creates gaps in secondary structure (not so good) - second is better
7.13. Profile methods in multiple sequence alignment y d g G A/I V/L v e A t 0.6Y 0.6D 0.8G 1.0G 0.4A 0.4V 0.6V 0.6E 1.0A 0.2V 0.4F 0.4E 0.2- 0.4I 0.4L 0.4- 0.4Q 0.8T 0.2- 0.2- • Consensus sequence • In multiple sequence alignment the consensus sequence gives the usual amino acid at a particular position: • Shown as upper case if only one amino acid present, e.g. A at position 9 • lower case if majority are one amino acid, e.g. y at position 1 • If equal numbers, show all residues present, e.g. V/L at position 6 • Profile • Percentage of each amino acid at each point • At position 1, 3/5 Y and 2/5 F so profile is 0.6Y, 0.4F
7.13. Profile methods in multiple sequence alignment The profile Sometimes it is useful to align sequences against the profile, especially if they are very different to each other.
7.14. Secondary structure prediction: predicting transmembrane helices membrane inside cell LAEPWQFSMLAAYMFLLIMLGFPINFLTLYVTVQHKRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIPNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMN
7.14. Prediction from Hidden Markov Method # Sequence Length: 243 # Sequence Number of predicted TMHs: 7 # Sequence Exp number of AAs in TMHs: 156.33216 # Sequence Exp number, first 60 AAs: 40.14445 # Sequence Total prob of N-in: 0.00006 # Sequence POSSIBLE N-term signal sequence Sequence TMHMM2.0 outside 1 9 Sequence TMHMM2.0 TMhelix 10 29 Sequence TMHMM2.0 inside 30 41 Sequence TMHMM2.0 TMhelix 42 64 Sequence TMHMM2.0 outside 65 78 Sequence TMHMM2.0 TMhelix 79 101 Sequence TMHMM2.0 inside 102 120 Sequence TMHMM2.0 TMhelix 121 143 Sequence TMHMM2.0 outside 144 152 Sequence TMHMM2.0 TMhelix 153 175 Sequence TMHMM2.0 inside 176 186 Sequence TMHMM2.0 TMhelix 187 209 Sequence TMHMM2.0 outside 210 218 Sequence TMHMM2.0 TMhelix 219 241 Sequence TMHMM2.0 inside 242 243 This is a highly sophisticated prediction based on hydrophobicities and known observations etc From http://www.sbc.su.se/internal.htm l The web is extremely important in bioinformatics Similar programs can predict helices, sheet and turn etc in globular proteins. 40
8.1. Drug targeting and delivery • Physical approaches: microspheres • Drugs enclosed in biodegradable particles that are delivered to fine capillaries where they get stuck - inject upstream of target. • Biochemical approaches: • Raise antibody to specific antigen, e.g. cell markers on tumour cells then link drug to antibody. • There are still problems as antibodies are large - it is preferable to use an antibody fragment as it is then distributed more easily. • The drug must still get inside the cell so it must be attached via a labile linkage.
8.2. The lead • Finding a lead - so a major drug development project can start • Serendipity • High throughput screening • e.g. testing compounds from companies own database • combinatorial chemistry • using libraries specifically designed using molecular modelling etc for a given target • Properties of a lead • not just active in primary screen • screen must be validated statistically • passed secondary tests to avoid false positives • show promise in a cascade of tests agreed for its selection • must be active in vivo • must be patentable - not too similar to a competitors product • Other desirable properties of lead • potent enough for efficacy at a convenient dose • selective within receptor class (e.g adrenergic ligand selective for  v , 1 v 2 • selective between classes, e.g. 1 antagonist doesn’t act at 5-HT receptors • toxicity: good therapeutic index, not mutagenic • active orally; reasonable duration of activity; stable • need to determine whether metabolites possess activity; are there species anomalies? • QSAR can start once we have a lead