400 likes | 637 Vues
EX3. Sequence Alignment: BLAST and Psi-BLAST. Outline. Pairwise alignment : Alignment with gaps Global alignment Local alignment Blast: NCBI BLAST web server NCBI PSI-BLAST web server BLAST through Chimera. Introduction. The Limits of Sequence Similarity. Introduction.
E N D
EX3 SequenceAlignment: BLAST and Psi-BLAST
Outline • Pairwise alignment: • Alignment with gaps • Global alignment • Local alignment • Blast: • NCBI BLAST web server • NCBI PSI-BLAST web server • BLAST through Chimera
Introduction The Limits of Sequence Similarity
Introduction Example: Aligning Two Globins Human Hemoglobin (HH): VLSPADKTNVKAAWGKVGAHAGYEG Sperm Whale Myoglobin (SWM): VLSEGEWQLVLHVWAKVEADVAGHG
Introduction Example: Aligning Two Globins (HH) VLSPADKTNVKAAWGKVGAHAGYEG (SWM) VLSEGEWQLVLHVWAKVEADVAGHG • No Gaps: • Percent identity: 36 • Percent similarity: 40
Introduction Example: Aligning Two Globins (HH) VLSPADKTNVKAAWGKVGAH-AGYEG (SWM) VLSEGEWQLVLHVWAKVEADVAGH-G • With Gaps: • Gaps: 2 • Percent identity: 45.833 (instead of 36 without gaps) • Percent similarity: 54.167 (instead of 40 without gaps)
Introduction How do gaps create? Indelsarerarein evolution. They vary in size from one base pair to a section of one chromosome. Insertion Deletion
Introduction Types of Gap Penalties • Once a gaps is created, easy to extend: • Gap open – penalty for the first residue in a gap • Gap extension – penalty for additional residue in a gap. Conclusion: gap opening and extension should be ranked differently. Gap opening will get higher penalty.
Introduction Proteins scoring matrices
Introduction Proteins scoring matrices PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 Closer sequences Distant sequences
Introduction Scoring • The final score of the alignment is the sum of the positive scores and penalty scores: + Identities + Similarities - Substitution - Gap insertions - Gap extensions Alignment score Scoring Matrix Gap penalties
Pairwise Alignment Local vs. Global • Global alignment – finds the best alignment across the whole two sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ • Local alignment– finds regions of similarity in parts of the • sequences. ADLG CDRYFQ |||| |||| | ADLG CDRYYQ
Pairwise Alignment Global: Needleman & Wunsch (1970) • Involves an iterative matrix method of calculation Needleman, S. B. and Wunsch, C. D., 1970
Pairwise Alignment http://www.ebi.ac.uk/Tools/psa/emboss_needle/
Pairwise Alignment http://www.ebi.ac.uk/Tools/psa/emboss_needle/
Pairwise Alignment Local: Smith & Waterman (1981) • Makes an optimal alignment of the best segmentof similarity • between two sequences • Sequences that contain regions that are highly similar • Use when one sequence is short and the other is very long • Can return a number of aligned segments Smith, T.F. and Waterman, M.S., 1981
Pairwise Alignment http://www.ebi.ac.uk/Tools/psa/emboss_water/
Pairwise Alignment http://www.ebi.ac.uk/emboss/align/
BLAST/PSI-BLAST • BLAST- search your sequence against a sequence database • PSI-BLAST- search a PSSM against a sequence database
BLAST(BASIC LOCAL ALIGNMENT SEARCH TOOL) • Goal: A fast search for homologues in a huge database BLAST is a heuristic method . Avoids an explicit search of the entire matrix by discarding most irrelevant sequences. Key concept: Homologous sequences expected to contain ungapped short segments with substitutions but without gaps. Altschul, S.F.,Gish, W., Miller, W., Myers, E.W., and Lipman,D.J(1990) “basic local alignment search tool” J. Mol. Biol. 215: 403-410
PSI-BLAST • Standard protein-protein BLAST search. • Building a position-specific scoring matrix (PSSM or profile) from a multiple alignment of the sequences returned with low Expect values. • BLAST search with PSSM as query. • Refining the PSSM by adding new database sequences. • Stop when no more matches to new database sequences are found. Otherwise, repeat to step 3.
PSSMPosition Specific Scoring Matrix • Given a query sequence: • Alignall sequences above a certain similarity • Each cell (i,j) represents probability of residue i to beat position j of the multiple alignment.
General Issues • Where? (to find homologues) • Structural templates- search against the PDB • Sequence homologues- search against SwissProt or Uniprot • How long? (length of homologues) • Fragments- short homologues (less than 50,60% the query’s length) = relatively bad alignment • Ensure your sequences exhibit the wanted domain(s) • N/C terminal tend to vary in length between homologues
General Issues • From who? (which species the sequence belongs to) • Don’t care, all homologues are welcome • Orthologues/paralogues may be helpful • Sequences from distant/close species provide different types of information • Which method? (BLAST/PSI-BLAST) • Depends…
General Issues • Which method? (BLAST/PSI-BLAST) • BLAST: • identify the query sequence • find protein sequences similar to the query • PSI-BLAST: • finding very distantly related proteins • finding new members of a protein family • build a custom position-specific score matrix • Poor results from BLAST.
No “Miracle solution” Each protein is a different story adjust parameters: • BLAST- E-value, substitution matrix, gap penalties, database, word length… • PSI-BLAST- BLAST parameters + PSSM inclusion threshold (or chose manually), number of rounds…
The Query Protein Name: Dihydrodipicolinatereductase Enzyme reaction: Molecular process: Lysine biosynthesis (early stages) Organism: E. coli Sequence length: 273 aa
The Query Protein Query: DAPB_ECOLI <DAPB_ECOLI MHDANIRVAIAGAGGRMGRQLIQAALALEGVQLGAALEREGSSLLGSDAGELAGAGKTGVTVQSSLDAVKDDFDVFIDFTRPEGTLNHLAFCRQHGKGMVIGTTGFDEAGKQAIRDAAADIAIVFAANFSVGVNVMLKLLEKAAKVMGDYTDIEIIEAHHRHKVDAPSGTALAMGEAIAHALDKDLKDCAVYSREGHTGERVPGTIGFATVRAGDIVGEHTAMFADIGERLEITHKASSRMTFANGAVRSALWLSGKESGLFDMRDVLDLNNL
BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi BLASTp
BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi Query Sequence Database BLASTp Run
BLAST NCBI As many as possible Evalue threshold Matrix
BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi Mark all Mark only wanted
BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
PSI-BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi Query Sequence Database Run PSI-BLAST
PSI-BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi Pre-calculated PSSM Threshold for inclusion in PSSM
PSI-BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi Run next round Not found in previous round Include sequence in the PSSM