1 / 62

Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA. Introduction to bioinformatics Stinus Lindgreen stinus@binf.ku.dk Bioinformatics Centre, University of Copenhagen. Outline of the lecture. Scoring an alignment: BLOSUM and PAM Scoring matrices for nucleotides

parry
Télécharger la présentation

Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pairwise alignment 2:Scoring matrices and gaps BLAST, BLAT and FASTA Introduction to bioinformatics Stinus Lindgreen stinus@binf.ku.dk Bioinformatics Centre, University of Copenhagen

  2. Outline of the lecture • Scoring an alignment: BLOSUM and PAM • Scoring matrices for nucleotides • Treatment of gaps • The BLAST algorithm • Position Specific Scoring Matrices • PSI-BLAST and other variants • BLAT • FASTA

  3. Definition of scoring system To build an alignment we need a scoring system • Aligning two different residues indicates a substitution • How likely is it to replace A with B? • A likely alignment should receive a large score Create a substitution matrix • A 20×20 symmetrical table (matrix) • Look up the score for aligning two amino acids • The diagonal represents conservation

  4. Considerations Conservative substitutions • Prefer changes that respect structure and function • E.g. hydrophobicity, size, chemical properties etc. Frequencies • How often do specific residues occur? • Scores should weigh the rare ones higher Evolution • How distant are the sequences? • Ancient divergence  more changes

  5. The creation Approach 1: • Make an evolutionary model • Use closely related sequences • Extrapolate to greater distances • Done in the PAM family Approach 2: • Look at related sequences • Observe actual mutations in motifs • Use sets of different overall identity • Done in the BLOSUM family

  6. The PAM250 matrix[1] [1] M.O. Dayhoff: Survey of new data and computer methods of analysis (1978), Atlas of protein sequence and structure, 5:3

  7. The PAM family Defining characteristics • PAM: Point (or percent) accepted mutations • One PAM unit: 1 change per 100 residues (~ 10M ys) • Not the probability that a residue A is something else • What would PAM250 mean in that case? • But: The number of expected changes overall • Including changes A  B  A

  8. Construction of PAM matrices The general idea: • Assume evolution is independent • Describe ”one step of evolution” • The next step will follow PAMx: Scoring matrix for x evolutionary steps • Very similar sequences: Low x • Very divergent sequences: Large x What is an evolutionary step?

  9. Evolutionary model Assume independence • between positions • over time • P(AB)=P(BA) Consequences: • How a sequence looks tomorrow depends • ONLY on how it looks today • Not on how it looked yesterday • Known as a Markov chain • The context does not matter • Evolution ”reversible”

  10. Building PAM1 • Find set of very similar sequences (1% ID) • Make global alignment (71 groups) • Count the substitutions (1572 changes) • Calculate weights

  11. From counts to scores The simplified version • Count mutation frequencies • The respective probabilities • Normalize by mutability of respective amino acid • Makes comparisons possible • Multiply to wanted evolutionary distance x • Still probabilities • Calculate log-odds score for (AB+BA)/2 • You have the PAMx matrix!

  12. Why log-odds? • Summation instead of multiplication • log(A·B)=log(A)+log(B) • S>0: More often than per chance • S=0: Number expected by chance • S<0: Less often than per chance

  13. The PAM1 matrix Highly identical sequences  1 evolutionary step • PAM1 describes likelihood of changes • Diagonal close to 1, off-diagonal close to 0 • PAM2=PAM1×PAM1 • PAM3=PAM1×PAM2 … • Markov chain property (independence) • Greater divergence through matrix multiplication • Large PAM: More even values

  14. PAM problems Good and bad things • Based on evolution  • Assumes independence  • Global data  evolution of proteins  • Based on small dataset  • Extrapolates to greater divergence  information loss & error growth  • Evolutionary model partly verified  • But still simplified…  • Mutations happen at the nucleotide level 

  15. The BLOSUM matrix[2] [2] S. Henikoff and J.G. Henikoff: Amino acid substitution matrices from protein blocks (1992), Proc. Natl. Acad, Sci., 89:10915–10919s

  16. The BLOSUM family Look at conserved motifs in proteins • Scan databases for motifs • No gaps in alignment  a BLOCK • One protein can contain many motifs • Make groups of motifs • ~2000 blocks, >500 protein families • The BLOCKS database • BLOSUM: BLOCKS substitution matrices

  17. The BLOSUM idea Don’t extrapolate data • Collect divergent dataset • Observe mutations directly • Local alignment of sequences • Evolution treated implicitly • Still assumes independence between sites

  18. Building BLOSUMx Substitution scores for sequences of x %ID • Create sets of blocks of x %ID • Very similar sequences are grouped and weighted • Avoid bias due to numbers • Count the substitutions • Calculate log-odds scores for all pairs • You have the BLOSUMx matrix!

  19. Some general notes • BLOSUM62 has proved to work for a range of similarities • Other values (e.g. 30, 45 and 80) also used • Today: BLOSUM used more than PAM

  20. BLOSUM problems Good and bad things • Evolution not considered explicitly  • Known related proteins  • Assumes star-shaped phylogeny  • Based on conserved blocks  / • Local alignment 

  21. The matrix matters • Example: BLAST using BLOSUM45 and BLOSUM80 • Note number of hits • Note difference in scores NCBI BLAST >GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR. MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQFVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVTHAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM

  22. Nucleotide scoring matrices Normally just simple 1/-1 cost scheme • Problematic simplification • Especially for structural sequences (e.g. ncRNA) • Consider evolution (PAM-like matrix) • Use actual frequencies of nucleotides as background • Weigh transitions and transversions differently

  23. Gap penalties We can score pairs of nucleotides/amino acids • What about indels and gaps? Two main strategies Linear gap penalty: G(n)=a·n • Penalize each gap the same (cf. last lecture) Affine gap penalty: G(n)=O+E·n • More evolutionary sound • Opening a gap costs more than extending it • Standard today

  24. Gaps: Things to consider How large should O and E be? • Depends on the scoring matrix used • Normally good values are given • Too large: No gaps created • Too small: Too many gaps created Should end gaps be penalized? • Local alignment: No • Global: Probably

  25. Heuristics So, now we know all about scoring an alignment • Let us use Smith-Waterman or Needleman-Wunsch • Against a database or a full-length genome? Bad idea! Instead: Heuristic method • Guaranteed to give a good solution fast • Not necessarily the optimal alignment

  26. The BLAST algorithm Basic Local Alignment Search Tool You have all used it, but how does it work? • Used to find statistically significant local alignments between a query and a database • Many versions (protein-protein, nucleotide-nucleotide,nucleotide-protein …) • Fast, widely used, good reason to know the algorithm

  27. BLAST overview The six steps • Filtering of low complexity regions (optional) • Compile list of relevant words • Scan database sequences • Extend hits to HSPs • Calculate E-value for significant hits • Report hits and alignments

  28. BLAST step 1 Filtering • Some sequences contain low complexity regions • Give rise to many random hits • Remember dotplots • Filter out by replacing with Xs

  29. BLAST step 2 Compiling word list • Word length L=3 (protein) or L=11 (nucleotides) • Choose score threshold T (usually 11) • Find all words of length L in query sequence • One for each position • Compare to all possible length L words (8000/~4.2M) • For each position in query, remove words below T • Limited to appr. 50 words per position

  30. The word list APLSADQASLVKSTWAQVRNSEVEILAAVFTAYPDIQARF… APL PLS LSA SAD ADQ DQA… Word list, position 1 APL: APC,APS,APT,APE… Remove words w with score(APL,w)<T

  31. BLAST step 3 Scan database • Store words for each position in efficient search tree • Scan each sequence in database • Remember exact word hits • If query length is 100, scan for appr. 50·100 hits

  32. Scanning APL APA  LPL  IIAPLIGNESNAPAVQTLVGQLPLSHKARG… Perfect match between word and database is a hit

  33. BLAST step 4 Extend hits (BLAST2) • HSP: High-scoring Segment Pair • Find two hits on same diagonal with distance ≤A • Connect them (ungapped alignment) • Extend using gaps, matches and mismatches • Extension continues while score increases • Stop when score drops X below highest score Original BLAST • Extend all single word hits (higher T needed)

  34. BLAST extension • Find diagonal hits • Extend alignment

  35. BLAST step 5 Calculate E-scores • Compile list of HSPs scoring more than S • Let x be the score of a HSP • The probability of seeing this score by chance P(score≥x) • The expectation of seeing this score in the database E≈1-e-P(score≥x)D • Rather complicated calculations • The Karlin-Altschul equation (written in many ways)

  36. BLAST step 6 Report the results • List the hits (sorted by E-value) • Graphical representation • Show Smith-Waterman alignment of the HSPs

  37. Using BLAST Settings • Query sequence • Sub-sequence • Database  important • Conserved Domain search: Additional info • What organisms to search • Scoring matrix • Normally leave the rest as defaults • But you can change, well, anything

  38. Understanding BLAST Output • Database searched and query used • Number of hits • Color-coded diagram • Magnitude of score • Relation between hits (if any) • Hits sorted by E-value • Alignments • Additional info by clicking a link

  39. BLAST summary • Heuristic local alignment • Finds word matches • Extends locally • Might miss optimal solutions • Fast • Lower E-value  better result • Many hits between query and sequence possible • Remember: Use a proper scoring matrix!

  40. Position Specific Scoring Matrix Another brief detour: PSSM or Profile • Useful to represent (part of) a multiple alignment • Represents the information in a motif • For each position: What are the frequencies of the characters? (Nucleotides or amino acids) • Frequencies can give the probabilities • Find most probable hit to a PSSM in a sequence • Compare to the background probability (i.e. the overall frequencies) • Pseudocounts to avoid 0 probabilities

  41. PSSM example 123456789 --------- AAGGTAAGT TGTGTGAGT CAGGTATAC ATGGTAACT TAGGTACTG TAGGTCATT ACAGTCAGT CAGGTTGGA TCCGTAAGT GAGGTAAAC | 1 2 3 4 5 6 7 8 9 -|-------------------- A| 3 6 1 0 0 6 7 2 1 C| 2 2 1 0 0 2 1 1 2 G| 1 1 7 10 0 1 1 5 1 T| 4 1 1 0 10 1 1 2 6 Just divide by 10 to get probabilities Assume equal background distribution P(A)=P(C)=P(G)=P(T)=0.25

  42. PSSM score Comparing sequence CAGGTAGTC to the PSSM • Probability given the model 0,2·0,6·0,7·1,0·1,0·0,6·0,1·0,2·0,2=0,0002016 • Probability given the background 0,259=0,000003815 • log-odds score (bits)

  43. PSI-BLAST Position-Specific-Iterated BLAST • Designed to find distant homologs • More sensitive than BlastP • Use PSSM instead of just sequence comparison

  44. The PSI-BLAST algorithm • Perform standard BlastP search • Create PSSM based on query and best hits • Search database again for hits to the PSSM • Incorporate new hits (i.e. update frequencies) • Iterate (repeat) until convergence • End result: More distant homologs found • Slower than standard BLAST (of course)

  45. PSI-BLASTing a protein • Output at first iteration, 2nd iteration, … • When to stop? • In this case: 7 iterations PSI-BLAST >GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR. MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQFVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVTHAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM

  46. Other BLAST versions • MegaBLAST • Assumes high similarity, longer sequences • BLAST2SEQUENCES • Blast on only two sequences (local alignments) • PHI-BLAST • Pattern Hit Initiated BLAST – searches for motif • WU-BLAST • BLAST from Washington University, not NCBI

  47. BLAST variants • Nucleotide vs nucleotide: blastn • Protein vs protein: blastp • Translated sequence vs protein database: blastx • Protein sequence vs translated database: tblastn • Translated sequence vs translated database: tblastx • Find divergent protein sequences: PSI-BLAST

  48. The BLAT algorithm BLAST-Like Alignment Tool • Designed for the genome projects • Local alignments between long sequences • Speed important! • BLAST turned upside-down

  49. What BLAT does How does it work? • Index the database (length 11 words) • Find hits in the query • Opposite the BLAST strategy • Extend hits to HSPs • Useful when the database does not often change • Too time (and space) consuming for normal BLAST

  50. The FASTA algorithm Similar to BLAST … but different • Fast All (since it works with all alphabets) • Not as widely used as BLAST – but one of the first • Works in a step-wise fashion • Locate word hits, extend using Smith-Waterman

More Related