Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA

Pairwise alignment 2:Scoring matrices and gaps BLAST, BLAT and FASTA Introduction to bioinformatics Stinus Lindgreen stinus@binf.ku.dk Bioinformatics Centre, University of Copenhagen

Outline of the lecture • Scoring an alignment: BLOSUM and PAM • Scoring matrices for nucleotides • Treatment of gaps • The BLAST algorithm • Position Specific Scoring Matrices • PSI-BLAST and other variants • BLAT • FASTA

Definition of scoring system To build an alignment we need a scoring system • Aligning two different residues indicates a substitution • How likely is it to replace A with B? • A likely alignment should receive a large score Create a substitution matrix • A 20×20 symmetrical table (matrix) • Look up the score for aligning two amino acids • The diagonal represents conservation

Considerations Conservative substitutions • Prefer changes that respect structure and function • E.g. hydrophobicity, size, chemical properties etc. Frequencies • How often do specific residues occur? • Scores should weigh the rare ones higher Evolution • How distant are the sequences? • Ancient divergence  more changes

The creation Approach 1: • Make an evolutionary model • Use closely related sequences • Extrapolate to greater distances • Done in the PAM family Approach 2: • Look at related sequences • Observe actual mutations in motifs • Use sets of different overall identity • Done in the BLOSUM family

The PAM250 matrix[1] [1] M.O. Dayhoff: Survey of new data and computer methods of analysis (1978), Atlas of protein sequence and structure, 5:3

The PAM family Defining characteristics • PAM: Point (or percent) accepted mutations • One PAM unit: 1 change per 100 residues (~ 10M ys) • Not the probability that a residue A is something else • What would PAM250 mean in that case? • But: The number of expected changes overall • Including changes A  B  A

Construction of PAM matrices The general idea: • Assume evolution is independent • Describe ”one step of evolution” • The next step will follow PAMx: Scoring matrix for x evolutionary steps • Very similar sequences: Low x • Very divergent sequences: Large x What is an evolutionary step?

Evolutionary model Assume independence • between positions • over time • P(AB)=P(BA) Consequences: • How a sequence looks tomorrow depends • ONLY on how it looks today • Not on how it looked yesterday • Known as a Markov chain • The context does not matter • Evolution ”reversible”

Building PAM1 • Find set of very similar sequences (1% ID) • Make global alignment (71 groups) • Count the substitutions (1572 changes) • Calculate weights

From counts to scores The simplified version • Count mutation frequencies • The respective probabilities • Normalize by mutability of respective amino acid • Makes comparisons possible • Multiply to wanted evolutionary distance x • Still probabilities • Calculate log-odds score for (AB+BA)/2 • You have the PAMx matrix!

Why log-odds? • Summation instead of multiplication • log(A·B)=log(A)+log(B) • S>0: More often than per chance • S=0: Number expected by chance • S<0: Less often than per chance

The PAM1 matrix Highly identical sequences  1 evolutionary step • PAM1 describes likelihood of changes • Diagonal close to 1, off-diagonal close to 0 • PAM2=PAM1×PAM1 • PAM3=PAM1×PAM2 … • Markov chain property (independence) • Greater divergence through matrix multiplication • Large PAM: More even values

PAM problems Good and bad things • Based on evolution  • Assumes independence  • Global data  evolution of proteins  • Based on small dataset  • Extrapolates to greater divergence  information loss & error growth  • Evolutionary model partly verified  • But still simplified…  • Mutations happen at the nucleotide level 

The BLOSUM matrix[2] [2] S. Henikoff and J.G. Henikoff: Amino acid substitution matrices from protein blocks (1992), Proc. Natl. Acad, Sci., 89:10915–10919s

The BLOSUM family Look at conserved motifs in proteins • Scan databases for motifs • No gaps in alignment  a BLOCK • One protein can contain many motifs • Make groups of motifs • ~2000 blocks, >500 protein families • The BLOCKS database • BLOSUM: BLOCKS substitution matrices

The BLOSUM idea Don’t extrapolate data • Collect divergent dataset • Observe mutations directly • Local alignment of sequences • Evolution treated implicitly • Still assumes independence between sites

Building BLOSUMx Substitution scores for sequences of x %ID • Create sets of blocks of x %ID • Very similar sequences are grouped and weighted • Avoid bias due to numbers • Count the substitutions • Calculate log-odds scores for all pairs • You have the BLOSUMx matrix!

Some general notes • BLOSUM62 has proved to work for a range of similarities • Other values (e.g. 30, 45 and 80) also used • Today: BLOSUM used more than PAM

BLOSUM problems Good and bad things • Evolution not considered explicitly  • Known related proteins  • Assumes star-shaped phylogeny  • Based on conserved blocks  / • Local alignment 

The matrix matters • Example: BLAST using BLOSUM45 and BLOSUM80 • Note number of hits • Note difference in scores NCBI BLAST >GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR. MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQFVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVTHAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM

Nucleotide scoring matrices Normally just simple 1/-1 cost scheme • Problematic simplification • Especially for structural sequences (e.g. ncRNA) • Consider evolution (PAM-like matrix) • Use actual frequencies of nucleotides as background • Weigh transitions and transversions differently

Gap penalties We can score pairs of nucleotides/amino acids • What about indels and gaps? Two main strategies Linear gap penalty: G(n)=a·n • Penalize each gap the same (cf. last lecture) Affine gap penalty: G(n)=O+E·n • More evolutionary sound • Opening a gap costs more than extending it • Standard today

Gaps: Things to consider How large should O and E be? • Depends on the scoring matrix used • Normally good values are given • Too large: No gaps created • Too small: Too many gaps created Should end gaps be penalized? • Local alignment: No • Global: Probably

Heuristics So, now we know all about scoring an alignment • Let us use Smith-Waterman or Needleman-Wunsch • Against a database or a full-length genome? Bad idea! Instead: Heuristic method • Guaranteed to give a good solution fast • Not necessarily the optimal alignment

The BLAST algorithm Basic Local Alignment Search Tool You have all used it, but how does it work? • Used to find statistically significant local alignments between a query and a database • Many versions (protein-protein, nucleotide-nucleotide,nucleotide-protein …) • Fast, widely used, good reason to know the algorithm

BLAST overview The six steps • Filtering of low complexity regions (optional) • Compile list of relevant words • Scan database sequences • Extend hits to HSPs • Calculate E-value for significant hits • Report hits and alignments

BLAST step 1 Filtering • Some sequences contain low complexity regions • Give rise to many random hits • Remember dotplots • Filter out by replacing with Xs

BLAST step 2 Compiling word list • Word length L=3 (protein) or L=11 (nucleotides) • Choose score threshold T (usually 11) • Find all words of length L in query sequence • One for each position • Compare to all possible length L words (8000/~4.2M) • For each position in query, remove words below T • Limited to appr. 50 words per position

The word list APLSADQASLVKSTWAQVRNSEVEILAAVFTAYPDIQARF… APL PLS LSA SAD ADQ DQA… Word list, position 1 APL: APC,APS,APT,APE… Remove words w with score(APL,w)<T

BLAST step 3 Scan database • Store words for each position in efficient search tree • Scan each sequence in database • Remember exact word hits • If query length is 100, scan for appr. 50·100 hits

Scanning APL APA  LPL  IIAPLIGNESNAPAVQTLVGQLPLSHKARG… Perfect match between word and database is a hit

BLAST step 4 Extend hits (BLAST2) • HSP: High-scoring Segment Pair • Find two hits on same diagonal with distance ≤A • Connect them (ungapped alignment) • Extend using gaps, matches and mismatches • Extension continues while score increases • Stop when score drops X below highest score Original BLAST • Extend all single word hits (higher T needed)

BLAST extension • Find diagonal hits • Extend alignment

BLAST step 5 Calculate E-scores • Compile list of HSPs scoring more than S • Let x be the score of a HSP • The probability of seeing this score by chance P(score≥x) • The expectation of seeing this score in the database E≈1-e-P(score≥x)D • Rather complicated calculations • The Karlin-Altschul equation (written in many ways)

BLAST step 6 Report the results • List the hits (sorted by E-value) • Graphical representation • Show Smith-Waterman alignment of the HSPs

Using BLAST Settings • Query sequence • Sub-sequence • Database  important • Conserved Domain search: Additional info • What organisms to search • Scoring matrix • Normally leave the rest as defaults • But you can change, well, anything

Understanding BLAST Output • Database searched and query used • Number of hits • Color-coded diagram • Magnitude of score • Relation between hits (if any) • Hits sorted by E-value • Alignments • Additional info by clicking a link

BLAST summary • Heuristic local alignment • Finds word matches • Extends locally • Might miss optimal solutions • Fast • Lower E-value  better result • Many hits between query and sequence possible • Remember: Use a proper scoring matrix!

Position Specific Scoring Matrix Another brief detour: PSSM or Profile • Useful to represent (part of) a multiple alignment • Represents the information in a motif • For each position: What are the frequencies of the characters? (Nucleotides or amino acids) • Frequencies can give the probabilities • Find most probable hit to a PSSM in a sequence • Compare to the background probability (i.e. the overall frequencies) • Pseudocounts to avoid 0 probabilities

PSSM example 123456789 --------- AAGGTAAGT TGTGTGAGT CAGGTATAC ATGGTAACT TAGGTACTG TAGGTCATT ACAGTCAGT CAGGTTGGA TCCGTAAGT GAGGTAAAC | 1 2 3 4 5 6 7 8 9 -|-------------------- A| 3 6 1 0 0 6 7 2 1 C| 2 2 1 0 0 2 1 1 2 G| 1 1 7 10 0 1 1 5 1 T| 4 1 1 0 10 1 1 2 6 Just divide by 10 to get probabilities Assume equal background distribution P(A)=P(C)=P(G)=P(T)=0.25

PSSM score Comparing sequence CAGGTAGTC to the PSSM • Probability given the model 0,2·0,6·0,7·1,0·1,0·0,6·0,1·0,2·0,2=0,0002016 • Probability given the background 0,259=0,000003815 • log-odds score (bits)

PSI-BLAST Position-Specific-Iterated BLAST • Designed to find distant homologs • More sensitive than BlastP • Use PSSM instead of just sequence comparison

The PSI-BLAST algorithm • Perform standard BlastP search • Create PSSM based on query and best hits • Search database again for hits to the PSSM • Incorporate new hits (i.e. update frequencies) • Iterate (repeat) until convergence • End result: More distant homologs found • Slower than standard BLAST (of course)

PSI-BLASTing a protein • Output at first iteration, 2nd iteration, … • When to stop? • In this case: 7 iterations PSI-BLAST >GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR. MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQFVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVTHAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM

Other BLAST versions • MegaBLAST • Assumes high similarity, longer sequences • BLAST2SEQUENCES • Blast on only two sequences (local alignments) • PHI-BLAST • Pattern Hit Initiated BLAST – searches for motif • WU-BLAST • BLAST from Washington University, not NCBI

BLAST variants • Nucleotide vs nucleotide: blastn • Protein vs protein: blastp • Translated sequence vs protein database: blastx • Protein sequence vs translated database: tblastn • Translated sequence vs translated database: tblastx • Find divergent protein sequences: PSI-BLAST

The BLAT algorithm BLAST-Like Alignment Tool • Designed for the genome projects • Local alignments between long sequences • Speed important! • BLAST turned upside-down

What BLAT does How does it work? • Index the database (length 11 words) • Find hits in the query • Opposite the BLAST strategy • Extend hits to HSPs • Useful when the database does not often change • Too time (and space) consuming for normal BLAST

The FASTA algorithm Similar to BLAST … but different • Fast All (since it works with all alphabets) • Not as widely used as BLAST – but one of the first • Works in a step-wise fashion • Locate word hits, extend using Smith-Waterman

Pairwise alignment 2: Scoring matrices and gaps BLAST, BLAT and FASTA