Sequence Alignment

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP Position within the alignment = columns

Sequence Alignment The alignment is a hypothesis: • The positions with identical nt/AA were present in the common ancestor • Differences represent the nt/AA that have diverged since the common ancestor Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP

From extant sequences to evolution

Constructing and evaluating alignments • how to identify regions of sequence similarity between two sequences? • How to evaluate the degree of similarity? • What is the biological significance of the alignment?

Dot Plots: visualization of an alignment • Take two English words: • place the two sequences on vertical and horizontal axes of graph • put dots wherever there is a match • diagonal line is the region of identity • local alignment • THISSEQUENCE and THATSEQUENCE

Alignments reveal insertions and deletions seq1 THIS---SEQUENCE seq2 THISISASEQUENCE a gap in Seq1 accounts for the insertion of ISA into Seq2

Alignments reveal substitutions THISSEQUENCE THATSEQUENCE • How many substitutions? • Are all substitutions equal? • If these were real AA sequences in two extant organisms, how can we determine whether they reflect evolutionary ancestry? • Would two unrelated sequence share this level of identity?

Substitutions Need a methods for evaluating the likelihood of - A to V (alanine to valine) - R to F (Arginine to Phenylalanine)

Scoring schemes to assess similarity • Percent identity = number of identical amino acids • Percent similarity (biochemical equivalence) • Substitution matrices • value assigned based on the probability of substitution • score the alignment 123456789…. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP

Substitution matrices How might one construct a scoring matrix? • what types of sequence events should we consider? • DNA level? Transition vs. transversion • Amino acid level? Biochemical equivalence • Using known proteins? • comparing protein homologs • post hoc determination of probabilities • should we use the same substitution matrix for two very closely related proteins vs. proteins that diverged long ago? • Probability of substitutions increases over time • Probability that multiple substitutions occurred in a single position

Substitution matrix Each cell represents the likelihood of substitution of each possible pair of amino acids • THISSEQUENCE • THATSEQUENCE • 581145505695 Sum up the score = 52

PAM and BLOSUM matrices for AA sequences Most protein alignment matrices are empirically derived: • PAM Scoring Matrices • compared full length of closely related proteins • Measured the frequency of all possible substitution pairs • BLOSUM Scoring Matrices • compared highly conserved regions of proteins • blocks

How to score gaps? • THISISASEQUENCE vs THATSEQUENCE? • THISISASEQUENCE • TH----ATSEQUENCE • THA---TSEQUENCE • TH---ATSEQUENCE • TH-A-T-SEQUENCE • Scoring the alignment must take into account • 1) Substitutions • 2) Gaps • Gap penalties: • 1) start a new gap (-4) • 2) extend an existing gap (-1) • Score all, choose highest score More than one possible alignment

Alignments • Finding regions of sequence identity or similarity • Inserting gaps to reflect indels • Scoring the possible alignments to find the optimal alignment by

Common tools that produce alignments • BLAST to identify similar sequences, given a query sequence • ClustalW to align two or more sequences across their entire length

the blast algorithm • Uses PAM or BLOSUM matrix • divides query sequence into short strings, called words • searches through the database to find subject sequences that contain similar words • When finds similar words, it extends and scores the alignment • Output consists of all subject sequences that align to the query at or above a threshold score • If no words are similar, then no alignment

BLAST Algorithm divide entire length into words (segments of X length)

Extend hits one base at a time S is the alignment score: If it falls below a threshold, the extension processes ends

HSPs are Aligned Regions • High scoring segment pairs = the original word match plus the extension • high scoring = score of the alignment above threshold • segment = the region of the query sequence aligned to the subject • pair = alignment between two sequences (query and subject) • BLAST often produces several short HSPs rather than a single aligned region

BLAST Results report local alignments • Query was the entire protein sequence (position 1 to 749) • Score,E-value, Identities, Positives, Gaps >gi|17556182|ref|NP_497582.1| Predicted CDS, phosphatidylinositol transfer protein [Caenorhabditis elegans] Score = 283 bits (723), Expect = 8e-75 Identities = 144/270 (53%), Positives = 186/270 (68%), Gaps = 13/270 (4%) Query: 48 KEYRVILPVSVDEYQVGQLYSVAEASKNXXXXXXXXXXXXXXPYEK----DGE--KGQYT 101 K+ RV+LP+SV+EYQVGQL+SVAEASK P++ +G+ KGQYT Sbjct: 70 KKSRVVLPMSVEEYQVGQLWSVAEASKAETGGGEGVEVLKNEPFDNVPLLNGQFTKGQYT 129 Query: 102 HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP 160 HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P Sbjct: 130 HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP 189 Query: 161 DLGTQENVHKLEPEAWKHVEAVYIDIADRSQVL-SKDYKAEEDPAKFKSIKTGRGPLGPN 219 D GT EN H L+ + E V I+IA+ + L S D + P+KF+S KTGRGPL N Sbjct: 190 DNGTTENAHGLKGDELAKREVVNINIANDHEYLNSGDLHPDSTPSKFQSTKTGRGPLSGN 249 Query: 220 WKQELVNQKDCPYMCAYKLVTVKFKWWGLQNKVENFIHKQERRLFTNFHRQLFCWLDKWV 279 WK + P MCAYKLVTV FKW+G Q VEN+ H Q RLF+ FHR++FCW+DKW Sbjct: 250 WKDSVQ-----PVMCAYKLVTVYFKWFGFQKIVENYAHTQYPRLFSKFHREVFCWIDKWH 304 Query: 280 DLTMDDIRRMEEETKRQLDEMRQKDPVKGM 309 LTM DIR +E + +++L+E R+ V+GM Sbjct: 305 GLTMVDIREIEAKAQKELEEQRKSGQVRGM 334

BLAST Statistics • E-value is equivalent to a P value • smaller numbers are more significant • 1e-4 = 1 x 10-4 = 0.0004 • 1e-50 = 1 x 10-50 • E-value is calculated from the alignment score (S) • how many alignments of that score would likely occur by chance if you query a database of that size? • if GenBank contains 10 million sequences, there is a good probability that the sequence “MAGAV” will occur multiple times in sequences that are NOT evolutionarily related • The E-value represents the likelihood that the observed alignment is due to chance alone

Interpretation of output • very low E-values (e-100) represent sequences that are very close to being identical • moderate E-values are related genes (homologs) • long list of gradually declining of E-values indicates a large gene family • you must examine the results when e-value is in the 10-4 to -5 range • examine sequences • a few AA matches in a long sequence? • many AA matches in a very short sequence?

Evaluating Blast results • Alignment: • colored bar alignments (region of alignment, score along length) • sequence alignments (region of alignment, AA information) • Exploring potential function from significant blast hits • use accession link to go to the record page for each hit • published papers • full sequence information • annotation • Blast is linked to a protein domain tool

Sequence Alignment