Sequence similarity Analysis

Sequence similarity Analysis Benny Shomer, December 2005

Identity The extent to which two (nucleotide or amino acid) sequences are invariant. Similarity The extent to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity and/or conservation. Homology Similarity attributed to descent from a common ancestor.

Query= uniprot|Q9UP52|TFR2_HUMAN Transferrin receptor protein 2 (TfR2). >gi|20140567|sp|Q07891|TFR1_CRIGR Transferrin receptor protein 1 (TfR1) (TR) (TfR) (Trfr) Length = 757 Score = 540 bits (1392), Expect = e-152 Identities = 305/727 (41%), Positives = 412/727 (56%), Gaps = 52/727 (7%) Query: 87 LTALLIFTGAFLLGYVAF--RGSCQAC--------GDSVLVVSEDVNYEPDLDFHQGRLY 136 + ++ F F++GY+ + R + C G+S ++ E++ RLY Sbjct: 71 IAVVIFFLIGFMIGYLGYCKRTEQKDCVRLAETETGNSEIIQEENIP-------QSSRLY 123 Query: 137 WSDLQAMFLQFLGEGRLEDTIRQTSLRERVAGSAGMAALTQDIRAALSRQKLDHVWTDTH 196 W+DL+ + + L DTI+Q S R AGS L I KL VW D H Sbjct: 124 WADLKKLLSEKLDAIEFTDTIKQLSQTSREAGSQKDENLAYYIENQFRDFKLSKVWRDEH 183 Query: 197 YVGLQFPDPAHPNTLHWVDEAGKVGEQLPLEDPDVYCPYSAIGNVTGELVYAHYGRPEDL 256 YV +Q A N + ++ G + +E+P Y YS V+G+L++A++G +D Sbjct: 184 YVKIQVKGSAAQNAVTIINVNG---DSDLVENPGGYVAYSKATTVSGKLIHANFGTKKDF 240 Query: 257 QDLRAXXXXXXXXXXXXXXXXISFAQKVTNAQDFGAQGVLIYPEPADFSQDPPKPSLSSQ 316 +DL+ I+FA+KV NAQ F A GVLIY + F P + ++ Sbjct: 241 EDLK---YPVNGSLVIVRAGKITFAEKVANAQSFNAIGVLIYMDQTKF------PVVEAE 291 Query: 317 QAVYGHVHLGTGDPYTPGFPSFNQTQFPPVASSGLPSIPAQPISADIASRLLRKLKGPVA 376 +++GH HLGTGDPYTPGFPSFN TQFPP SSGLPSIP Q IS A +L + ++ Sbjct: 292 LSLFGHAHLGTGDPYTPGFPSFNHTQFPPSQSSGLPSIPVQTISRKAAEKLFQNMETNCP 351 IdentitySimilarityHomology

Discover as many (subject) sequences as possible, that are similar to the query sequence. Objective

Finding Distant Relatives For Proteins, finding distant relatives is a difficult task. Distant protein family members, may share <20% amino acid identity(!).

Query: >gi|3582021|emb|CAA70575.1| cytochrome P450 [Nepeta racemosa] Length = 509 Score = 405 bits (1043), Expect = e-111 Identities = 94/479 (19%), Positives = 192/479 (40%), Gaps = 35/479 (7%) Query: 61 NLYHFWRETGTHKVHLHHVQNFQKYGPIYREKLGNVESVYVIDPEDVALLFKSEGPNPER 120 NL+ G + H + ++YGP+ + G+V + PE + K++ Sbjct: 45 NLHQL----GLY-PHRYLQSLSRRYGPLMQLHFGSVPVLVASSPEAAREIMKNQDIVFSN 99 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Query: 297 -----DYRGMLYRLLGDSK----MSFEDIKANVTEMLAGGVDTTSMTLQWHLYEMARNLK 347 D+ +L + ++K + + +KA + +M G DTT+ L+W + E+ +N + Sbjct: 271 GDGALDFVDILLQFQRENKNRSPVEDDTVKALILDMFVAGTDTTATALEWAVAELIKNPR 330 Query: 348 VQDMLRAEVLAARHQAQGDMATMLQLVPLLKASIKETLRLH-PISVTLQRYLVNDLVLRD 406 L+ EV L+ +P LKASIKE+LRLH P+ + + R D + Sbjct: 331 AMKRLQNEVREVAGSKAEIEEEDLEKMPYLKASIKESLRLHVPVVLLVPRESTRDTNVLG 390 Query: 407 YMIPAKTLVQVAIYALGREPTFFFDPENFDPTRWLSK--DKNITYFRNLGFGWGVRQCLG 464 Y I + T V + +A+ R+P+ + +PE F P R+L D +F L FG G R C G Sbjct: 391 YDIASGTRVLINAWAIARDPSVWENPEEFLPERFLDSSIDYKGLHFELLPFGAGRRGCPG 450 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Main Search Algorithms Smith-Waterman (SSEARCH / MPsrch) Dynamic programming based optimal local alignment algorithm. Most sensitive in detecting distantly related proteins. Usually runs on MASPAR (1024-16384 processors on a typical MP2 machine) SW

Main Search Algorithms Two program families which are heuristics. Both reduce computation time, by scarifying some sensitivity. BLAST FASTA Reduce the size of the problem • pre-select sequences thought to share significant similarity with the query • locating similarity regions inside those sequences. Definition: 1. A rule of thumb, simplification, or educated guess that reduces or limits the search for solutions in domains that are difficult and poorly understood. Unlike algorithms, heuristics do not guarantee optimal, or even feasible, solutions and are often used with no theoretical guarantee.

Speed Speed Speed BLAST BLAST FASTA FASTA SW Sensitivity Sensitivity Sensitivity BLAST FASTA SW Main Search Algorithms Compared Compare Full Length Protein Compare Nucleic acids Compare Short Protein Segments

DNA Vs. Protein Protein similarity search is generally more sensitive than DNA. Proteins ignore silent mutations. Protein substitution matrices are better. Where appropriate – Prefer translating DNA into protein and compare Vs. a protein database.

FASTA Written By Bill Pearson in 1990

How do they do it? FASTA Step 1: The Goal: Quickly locate ungapped similarity regions between the query sequence and the database sequences.

How do they do it? FASTA Step 1: • Determine the length of a word called “k-tuples”. (for proteins usually 1-3 and for DNA 4-6) • Pre-compute all possible k-tuples and build a lookup hash.(for instance, assume k =3: there are 8000=203 possible k-tuples) { ARR:[ ], ARN:[ ], ARD:[ ]... ANR:[ ], ANN:[ ], AND:[ ]... RNN:[ ], RND:[ ], RNC:[ ] }

FASTA How do they do it? Step 1: • Now, slide the k-tuple on the query sequence and record each k-tuple and its position in the hash structure. 01234567890123456789012345 NTLGTEIAIEDQICQGLKLTFDTTFS { TLG:[], EIA:[ ], LGT:[]... AIE:[ ], TEI:[ ], NTL:[]... IAI:[ ], GTE:[], IED:[ ] } 2 1 0 3

FASTA How do they do it? Step 1: • NEXT, slide the k-tuple on the next subject sequence and record its position and its offset from the query in the hash . 01234567890123456789012345 QICQGLKLTNTLGTEIAIEDFDTTFS { NTL:[], TLG:[], LGT:[]... GTE:[], TEI:[ ], EIA:[ ]... IAI:[ ], AIE:[ ], IED:[ ] } 0 ,10,9 2 ,11,9 ,9,9 1 3 ,12,9

FASTA How do they do it? Step 1: Query Subject

Note: Since #2 is done following #1, it may happen that the 10 regions are not the most similar, where there are many conservative substitutions and few identities. FASTA How do they do it? Step 2: • Select the 10 best regions. • Evaluate those regions using either the PAM or BLOSUM substitution matrix. • The score of the best region (*) is calledinit1. Query Subject *

How do they do it? FASTA Step 3: Query 1. Consider only regions with a score above a certain threshold. 2. Attempt to join the selected regions. 3. Score, summing scores of the regions and subtracting the “join” areas (similar to gap penalty). This score is called Initn Subject *

Note: Since the process is confined to a selected band out of the entire matrix, the alignment may be sub-optimal. How do they do it? FASTA Apply “Banded Smith-Waterman” algorithm. Step 4: Use dynamic programming to calculate local alignment, but restrict the region of the matrix to a band, centered around the diagonal with the best init1 score (*). This score of the alignment is called the opt score. It is the score used to rank the alignments. Query Subject *

Z score: Simply put – FASTA calculates an average score for each length range of sequences in the database and plots them onto a score*length regression line. Z score is the number of standard deviations of a given real score from the theoretical regression line. FASTA Result Evaluation

Z Score Regression Length STD FASTA Result Evaluation * I I I I I I I I I I

Z score: Simply put – FASTA calculates an average score for each length range of sequences in the database and plots them onto a score*length regression line. Z score is the number of standard deviations of a given real score from the theoretical regression line. E value: The probability that a given match (query/subject) of a random sequence of the same length range, would be greater than z. FASTA Result Evaluation

opt E() < 20 1412 0:= 22 95 0:= one = represents 2289 library sequences 24 254 1:* 26 719 30:* 28 2260 323:* 30 6209 1960:*== 32 13248 7578:===*== 34 25551 20550:========*=== 36 40898 42205:==================* 38 64075 69749:============================ * 40 91829 97294:========================================= * 42 115365 118931:===================================================* 44 131747 131192:=========================================================* 46 137309 133622:==========================================================*= 48 134659 127927:=======================================================*=== 50 127192 116734:==================================================*===== 52 105873 102629:============================================*== 54 85048 87663:======================================* 56 69881 73226:===============================* 58 57673 60117:==========================* 60 46861 48698:=====================* 62 36549 39041:================ * 64 28272 31049:=============* 66 22568 24541:==========* 68 17633 19303:========* 70 13099 15127:======* 72 10981 11820:=====* 74 8757 9216:====* 76 6154 7173:===* 78 4772 5575:==* 80 3467 4329:=* 82 2873 3312:=* 84 2217 2623:=* 86 1679 2030:* 88 1268 1571:* inset = represents 16 library sequences 90 883 1215:* 92 752 940:* :=======================================* 94 507 728:* :================================ * 96 498 563:* :================================ * 98 282 436:* :================== * 100 284 337:* :================== * 102 205 261:* :============= * 104 121 202:* :======== * 106 127 156:* :======== * 108 96 121:* :====== * 110 65 93:* :=====* 112 51 72:* :====* 114 83 56:* :===*== 116 59 43:* :==*= 118 35 33:* :==* >120 195 26:* :=*=========== Score range Number of optimized scores in the range Number of random scores expected to be in the range Actual score distribution Expected score distribution Watch the 80-110 range

Result Evaluation FASTA Kolmogorov-Smirnov statistic 116 59 43:* :==*= 118 35 33:* :==* >120 195 26:* :=*=========== 454171735 residues in 1422690 sequences statistics extrapolated from 60000 to 1422511 sequences Expectation_n fit: rho(ln(x))= 4.1847+/-0.000201; mu= 24.1751+/- 0.012 mean_var=64.3387+/-13.596, 0's: 152 Z-trim: 160 B-trim: 3848 in 1/64 Lambda= 0.159896 Kolmogorov-Smirnov statistic: 0.0187 (N=29) at 52 FASTA (3.45 Mar 2002) function [optimized, MD_40 matrix (18:-23)] ktup: 2 join: 37, opt: 25, open/ext: -10/-2, width: 16 Scan time: 206.400 An evaluation of the fit of the data to the expected curve. < 0.1 == Excellent agreement. > 0.2 Repeat the analysis with higher gap penalties.

>>UNIPROT:Q7XB42 Prunus dulcis cyp74C5 gene for cytochrome P450 initn: 2015 init1: 1741 opt: 2734 Z-score: 3396.1 bits: 637.8 E(): 3.3e-181 Smith-Waterman score: 2755; 61.245% identity (64.346% ungapped) in 498 aa overlap (1-491:1-481) 10 20 30 40 50 Sequen MSSVSSKYPAIASSS-DNESCKPLLQVREIPGDYGFPFFGAIKDRYDYYYSLGADEFFRT ::: :: .::: .: :: :::::: :::: :::::::.:. : .::.: UNIPRO MSSSSS-----SSSSPNNLPLKP------IPGDYGWPFFGHIKDRYDYFYNQGRYDFFKT 10 20 30 40 60 70 80 90 100 110 Sequen KSLKYNSTIFRTNMPPGPFIAKDPKVIVLLDAISFPILFDCSKVEKKNVLDGTYMPSTDF . :: ::.:::::::: :: .:::: :::: :::: :: .:: ...:::::::::: . UNIPRO RIEKYQSTVFRTNMPPGILIASNPKVIALLDAKSFPIIFDNTKVLRRDVLDGTYMPSTAY 50 60 70 80 90 100 120 130 140 150 160 170 Sequen FGGYRPCAFLDPSEPSHATHKGFYLSIISKLHTQFIPIFENSVSLLFQNLEIQISKDGKA :::: ::.::::::.::: : .. . . :: ::: :..: : .: ::: : :::::: UNIPRO TGGYRVCAYLDPSEPNHATLKSYFAALLASQHTKFIPLFQSSTSDMFLNLEAQLSKDGKA 110 120 130 140 150 160 FASTA Result Evaluation E(): 3.3e-181

[T]FASTA[X/Y/S/F] FASTA: DNA  DNA or Protein  Protein T : Translate a DNA database in all 6 reading frames for comparison with a Protein query. X /Y : For situations where DNA sequences are likely to contain errors (aka EST). X: Allow frameshifts only betweens codons. Y: Allow frameshifts also within codons. F : Analyze a set of fragments resulting from electrophoresis band cleavage and sequencing. S : Analyze data from Mass Spectrometry analysis of Proteins. FASTA Versions of the program

Performs better with local alignments. Speed Developed by Janet Thornton’s Group On the basis of the PAM methodology Interactivity Program Type Select Database Adapt Gap Penalty Both Strands? Select Matrix View Histogram? Select k-tuple size How Many To View? PAM MDM k-tuple BLOSSUM Sensitivity Evolutionary Distance In Practice. FASTA http://www.ebi.ac.uk/fasta33/

Distant Relatives Vs. Garbage How far do we start with? Where in the sequence? What size ranges to look into? In Practice. FASTA

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

Word of length w (3 for proteins, 11 for DNA) Basic Local Alignment Search Tools BLAST For each position of the query sequence: SCKPNDQVREIPGDYGFPFFGAIKDRYDYYYSLGA

# Matrix made by matblas from blosum62.iij # * column uses minimum score # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62 # Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 6+6 6 6+6+5=17 BLAST SCKPNDQVREIPGDYGFPFFGAIKDRYDYYYSLGA NDQ NDQ

# Matrix made by matblas from blosum62.iij # * column uses minimum score # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62 # Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 6 6+4 6+4+3=13 BLAST SCKPNDQVREIPGDYGFPFFGAIKDRYDYYYSLGA [NDQ:17] NDQ NBZ

# Matrix made by matblas from blosum62.iij # * column uses minimum score # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62 # Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 BLAST SCKPNDQVREIPGDYGFPFFGAIKDRYDYYYSLGA [NDQ:17, NBZ:13, NCA:2 , NEE:10, BDZ:13, BBZ:11, ...:.. ] Generate a list of all possible combinations for this word (e.g. for protein 3 a.a. words, it is a list of 8000 possible combinations)

BLAST SCKPNDQVREIPGDYGFPFFGAIKDRYDYYYSLGA [NDQ:17, NBZ:13, NCA:2 , NEE:10, BDZ:13, BBZ:11, ...:.. ] Weed out all word combinations having a score below a pre-determined cutoff score (T) (currently 11) This resulting list of scores above the T score cutoff, is called a “neighbors list”.

BLAST SCKPNDQVREIPGDYGFPFFGAIKDRYDYYYSLGA For each sequence in the database… [NDQ:17, NBZ:13, BDZ:13, BBZ:11, ...:.. ] SCPFFGAIKDRYDYYYSLKPNDQVREIPGDGAYFG GDYGFPFFGAIKDRYPBDZVREIDLGAYYYSCKPS KPNBZVREIPGDYGFPFFSCGAIKDSLGAYDRYYY PNDQVREIPGDYSCPBBZAKGYDYYFIKDRYSLGA LGAFGAISCKGEIPGFPFYSKDRYDYPRDQVRYYD Each word match is called a “hit” VREIPGDYGBBZPNDDRYDQFPFFGAYSLGAIKYY QVREISCKPDYGFNRZNDPGGAIKDRYYSLGYDYA

A * BLAST For each sequence that resulted with hits. We now have many neighbors hits. We need a methodology to screen, which hits can serve as seeds for a gapped local alignment. 1. Plot all neighbors hits on the sequence with their respective distances. Identify all diagonals. 2. Find candidates for extension of alignment. Requirement: Two hits (or more) within a pre-determined distance A, can be used as a seed for extension. Many unrelated or overlapping hits are filtered out this way. An expensive time for dynamic programming is saved.

Generating HSP BLAST Extend hit without gaps in both directions. Stop extending before the total score drops by X from the maximal score obtained so far. Only segments with a score >= S are counted in. Such segments are called HSP – High Scoring Segment Pair.

Start off from the middle point of the highest scoring neighbor hit within the HSP. Restrict search for the optimal path, such that the score does not drop off by X more than the maximal score already obtained. BLAST Score-limited Gapped extension Apply a modified Smith-Waterman algorithm: Score Limited. Explore the Dynamic Programming Matrix in both directions. But…….

BLAST E value E value (of an alignment having a score S): The number of times one expects to find alignments with a score >= S of a random sequence Vs. a random database. (having the same lengths and compositions)

Filtering Low Complexity

Filtering Low Complexity The Problem: Regions of Low Complexity or sequence repeats tend to generate high scores, that do not reflect real sequence similarity.

Filtering Low Complexity The Solution: SEG For Proteins DUST For DNA

Note: Masking is practiced on the query sequence only, not on the database sequences. Filtering Low Complexity SEG

[t]BLAST[x/n/p] t : Translate a DNA database in all 6 reading frames for comparison with a Protein query. x : Translate a nucleotide query in all 6 reading frames for comparison with a Protein database. p : Comparison is against a Protein database. n : Comparison is against a Nucleotide database. BLAST Versions of the program

specifically designed to efficiently find long alignments between very similar sequences. MEGABLAST uses longer words in the comparison process. MEGABLAST: BLAST Special Versions Discontiguous MEGABLAST: Better at finding nucleotide sequences similar, but not identical to your nucleotide query.

"Search for short nearly exact matches" : Simply a regular BLAST, but with the parameters pre-set for optimally finding significant matches to short segments such as PCR primers. It uses a shorter word (7), turns off filtering and allows a higher expect threshold. BLAST Special Versions Will be discussed later on: PSI-BLAST: Position-Specific Iterated BLAST RPS-BLAST: Reverse Position Specific BLAST CDART: Conserved Domain Architecture Retrieval Tool.

BLAST http://www.ncbi.nlm.nih.gov/

http://www.ncbi.nlm.nih.gov/BLAST/ BLAST

BLAST

Sequence similarity Analysis