Sequence Similarity

Sequence Similarity Cathal Seoighe

Why search databases? • To find a gene related to a newly sequenced gene • To infer a function • To find genes in an unannotated stretch of sequence • To locate false priming sites for a set of PCR oligonucleotides • To discover sets of related genes

What is a homologous sequence? • In molecular biology two sequences are homologous if they are devived from acommon ancestor. • Homologous sequences are often similar to one another • Homologous proteins can can refer to proteins that are similar in their folding or their structure.

Sequence Similarity

Dotplots Diagonals Parallels Breaks

DNA vs. Protein searches • DNA is composed of 4 characters: A,G,C,T at least 25% of the residues of any 2 unrelated aligned sequences, will be identical by chance. • Protein sequence is composed of 20 characters (aa). The selectivity of the comparison is improved.

DNA vs. Protein searches • What should we use to search for similarity, the nucleotide or the protein sequences? • If we have a nucleotide sequence, should we search the DNA databases only? Or should we translate it to protein and search protein databases? Note, that by translating into aa sequence, we’ll lose information, since the genetic code is degenerate, meaning that two or more codons can be translated to the same amino acid.

Should use nucleotide sequences to search for close relatives and proteins for more distant homology

Specificity and sensitivity Definitions: • Sensitivity: the ability to detect "true positive" matches. The most sensitive search finds all true matches, but might have lots of false positives • Specificity: the ability to reject "false positive" matches. The most specific search will return only true matches, but might have lots of false negatives

Approaches to searching a database • Align each sequence in the database with the query sequence • Assign some score depending on how well the sequence pair aligns – this is the score of that subject sequence • Determine the statistical significance of the match • e.g. Smith-Waterman (ssearch)

In fact the above simple scheme describes the most accurate way to search for similar sequences • Scores are determined by scoring matrices and gap penalties • Examples of scoring matrices: PAM (Point Accepted Mutations), BLOSUM (Blocks Amino Acid Substitution Matrices)

A R N D C Q E G H I L K M F P S T W Y V A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0 R -2 7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3 N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3 D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4 C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -4 Q -1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3 E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3 G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4 I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4 L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1 K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -3 M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1 F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1 P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3 S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0 W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 -3 Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 -1 V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5 Blosum 50 scoring matrix Scoring Matrices

Problem • Each alignment of a sequence pair takes a lot of computation • Takes too long to implement this search strategy

Heuristic Methods • Most common: BLAST (Basic Local Alignment Search Tool), FASTA • Both are about 50 times faster than the exhaustive methods • Both work by finding short exact matches (words) and then extending to produce alignments (HSPs in the case of BLAST) • The word size (k-tuple) can determine sensitivity (smaller word size = more sensitive)

Comparison of BLAST and FASTA • FASTA is theoretically more sensitive to distantly related sequences than BLAST • FASTA does not filter low-complexity regions • Significance of the BLAST hit is worked out theoretically • Significance of FASTA is determined based on the distribution of hits in the database • Minimum word length for FASTA 1

Output of FastA < 20 222 0 :* 22 30 0 :* 24 18 1 :* 26 18 15 :* 28 46 159 :* 30 207 963 :* 32 1016 3724 := * 34 4596 10099 :==== * 36 9835 20741 :========= * 38 23408 34278 :==================== * 40 41534 47814 :=================================== * 42 53471 58447 :============================================ * 44 73080 64473 :====================================================*======= 46 70283 65667 :=====================================================*==== 48 64918 62869 :===================================================*== 50 65930 57368 :===============================================*======= 52 47425 50436 :======================================= * 54 36788 43081 :=============================== * 56 33156 35986 :============================ * 58 26422 29544 :====================== * 60 21578 23932 :================== * 62 19321 19187 :===============* 64 15988 15259 :============*= 66 14293 12060 :=========*== 68 11679 9486 :=======*== 70 10135 7434 :======*==

Output of FastA 72 8957 5809 :====*=== 74 7728 4529 :===*=== 76 6176 3525 :==*=== 78 5363 2740 :==*== 80 4434 2128 :=*== 82 3823 1628 :=*== 84 3231 1289 :=*= 86 2474 998 :*== 88 2197 772 :*= 90 1716 597 :*= 92 1430 462 :*= :===============*======================== 94 1250 358 :*= :============*=========================== 96 954 277 :* :=========*======================= 98 756 214 :* :=======*=================== 100 678 166 :* :=====*================== 102 580 128 :* :====*=============== 104 476 99 :* :===*============= 106 367 77 :* :==*========== 108 309 59 :* :==*======== 110 287 46 :* :=*======== 112 206 36 :* :=*====== 114 161 28 :* :*===== 116 144 21 :* :*==== 118 127 16 :* :*==== >120 886 13 :* :*==============================

BLAST Programs • BLAST is actually a family of programs • BLASTN - Nucleotide query searching a nucleotide database. • BLASTP - Protein query searching a protein database. • BLASTX - Translated nucleotide query sequence (6 frames) searching a protein database. • TBLASTN - Protein query searching a translated nucleotide (6 frames) database. • TBLASTX - Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) database.

Recent Improvements in BLAST • BLAST 2.0 • Treatment of Gaps • PSI-BLAST • Short nearly exact matches N.B. You can restrict a BLAST search to a particular organism or set of organisms using the advanced BLAST options

Interpreting BLAST results • Bit Score • the bit score is calculated based on the frequency of a particular aligned a.a. pair compared to the frequency of the same a.a. pair in a random sequence • E-value • indication of statistical significance N.B. the bit score is always the same for a given query and a given subject but an E-value depends on the database being queried

BLAST E-values Relationship between bit-score and E-value

Sequence Similarity