Sequence Comparison

Sequence Comparison Pair-wise Similarities

Sequence Comparison • Graphical Alignments • Compare • DotPlot • Pairwise alignments • BestFit • Gap

Similarity vs. Homology • Similarity • Two sequences which resemble each other • Can be measured • Degrees of similarity exist • Homology • Two sequences which are similar due to common evolutionary origin • Must be inferred • All or none

Paralogous vs. Orthologus Relationships • Implies Homology • Orthologs • Sequences that have evolved from a common ancestor following speciation • Paralogs • Sequences that have evolved within a single line of descent following gene duplication

Match Criterion • Is there a similarity between two sequences? • Identical symbols (nucleotides or amino acids) • Related symbols (amino acids) • Do gaps/rearrangements allow for a higher degree of similarity?

Dot Plots • Allow comparison of two sequences in all registers • Produces a graph (Dotplot) of sequence similarities • The human brain interprets the results

GCG DotPlots • Compare • Compares the sequences • Output is a text table containing the comparison information • DotPlot • Produces a graph of Compare's results

Simple 1:1 DotPlot R • • E • • • • M • I • • R • • P • S • • • I • • S • • • Y • L • A • • N • • A • • E • • • • C • N • • E • • • • U • Q • E • • • • S • • • S E Q U E N C E A N A L Y S I S P R I M E R

Stringency and Specificity • Degree to which programs parameters are set to detect more distant similarities • Degree to which programs parameters are set to exclude unrelated “background” similarities or “noise”

High Stringency • Low background noise • Only relatively close matches detected

Low Stringency • High background noise • Distant relationships detected

Word Match Comparisons • Identifies short, perfect matches (words) • ktup (k-tuple) • Fast • 1,000 times faster than window/stringency comparison • Less sensitive than window/stringency

Word DotPlot -WordSize=2 Word DotPlot /WordSize=2 R E M I R P S I S Y L A N A E C N E U Q • E E S S S E Q U E N C E A N A L Y S I S P R I M E R

Word DotPlot -WordSize=2 Word DotPlot /WordSize=2 R R • E E • M • I • R • P • S • I • S • Y • L • A • N • A • E • C • N • E • U • Q • E • S S E Q U E N C E A N A L Y S I S P R I M E R

Word DotPlot -WordSize=2

Window/Stringency Comparisons • Identifies a given number of matches (stringency) • Over a given range (window) • Slow • High sensitivity

Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E C N E U U • Q Q 4/4 E E S S S E Q U E N C E A N A L Y S I S P R I M E R

Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E C N E U U • Q 0/0 E Q S E S S E Q U E N C E A N A L Y S I S P R I M E R

Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E C N E E U • U Q 0/4 Q E E S S E Q U E N C E A N A L Y S I S P R I M E R

Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E C N E • E • U U 4/4 Q Q E E S S E Q U E N C E A N A L Y S I S P R I M E R

Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E E • U C 2/4 Q N • E E • U Q E S S E Q U E N C E A N A L Y S I S P R I M E R

Window DotPlot 4/2 Window DotPlot 4/2

Symbol Comparison Tables(Scoring Matrices) • What is a match? • Define match values for all GCG symbols • Nucleotides • Amino acids • Located in GenRunData:*.cmp

Nucleotide Tables • Programs use different tables depending on the alignment algorithm in use • Matches and mismatches receive different values

compardna.cmp • Compare • Match=1 • Mismatch=0 • Ambiguity symbols with any overlap between the sets of nucleotides are considered matches

!!NA_SCORING_MATRIX_RECT 1.0 Default scoring matrix used by COMPARE for the comparision of nucleic acid sequences. This table scores a match for any overlap between any IUB nucleic acid ambiguity symbols EXCEPT X/N. February 20, 1996 14:33 .. A B C D G H K M R S T U V W Y A 1 0 0 1 0 1 0 1 1 0 0 0 1 1 0 B 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 C 0 1 1 0 0 1 0 1 0 1 0 0 1 0 1 D 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 G 0 1 0 1 1 0 1 0 1 1 0 0 1 0 0 H 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 K 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 M 1 1 1 1 0 1 0 1 1 1 0 0 1 1 1 R 1 1 0 1 1 1 1 1 1 1 0 0 1 1 0 S 0 1 1 1 1 1 1 1 1 1 0 0 1 0 1 T 0 1 0 1 0 1 1 0 0 0 1 1 0 1 1 U 0 1 0 1 0 1 1 0 0 0 1 1 0 1 1 V 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 W 1 1 0 1 0 1 1 1 1 0 1 1 1 1 1 Y 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1

nwsgapdna.cmp • Gap • Match=10 • Mismatch= 0 • Gap penalties • Gap Create • Gap Extend

!!NA_SCORING_MATRIX_RECT 1.0 Default scoring matrix used by GAP for the comparison of nucleic acid sequences. { GAP_CREATE 50 GAP_EXTEND 3 } A B C D G H K M N R S T U V W X Y A 10 0 0 10 0 10 0 10 10 10 0 0 0 10 10 10 0 B 0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 C 0 10 10 0 0 10 0 10 10 0 10 0 0 10 0 10 10 D 10 10 0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 G 0 10 0 10 10 0 10 0 10 10 10 0 0 10 0 10 0 H 10 10 10 10 0 10 10 10 10 10 10 10 10 10 10 10 10 K 0 10 0 10 10 10 10 0 10 10 10 10 10 10 10 10 10 M 10 10 10 10 0 10 0 10 10 10 10 0 0 10 10 10 10 N 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 R 10 10 0 10 10 10 10 10 10 10 10 0 0 10 10 10 0 S 0 10 10 10 10 10 10 10 10 10 10 0 0 10 0 10 10 T 0 10 0 10 0 10 10 0 10 0 0 10 10 0 10 10 10 U 0 10 0 10 0 10 10 0 10 0 0 10 10 0 10 10 10 V 10 10 10 10 10 10 10 10 10 10 10 0 0 10 10 10 10 W 10 10 0 10 0 10 10 10 10 10 0 10 10 10 10 10 10 X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Y 0 10 10 10 0 10 10 10 10 0 10 10 10 10 10 10 10

swgapdna.cmp • BestFit • Match=10 • Mismatch= -9 • Negative numbers prevent extension of an alignment once the sequences diverge

!!NA_SCORING_MATRIX_RECT 1.0 Default scoring matrix used by BESTFIT for the comparison of nucleic acid sequences. February 20, 1996 14:35 .. { GAP_CREATE 50 GAP_EXTEND 3 } A B C D G H K M N R S T U V W X Y A 10 -9 -9 10 -9 10 -9 10 10 10 -9 -9 -9 10 10 10 -9 B -9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 C -9 10 10 -9 -9 10 -9 10 10 -9 10 -9 -9 10 -9 10 10 D 10 10 -9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 G -9 10 -9 10 10 -9 10 -9 10 10 10 -9 -9 10 -9 10 -9 H 10 10 10 10 -9 10 10 10 10 10 10 10 10 10 10 10 10 K -9 10 -9 10 10 10 10 -9 10 10 10 10 10 10 10 10 10 M 10 10 10 10 -9 10 -9 10 10 10 10 -9 -9 10 10 10 10 N 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 R 10 10 -9 10 10 10 10 10 10 10 10 -9 -9 10 10 10 -9 S -9 10 10 10 10 10 10 10 10 10 10 -9 -9 10 -9 10 10 T -9 10 -9 10 -9 10 10 -9 10 -9 -9 10 10 -9 10 10 10 U -9 10 -9 10 -9 10 10 -9 10 -9 -9 10 10 -9 10 10 10 V 10 10 10 10 10 10 10 10 10 10 10 -9 -9 10 10 10 10 W 10 10 -9 10 -9 10 10 10 10 10 -9 10 10 10 10 10 10 X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Y -9 10 10 10 -9 10 10 10 10 -9 10 10 10 10 10 10 10

Amino Acid Tables • Measure of similarity between amino acids • Not a simple match/mismatch relationship • Values vary depending on degree of relatedness • Based on evolution, chemistry, or structure

PAM250 Dayhoff Matrix • Based on evolutionary relationships • Derived empirically by comparing amino acid usage between closely related proteins • At least 85% identical

PAM • Accepted Point Mutations • PAM-1 Matrix • 1 "evolutionary" event • Allow 1 residue out of 100 to change • 1% Difference • What is the probability that that residue will change to any other?

PAM250 Matrix • Allow 250 "evolutionary" events • 80% Difference • Account for more distant relationships • Can construct any PAM-N matrix

pam250.cmp • Matches vary depending on the degree of conservation of any particular amino acid • A - A: 2 • W - W: 17 • Mismatches: vary depending on degree of relatedness between amino acids • Phe - Tyr: 7 • Leu - Ile: 2 • Cys - Leu: -6

PAM250 amino acid substitution matrix. { GAP_CREATE 12 GAP_EXTEND 4 } A B C D E F G H I K L M N P Q R S T V W Y Z A 2 0 -2 0 0 -4 1 -1 -1 -1 -2 -1 0 1 0 -2 1 1 0 -6 -3 0 B 0 2 -4 3 2 -5 0 1 -2 1 -3 -2 2 -1 1 -1 0 0 -2 -5 -3 2 C -2 -4 12 -5 -5 -4 -3 -3 -2 -5 -6 -5 -4 -3 -5 -4 0 -2 -2 -8 0 -5 D 0 3 -5 4 3 -6 1 1 -2 0 -4 -3 2 -1 2 -1 0 0 -2 -7 -4 3 E 0 2 -5 3 4 -5 0 1 -2 0 -3 -2 1 -1 2 -1 0 0 -2 -7 -4 3 F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2 0 -4 -5 -5 -4 -3 -3 -1 0 7 -5 G 1 0 -3 1 0 -5 5 -2 -3 -2 -4 -3 0 -1 -1 -3 1 0 -1 -7 -5 -1 H -1 1 -3 1 1 -2 -2 6 -2 0 -2 -2 2 0 3 2 -1 -1 -2 -3 0 2 I -1 -2 -2 -2 -2 1 -3 -2 5 -2 2 2 -2 -2 -2 -2 -1 0 4 -5 -1 -2 K -1 1 -5 0 0 -5 -2 0 -2 5 -3 0 1 -1 1 3 0 0 -2 -3 -4 0 L -2 -3 -6 -4 -3 2 -4 -2 2 -3 6 4 -3 -3 -2 -3 -3 -2 2 -2 -1 -3 M -1 -2 -5 -3 -2 0 -3 -2 2 0 4 6 -2 -2 -1 0 -2 -1 2 -4 -2 -2 N 0 2 -4 2 1 -4 0 2 -2 1 -3 -2 2 -1 1 0 1 0 -2 -4 -2 1 P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 0 0 1 0 -1 -6 -5 0 Q 0 1 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 1 -1 -1 -2 -5 -4 3 R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 0 -1 -2 2 -4 0 S 1 0 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 1 -1 -2 -3 0 T 1 0 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 0 -5 -3 -1 V 0 -2 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 -6 -2 -2 W -6 -5 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 0 -6 Y -3 -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10 -4 Z 0 2 -5 3 3 -5 -1 2 -2 0 -3 -2 1 0 3 0 0 -1 -2 -6 -4 3

Problems • Constructed in 1978 • Dataset much less complete than today • Used mainly small, globular proteins • Set of proteins used much more closely related than most relationships people are attempting to identify • Assumes all positions are equally mutable • Actually have conserved and unconserved positions

BLOSUM Tables • Blocks substitution matrix • Derived from aligned Blocks of related sequences • 2000 blocks • 500 different protein groups • Created from an all vs. all comparison of the protein database

BLOSUM Reference • Amino acid substitution matrices from protein blocks. Henikoff, S. and Henikoff, J. G. (1992). Proc. Natl. Acad. Sci. USA 89: 10915-10919.

Blocks • Aligned, ungapped conserved region of a protein family • Calculate the frequency with which any amino acid can appear at each position • Compute the probability that any amino acid can substitute for any other

BLOSUM Advantages • Frequencies obtained from protein blocks constructed regardless of evolutionary distance • Blocks represent regions of conserved sequence similarities • Conservation due to functional constraints • Calculated frequencies reflect functional constraints • Much larger data set used than for the PAM matrix

BLOSUM62 Table • Default table for almost all amino acid comparisons • FastA and TFastA use blosum50 • Many other blosum tables are available • In GenMoreData

BLOSUM62 amino acid substitution matrix. { GAP_CREATE 12 GAP_EXTEND 4 } A B C D E F G H I K L M N P Q R S T V W X Y Z A 4 -2 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1 B -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 C 0 -3 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4 D -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 E -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5 F -2 -3 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 -1 3 -3 G 0 -1 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2 H -2 -1 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2 0 I -1 -3 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3 K -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2 1 L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3 M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2 N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -1 -2 0 P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -1 -3 -1 Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 -1 2 R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -1 -2 0 S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -1 -2 0 T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -1 -2 -1 V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -1 -2 W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 -1 2 -3 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 -2 Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5

StructGapPep.cmp • Alternative table • Based upon amino acid substitutions after superpostion of homologous protein structures • Closely related amino acids have alpha-carbon atoms close to one another after superposition of the structures • Useful for finding weak similarities between proteins (?)

Other Tables • Genetic Code Matrix • How many nucleotide changes required to switch between amino acids • Chemical Similarity • Side Chain

Compare

Compare • Compares two sequences for regions of similarity • Uses either a word or window/stringency (default) comparison • Produces a table of overlapping points of similarity • DotPlot plots the points on a graph

Sequence Comparison