Sequence Comparison
1.46k likes | 1.74k Vues
Sequence Comparison. Pair-wise Similarities. Sequence Comparison. Graphical Alignments Compare DotPlot Pairwise alignments BestFit Gap. Similarity vs. Homology. Similarity Two sequences which resemble each other Can be measured Degrees of similarity exist Homology
Sequence Comparison
E N D
Presentation Transcript
Sequence Comparison Pair-wise Similarities
Sequence Comparison • Graphical Alignments • Compare • DotPlot • Pairwise alignments • BestFit • Gap
Similarity vs. Homology • Similarity • Two sequences which resemble each other • Can be measured • Degrees of similarity exist • Homology • Two sequences which are similar due to common evolutionary origin • Must be inferred • All or none
Paralogous vs. Orthologus Relationships • Implies Homology • Orthologs • Sequences that have evolved from a common ancestor following speciation • Paralogs • Sequences that have evolved within a single line of descent following gene duplication
Match Criterion • Is there a similarity between two sequences? • Identical symbols (nucleotides or amino acids) • Related symbols (amino acids) • Do gaps/rearrangements allow for a higher degree of similarity?
Dot Plots • Allow comparison of two sequences in all registers • Produces a graph (Dotplot) of sequence similarities • The human brain interprets the results
GCG DotPlots • Compare • Compares the sequences • Output is a text table containing the comparison information • DotPlot • Produces a graph of Compare's results
Simple 1:1 DotPlot R • • E • • • • M • I • • R • • P • S • • • I • • S • • • Y • L • A • • N • • A • • E • • • • C • N • • E • • • • U • Q • E • • • • S • • • S E Q U E N C E A N A L Y S I S P R I M E R
Stringency and Specificity • Degree to which programs parameters are set to detect more distant similarities • Degree to which programs parameters are set to exclude unrelated “background” similarities or “noise”
High Stringency • Low background noise • Only relatively close matches detected
Low Stringency • High background noise • Distant relationships detected
Word Match Comparisons • Identifies short, perfect matches (words) • ktup (k-tuple) • Fast • 1,000 times faster than window/stringency comparison • Less sensitive than window/stringency
Word DotPlot -WordSize=2 Word DotPlot /WordSize=2 R E M I R P S I S Y L A N A E C N E U Q • E E S S S E Q U E N C E A N A L Y S I S P R I M E R
Word DotPlot -WordSize=2 Word DotPlot /WordSize=2 R R • E E • M • I • R • P • S • I • S • Y • L • A • N • A • E • C • N • E • U • Q • E • S S E Q U E N C E A N A L Y S I S P R I M E R
Window/Stringency Comparisons • Identifies a given number of matches (stringency) • Over a given range (window) • Slow • High sensitivity
Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E C N E U U • Q Q 4/4 E E S S S E Q U E N C E A N A L Y S I S P R I M E R
Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E C N E U U • Q 0/0 E Q S E S S E Q U E N C E A N A L Y S I S P R I M E R
Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E C N E E U • U Q 0/4 Q E E S S E Q U E N C E A N A L Y S I S P R I M E R
Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E C N E • E • U U 4/4 Q Q E E S S E Q U E N C E A N A L Y S I S P R I M E R
Window DotPlot 4/2 Window DotPlot 4/2 R E M I R P S I S Y L A N A E E • U C 2/4 Q N • E E • U Q E S S E Q U E N C E A N A L Y S I S P R I M E R
Window DotPlot 4/2 Window DotPlot 4/2
Window DotPlot 4/1 Window DotPlot 4/1
Window DotPlot 4/3 Window DotPlot 4/3
Symbol Comparison Tables(Scoring Matrices) • What is a match? • Define match values for all GCG symbols • Nucleotides • Amino acids • Located in GenRunData:*.cmp
Nucleotide Tables • Programs use different tables depending on the alignment algorithm in use • Matches and mismatches receive different values
compardna.cmp • Compare • Match=1 • Mismatch=0 • Ambiguity symbols with any overlap between the sets of nucleotides are considered matches
!!NA_SCORING_MATRIX_RECT 1.0 Default scoring matrix used by COMPARE for the comparision of nucleic acid sequences. This table scores a match for any overlap between any IUB nucleic acid ambiguity symbols EXCEPT X/N. February 20, 1996 14:33 .. A B C D G H K M R S T U V W Y A 1 0 0 1 0 1 0 1 1 0 0 0 1 1 0 B 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 C 0 1 1 0 0 1 0 1 0 1 0 0 1 0 1 D 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 G 0 1 0 1 1 0 1 0 1 1 0 0 1 0 0 H 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 K 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 M 1 1 1 1 0 1 0 1 1 1 0 0 1 1 1 R 1 1 0 1 1 1 1 1 1 1 0 0 1 1 0 S 0 1 1 1 1 1 1 1 1 1 0 0 1 0 1 T 0 1 0 1 0 1 1 0 0 0 1 1 0 1 1 U 0 1 0 1 0 1 1 0 0 0 1 1 0 1 1 V 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 W 1 1 0 1 0 1 1 1 1 0 1 1 1 1 1 Y 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1
nwsgapdna.cmp • Gap • Match=10 • Mismatch= 0 • Gap penalties • Gap Create • Gap Extend
!!NA_SCORING_MATRIX_RECT 1.0 Default scoring matrix used by GAP for the comparison of nucleic acid sequences. { GAP_CREATE 50 GAP_EXTEND 3 } A B C D G H K M N R S T U V W X Y A 10 0 0 10 0 10 0 10 10 10 0 0 0 10 10 10 0 B 0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 C 0 10 10 0 0 10 0 10 10 0 10 0 0 10 0 10 10 D 10 10 0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 G 0 10 0 10 10 0 10 0 10 10 10 0 0 10 0 10 0 H 10 10 10 10 0 10 10 10 10 10 10 10 10 10 10 10 10 K 0 10 0 10 10 10 10 0 10 10 10 10 10 10 10 10 10 M 10 10 10 10 0 10 0 10 10 10 10 0 0 10 10 10 10 N 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 R 10 10 0 10 10 10 10 10 10 10 10 0 0 10 10 10 0 S 0 10 10 10 10 10 10 10 10 10 10 0 0 10 0 10 10 T 0 10 0 10 0 10 10 0 10 0 0 10 10 0 10 10 10 U 0 10 0 10 0 10 10 0 10 0 0 10 10 0 10 10 10 V 10 10 10 10 10 10 10 10 10 10 10 0 0 10 10 10 10 W 10 10 0 10 0 10 10 10 10 10 0 10 10 10 10 10 10 X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Y 0 10 10 10 0 10 10 10 10 0 10 10 10 10 10 10 10
swgapdna.cmp • BestFit • Match=10 • Mismatch= -9 • Negative numbers prevent extension of an alignment once the sequences diverge
!!NA_SCORING_MATRIX_RECT 1.0 Default scoring matrix used by BESTFIT for the comparison of nucleic acid sequences. February 20, 1996 14:35 .. { GAP_CREATE 50 GAP_EXTEND 3 } A B C D G H K M N R S T U V W X Y A 10 -9 -9 10 -9 10 -9 10 10 10 -9 -9 -9 10 10 10 -9 B -9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 C -9 10 10 -9 -9 10 -9 10 10 -9 10 -9 -9 10 -9 10 10 D 10 10 -9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 G -9 10 -9 10 10 -9 10 -9 10 10 10 -9 -9 10 -9 10 -9 H 10 10 10 10 -9 10 10 10 10 10 10 10 10 10 10 10 10 K -9 10 -9 10 10 10 10 -9 10 10 10 10 10 10 10 10 10 M 10 10 10 10 -9 10 -9 10 10 10 10 -9 -9 10 10 10 10 N 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 R 10 10 -9 10 10 10 10 10 10 10 10 -9 -9 10 10 10 -9 S -9 10 10 10 10 10 10 10 10 10 10 -9 -9 10 -9 10 10 T -9 10 -9 10 -9 10 10 -9 10 -9 -9 10 10 -9 10 10 10 U -9 10 -9 10 -9 10 10 -9 10 -9 -9 10 10 -9 10 10 10 V 10 10 10 10 10 10 10 10 10 10 10 -9 -9 10 10 10 10 W 10 10 -9 10 -9 10 10 10 10 10 -9 10 10 10 10 10 10 X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Y -9 10 10 10 -9 10 10 10 10 -9 10 10 10 10 10 10 10
Amino Acid Tables • Measure of similarity between amino acids • Not a simple match/mismatch relationship • Values vary depending on degree of relatedness • Based on evolution, chemistry, or structure
PAM250 Dayhoff Matrix • Based on evolutionary relationships • Derived empirically by comparing amino acid usage between closely related proteins • At least 85% identical
PAM • Accepted Point Mutations • PAM-1 Matrix • 1 "evolutionary" event • Allow 1 residue out of 100 to change • 1% Difference • What is the probability that that residue will change to any other?
PAM250 Matrix • Allow 250 "evolutionary" events • 80% Difference • Account for more distant relationships • Can construct any PAM-N matrix
pam250.cmp • Matches vary depending on the degree of conservation of any particular amino acid • A - A: 2 • W - W: 17 • Mismatches: vary depending on degree of relatedness between amino acids • Phe - Tyr: 7 • Leu - Ile: 2 • Cys - Leu: -6
PAM250 amino acid substitution matrix. { GAP_CREATE 12 GAP_EXTEND 4 } A B C D E F G H I K L M N P Q R S T V W Y Z A 2 0 -2 0 0 -4 1 -1 -1 -1 -2 -1 0 1 0 -2 1 1 0 -6 -3 0 B 0 2 -4 3 2 -5 0 1 -2 1 -3 -2 2 -1 1 -1 0 0 -2 -5 -3 2 C -2 -4 12 -5 -5 -4 -3 -3 -2 -5 -6 -5 -4 -3 -5 -4 0 -2 -2 -8 0 -5 D 0 3 -5 4 3 -6 1 1 -2 0 -4 -3 2 -1 2 -1 0 0 -2 -7 -4 3 E 0 2 -5 3 4 -5 0 1 -2 0 -3 -2 1 -1 2 -1 0 0 -2 -7 -4 3 F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2 0 -4 -5 -5 -4 -3 -3 -1 0 7 -5 G 1 0 -3 1 0 -5 5 -2 -3 -2 -4 -3 0 -1 -1 -3 1 0 -1 -7 -5 -1 H -1 1 -3 1 1 -2 -2 6 -2 0 -2 -2 2 0 3 2 -1 -1 -2 -3 0 2 I -1 -2 -2 -2 -2 1 -3 -2 5 -2 2 2 -2 -2 -2 -2 -1 0 4 -5 -1 -2 K -1 1 -5 0 0 -5 -2 0 -2 5 -3 0 1 -1 1 3 0 0 -2 -3 -4 0 L -2 -3 -6 -4 -3 2 -4 -2 2 -3 6 4 -3 -3 -2 -3 -3 -2 2 -2 -1 -3 M -1 -2 -5 -3 -2 0 -3 -2 2 0 4 6 -2 -2 -1 0 -2 -1 2 -4 -2 -2 N 0 2 -4 2 1 -4 0 2 -2 1 -3 -2 2 -1 1 0 1 0 -2 -4 -2 1 P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 0 0 1 0 -1 -6 -5 0 Q 0 1 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 1 -1 -1 -2 -5 -4 3 R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 0 -1 -2 2 -4 0 S 1 0 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 1 -1 -2 -3 0 T 1 0 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 0 -5 -3 -1 V 0 -2 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 -6 -2 -2 W -6 -5 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 0 -6 Y -3 -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10 -4 Z 0 2 -5 3 3 -5 -1 2 -2 0 -3 -2 1 0 3 0 0 -1 -2 -6 -4 3
Problems • Constructed in 1978 • Dataset much less complete than today • Used mainly small, globular proteins • Set of proteins used much more closely related than most relationships people are attempting to identify • Assumes all positions are equally mutable • Actually have conserved and unconserved positions
BLOSUM Tables • Blocks substitution matrix • Derived from aligned Blocks of related sequences • 2000 blocks • 500 different protein groups • Created from an all vs. all comparison of the protein database
BLOSUM Reference • Amino acid substitution matrices from protein blocks. Henikoff, S. and Henikoff, J. G. (1992). Proc. Natl. Acad. Sci. USA 89: 10915-10919.
Blocks • Aligned, ungapped conserved region of a protein family • Calculate the frequency with which any amino acid can appear at each position • Compute the probability that any amino acid can substitute for any other
BLOSUM Advantages • Frequencies obtained from protein blocks constructed regardless of evolutionary distance • Blocks represent regions of conserved sequence similarities • Conservation due to functional constraints • Calculated frequencies reflect functional constraints • Much larger data set used than for the PAM matrix
BLOSUM62 Table • Default table for almost all amino acid comparisons • FastA and TFastA use blosum50 • Many other blosum tables are available • In GenMoreData
BLOSUM62 amino acid substitution matrix. { GAP_CREATE 12 GAP_EXTEND 4 } A B C D E F G H I K L M N P Q R S T V W X Y Z A 4 -2 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1 B -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 C 0 -3 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4 D -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 E -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5 F -2 -3 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 -1 3 -3 G 0 -1 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2 H -2 -1 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2 0 I -1 -3 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3 K -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2 1 L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3 M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2 N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -1 -2 0 P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -1 -3 -1 Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 -1 2 R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -1 -2 0 S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -1 -2 0 T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -1 -2 -1 V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -1 -2 W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 -1 2 -3 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 -2 Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5
StructGapPep.cmp • Alternative table • Based upon amino acid substitutions after superpostion of homologous protein structures • Closely related amino acids have alpha-carbon atoms close to one another after superposition of the structures • Useful for finding weak similarities between proteins (?)
Other Tables • Genetic Code Matrix • How many nucleotide changes required to switch between amino acids • Chemical Similarity • Side Chain
Compare • Compares two sequences for regions of similarity • Uses either a word or window/stringency (default) comparison • Produces a table of overlapping points of similarity • DotPlot plots the points on a graph