Valg av poengverdier (substitusjonsmatrise) er viktig

Valg av poengverdier (substitusjonsmatrise) er viktig • Scoring matrices appear in all analysis involving sequence comparison. • The choice of matrix can strongly influence the outcome of the analysis. • Scoring matrices implicitly represent a particular theory of evolution. • Understanding theories underlying a given scoring matrix can aid in making proper choice.

Forskjellige prinsipper for substitusjonsmatriser • Identity matrix • Genetic Code Matrix: Score based on minimum number of base changes required to convert one amino acid into another. • Physical/ chemical characteristics. Attempt to quantify some physical or chemical attribute of the residues and arbitrarily assign weights based on similarities of the residues • Log odds matricesS is the log odds ratio of two probabilities: the probability that two residues, i and j, are aligned by evolutionary descent and the probability that theyare aligned by chance.qijare the frequencies that residue i and j are observed to align in sequences known to be related. They are derived from a "transitionprobability matrix.”pi and pj are frequencies of occurrence of residue i and j in the set of sequences. e. g., PAM250, BLOSUM62 et al.

PAM-matriser: Hvordan ble de konstruert av Margaret Dayhoff? • Align sequences that are at least 85% identical (minimize ambiguity in alignments, minimize the number of coincident mutations. • Reconstruct phylogenetic trees and infer ancestral sequences. 71 trees containing 1,572 exchanges were used. • Count replacements "accepted" by natural selection, in all pairwise comparisons (each Aijis the number of times amino acid j wasreplacedby amino acid i in all comparisons). • Compute amino acid mutability mj , i. e., the propensity of a given amino acid, j, to be replaced.

PAM-konstruksjon, forts. • Combine data from 3 & 4 to produce a Mutation Probability Matrix for one PAM of evolutionary distance (1 PAM (Accepted Point Mutation per 100 residues)), according to the followingformulae: • Calculate Log Odds Matrix for similarity scoring:Divide each element of the Mutation Data Matrix, M, by the frequency of occurrence of each residue: R is a Relatedness Odds Matrix , fiis the frequency of residue i. The Log Odds Matrix, Sij, is calculated from the relatedness odds matrix, Rij, simply by taking the log of each Rij and multiplying with 10

PAM 250 substitution matrix

Limitations of the PAM model • Assumptions in PAM model: • replacement at any site depends only on the amino acid at that site and the probability given by the table (Markov model). • sequences that are being compared have average amino acid composition. • Sources of error in PAM model • Many sequences depart from average composition. • Rare replacements were observed too infrequently to resolve relative probabilities accurately (for 36 pairs no replacements were observed!). • Errors in 1PAM are magnified in the extrapolation to 250 PAM. • The Markov process is an imperfect representation of evolution: Distantly related sequences usually have islands (blocks) of conserved residues.This implies that replacement is not equally probable over entire sequence.

BLOSUM (Blocks Substitution Matrix) substitusjonsmatriser 1. Starting data is conserved blocks from Blocks database. • aligned, ungapped sequences • widely varying similarity, but measures are taken to avoid biasing the sample with frequently occurring highly related sequences. 2. Counts of replacements are made by straight forward counting of all pairs of aligned residues, fij • The observed frequency of each pair is:qij= fij/( total number of residue pairs) • This includes cases of i= j (i. e. no replacement observed). • The expected frequency of each pair is essentially the product of the frequencies of each residue in the data set.

BLOSUM (Blocks Substitution Matrix) substitusjonsmatriser 3. Similar sequences in a block above a threshold percent similarity are clustered and members of the cluster count fractionally toward the finaltally. • Reduces the number of identical pairs (AA, SS, TT, etc., matches) in the final tallies. • Somewhat analogous to increasing the PAM distance. • If clustering threshold is 80%, final matrix is BLOSUM 80. • Clustering at 62% reduces the number of blocks contributing to the table by 25%- still 1.25 x 10^ 6 pairs contributed! • Least frequent amino acid pair replacement was observed 2369 times!

BLOSUM 62

Blosum og PAM – en sammenligning

FASTA og BLAST: søk etter beslektede sekvenser i databasene Søk i databasene med en rigorøs Smith-Waterman-algoritme er ressurskrevende (men mulig). FASTA og BLAST gir raskere søk og mindre ressursbruk ved å benytte snarveier. For begge gjelder det at det foretas en forhånds-”siling” av sekvensene i databasen slik at bare sekvenser som ser interessante ut (ser ut til å ligne på søkesekvensen) behandles videre

Slik arbeider FASTA 1 2 3 4 5 6 7 8 9 10 11 s = H A R F Y A A Q I V L A 2, 6, 7 F 4 H 1 I 9 L 11 Q 8 R 3 V 10 Y 5 others... Ktup= 1 1 2 3 4 5 6 7 8t = V D M A A Q I A +9 Hash table -2+2+3 -3+1+2 +2 +2 -6-2-1 -7 –6 –5 –4 –3 –2 –1 0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 Offset vector

From: G.J .Barton: Protein Sequence Alignment and Database Scanningin Protein Structure prediction - a practical approach,Edited by M. J. E. Sternberg, IRL Press at Oxford University Press, 1996

FASTA, forts. FASTA vil så koble samme to eller flere k-tupler dersom de ikke ligger for langt fra hverandre, disse utgjør sammen en region. Kan ses på som en lokal sammenstilling uten gap. De 5 beste regionene fra forrige fase poengsettes så på ny med PAM120 eller PAM250. Dette er første mål på likhet mellom r og s og kalles initial score i resultatfilen. En slik regnes ut for alle sekvenser i databasen. Optimized score regnes så ut a la Smith-Waterman, men begrenset til ruter i et bånd rundt utgangs-sammenstillingen

FASTA – valg av k-tuple-verdi For DNA-søk er ktup 4-6, for proteinsøk 1eller 2. Valg av ktup har innvirkning på resultatet: • Lav ktup øker sensitiviteten, dvs. evnen til å finne fjerne slektninger • Høy ktup øker selektiviteten, dvs. evnen til å forkaste falske positiver

Varianter av FASTA

FASTA-resultater

Parametere som sier noe om hvor gode våre databasetreff er • Init1: score of the highest scoring initial region • Initn: sum of initial scores of joined regions minus joining penalty for each gap • opt: score of optimal alignment of the region • Z: measure of how unusual the original match is. If score=S, Z=(S-mean)/sd • P: probability that the alignment is no better than random • E(n): expected number of sequences giving the same z-score or better if the database is probed with a random sequence. E=P*(database size n)

Vurdering av resultater • Z-score > 5: significant • P < 10-100: eksakt treff10-100 < P < 10-50: nesten identiske sekvenser 10-50 < P < 10-10: nær beslektede, sikker homologi10-5 < P < 10-1: vanligvis fjerne slektningerP > 10-1: Trolig ikke signifikant treff • E < 0.02: Trolig homologe sekvenser0.02 < E < 1: homologi kan ikke utelukkesE > 1: tilfeldig?

Slik virker BLAST (Basic Local Alignment Search Tool) • Blast lager en liste over alle tretegns-ord (words, delsekvenser) i søkeproteinet (for sekvensen MEFGALLY.. blir de MEF, EFG, FGA, GAL osv.) • Ved bruk av BLOSUM62 identifiseres for hvert av disse ordene ord som gir en score over en viss grenseverdi (neighborhood word score threshold) (ca. 50 nye ord for hvert utgangsord • Hver sekvens i databasen gjennomsøkes så for eksakte treff med hvert av de 50 ordene for hver posisjon i søkesekvensen • Treffene utvides så til poengsummen begynner å bli lavere. Resultatet er et lengre sammenstilte sekvensstrekk kalt HSP (high-scoring segment pair). • Sammenkobling av HSP med egnet plassering.

From: G.J .Barton: Protein Sequence Alignment and Database Scanningin Protein Structure prediction - a practical approach,Edited by M. J. E. Sternberg, IRL Press at Oxford University Press, 1996

BLAST-resultater

BLAST-resultater, fortsatt

Varianter av Blast • blastp compares an amino acid query sequence against a protein sequence database • blastn compares a nucleotide query sequence against a nucleotide sequence database • blastxcompares a nucleotide query sequence translated in all reading frames against a protein sequence database • tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames • tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx is extremely slow and cpu-intensive • Psi-blast - Position Specific Iterated BLAST uses an iterative search in which sequences found in one round of searching are used to build a score model for the next round of searching. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity

Det humane genom

Horizontal gene transfer?

Probable vertebrate-specific acquisition of bacterial genes

Men nei….

Men nei, fortsatt

Fylogenetisk analyse

Hva gikk feil? ”A different methodological reason for several of the genes in thehuman genome reportbeing considered as bacteria±vertebrateHGTs, was that phylogenetics was not the analytical approach, andthat the conclusions were instead derived largely from top BLASThit results. In several instances the top BLAST hit was indeed abacterial species, whereas further down the list of significantBLAST hits one finds a non-vertebrate eukaryote. When suchsequences were properly aligned, the resulting phylogenetic treesoften supported the monophyly of eukaryotes with the nonvertebrate eukaryote at the base.”

ClustalW-sammestilling

Konklusjonen ”Mostof our analyses and phylogenetic topologies are highly consistentwith the view that vertebrates and bacteria share these loci throughcommon ancestry, involving a succession of non-vertebrate eukaryote intermediates. A further point arising from our analysis is thatthe evolutionary relation-ships among proteins cannot be concludedsolely from the ranking of database hits in homology searches (forexample, BLAST reports). This is not a new conceptual point (seerefs 7, 12, 13), but one that seems to have been overlooked in thisinstance. Phylogenetic analysis must be a central component of anyprotein family or genome annotation effort. Importantly, phylogenetic reconstruction is critical to synthesizing, from the growingwealth of sequence data, a more comprehensive view of genomeevolution.”

Valg av poengverdier (substitusjonsmatrise) er viktig

Valg av poengverdier (substitusjonsmatrise) er viktig

Presentation Transcript