280 likes | 417 Vues
Multiple Sequence Alignment (MSA) and Phylogeny. Clustal X. Input: multiple sequence Fasta file. >gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens] MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQ
E N D
Input: multiple sequence Fasta file >gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens] MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQ VRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECL ISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQL QGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS >gi|114051746|ref|NP_001040585.1| protease, serine, 2 [Macaca mulatta] MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQ VRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEAL ISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQL QGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS >gi|6755891|ref|NP_035775.1| mesotrypsin [Mus musculus] MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQ VRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCL ISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNREL QGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|6981422|ref|NP_036861.1| protease, serine, 2 [Rattus norvegicus] MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQ VRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCL ISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGEL QGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen [Bos taurus] MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQ VRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL . . .
Input: multiple sequence Fasta file >gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens] MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQ VRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECL ISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQL QGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS >gi|114051746|ref|NP_001040585.1| protease, serine, 2 [Macaca mulatta] MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQ VRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEAL ISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQL QGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS >gi|6755891|ref|NP_035775.1| mesotrypsin [Mus musculus] MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQ VRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCL ISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNREL QGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|6981422|ref|NP_036861.1| protease, serine, 2 [Rattus norvegicus] MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQ VRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCL ISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGEL QGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen [Bos taurus] MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQ VRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL . . .
Input: multiple sequence Fasta file >gi|21536452|ref|NP_002762.2|mesotrypsin preproprotein [Homo sapiens] MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQ VRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECL ISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQL QGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS >gi|114051746|ref|NP_001040585.1|protease, serine, 2 [Macaca mulatta] MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQ VRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEAL ISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQL QGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS >gi|6755891|ref|NP_035775.1|mesotrypsin [Mus musculus] MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQ VRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCL ISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNREL QGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|6981422|ref|NP_036861.1|protease, serine, 2 [Rattus norvegicus] MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQ VRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCL ISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGEL QGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|27819626|ref|NP_777115.1|pancreatic anionic trypsinogen [Bos taurus] MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQ VRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL . . .
The Newick tree format is used to represent trees as strings A C B D In Newick format: ((A,C),(B,D)); Each pair of parenthesis () enclose a clade in the tree, and the comma separates the members of the corresponding clade. “;” – is always the last character
How robust is our tree? • We need some statistical way to estimate the confidence in the tree topology • But we don’t know anything about the tree topology distribution or parameters • The only data source we have is our data (MSA) • So, we must rely on our own resources: “pull up by your own bootstraps”
Bootstrap (and jackknife)
Jackknife 1. We create n (typically 100-1000) new MSAs (pseudo-data sets) by randomly sampling half of the characters. (random samples without replacement) We do not change the number of sequences, just the number of positions! POS: 52316 1 : TATTT 2 : CATTT 3 : CACTT N : AACTT POS: 18745 1 : TTTAT 2 : TAACC 3 : TAACC N : TGGGA POS: 18394 1 : TTGTA 2 : TAGAC 3 : TAAAC N : TGAGG
Sp1 Sp2 Sp3 Sp4 Jackknife 2. We reconstruct a tree from each data set, using the same method used for reconstructing the original tree POS: 52316 1 : TATTT 2 : CATTT 3 : CACTT N : AACTT POS: 18745 1 : TTTAT 2 : TAACC 3 : TAACC N : TGGGA POS: 18394 1 : TTGTA 2 : TAGAC 3 : TAAAC N : TGAGG Sp1 Sp1 Sp2 Sp2 Sp3 Sp3 Sp4 Sp4
In 67% of the data sets, the node SP1+SP2 was found Sp1 67% Sp1 Sp2 100% Sp2 Sp3 Sp4 Sp3 Sp4 Back to Jackknife 3. For each node in our original tree, we count the number of times it appeared in the Jackknife analysis Sp1 Sp1 Sp2 Sp2 Sp3 Sp3 Sp4 Sp4
Bootstrap The same as jackknife, but instead of sampling K/2 positions, we sample K positions with replacement
Bootstrap 1. Resample K positions n times 12345 K 1 : ATCTG…A 2 : ATCTG…C 3 : ACTTA…C N : ACCTA…T 11244 K 1 : AATTT…T 2 : AATTT…G 3 : AACTT…T N : AACTT…T 47789…K 1 : TTTAT…T 2 : TAACC…G 3 : TAACC…T N : TGGGA…T 15578… K 1 : AGGTA…T 2 : AGGAC…G 3 : AAAAC…A N : AAAGG…C
Sp1 Sp2 Sp3 Sp4 Bootstrap 2. Reconstruct a tree from each data set using the same method used for reconstructing the original tree 11244 K 1 : AATTT…T 2 : AATTT…G 3 : AACTT…T N : AACTT…T 47789…K 1 : TTTAT…T 2 : TAACC…G 3 : TAACC…T N : TGGGA…T 15578… K 1 : AGGTA…T 2 : AGGAC…G 3 : AAAAC…A N : AAAGG…C Sp1 Sp1 Sp2 Sp2 Sp3 Sp3 Sp4 Sp4
Sp1 Sp2 Sp3 Sp4 Bootstrap 3. For each node in our original tree, we count the number of times it appeared in the bootstrap analysis Sp1 Sp1 Sp2 Sp2 Sp3 Sp3 Sp4 Sp4 • The jackknife method is less general than bootstrap • Jackknife explores the data differently • Jackknife is easier to apply to complex sampling schemes 67% Sp1 100% Sp2 Sp3 Sp4
Bootstrap values on NJPlot Note:ClustalX saves trees as .ph filetrees with bootstrap are saved as .phb You might have to reopen the tree…