500 likes | 615 Vues
Robust Alignment of Drosophila Genomes Lior Pachter EECS Joint Colloquium, October 5th 2005. What is genomics?. GGGCTGGGCGAGTATCTCTTCGAAAGGCTCACTCTCAAGCACGACTAAGAGCCTTCTGAGC. GLGEYLFERLTLKHD *. What is genomics?. GGGCTGGGCGAGTATCTCTTCGAAAGGCTCACTCTCAAGCACGACTAA GAGCCTTCTGAGC.
E N D
Robust Alignment of Drosophila Genomes Lior Pachter EECS Joint Colloquium, October 5th 2005
What is genomics? . . . . . . . . . . . . . . . GGGCTGGGCGAGTATCTCTTCGAAAGGCTCACTCTCAAGCACGACTAAGAGCCTTCTGAGC
. . . . GLGEYLFERLTLKHD* What is genomics? . . . . . . . . . . . . . . . GGGCTGGGCGAGTATCTCTTCGAAAGGCTCACTCTCAAGCACGACTAAGAGCCTTCTGAGC
What is genomics? TTCCTTAGACTCTTAGAAAGTACCTCAAAAACGAAATGCGAACAC . . . . . . . . .
What is genomics? TTCCTTAGACTCTTAGAAAGTACCTCAAAAACGAAATGCGAACAC . . . . . . .
What is genomics? TTCCTTAGACTCTTAGAAAGTACCTCAAAAACGAAATGCGAACAC microRNA . . . . . . . . ATGGAGT . . . . . . .
TTCCCTAG--------CAAGTACCTCA------------------TTCCCTAG--------CAAGTACCTCA------------------TTCCCTAG--------CAAGTACCTCA------------------TTCCTTAGACTCTTAGCAAGTACCTCA------------------TTCCTTAGACTCTTAGAAAGTACCTCAAAAACGAAATGCGAACACGACTCT----TTTTAGCAAGTACCTCAAAATATTTAATTAAA-AC ACTCTT----TTTTAGCAAGTACCTCAAGAATTACAATTAAATAT let-7 . . . . . . . . ATGGAGT Grun et al. microRNA target predictions across seven Drosophila species and comparison to mammalian targets, PloS Computational Biology, June 2005 Lall et al. A genome wide map of conserved microRNA targets in C. Elegans, submitted to Cell, 2005. What is comparative genomics? TTCCTTAGACTCTTAGAAAGTACCTCAAAAACGAAATGCGAACAC
The Drosophila Genome Project • 1911 Genetic Mapping in Drosophila • Sturtevant and Morgan • 2000 Drosophila melanogaster genome sequenced • Celera and LBNL publish Drosophila genome in Science • 2003 Proposal for Drosophila as a model system for comparative genomics • Clark, Gibson, Kaufman, McAllister, Myers, O’Grady • 2005 Twelve Drosophila genomes sequenced • Consortium involving Agencourt, Broad Institute, Baylor College Medicine, • Washington University St. Louis and the Venter Institute.
Drosophila Projects • Transposable Element Annotation • A. Caspi and L. Pachter, Identification of transposable elements using multiple alignments • of related genomes, Genome Research, in press. • Multiple Sequence Alignment • C. Dewey and L. Pachter, Whole Genome Mapping, in preparation. • A.S. Schwartz, E.W. Myers and L. Pachter, Alignment metric accuracy, submitted. • N. Bray and L. Pachter, MAVID: Constrained ancestral alignment of multiple sequences, • Genome Research 14 (2004), p 693--699. • Gene Finding • S. Chatterji and L. Pachter, Multiple organism gene finding by collapsed Gibbs sampling, • Journal of Computational Biology, 12 (2005), p 599--608. • S. Chatterji and L. Pachter, GeneMapper: Evidence based multiple organism gene finding, • in preparation.
Drosophila Projects • Transposable Element Annotation • A. Caspi and L. Pachter, Identification of transposable elements using multiple alignments • of related genomes, Genome Research, in press. • Multiple Sequence Alignment • C. Dewey and L. Pachter, Whole Genome Mapping, in preparation. • A.S. Schwartz, E.W. Myers and L. Pachter, Alignment metric accuracy, submitted. • N. Bray and L. Pachter, MAVID: Constrained ancestral alignment of multiple sequences, • Genome Research 14 (2004), p 693--699. • Parametric alignment • Gene Finding • S. Chatterji and L. Pachter, Multiple organism gene finding by collapsed Gibbs sampling, • Journal of Computational Biology, 12 (2005), p 599--608. • S. Chatterji and L. Pachter, GeneMapper: Evidence based multiple organism gene finding, • in preparation.
Available Drosophila whole genome multiple alignments • MAVID • http://hanuman.math.berkeley.edu/kbrowser • MULTIZ • http://genome.ucsc.edu/ (currently no D. erecta)
Alignment of an exon DroAna_20041206_ GTCGCTCAACCAGCATTTGCAAAAGTCGCAGAACTTGCGCTCATTGGATTTCCAGTACTC DroMel_4_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroMoj_20041206_ GTCGCTTAACCAGCATTTACAGAAATCGCAATACTTGCGTTCATTGGATTTCCAGTACTC DroPse_1_ GTCGCTCAGCCAGCACTTGCAGAAGTCGCAGTACTTGCGCTCGTTTGATTTCCAGAATTC DroSim_20040829_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroVir_20041029_ GTCGCTCAACCAGCATTTGCAGAAGTCGCAATACTTGCGTTCATTCGACTTCCAGTACTC DroYak_1_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTCCGCTCGTTTGACTTCCAGTACTC ****** * ****** ** ** ** ***** **** ** ** ** ** ****** * ** Alignment of an intron DroAna_20041206_ CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroMoj_20041206_ CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG----DroSim_20040829_ CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT-DroVir_20041029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- DroYak_1_ CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT *** * * * DroAna_20041206_ AATC-----ACTTAC DroMel_4_ ATTCTATGGACTCAC DroMoj_20041206_ ----TATTTACTCAC DroPse_1_ ------TGTACTTAC DroSim_20040829_ ATTCTATGGACTCAC DroVir_20041029_ ----TATTTACTCAC DroYak_1_ ATTTCATAAACTCAC *** **
Alignment of an exon DroAna_20041206_ GTCGCTCAACCAGCATTTGCAAAAGTCGCAGAACTTGCGCTCATTGGATTTCCAGTACTC DroMel_4_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroMoj_20041206_ GTCGCTTAACCAGCATTTACAGAAATCGCAATACTTGCGTTCATTGGATTTCCAGTACTC DroPse_1_ GTCGCTCAGCCAGCACTTGCAGAAGTCGCAGTACTTGCGCTCGTTTGATTTCCAGAATTC DroSim_20040829_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroVir_20041029_ GTCGCTCAACCAGCATTTGCAGAAGTCGCAATACTTGCGTTCATTCGACTTCCAGTACTC DroYak_1_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTCCGCTCGTTTGACTTCCAGTACTC ****** * ****** ** ** ** ***** **** ** ** ** ** ****** * ** Alignment of an intron droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG----------------------------------- dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAA----AGCGAGT-TTATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGTAA---------CACATAAA------CGTTTTAAATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGG----AGAGGCCATCATCG droSim1.chr2L CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAA----AGCGGG--TTATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGTAA---------TAAACAA--------TTCTCTAATTT droYak1.chr2L CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAAT----AGATCCT-TTATTT *** * * * * droAna1.2448876 AAGATTTCTCATCATTGGTTGAATC---------------------ACTTAC dm2.chr2L -----------------------------------------TATGGACTCAC droMoj1.contig_2959 -------------------------AAATATTT--------TATTGACTCAC dp3.chr4_group3 -----------------------------------------TGT--ACTTAC droSim1.chr2L -----------------------------------------TATGGACTCAC droVir1.scaffold_6 ---------------------------------AAATATTTGGTCCACTCAC droYak1.chr2L -----------------------------------------CATAAACTCAC *** **
How is an alignment made from the sequences? Given two sequences of lengths n,m: >dm2.chr2L CTGCGGGATTAGGGGTCATTAGAGTGCCGAAAAGCGAGTTTATTCTATGGACTCAC >dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCGTGTACTTAC n=50 m=62 ? dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG dm2.chr2L TATGGACTCAC dp3.chr4_group3 TGT--ACTTAC Note that the length of an alignment is at least max(n,m) and at most n+m.
dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG dm2.chr2L TATGGACTCAC dp3.chr4_group3 TGT--ACTTAC DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroMel_4_ ATTCTATGGACTCAC DroPse_1_ ------TGTACTTAC Each alignment can be summarized by counting the number of matches (#M), mismatches (#X), gaps (#G), and spaces (#S).
#M=31, #X=22, #G=3, #S=12 #M=27, #X=18, #G=3, #S=28 dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG dm2.chr2L TATGGACTCAC dp3.chr4_group3 TGT--ACTTAC DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroMel_4_ ATTCTATGGACTCAC DroPse_1_ ------TGTACTTAC Each alignment can be summarized by counting the number of matches (#M), mismatches (#X), gaps (#G), and spaces (#S). 2(#M+#X)+#S=112 so #X,#G and #S suffice to specify a summary. This notation follows Chapter 7 (Parametric Sequence Alignment) by Colin Dewey and Kevin Woods in the book Algebraic Statistics for Computational Biology.
The summary of an alignment is a point in 3 dimensional space. For example, the two alignments just shown correspond to the points: (22,3,12) (18,3,28) In the example of our two sequences there are 379522884096444556699773447791552717765633 different alignments, but only 53890 different summaries. So we don’t need to plot that many points. But 53890 is still quite a large number. Fortunately, there are only 69 vertices on the convex hull of the 53890 points. That is something we can draw…
>mel CTGCGGGATTAGGGGTCATTAGAGTGCCGA AAAGCGAGTTTATTCTATGGAC >pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGA GGAGAGGCCATCATCGTGTAC 49 #x=24, #S=10, #G=2 There are eight alignments that have this summary. For the sequences: the alignment polytope is:
mel CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC mel CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC mel CTGCGGGATTAGGGGTCATTAGA---------GTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAGA---------GTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC mel CTGCGGGATTAGGGGTCATTAG---------AGTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAG---------AGTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC
mel CTGCGGGATTAGGGGTCATTAGAGT===------===GCCGAAAAGCGAGTTTATTCTA=TGGAC pse CTGGAAGAGTTTTGATTAGTAG===GGGATCCATGGGGGCGAGGAGAGGCCATCATC==GTGTAC Consensus at a vertex
49 #x=24, #S=10, #G=2 The vertices of the polytope have special significance. Given parameters for a model, e.g. the default parameters for MULTIZ: M = 100, X = -100, S = -30, G = -400 the summary is the result of maximizing the linear form -200*(#X)-400*(#G)-80*(#S) over the polytope. Thus, the vertices of the polytope correspond to optimalalignments.
Needleman-Wunsch Alignment What is usually done, is that a single set of parameters is specified (M = 100, X = -100, S = -30, G = -400 is a standard default) and then theoptimal vertex is identified using dynamic programming. An alignment optimal for the vertex is then selected. The running time of the algorithm is O(nm) [Needleman-Wunsch, 1970, Smith-Waterman, 1981] and it requires O(n+m) space [Hirschberg 1975] . Standard scoring schemes are: Parameters Model M,X,S Jukes-Cantor with linear gap penalty M,X,S,G Jukes-Cantor with affine gap penalty M,XTS,XTV,S,G Kimura-2 parameter with affine gap penalty
Available Drosophila whole genome multiple alignments • MAVID • http://hanuman.math.berkeley.edu/kbrowser • MULTIZ • http://genome.ucsc.edu/ (currently no D. erecta)
DroAna_20041206_ CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroMoj_20041206_ CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG----DroSim_20040829_ CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT-DroVir_20041029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- DroYak_1_ CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT *** * * * DroAna_20041206_ AATC-----ACTTAC DroMel_4_ ATTCTATGGACTCAC DroMoj_20041206_ ----TATTTACTCAC DroPse_1_ ------TGTACTTAC DroSim_20040829_ ATTCTATGGACTCAC DroVir_20041029_ ----TATTTACTCAC DroYak_1_ ATTTCATAAACTCAC *** ** MAVID N. Bray and L. Pachter, MAVID: Constrained ancestral alignment of multiple sequences, Genome Research 14 (2004) p 693--699
droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG----------------------------------- dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAA----AGCGAGT-TTATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGTAA---------CACATAAA------CGTTTTAAATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGG----AGAGGCCATCATCG droSim1.chr2L CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAA----AGCGGG--TTATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGTAA---------TAAACAA--------TTCTCTAATTT droYak1.chr2L CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAAT----AGATCCT-TTATTT *** * * * * droAna1.2448876 AAGATTTCTCATCATTGGTTGAATC---------------------ACTTAC dm2.chr2L -----------------------------------------TATGGACTCAC droMoj1.contig_2959 -------------------------AAATATTT--------TATTGACTCAC dp3.chr4_group3 -----------------------------------------TGT--ACTTAC droSim1.chr2L -----------------------------------------TATGGACTCAC droVir1.scaffold_6 ---------------------------------AAATATTTGGTCCACTCAC droYak1.chr2L -----------------------------------------CATAAACTCAC *** ** MULTIZ Blanchette et al., Aligning multiple sequences with the threaded blockset aligner, Genome Research 14 (2004) p 708--715
droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG----------------------------------- dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAA----AGCGAGT-TTATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGTAA---------CACATAAA------CGTTTTAAATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGG----AGAGGCCATCATCG droSim1.chr2L CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAA----AGCGGG--TTATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGTAA---------TAAACAA--------TTCTCTAATTT droYak1.chr2L CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAAT----AGATCCT-TTATTT *** * * * * droAna1.2448876 -----ACTTAC dm2.chr2L TATGGACTCAC droMoj1.contig_2959 TATTGACTCAC dp3.chr4_group3 TGT--ACTTAC droSim1.chr2L TATGGACTCAC droVir1.scaffold_6 GGTCCACTCAC droYak1.chr2L CATAAACTCAC *** **
DroAna_20041206_ CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroMoj_20041206_ CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG----DroSim_20040829_ CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT-DroVir_20041029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- DroYak_1_ CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT *** * * * DroAna_20041206_ AATC-----ACTTAC DroMel_4_ ATTCTATGGACTCAC DroMoj_20041206_ ----TATTTACTCAC DroPse_1_ ------TGTACTTAC DroSim_20040829_ ATTCTATGGACTCAC DroVir_20041029_ ----TATTTACTCAC DroYak_1_ ATTTCATAAACTCAC *** **
droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG----------------------------------- dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAA----AGCGAGT-TTATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGTAA---------CACATAAA------CGTTTTAAATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGG----AGAGGCCATCATCG droSim1.chr2L CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAA----AGCGGG--TTATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGTAA---------TAAACAA--------TTCTCTAATTT droYak1.chr2L CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAAT----AGATCCT-TTATTT *** * * * * droAna1.2448876 AAGATTTCTCATCATTGGTTGAATC---------------------ACTTAC dm2.chr2L -----------------------------------------TATGGACTCAC droMoj1.contig_2959 -------------------------AAATATTT--------TATTGACTCAC dp3.chr4_group3 -----------------------------------------TGT--ACTTAC droSim1.chr2L -----------------------------------------TATGGACTCAC droVir1.scaffold_6 ---------------------------------AAATATTTGGTCCACTCAC droYak1.chr2L -----------------------------------------CATAAACTCAC *** **
DroAna_20041206_ CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroMoj_20041206_ CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG----DroSim_20040829_ CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT-DroVir_20041029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- DroYak_1_ CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT *** * * * DroAna_20041206_ AATC-----ACTTAC DroMel_4_ ATTCTATGGACTCAC DroMoj_20041206_ ----TATTTACTCAC DroPse_1_ ------TGTACTTAC DroSim_20040829_ ATTCTATGGACTCAC DroVir_20041029_ ----TATTTACTCAC DroYak_1_ ATTTCATAAACTCAC *** **
One (possibly wrong) alignment is not enough: the history of parametric inference • 1992: Waterman, M., Eggert, M. & Lander, E. • Parametric sequence comparisons, Proc. Natl. Acad. Sci. USA89, 6090-6093 • 1994: Gusfield, D., Balasubramanian, K. & Naor, D. • Parametric optimization of sequence alignment, Algorithmica12, 312-326. • 2003: Wang, L., Zhao, J. • Parametric alignment of ordered trees, Bioinformatics, 19 2237-2245. • 2004: Fernández-Baca, D., Seppäläinen, T. & Slutzki, G. • Parametric Multiple Sequence Alignment and Phylogeny Construction, Journal of Discrete Algorithms, 2 271-287. XPARAL by Kristian Stevens and Dan Gusfield
Whole Genome Parametric AlignmentColin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods • Mathematics and Computer Science • Parametric alignment in higher dimensions. • Faster new algorithms. • Deeper understanding of alignment polytopes. • Biology • Whole genome parametric alignment. • Biological implications of alignment parameters. • Alignment with biology rather than for biology.
Whole Genome Parametric AlignmentColin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods • Mathematics and Computer Science • Parametric alignment in higher dimensions. • Faster new algorithms. • Deeper understanding of alignment polytopes. • Biology • Whole genome parametric alignment. • Biological implications of alignment parameters. CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT- CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT analysis
Whole Genome Parametric AlignmentColin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods • Mathematics and Computer Science • Parametric alignment in higher dimensions. • Faster new algorithms. • Deeper understanding of alignment polytopes. • Biology • Whole genome parametric alignment. • Biological implications of alignment parameters. CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT- CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT analysis
(#X #S #G)[#alignments] 40 (15,16,16)[1080] 41 (17,30,2)[4] 42 (18,14,5)[4] 43 (18,16,4)[56] 44 (20,10,6)[16] 45 (20,10,7)[24] 46 (23,8,6)[6] 47 (23,8,8)[165] 48 (24,8,3)[38] 49 (24,10,2)[8] 50 (25,8,2)[24] 51 (25,62,3)[2] 52 (28,48,2)[1] 53 (29,8,1)[6] Finding the polytope is called parametric inference. This polytope took 3 seconds to compute using the beneath-beyond method [Grünbaum, Convex Polytopes, 1967].
>mel CTGCGGGATTAGGGGTCATTAGAGTGCCGA AAAGCGAGTTTATTCTATGGAC >pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGA GGAGAGGCCATCATCGTGTAC 49 #x=24, #S=10, #G=2 corresponds to the monomial 8X24S10G2 ? How do we build the polytope for Associated to every pair of sequences is a polynomial built from the “summaries” of the alignments. For example:
Newton polytope for positions [1,i] and [1,j] in each sequence Minkowski sum Polytope propagation Convex hull of union NPi,j = S*NPi-1,j+S*NPi,j-1+(X or M)*NPi-1,j-1 A C A T T A G A A A G A T T A C C A C A
Complexity of polytope propagation Theorem: The number of vertices of an alignment polytope for two sequences of length n and m is O((n+m)d(d-1)/(d+1)) where d is the number of free parameters in the scoring scheme. Examples: Parameters Model Vertices M,X,S Jukes-Cantor with linear gap penalty O(n+m)2/3 M,X,S,G Jukes-Cantor with affine gap penalty O(n+m)3/2M,XTS,XTV,S,G K2P with affine gap penalty O(n+m)12/5 L. Pachter and B. Sturmfels, Parametric inference for biological sequence analysis, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16138--16143. L. Pachter and B. Sturmfels, Tropical geometry of statistical models, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16132--16137.
Inference functions Definition: Given two integers n and m and a scoring scheme for sequence alignment, an inference function assigns to every pair of sequences of lengths n and m respectively, an (optimal) alignment. Remark: The number of inference functions could, in principle, be doubly exponential in n+m. This is because the number of alignments is the Delannoy number D(n,m), which is exponential in n+m, and the number of sequence pairs is 4n+m.
Few inference functions theorem Theorem (S. Elizalde 2005): The number of inference functions for two parameter alignment model with two sequences of length n isQ(n2). Proof (outline): The number of inference functions is the number of vertices of the Minkowski sum of the Newton polytopes of the observations. The Newton polytopes are all lattice polytopes, and therefore have few non-parallel edges. The number of vertices of the Minkowski sum is at most where m is the number of non-parallel edges and d is the dimension of the polytopes.
Algebraic Statistics • -- A language for unifying and developing many of the algorithms for biological sequence analysis -- • The few inference functions theorem • Polytope propagation • Phylogenetic tree reconstruction • Evolutionary models • Maximum likelihood estimation • Mutagenic tree models
The future ATCCAGAAGTCTAGTATACATCTCAAAATTCATGCATCTGGCCGGGCACAGTGGCTCACACCTGCAATCCCAGCACTTTGGGAGGCCGAGGTGGGTGGATTACCTGAGGTCAGGAGTTTAAGACCAGCCTGGCCAACATGGTAAAACCCCATCTCTACTAAAAATACAAGTATTAGCCAGGCATTGTGGCAGGTGCCTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGGAAAATCACTTGAACCGGGAGGCGGAGGTTGGAGTGAGCTGAGATCGTGCTACCGCACTCCATGCACTCTAGCCTGGGCAACAGAACGAGATGCTGTCACAACAACAACAACAACAACAACAACAACAACAACAACAACAACAAATTCTCACATCTAAAACAGAGTTCCTGGTTCCATTCCTGCTTCCTGCCTTTCCCACTCCCCCATATTCCCTACCATGCCTTCTTCATCTAATTTAATATTACTAACAAGATCTATTGTTCAAGCCAAAACCCAAGTGTCACTCCTTCAATTTCTCTTTACCTTATCCTCCAAATTTAATCCATTAGCAAGTCCTCTCTTCAAACCCATCCCAAACCAACCTTGTTTTTAACCATCTCCACACCACCAATTACCACAAGGATAAAATCTGAATTCCTTACCACCAAATACTATGTGATCTGGCCCTCATCTATGACCTTCTCCCATTCCTTGTGTAATCTCTGCCTCCACACATAATTTGCAAATTACTCCAGCTACACTGGCCTATTATTATTATTATTATTATTTTTGAGACGGAGTCTTGCTCTTTCGCCCAGCCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAATCTCCGCCTCCTGGGTTCAAGCGATTCTCCTGCCCCAGCCTCCCAAGTAGCTGTGATTACAGGCACATGCCACCATTCCCAGCTAATTTTTTTTTGTTTTTGAGATGGAGTTTCACTCTTGTTGCCCAGGCTGGAGTGCAATGGTGCGATCTCAGCTCACCACAACCTCCACCTCCCGGGTTGATGAAGTGATTCTCTTGTCTCAGCCTCCCGTGTAGCTGGGATTAGAGGCACGCGCCACCACGCTGGGCAAATTTTTGTATTTTTAGTAGAGACAGGGTTTCTACCTCAGTGATCTGTCCGCCTTGACCTCCCAAAGTGCTGGGATTACAGGAATGAGCCACCACACCCAGCCGTGCCCAGCTAATTTTTGCATTTTTTAGTAGAGATGGGGTTTTGCCACGTTGGCCAGGCTGGTCTCAAACTCCTGACCTCAGGGGATCTGCCTGCCTCGGCCTCCTAGAGTGCTGGAATTACAGGTGTGAGCCACTGTGCCCGAACCTTTTATCATTATTATTTCTTGAGACAGGAGTCTTGCTCTGTCGTTCAGGCTGGAGTGCAGTGATGCGATCTTGGCTCACTGTAACTCCTACCTTTCGGTTCAAGTGATTCTCCTGCCTCAGCCTCTGGAGTAGCTGGGATTACAGGCACTGGGATTACAGGCACACACCACCACACCATGCTAGTTTTTTGTATTTTTAGTAGAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCGAACTCCTGACCTCAAGTGATTTGCCTGCCTTGGCTTCCCAAAGTGCTGGGATTATAGGCACGAGCCACCACACACGACCAACATTGGCCTATCTTTTAAAAAATAAACCAAGCTCTGGCCGGGCACAGTGGCTCACACCTGTGATCCCAGCACTTTGGGAGGTTGAGGTGGTTGGATCACTTGAGTTCAGGAGTTTGAGACCAGCCTGACCAACGTGGTAAAACCCCATCTCTACTAAAAATAAAAACTAGTCGGGTGTGGTAGCACGCGTGCCTGTAATACCAGCTACTCAGGAGGCCAAGGCAGGAGAATTGCTTGAACCCAGGAGACAGAGTTTGCAGTGAGCCAAGATTGTGCCACTGCACTCCAGCCTGGGGGATAGAGGGAGACACCATCTCAAAAAAACCAAAATACAGAAATCAAAAAACCACACTCATTATTACCTCAAGACCTTTATGTTTGCTATTCCTCTGCCTATAAGATGCATTCCCTTCATTTTTCAAGGACAATTATTTCTTGTTATTTAGGTCTCAGCTCAATTTTTTCAGAAAGGCTTTCCCTGGCCTCCTTAAACGAAAGTAATCAACAACCTTTGACAGCTAATACTATTCCACTGTTCTGTATATTTCTCCATAGCATTTATTGTTATCTTAAATTCATCTTTATTGTGTATCTCCCCTCGACAGAACCTGAATCCTACCAGGGACTTAGTTAGTCTTATTTACTGTTGCATTCCTAGTGCCCAGAACACAGTAGGCTCCCAATAAATAGCCACTGAATAAAAGTTAAAACCAACAAAAATAATCATTTAATTAATTATGAATACATCGAATTGTGCACAATAGTTTATAAAATTACTTTTTTTTTTTTTTTAAGACAGGGTCTCATTCTGTCTCACAGGCTGGAGTGCAGTGGTGCAATCTAGGCTCACTGCAACCTCCGCCTCCCGGGTTCAAGTGATTCTCCTGCCTCAGCCTCCCCAGCAGCTAGGATTACAGGCACATGCCACCACGCTCGACTAATTTTTTTGTGTTTTTAGTAGAGACAAGGTTTCACCATGTTGACCAGGCTGGTCTCGAACTCCTGACCTCAAGTGATCCACCTGCCTTGGCCACTCAAAGTGCTGGGATTATAGGCATGAGCCACCACGCCTGGCCTATAAAATTACTTTCACATTTCATTTTGCCTGATCTGTTGTCACAGAAGTTCTCAGATGGCTGTTCTGAAATTATTCCTCCTCCTACACTCTATCTTATTTACTTCTCACTGTTCTCAGTATCATAAAGTGCAACATCTTTTTGAAGCAATCTGAATTATAAACAGATACATTTGCATGTATATATATGTATATATGCATATGCACACACACACTTTTTTTTTTTTAAGAGACAGGGTCTTGCTCTGTGCAAGTGCAAGAGTGCAATGGTATGATCATAGCTCACTGCAGCCTTGAACTCCTGGGCTCAAGTGATTCTTCTGGCTTAGCTTCCTCAGTAGCTAAGACTACAGAAGCACACTGCCATGCCCGGCTAATTAAAAAAAAATTTTGTGGAGACAGAGTCTCACTATGTTGCCCAGGCTGGTTTCAAACTCCTGGCCTCAAGTAATCTTCCTGTCTCAGCCTCCCAAAGGGCTGAGATTATAAGTGTGAGCCACTGCATCTGGACTGCATATTAATATGAAGAGCTTTTCTTCAACAACAGTGAACAGTTTTCTACAAAGGTATATGCAAGTGGGCCCACTTCTTGTTCTTATGAATCTTTTCTTTCCTTTTATAAAACTCCTTTTCCTTTCTCTTTTCCCCAAAGAAAGGACTGTTTCTTTTGAAATCTAGAACAAATGAGAACAGAGGATATCCTGGTTTGCGCTGCAAAATTTTTTTTTTTTTTAAGACGGAGTCTCGCTCTGTTGCCAGGTTGGAGTGCAGTGGCACGATCTTGGCTCATTGCAACCTCCACCTCCCGGGTTCAAGAGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGAACTAAAGGCGCATGCCACCACGCTGAGTAATTTTTTGTATTTTAGTAGAGACAGGGTTTCACCATGTTGCCCAGGCTGATCTCGAACTCCTGAGCTCAGGCAATCTGCCTGTCTTGGCCTCCCACAGTGTTAGGATTACAGGCATGAGCCACTGCACCCGATTTTTTTTTTCTTTTGATGGAGTTTTGCTCTTGTTGCCCAGGTTAGAGTGCAATGATGCGATCTCAGCTCACTGCAACCCCCGCCTCCCAGGTTCAAGTGATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGAATTACAGGCAAGTGCCACCAAGCCCGGCTAATTTTGTATTTTTAGTAGAAACGGGGTTTCTCCATGTTGGTCAGGCTGGTCTTGAACTCCCGACATCAGGTGATCCAAGCGCCTCAGCCTCCCAAAGCGCTGGGATTATAGGTATGAGCCACAGTGCAGGCCTGCATAATTCTTGATGATCCTCATTATCATGGAAAATTTGTGCATTGTTAAGGAAAGTGGTGCATTGATGGAAGGAAGCAAATACATTTTTAACTATATGACTGAATGAATATCTCTGGTTAGTTTGTAACATCAAGTACTTACCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCACCCCTAAAGAGATCATAGAAAAGACAGGTTACATACAGCAGAAGAACGTGCTCTTTTCACGGAGATAGAGAGGTCAGCGATTCACAAAAGAGCACAGGAAGAATGACAGAGGAGAGGTCCTTCCCTCTAAAGCCACAGCCCTTTAATAAGGCTTGTAGCAGCAGTTTCCTTCTGGAGACAGAGTTGATGTTTAATTTAAACATTATAAGTTTGCCTGCTGCACATGGATTCCTGCCGACTATTAAATAAATCCCTAGCTCATATGCTAACATTGCTAGGAGCAGATTAGGTCCTATTAGTTATAAAAGAGACCCATTTTCCCAGCATCACCAGCTTATCTGAACAAAGTGATATTAAAGATAAAAGTAGTTTAGTATTACAATTAAAGACCTTTTGGTAACTCAGACTCAGCATCAGCAAAAACCTTAGGTGTTAAACGTTAGGTGTAAAAATGCAATTCTGAGGTGTTAAAGGGAGGAGGGGAGAAATAGTATTATACTTACAGAAATAGCTAACTACCCATTTTCCTCCCGCAATTCCTAGAAAATATTTCAGTGTCCGTTCACACACAAACTCAGCATCTGCAGAATGAAAAACACTCAAAGGATTAGAAGTTGAAAACAAAATCAGGAAGTGCTGTCCTAAGAAGCTAAAGAGCCTCAGTTTTTTACACTCCCAAGATCAATCTGGATTTATGATTCTAAAACCCCTGGTGACAGAATCAGAGGCTGAAAACACCACTAATTATAACCAGCAGGTATGGATATTTGGAAGTCTAGGGGAGGCTGATATGAAGTTAAGACCAGAGGAAATATCTGTCCACTCCCTCTTCTCAACACCCATCTTCTAGACGCCAAGGCTAGCTATAGATCTCCATTATAGTGTTCAAGGAATTAGGAATTATCCATGTCAATAGTTTTGATTAATGTGGACGGAGAACATCTATATTACTAGATGGCAATATGTGAAAGAAGAAAACAGTATTGTTGAAAACCTAAATCTGAAATGTCAATGTAATGACAAATTTTCACCCCTAGAATGTCTACCTGGGGAGTCCTAACCCTCTAATATTCCCCTGAGAGGGATGGGAGAATACAGTGCAGAGCTTTTATATAAGTATTTCAGAAAGCAGTAGCTAAAGAATCACTTGTTTATTTCCCAGTGTTTCAAAGGCCCTTCTGAAGAACTAAGCAAACTAAGGAAAGACCATTTAGTTTTAAACAGGAGAAATGTATTTAACTAAATCCTAAACACAGCAGGCTATCTGCAAGCAGCAGCAGCAGCAGCAGCCATGCTCCCTCACAGAATCCTTACAATTTTTGAAGTTTTTTGTTTAACTGCTACAAAAGCCGATTTAGTAACATTTATTACACTTAAAAACTTCAGTTCATTTGTAGTTCAAAGCAAATGTATTGGCTTTGAGTTTAAAGACTGAACTACTTTAGATTTGATTTGCATTTTTTTTTTTTTTTTTTTTTGAGATGCAGTCTTGCTCTGTCAGCCAGGCTGGAGTGCAGTGGCTGGATCTCAGCTCACGGCAAGCTCTGCCTCCTGGGTTCATGCCATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGACTACAGATGCCCGCCACCATGCCCGGCTAATTTTTTGTATTTTTACTAGAGATGGGGTTTCACCGTGTTAGCCAGGATGGTCTCGATCTCCTGACCTCGTGATCTGCCCGCCTTGGCCCCCCAAAGCGCTGGGATTACAGGCCTGAGCCACCACGCTTGGCATCTTTTTACCTTTCATTAACTTTGATGCAAACCTATAGCTTAAGGTATCTTAAACTTTAATGACATTTTTCTCTAAAATAGTAGTTTGTAATAACTTGTTCTGGCACCTGGCTCCAATGAACACTACCCTCTGACCCTGTGGTATAATTTTCATGAGTAAGTGGAAACCTAAGATCTTAGAAGTTCAACGGCAATGTGTCCAAGGGGTTTAGATCCTCTCCTTAAGTGCCTGTATCTCTGTGAAAAGAATCATCATAGGCTAGGCGCGATGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCGAGGTAGGTGGATCACCTGAGGTCGGGAGTCCAAGACCAGCCTGACTGACATGGAAAAACCCTGTCTCTACTAAAAATACAAAATTAGGTATGGTGGTGCATTCCTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGGAGGGGGAGGTTGCAGCAAGCCAAGATCGTGCCATTGCACTCCAGCAGCCTGGGCAACAAGAGTGAAAAACTACACCTCAAAAACAAAAACAAAAACAAAAGAATCATCATCAAGTGAACTGGAACACATCCAGAGAACTAATTTTGTTAGAAAGATTTTAGAGTTGAGCCACACAATCTGCATCTTCTGCGTCCTCCATGCACTCGTCTGCTTTCTGGAGCCCCATGAGTGAGTCTTAATCCTGTTCCAGATAACAGTTCTCTTCCGGGTAACGGTTCTTCAGATACTTGAAGACAGTGTCTTATTTCCTTAAATCTTCTCATTTCTTCTTCAAAAGACAGTATTTCAAGTTACTTTTATGTATCTTTACCATCTACCTCTGGATAAACACTCTCCAATTTGTCAGTGACCATGTTAAAAACCAAGCACGGTGCTTAAAACTGACATCATCTTTCAGGCAATCACTCCATTGGAGAATACAGTGGGGCTCTGGATCTGTACTTCACTTGCTCCAGAGCCTCTGCTTGTGTTAATACGGCCCAGTTTCAAATAAGCATTTTTAGCAGCCCTGAAATGTGTACTCAGATTTAGTTTATAGTCAACTAAAAACACCCAGAGGTCTCCTGTATTACACAAGTTATAATTAAAACCTTAAAAGAGAAAGGTATAGGACAAATGATCTGTCTCCTCCCTTTTTTGCTTTTTCATATGTTAAGACTATCTCGGAGCTGTTATCAGACTT