310 likes | 459 Vues
Algebraic Statistics for Computational Biology. Lior Pachter & Bernd Sturmfels. What is Biology? The study of living organisms. What is Statistics? The science concerned with the collection, organization, analysis and interpretation of data. What is Algebra?
E N D
Algebraic Statistics for Computational Biology Lior Pachter & Bernd Sturmfels
What is Biology? The study of living organisms. What is Statistics? The science concerned with the collection, organization, analysis and interpretation of data. What is Algebra? The part of mathematics that deals with generalized arithmetic.
Customers who bought this book might also buy… What is Algebraic Statistics? There is no dictionary definition yet. The term was coined by European statisticians interested in applying Gröbner bases to the design of experiments. Their book is: G. Pistone, E. Riccomagno and H. Wynn, “Algebraic Statistics: Computational Algebra in Statistics”. CRC Press, 2000.
Algebraic Statistics for Computational Biology Edited by Lior Pachter and Bernd Sturmfels Table of Contents Part I - Introduction to the four themes 1. Statistics 2. Computation 3. Algebra 4. Biology Part II - Studies on the four themes 5. Parametric Inference 6. Polytope Propagation on Graphs 7. Parametric Sequence Alignment 8. Bounds for Optimal Sequence Alignment 9. Inference Functions 10. Geometry of Markov Chains 11. Equations Defining Hidden Markov Models 12. The EM Algorithm for Hidden Markov Models 13. Homology Mapping with Markov Random Fields 14. Mutagenic Tree Models 15. Catalog of Small Trees 16. The Strand Symmetric Model 17. Extending Tree Models to Split Networks 18. Small Trees and Generalized Neighbor Joining 19. Tree Construction Using Singular Value Decomposition 20. Application of Interval Methods to Phylogenetics 21. Analysis of Point Mutations in Vertebrate Genomes 22. Ultra-Conserved Elements in Vertebrate and Fly Genomes New book: Algebraic Statistics for Computational Biology Edited by Lior Pachter and Bernd Sturmfels Cambridge University Press, Summer 2005
Her name is DiaNA.She makes DNA sequences. TAGAGACGGGGGTTTCACAATGTTGGCCA Who is this girl ?
The human genome Consists of 2.8 billion DNA bases. Sequenced in 2001 and finished in 2004. Contains genes: - these are subsequences which code for protein. - estimated number of genes: 20,000-25,000. - genes make up less than 5% of the genome. Example: Breast-ovarian cancer susceptibility gene (BRCA1)
>hg17_dna range=chr17:38464686-38473085 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=noneATCCAGAAGTCTAGTATACATCTCAAAATTCATGCATCTGGCCGGGCACAGTGGCTCACACCTGCAATCCCAGCACTTTGGGAGGCCGAGGTGGGTGGATTACCTGAGGTCAGGAGTTTAAGACCAGCCTGGCCAACATGGTAAAACCCCATCTCTACTAAAAATACAAGTATTAGCCAGGCATTGTGGCAGGTGCCTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGGAAAATCACTTGAACCGGGAGGCGGAGGTTGGAGTGAGCTGAGATCGTGCTACCGCACTCCATGCACTCTAGCCTGGGCAACAGAACGAGATGCTGTCACAACAACAACAACAACAACAACAACAACAACAACAACAACAACAAATTCTCACATCTAAAACAGAGTTCCTGGTTCCATTCCTGCTTCCTGCCTTTCCCACTCCCCCATATTCCCTACCATGCCTTCTTCATCTAATTTAATATTACTAACAAGATCTATTGTTCAAGCCAAAACCCAAGTGTCACTCCTTCAATTTCTCTTTACCTTATCCTCCAAATTTAATCCATTAGCAAGTCCTCTCTTCAAACCCATCCCAAACCAACCTTGTTTTTAACCATCTCCACACCACCAATTACCACAAGGATAAAATCTGAATTCCTTACCACCAAATACTATGTGATCTGGCCCTCATCTATGACCTTCTCCCATTCCTTGTGTAATCTCTGCCTCCACACATAATTTGCAAATTACTCCAGCTACACTGGCCTATTATTATTATTATTATTATTTTTGAGACGGAGTCTTGCTCTTTCGCCCAGCCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAATCTCCGCCTCCTGGGTTCAAGCGATTCTCCTGCCCCAGCCTCCCAAGTAGCTGTGATTACAGGCACATGCCACCATTCCCAGCTAATTTTTTTTTGTTTTTGAGATGGAGTTTCACTCTTGTTGCCCAGGCTGGAGTGCAATGGTGCGATCTCAGCTCACCACAACCTCCACCTCCCGGGTTGATGAAGTGATTCTCTTGTCTCAGCCTCCCGTGTAGCTGGGATTAGAGGCACGCGCCACCACGCTGGGCAAATTTTTGTATTTTTAGTAGAGACAGGGTTTCTACCTCAGTGATCTGTCCGCCTTGACCTCCCAAAGTGCTGGGATTACAGGAATGAGCCACCACACCCAGCCGTGCCCAGCTAATTTTTGCATTTTTTAGTAGAGATGGGGTTTTGCCACGTTGGCCAGGCTGGTCTCAAACTCCTGACCTCAGGGGATCTGCCTGCCTCGGCCTCCTAGAGTGCTGGAATTACAGGTGTGAGCCACTGTGCCCGAACCTTTTATCATTATTATTTCTTGAGACAGGAGTCTTGCTCTGTCGTTCAGGCTGGAGTGCAGTGATGCGATCTTGGCTCACTGTAACTCCTACCTTTCGGTTCAAGTGATTCTCCTGCCTCAGCCTCTGGAGTAGCTGGGATTACAGGCACTGGGATTACAGGCACACACCACCACACCATGCTAGTTTTTTGTATTTTTAGTAGAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCGAACTCCTGACCTCAAGTGATTTGCCTGCCTTGGCTTCCCAAAGTGCTGGGATTATAGGCACGAGCCACCACACACGACCAACATTGGCCTATCTTTTAAAAAATAAACCAAGCTCTGGCCGGGCACAGTGGCTCACACCTGTGATCCCAGCACTTTGGGAGGTTGAGGTGGTTGGATCACTTGAGTTCAGGAGTTTGAGACCAGCCTGACCAACGTGGTAAAACCCCATCTCTACTAAAAATAAAAACTAGTCGGGTGTGGTAGCACGCGTGCCTGTAATACCAGCTACTCAGGAGGCCAAGGCAGGAGAATTGCTTGAACCCAGGAGACAGAGTTTGCAGTGAGCCAAGATTGTGCCACTGCACTCCAGCCTGGGGGATAGAGGGAGACACCATCTCAAAAAAACCAAAATACAGAAATCAAAAAACCACACTCATTATTACCTCAAGACCTTTATGTTTGCTATTCCTCTGCCTATAAGATGCATTCCCTTCATTTTTCAAGGACAATTATTTCTTGTTATTTAGGTCTCAGCTCAATTTTTTCAGAAAGGCTTTCCCTGGCCTCCTTAAACGAAAGTAATCAACAACCTTTGACAGCTAATACTATTCCACTGTTCTGTATATTTCTCCATAGCATTTATTGTTATCTTAAATTCATCTTTATTGTGTATCTCCCCTCGACAGAACCTGAATCCTACCAGGGACTTAGTTAGTCTTATTTACTGTTGCATTCCTAGTGCCCAGAACACAGTAGGCTCCCAATAAATAGCCACTGAATAAAAGTTAAAACCAACAAAAATAATCATTTAATTAATTATGAATACATCGAATTGTGCACAATAGTTTATAAAATTACTTTTTTTTTTTTTTTAAGACAGGGTCTCATTCTGTCTCACAGGCTGGAGTGCAGTGGTGCAATCTAGGCTCACTGCAACCTCCGCCTCCCGGGTTCAAGTGATTCTCCTGCCTCAGCCTCCCCAGCAGCTAGGATTACAGGCACATGCCACCACGCTCGACTAATTTTTTTGTGTTTTTAGTAGAGACAAGGTTTCACCATGTTGACCAGGCTGGTCTCGAACTCCTGACCTCAAGTGATCCACCTGCCTTGGCCACTCAAAGTGCTGGGATTATAGGCATGAGCCACCACGCCTGGCCTATAAAATTACTTTCACATTTCATTTTGCCTGATCTGTTGTCACAGAAGTTCTCAGATGGCTGTTCTGAAATTATTCCTCCTCCTACACTCTATCTTATTTACTTCTCACTGTTCTCAGTATCATAAAGTGCAACATCTTTTTGAAGCAATCTGAATTATAAACAGATACATTTGCATGTATATATATGTATATATGCATATGCACACACACACTTTTTTTTTTTTAAGAGACAGGGTCTTGCTCTGTGCAAGTGCAAGAGTGCAATGGTATGATCATAGCTCACTGCAGCCTTGAACTCCTGGGCTCAAGTGATTCTTCTGGCTTAGCTTCCTCAGTAGCTAAGACTACAGAAGCACACTGCCATGCCCGGCTAATTAAAAAAAAATTTTGTGGAGACAGAGTCTCACTATGTTGCCCAGGCTGGTTTCAAACTCCTGGCCTCAAGTAATCTTCCTGTCTCAGCCTCCCAAAGGGCTGAGATTATAAGTGTGAGCCACTGCATCTGGACTGCATATTAATATGAAGAGCTTTTCTTCAACAACAGTGAACAGTTTTCTACAAAGGTATATGCAAGTGGGCCCACTTCTTGTTCTTATGAATCTTTTCTTTCCTTTTATAAAACTCCTTTTCCTTTCTCTTTTCCCCAAAGAAAGGACTGTTTCTTTTGAAATCTAGAACAAATGAGAACAGAGGATATCCTGGTTTGCGCTGCAAAATTTTTTTTTTTTTTAAGACGGAGTCTCGCTCTGTTGCCAGGTTGGAGTGCAGTGGCACGATCTTGGCTCATTGCAACCTCCACCTCCCGGGTTCAAGAGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGAACTAAAGGCGCATGCCACCACGCTGAGTAATTTTTTGTATTTTAGTAGAGACAGGGTTTCACCATGTTGCCCAGGCTGATCTCGAACTCCTGAGCTCAGGCAATCTGCCTGTCTTGGCCTCCCACAGTGTTAGGATTACAGGCATGAGCCACTGCACCCGATTTTTTTTTTCTTTTGATGGAGTTTTGCTCTTGTTGCCCAGGTTAGAGTGCAATGATGCGATCTCAGCTCACTGCAACCCCCGCCTCCCAGGTTCAAGTGATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGAATTACAGGCAAGTGCCACCAAGCCCGGCTAATTTTGTATTTTTAGTAGAAACGGGGTTTCTCCATGTTGGTCAGGCTGGTCTTGAACTCCCGACATCAGGTGATCCAAGCGCCTCAGCCTCCCAAAGCGCTGGGATTATAGGTATGAGCCACAGTGCAGGCCTGCATAATTCTTGATGATCCTCATTATCATGGAAAATTTGTGCATTGTTAAGGAAAGTGGTGCATTGATGGAAGGAAGCAAATACATTTTTAACTATATGACTGAATGAATATCTCTGGTTAGTTTGTAACATCAAGTACTTACCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCACCCCTAAAGAGATCATAGAAAAGACAGGTTACATACAGCAGAAGAACGTGCTCTTTTCACGGAGATAGAGAGGTCAGCGATTCACAAAAGAGCACAGGAAGAATGACAGAGGAGAGGTCCTTCCCTCTAAAGCCACAGCCCTTTAATAAGGCTTGTAGCAGCAGTTTCCTTCTGGAGACAGAGTTGATGTTTAATTTAAACATTATAAGTTTGCCTGCTGCACATGGATTCCTGCCGACTATTAAATAAATCCCTAGCTCATATGCTAACATTGCTAGGAGCAGATTAGGTCCTATTAGTTATAAAAGAGACCCATTTTCCCAGCATCACCAGCTTATCTGAACAAAGTGATATTAAAGATAAAAGTAGTTTAGTATTACAATTAAAGACCTTTTGGTAACTCAGACTCAGCATCAGCAAAAACCTTAGGTGTTAAACGTTAGGTGTAAAAATGCAATTCTGAGGTGTTAAAGGGAGGAGGGGAGAAATAGTATTATACTTACAGAAATAGCTAACTACCCATTTTCCTCCCGCAATTCCTAGAAAATATTTCAGTGTCCGTTCACACACAAACTCAGCATCTGCAGAATGAAAAACACTCAAAGGATTAGAAGTTGAAAACAAAATCAGGAAGTGCTGTCCTAAGAAGCTAAAGAGCCTCAGTTTTTTACACTCCCAAGATCAATCTGGATTTATGATTCTAAAACCCCTGGTGACAGAATCAGAGGCTGAAAACACCACTAATTATAACCAGCAGGTATGGATATTTGGAAGTCTAGGGGAGGCTGATATGAAGTTAAGACCAGAGGAAATATCTGTCCACTCCCTCTTCTCAACACCCATCTTCTAGACGCCAAGGCTAGCTATAGATCTCCATTATAGTGTTCAAGGAATTAGGAATTATCCATGTCAATAGTTTTGATTAATGTGGACGGAGAACATCTATATTACTAGATGGCAATATGTGAAAGAAGAAAACAGTATTGTTGAAAACCTAAATCTGAAATGTCAATGTAATGACAAATTTTCACCCCTAGAATGTCTACCTGGGGAGTCCTAACCCTCTAATATTCCCCTGAGAGGGATGGGAGAATACAGTGCAGAGCTTTTATATAAGTATTTCAGAAAGCAGTAGCTAAAGAATCACTTGTTTATTTCCCAGTGTTTCAAAGGCCCTTCTGAAGAACTAAGCAAACTAAGGAAAGACCATTTAGTTTTAAACAGGAGAAATGTATTTAACTAAATCCTAAACACAGCAGGCTATCTGCAAGCAGCAGCAGCAGCAGCAGCCATGCTCCCTCACAGAATCCTTACAATTTTTGAAGTTTTTTGTTTAACTGCTACAAAAGCCGATTTAGTAACATTTATTACACTTAAAAACTTCAGTTCATTTGTAGTTCAAAGCAAATGTATTGGCTTTGAGTTTAAAGACTGAACTACTTTAGATTTGATTTGCATTTTTTTTTTTTTTTTTTTTTGAGATGCAGTCTTGCTCTGTCAGCCAGGCTGGAGTGCAGTGGCTGGATCTCAGCTCACGGCAAGCTCTGCCTCCTGGGTTCATGCCATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGACTACAGATGCCCGCCACCATGCCCGGCTAATTTTTTGTATTTTTACTAGAGATGGGGTTTCACCGTGTTAGCCAGGATGGTCTCGATCTCCTGACCTCGTGATCTGCCCGCCTTGGCCCCCCAAAGCGCTGGGATTACAGGCCTGAGCCACCACGCTTGGCATCTTTTTACCTTTCATTAACTTTGATGCAAACCTATAGCTTAAGGTATCTTAAACTTTAATGACATTTTTCTCTAAAATAGTAGTTTGTAATAACTTGTTCTGGCACCTGGCTCCAATGAACACTACCCTCTGACCCTGTGGTATAATTTTCATGAGTAAGTGGAAACCTAAGATCTTAGAAGTTCAACGGCAATGTGTCCAAGGGGTTTAGATCCTCTCCTTAAGTGCCTGTATCTCTGTGAAAAGAATCATCATAGGCTAGGCGCGATGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCGAGGTAGGTGGATCACCTGAGGTCGGGAGTCCAAGACCAGCCTGACTGACATGGAAAAACCCTGTCTCTACTAAAAATACAAAATTAGGTATGGTGGTGCATTCCTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGGAGGGGGAGGTTGCAGCAAGCCAAGATCGTGCCATTGCACTCCAGCAGCCTGGGCAACAAGAGTGAAAAACTACACCTCAAAAACAAAAACAAAAACAAAAGAATCATCATCAAGTGAACTGGAACACATCCAGAGAACTAATTTTGTTAGAAAGATTTTAGAGTTGAGCCACACAATCTGCATCTTCTGCGTCCTCCATGCACTCGTCTGCTTTCTGGAGCCCCATGAGTGAGTCTTAATCCTGTTCCAGATAACAGTTCTCTTCCGGGTAACGGTTCTTCAGATACTTGAAGACAGTGTCTTATTTCCTTAAATCTTCTCATTTCTTCTTCAAAAGACAGTATTTCAAGTTACTTTTATGTATCTTTACCATCTACCTCTGGATAAACACTCTCCAATTTGTCAGTGACCATGTTAAAAACCAAGCACGGTGCTTAAAACTGACATCATCTTTCAGGCAATCACTCCATTGGAGAATACAGTGGGGCTCTGGATCTGTACTTCACTTGCTCCAGAGCCTCTGCTTGTGTTAATACGGCCCAGTTTCAAATAAGCATTTTTAGCAGCCCTGAAATGTGTACTCAGATTTAGTTTATAGTCAACTAAAAACACCCAGAGGTCTCCTGTATTACACAAGTTATAATTAAAACCTTAAAAGAGAAAGGTATAGGACAAATGATCTGTCTCCTCCCTTTTTTGCTTTTTCATATGTTAAGACTATCTCGGAGCTGTTATCAGACTTTTTTCCTGAAAAACTCTCAACAATACTCAAACTAGGTGTTACATGAAGCTGGGGTCTCCAGGTTTTGCCTCACTTGTTCTTTCTTTTGTTGTTGTTGAGACAGAGTCTCACTCTGTCGCCAGGCTGGAGTGCAGTGGCAGGATCTCAGCTGACTGCAACCTCAGCCTCCAGAGTTCAAGCAATTCTTCTGTGTCAGCCTCCCAAGTAGCTGGGATTACAGGTGCACACCACCACGCCCAGCCA
INPUT: ..t..r…o..p..i..c..a..a..l...g..e..e..t..r..y.. OUTPUT: ..t..r…o..p..i..c..a..a..l...g..e..e..t..r..y.. ome Annotation is the labeling of the input sequence, in this case with 3 colors: keep change delete Another example of annotation
tctctggttagtttgtaacatcaagtacttacctcattcagcatttttctttctttaatagactgggtcacccctaaagagtctctggttagtttgtaacatcaagtacttacctcattcagcatttttctttctttaatagactgggtcacccctaaagag tccgggattagtctgtatgaggtacccaccacactcagaagttttctttcttggatagacttgatcacccctgaagagaag
Data summary tctctggttagtttgtaacatcaagtacttacctcattcagcatttttctttctttaatagactgggtcacccctaaagag tccgggattagtctgtatgaggtacccaccacactcagaagttttctttcttggatagacttgatcacccctgaagagaag
Statistics Question Are the two sequencesindependent? Algebra Question Is the 4x4 matrix close to rank 1?
The independence model • m = 16 observable states {A,C,G,T}2 • d = 6 unknown parameters • = (aA , aC , aG , aT , bA , bC , bG ,bT) where aA+ aC +aG +aT=bA +bC +bG +bT= 1 Independence means probabilities factor AG= prob(A,G) = aAbG
The independence model • m = 16 observable states {A,C,G,T}2 • d = 6 unknown parameters • = (aA , aC , aG , aT , bA , bC , bG ,bT) where aA+ aC +aG +aT=bA +bC +bG +bT= 1 Independence means probabilities factor AG= prob(A,G) = aAbG The model is the polynomial map • (a,b) aTb
Models for discrete data A statistical model is a parameterized family of probability distributions U Q U D d = number of parameters m = number of observable states Q = the parameter space D = probability simplex on the m states
The geometry of maximum likelihood estimation parameter space data probability simplex
Observed data tctctggttagtttgtaacatcaagtacttacctcattcagcatttttctttctttaatagactgggtcacccctaaagagatc tccgggattagtctgtatgaggtacccaccacactcagaagttttctttcttggatagacttgatcacccctgaagagaag
tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCACCCctaaagagatctctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCACCCctaaagagatc tccgggattagtctgt---atgaggtacccacCACACTCAGAAGTTTTCTTTCTTGGATAGACTTGATCACCCctgaagagaag ** * ***** *** ** * **** *** ** **** * *********** ******* * ******** ****** Hidden data
c g t Example: n=5, m=4 g gttta- gt--gc ** g t t t a finish start The alignment problem is to find the shortest path in the alignment graph: This is solved with dynamic programming and is known in computational biology as the Needleman-Wunsch algorithm.
The algebraic statisticalmodel for sequence alignment, known as the pair hidden Markov model, is the image of a map whose coordinates are polynomials with one term for each path in the alignment graph. The logarithms of the 33 parameters give the edge lengths for the shortest path problem on the alignment graph.
General Mathematical Framework • Statistical models are algebraic varieties. • Algebraic varieties can be tropicalized. • Tropicalized models are useful • for MAP inference in statistics. L. Pachter and B. Sturmfels, Tropical Geometry of Statistical Models, Proceedings of the National Academy of Sciences, Volume 101:46 (2004), p 16132--16137. L. Pachter and B. Sturmfels, Parametric Inference for Biological Sequence Analysis, Proceedings of the National Academy of Sciences, Volume 101:46 (2004), p 16138--16143.•
2.1. Tropical arithmetic and dynamic programming In tropical algebraic geometry, varieties are piecewise linear…
Human tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCA Chimp tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCA Mouse tcccagatcagttcgt---atcaggtacccacCACATTCAGAAGTCTTCTTTCTTGGATAGACCGGACCA Rat tccgggattagtctgt---atgaggtacccacCACACTCAGAAGTTTTCTTTCTTGGATAGACTTGATCA Dog tttctgattcgtttgtaacattgagtacctacCTCATCTAGTATCTTTCTTTCTTTAATAGACTGGGTTA * * * ** ** ** **** *** ** ** * ********* ****** * * Comparative Genomics A phylogenetic tree on 5 taxa.
Human tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCA Chimp tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCA Mouse tcccagatcagttcgt---atcaggtacccacCACATTCAGAAGTCTTCTTTCTTGGATAGACCGGACCA Rat tccgggattagtctgt---atgaggtacccacCACACTCAGAAGTTTTCTTTCTTGGATAGACTTGATCA Dog tttctgattcgtttgtaacattgagtacctacCTCATCTAGTATCTTTCTTTCTTTAATAGACTGGGTTA * * * ** ** ** **** *** ** ** * ********* ****** * * Comparative Genomics Petersen graph parametrizes trees on 5 taxa.
Trees are Ubiquitous in Biology Fig. 1. Y Chromosome of D. pseudoobscura Is Not Homologous to the Ancestral Drosophila Y Antonio Bernardo Carvalho and Andrew G. Clark, Science, January 7 2005.
1 5 2 4 3 1 5 4 2 3 1 4 2 5 3 1 3 2 4 5
Conclusion Organ (liver) Algebra, discrete mathematics and statistics are relevant for genomics. Organ system (digestive) Tissue (liver sinusoid) Cell (hepatocyte) Organelle (nucleus) TAGAGACGGGGGTTTCACAATGTTGGCCA Molecule (DNA)
Algebraic Statistics for Computational Biology Group Department of Mathematics, U.C. Berkeley http://math.berkeley.edu/~lpachter/ascb_book/ Photo courtesy of Robert Fisher Lawrence Hall of Science March 7th, 2005