1 / 41

Minimal Model for the Growth and Evolution of Genomes

Minimal Model for the Growth and Evolution of Genomes. International Symposium on Frontiers of Science Tsing hua University, Beijing, 2002 June 17-19. HC LEE National Central University Computational Biology Laboratory. Plan of Presentation. The Human Genome Project

branhamd
Télécharger la présentation

Minimal Model for the Growth and Evolution of Genomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Minimal Model for the Growth and Evolution of Genomes International Symposium on Frontiers of Science Tsinghua University, Beijing, 2002 June 17-19 HC LEE National Central University Computational Biology Laboratory

  2. Plan of Presentation • The Human Genome Project • Life Science in silico • Some statistical properties of genomes • Models for evolution and growth of genomes • Some preliminary results • Discussion

  3. The Book of Life

  4. Genome - book of four letter Genome - Book of Life written in four letters DNA - a polymer of nucleotides Nucleotide – backbone + bases Four types of bases: A, C, G, T (the four letters) Gene – coded sequence of bases Genome – set of all genes; set of all chromosomes packaged pair of DNA strands with double helix structure CBL@NCU

  5. The Human Genome Project The Human Genome Project • 1984 to 1986 – first proposed at US DOE • 1988 - endorsed by US National Research Council • creation of genetic, physical and sequence maps of the human genome • parallel efforts in key model organisms: bacteria, yeast,worms, flies and mice; • develop of supporting technology • ethical, legal and social issues (ELSI) • 1990 – Human Genome Project (NHGRI) • Later – UK, France, Japan, Germany, China

  6. Growth of sequenced genome data exploded after 1995 (GenBank: as of 2002 January 13) Genome data exploded after 1995 Millions of sequences CBL@NCU

  7. First working draft of Human Genome Sequencing of first working draft ofHuman Genome published in 2001 February Nature, 409, February 15, 860-921 (2001) Science, 291, February 16, 1304-1351 (2001)

  8. Many completed genomes Many completed genomes 1995-2002 – Bacteria 細菌(about 75 organisms); 0.5-5 Mb; hundreds to 2000 genes 1996 April –Yeast 酵母(Saccharomyces cerevisiae) 12 Mb, 5,500 genes 1998 Dec. -Worm 線蟲(Caenorhabditis elegans) 97 Mb, 19,000 genes 2000 March –Fly 果蠅 (Drosophila melanogaster) 137 Mb, 13,500 genes 2000 Dec. - Mustard 芥末子(Arabidopsis thaliana) 125 Mb, 25,498 genes 2001 Feb. – Human 人類 (Homo sapiens) 3000 Mb, 35,000~40,000 genes CBL@NCU

  9. New way to do Life Science Research • in vivo 在活體裡 • in vitro 在試管中 • in silico 在電腦中 CBL@NCU

  10. CBL@NCU [biology] + [computer-science] + [math & physics] + [sequence data] = Life Science in silico Life Science in silico “It is much easier to teach biology to people from a math, physics or computer-science background than to teach a biologist how to code well.” - Nature, February 15, 2001, p963

  11. Structure of genome is complex • Many levels – genes, intergenic region, regulatory sections • Gene – network of introns and exons • Genome – network of genes • Random mutation • Genes are products of “blind watchmaker” • Once made, gene is repeatedly copied • paralogues, orthologues and pseudogenes • Genes are protected against rapid mutation

  12. Genome as text • Genome is a text of four letters – A,C,G,T • Frequencies of k-mers characterize the whole genome • E.g. counting frequen- cies of 7-mers with a “sliding window” N(GTTACCC) = N(GTTACCC) +1

  13. 8-mer portraits of bacterial genomes 8-mer portraits of bacterial genomes Hao Lee Zhang (2000)

  14. ConstructingTree of Lifewith k-mers(16S rRNA 35 organisms) Bacteria A. aeolicus LF Luo FM Ji LC Hsieh HC Lee (2001) T. maritima Eukarya Archaea Black tree: dist’n of 8-mers. Red tree: sequence aligment.

  15. Textual statistics of genome almost random but NOT TRIVIALLY so • Distribution of GC content • GC content across genome correlated with density of coding regions • Distribution of frequencies of k-mers • Characterizes whole genomes • Same in coding and non-coding regions • Typically 10~15 time wider than normal

  16. Typing the Book of Life

  17. If genome grows randomly by single nucleotide then distribution is Poisson Poisson P(f=k) =lke-l/k! <f> = l, D (stand. dev.) = <f>1/2 Gamma G(f) = fa-1e-f/b /baG(a) <f> = ab, D = a1/2 b Random single nucleotide; D = 15.5 E. coli, a=3.05, b = 80.0; D = 140

  18. ________ 0 11.4 1 26.6 2 62.0 3 144 4 337 5 787 6 1837 kfk Non-uniform nucleotide composition breaks the n-mer Poisson distribution into n+1 peaks 62.0 Given [at]/[cg]=70/30. If mean frequency is 244, then mean frequency of 6-mers with K a or t’s and 6-k c or g’s is fk = 244 (0.7)k (0.3)6-k/(.5)6 144 Random single nucleotide 26.6 337 Number of 6-mers 787 M. janaschii 11.4 1837 Frequency of 6-mers

  19. Standard deviation of distribution of GC content in Human Genome 15 times wider than normal The Human Genome International consortium Nature 409 (2001) 860-921

  20. Distribution is broad gamma rather than Poisson • Narrow Poisson - few objects (k-mers) distributed into many boxes (long genome) • Gamma • power arise and exponential tail • Broad gamma - many objects distributed into few boxes • many more entries (k-mers, GC contents, etc) with very high or very low frequencies • Problem: number of k-mers (4k, k < 10) much less than genome length (>1 M for bacteria) • E.g. 46/1M = 4096/1,000,000 = 244

  21. How does a genome evolve and grow? • Evolve by random mutation • replacement, insertion, deletion • Plus selection • affects only coding regions • not globally important • Cannot grow by random mutation alone • Otherwise Poisson distribution • Must grow to long length while retaining statistical characteristics of SHORT genome

  22. The genome mutates and copies itself • 50%, probably much more, of human genome composed of repeats • Many traces of repeats obliterated by mutation • Lower organisms may have longer genomes • Five types of repeats • transposable elements; processed pseudogenes; simple k-mer repeats; segmental duplications (10-300 kb); (large) blocks of tandemly repeated sequences

  23. A Conjecture on Genome Growth • Random early growth • Followed by • random self-copying and • random mutation Self copying – strategy for retaining and multiple usage of hard-to-come-by coded sequences (i.e. genes)

  24. The Model • The genome grows by random single base addition from nothing to an initial length much shorter than final length • Thereafter the genome evolves by random mutation and random self-copying, with a fixed frequency ratio

  25. The Model (continued) • Mutation is standard single point mutation: replacement, insertion and deletion • Random self-copying • random selection of site of copied segment • weighed random selection of length of copied segment • random selection of insertion site of copied segment

  26. Selecting copied segment length

  27. An explicit one-parameter model l is copied Segment length 0 <y <1 is a random number l =

  28. Some results • Distribution of 6-mer frequency • Starting genome length 1000 • Final genome length 1 million • Mutation to self-copying event ratio 100 < h < 4000 • Length scale for copied segments 2500 < s < 100 K • Compared withE. coli (4.5 Mbp), B. subtilis (4.2 Mbp), M. jannaschii (1.7 Mbp) (all normalized to 1 Mbp)

  29. E. coli [at]/[cg]=50/50 E. coli vs mutation + repeat Ratio 500:1 Sigma = 15k D= 140, 144 E. coli vs random D= 140, 15.5 Number of 6-mers Frequency of 6-mers

  30. B. subtilis [at]/[cg]=60/40 B. subtilis vs mutation + repeat Ratio 600:1 Sigma = 15k D= 167, 169 B. subtilis vs random D= 167, 79 Number of 6-mers Frequency of 6-mers

  31. M. jannaschii [at]/[cg]=70/30 M. jannaschii vs mutation + repeat Ratio 600:1 Sigma = 15k D= 320, 321 M. jannaschii vs random D= 320, 265 Number of 6-mers Frequency of 6-mers

  32. Gamma function reproduce highermoments Organism [at]/[gc] a b D(2) D(3) D(4) D(5) E. coli 50/50 140 147 213 252 gamma distribution 3.05 80.0 140 146 208 243 radom w/o self-copy (Poisson) 15.6 3.6 20.7 10 w/ self-copy (h = 500 s = 15K) 144 148 212 247 B. subtilis 60/40 168 223 316 400 gamma distribution 2.12 115 168 186 261 310 radom w/o self-copy (Poisson/7) 79 68 109 117 w/ self-copy (h = 600 s = 15K) 169 194 266 311 M. jannaschii 70/30 320 465 650 810 gamma distribution 0.58 418 320 439 609 767 radom w/o self-copy (Poisson/7) 264 369 500 603 w/ self-copy (h = 600 s = 15K) 321 462 635 783 Gamma distribution: D(x) = xa-1 b-aexp(-x/b)/G(a) D(n) = (<(x - <x>)n>)1/n; <x> = 244 = a b; D(2) = a1/2 b

  33. Result sensitive to values of two parameters • Mutation to self-copying event ratioh • bacterial genomes, 200 < h~ 0.04s < 800 • If h >> 800(@ s ~ 15K) • too many mutations • gets long genome with Poisson distribution • If h << 200(@ s ~ 15K) • too much self-copying • too few mutations • gets multiple copies of random short (initial) genome (distribution too wide)

  34. h = 100 h = 250 h = 500 h = 2000 h = 4000 Mutation to self-copy ratio is 500 +/- 100 Mutation/self-copy = h Scale of repeat length = s = 15K P(l)/P(l’) = exp{-(l-l’)/s} [at]:[cg] = 70:30 (genome-like)

  35. Result sensitive to values of two parameters (cont’d) • Length scale s for copied segments • s ~ 10 K to 25 K for bacterial genomes • If s << 5 K(@h ~ 600) • genome grows too slowly • too many mutations • gets long genome with Poisson distribution • If s >> 25 K(@h ~ 600) • genome grows too quickly • too few mutations • gets multiple copies of random short (initial) genome (distribution too wide)

  36. = 0.5K s =2.5K s =15K s =50K s =1000K Scale of repeat length cannot be too short Scale of repeat length = s P(l)/P(l’) = exp{-(l-l’)/s} Mutation/self-copy = h = 500 [at]:[cg] = 70:30 (genome-like)

  37. Summary & Discussion • Genomes have overabundances of extremely frequent and extremely rare oligos (EFEROs) • Genomes have statistical properties of very SHORT sequences • Suggests genome grew by mutation and random self-copying • Minimal model with two parameters – length scale and event ratio - explains frequency of occurrence of k-mers (oligos) very well

  38. Darwinian gradualism or Punctuated equilibrium? • Palaeontologists have long debated the mode of evolution • Gradual evolution of Classical Darwinism (species variaties; Dawkins et al.) • Change by spurts, as in “Punctuated Equilibrium” (Burgess shale, missing link; Gould et al.) • Minimal model already accommodates two competing modes of evolution • Mutation - Classical Darwinism • Self-copying of long sequences - Punctuated Equilibrium • Seems Nature uses both modes

  39. A peek at the Universal Ancestor • Since extremely frequent and extremely rare oligos (EFERO) are the remnant of early sequence, they characterize the common ancestor of phylogenetically related genomes • Should be able to use the set of EFEROs in whole genomes to construct phylogenetic trees of whole genomes. • At each node of the tree would be an ancestor sequence characterized by a set of EFEROs. • The ancestor of Life would be characterized by the minimum set of EFEROs.

  40. Outlook • Punctuated equilibrium • more evidence in textual detail? • Time scale for evolution • time scale from mutation to self-copy ratio? • length scale of repeats verifiable? • more textual detail needed to refine model? • Universal Ancestor • Can we build a good tree using EFEROs? • Does a Universal Ancestor exist in terms of its EFEROs? • If so, can we reconstruct the sequence of the Universal Ancestor?

  41. The End 謝謝大家! All computation by 謝立青 CBL webpage: www.phy.ncu.edu.tw/hclee/index_eng.htm Preprint: www.phy.ncu.edu.tw/hclee/preprints/gro_prsub.pdf

More Related