1 / 353

Genome Reconstruction: A Puzzle with a Billion Pieces

Genome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau & Pavel Pevzner University of California-San Diego. Outline. Introduction to Genome Sequencing The Newspaper Problem DNA Chips: A First Shot at Sequencing with Short Reads Two Mathematical Detours

peigi
Télécharger la présentation

Genome Reconstruction: A Puzzle with a Billion Pieces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University of California-San Diego

  2. Outline • Introduction to Genome Sequencing • The Newspaper Problem • DNA Chips: A First Shot at Sequencing with Short Reads • Two Mathematical Detours • Introduction to Graph Theory • Euler’s Theorem • ECP vs. HCP and Algorithmic Complexity • From Euler and Hamilton to Fragment Assembly • De Bruijn and a Final Solution to Fragment Assembly • Generalizing Fragment Assembly

  3. Section 1: Introduction to Genome Sequencing

  4. What Is Genome Sequencing? • A genome can be represented as a book written in an alphabet containing only 4 letters, called nucleotides: A,T,G, and C. • A human genome has roughly 3 billion nucleotides. • Genome sequencing is the process of determining the sequence of nucleotides that make up a genome. ...CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACAGATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATATAGCCGAGCGGCTACGATGATGCTAGCTGTACAGCTGATGATCTAGCTATCGATGCGATCGATGCGCGAGTGCGATCGATCACTTCGAGCTAGCTGATCGATCGATGCTAGCTAGCTGACTGATCATGGCGTTAGCTAGCTAGCTGATCGTCGATCGTACGTAGCTGATTACGATCGTCCGATCGTGCTATGACGTACGAGGCGGCTACGTAGCATGCTAGCTGACTGATGTAGCTAGCTATACGATACTATATATTCGATCGATTTATTACCATGACTGACGCGCATCGCTGTACACGTACTAGCTGATCGATGCTAGTCGATCGATCGATCATGTTATATATCGCGGCGCATCGATCGACTGCTCGATTATCGATACGTCGATCGCTGTATATACGTCTTTATAGCTAGGAGCATAGCGACGCGCTATCGATCGATCGTCTAGTCGACTGATCGTACTAGCTGACGCTGACGACTAGCTAGCTATCGACGATCGTAGTGCGATTACTAGCTAGGATCCTACTGTACGTCAGTCAGTCTGATCGATAGCGAGGAAAGCGAGACTGATCGTTCTCTAGATGTAGCTGATGTGACTACTATACTACTGGCAGCGATCGGGA…

  5. What Is Genome Sequencing? • Different people have slightly different genomes: all humans share 99.9% of the same genetic code. • The 0.1% difference accounts for height, eye color, high cholesterol susceptibility, etc. CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACAGATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACAACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGTGACTATTATCGACTACAGATGAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT

  6. Species Sequencing vs. Individual Genome Sequencing • Species Sequencing: Determine the “consensus genome” of an entire species.

  7. Species Sequencing vs. Individual Genome Sequencing • Individual Sequencing: Determine how an individual differs from its species.

  8. Why Would We Want to Sequence a Genome? • Species genome sequencing: • Compare various species (e.g. human and chimpanzee) to understand how their genes function (e.g. which genes are important for braindevelopment). • Reveal evolutionaryrelationships betweenspecies. • Determine the geneticmakeup of ourevolutionary ancestors.

  9. Why Would We Want to Sequence a Genome? • Individual genome sequencing: • Unearth the genetic basis of many diseases. • Forensics applications. • Example: In 2010, 6-year old Nicholas Volker became the first human being to be saved because of genome sequencing. • Doctors could not diagnose his condition, which caused strange infections; he went through nearly 100 surgeries. • Genome sequencing revealed a rare mutation in a gene linked to a defect inhis immune system. • This led doctors to use advancedimmunotherapy, which saved the child.

  10. Brief History of Genome Sequencing • Late 1970s: Walter Gilbert and Frederick Sanger develop independent sequencing methods. • 1980: They share the Nobel Prize in Chemistry. • Still, their sequencing methods were too expensive for large genomes: with a $1 per nucleotide cost, it would cost $3 billion to sequence the human genome. Walter Gilbert Frederick Sanger

  11. Brief History of Genome Sequencing • 1990: The public Human Genome Project, headed by Francis Collins, aims to sequence the human genome. • 1997: Craig Venter founds Celera Genomics, a private firm, with the same goal. Francis Collins Craig Venter

  12. Brief History of Mammalian Genome Sequencing • 2000: The draft of the human genome is simultaneously completed by the (public) Human Genome Consortium and (private) Celera Genomics.

  13. Brief History of Mammalian Genome Sequencing • 2000s: Many more mammalian genomes are sequenced.

  14. The Arrival of Personal Genomics • 2000s: Many companies launch projects aimed at reducing sequencing costs by orders of magnitude. • 2010: The market for sequencing machines takes off. • Illumina reduces the cost of sequencing an individual human genome from $3 billion to $10,000. • Complete Genomics builds a genomic factory in Silicon Valley that sequences hundreds of genomes per month. • Beijing Genome Institute orders hundreds of sequencing machines, becoming the world’s largest sequencing center. • 23andMe offers partial genome sequencing for $499. • Many universities introduce new courses in which students study their own genomes.

  15. The Future of Genome Sequencing • 2010s?: Genome sequencing will hopefully continue to bloom. • The $1,000 human genome may arrive as early as in 2012. • Hopefully, sequencing an individual genome will soon become as routine as an X-ray.

  16. What Makes Genome Sequencing So Difficult? • When we read a book, we can read the entire book one letter at a time from the beginning to the end. • However, modern sequencing machines cannot read an entire genome one nucleotide at a time from beginning to end. They can only shred the genome and read the short pieces. • Thus, we can identify very short fragments of DNA (~100 nucleotides long), called reads. • But we have no idea which genomic positions these reads come from! • We must figure out how to put the reads back together to assemble a genome.

  17. Section 2: The Newspaper Problem and Genome Sequencing

  18. The Newspaper Problem

  19. The Newspaper Problem

  20. The Newspaper Problem

  21. The Newspaper Problem

  22. The Newspaper Problem

  23. The Newspaper Problem

  24. The Newspaper Problem as an “Overlap Puzzle” • The newspaper problem is not the same as a jigsaw puzzle: • We have multiple copies of the same edition of a newspaper. • Plus, some pieces of paper got blown to bits in the explosion. • Instead, we must use overlapping shreds of paper to reconstruct what the newspaper said. • This gives us a giant overlap puzzle!

  25. Sequencing is Harder than Newspaper Problem • In the newspaper problem, we have the rules of language and common sense (e.g. “murder” and “suspect” would often appear near each other in a newspaper.) • However, the “language” of DNA remains largely unknown.

  26. Sequencing is Harder than Newspaper Problem • There are lots of repeated substrings in every genome (50% of human genome is formed by repeats). • Example: GCTTis repeated 4 times in the following: • AAGCTTCTATTGCTTAATTGGCTTGCTTCGCTTTG • Analogy: The Triazzle puzzle contains lots of repeated figures. This makes it very difficult tosolve (even with just 16 pieces).

  27. Sequencing a Genome: Lab + Computation • Read Generation (Experimental):Generate many reads from multiplecopies of the same genome. • Fragment Assembly (Computational):Use these reads to algorithmicallyput the genome back together.

  28. Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies

  29. Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Read Generation

  30. Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Read Generation Reads

  31. Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Read Generation Reads Fragment Assembly

  32. Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Read Generation Reads Fragment Assembly Sequenced Genome …GGCATGCGTCAGAAACTATCATAGCTAGATCGTACGTAGCC…

  33. Section 3: DNA Chips: A First Shot at Sequencing with Short Reads

  34. DNA Chips: From an Idea to a New Industry • 1989:RadojeDrmanac, AndreyMirzabekov, and Edwin Southern independently invent DNA chips (arrays) for read generation. • Key Idea: Generate all k-mers (see below) from the genome in the hope that they can be assembled to reconstruct the genome. • 1989: Science magazine writes, “Using DNA arrays for sequencing would simply be substituting one horrendous task for another.” • 2000: Arrays are a multi-billion dollar industry Mirzabekov Drmanac Southern k-mer: A string of length k (in an alphabet of 4 nucleotides)

  35. DNA Chips: Implementation • Synthesizea distinct k-mer in each of 4k cells in the array. • Cover the array with multiple copies of a fluorescently-labeled unknown DNA fragment. • DNA will hybridize witha k-mer if it contains thecomplement of that k-mer. • Use a spectroscope todetermine which sites emitlight …the complementsof these sites will reveal thek-mers in the unknownDNA fragment = our reads!

  36. DNA Chips: Illustration

  37. DNA Chips: Example • What are our reads?

  38. DNA Chips: Example • What are our reads? CAT

  39. DNA Chips: Example • What are our reads? CAT ||| ATG

  40. DNA Chips: Example • What are our reads? CAT ATG

  41. DNA Chips: Example • What are our reads? CAT ATG

  42. DNA Chips: Example • What are our reads? CAT ATG

  43. DNA Chips: Example • What are our reads? CAT ATG

  44. DNA Chips: Example • What are our reads? • So 3-mer ATG mustoccur in the genome! ATG

  45. Red 3-mers Must Occur in the Genome • What are our reads? CAC  GTG CGC  GCG • CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC  GGC TTG  CAA

  46. Red 3-mers Must Occur in the Genome • What are our reads? • CAC CGC  GCG • CAT  ATG

  47. Red 3-mers Must Occur in the Genome • What are our reads? • CAC  GTG CGC  GCG • CAT  ATG

  48. Red 3-mers Must Occur in the Genome • What are our reads? • CAC  GTG • CGC • CAT  ATG • TGC  GCA • ACG  CGT • ATT  AAT • CCA  TGG • GCA  TGC • GCC  GGC • TTG  CAA

  49. Red 3-mers Must Occur in the Genome • What are our reads? • CAC  GTG • CGC  GCG • CAT  ATG • TGC  GCA • ACG  CGT • ATT  AAT • CCA  TGG • GCA  TGC • GCC  GGC • TTG  CAA

  50. Red 3-mers Must Occur in the Genome • What are our reads? • CAC  GTG • CGC  GCG • CAT  ATG • TGC

More Related