4.66k likes | 5.36k Vues
Genome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau & Pavel Pevzner University of California-San Diego. Outline. Introduction to Genome Sequencing The Newspaper Problem DNA Chips: A First Shot at Sequencing with Short Reads Two Mathematical Detours
E N D
Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University of California-San Diego
Outline • Introduction to Genome Sequencing • The Newspaper Problem • DNA Chips: A First Shot at Sequencing with Short Reads • Two Mathematical Detours • Introduction to Graph Theory • Euler’s Theorem • ECP vs. HCP and Algorithmic Complexity • From Euler and Hamilton to Fragment Assembly • De Bruijn and a Final Solution to Fragment Assembly • Generalizing Fragment Assembly
What Is Genome Sequencing? • A genome can be represented as a book written in an alphabet containing only 4 letters, called nucleotides: A,T,G, and C. • A human genome has roughly 3 billion nucleotides. • Genome sequencing is the process of determining the sequence of nucleotides that make up a genome. ...CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACAGATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATATAGCCGAGCGGCTACGATGATGCTAGCTGTACAGCTGATGATCTAGCTATCGATGCGATCGATGCGCGAGTGCGATCGATCACTTCGAGCTAGCTGATCGATCGATGCTAGCTAGCTGACTGATCATGGCGTTAGCTAGCTAGCTGATCGTCGATCGTACGTAGCTGATTACGATCGTCCGATCGTGCTATGACGTACGAGGCGGCTACGTAGCATGCTAGCTGACTGATGTAGCTAGCTATACGATACTATATATTCGATCGATTTATTACCATGACTGACGCGCATCGCTGTACACGTACTAGCTGATCGATGCTAGTCGATCGATCGATCATGTTATATATCGCGGCGCATCGATCGACTGCTCGATTATCGATACGTCGATCGCTGTATATACGTCTTTATAGCTAGGAGCATAGCGACGCGCTATCGATCGATCGTCTAGTCGACTGATCGTACTAGCTGACGCTGACGACTAGCTAGCTATCGACGATCGTAGTGCGATTACTAGCTAGGATCCTACTGTACGTCAGTCAGTCTGATCGATAGCGAGGAAAGCGAGACTGATCGTTCTCTAGATGTAGCTGATGTGACTACTATACTACTGGCAGCGATCGGGA…
What Is Genome Sequencing? • Different people have slightly different genomes: all humans share 99.9% of the same genetic code. • The 0.1% difference accounts for height, eye color, high cholesterol susceptibility, etc. CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACAGATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACAACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGTGACTATTATCGACTACAGATGAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT
Species Sequencing vs. Individual Genome Sequencing • Species Sequencing: Determine the “consensus genome” of an entire species.
Species Sequencing vs. Individual Genome Sequencing • Individual Sequencing: Determine how an individual differs from its species.
Why Would We Want to Sequence a Genome? • Species genome sequencing: • Compare various species (e.g. human and chimpanzee) to understand how their genes function (e.g. which genes are important for braindevelopment). • Reveal evolutionaryrelationships betweenspecies. • Determine the geneticmakeup of ourevolutionary ancestors.
Why Would We Want to Sequence a Genome? • Individual genome sequencing: • Unearth the genetic basis of many diseases. • Forensics applications. • Example: In 2010, 6-year old Nicholas Volker became the first human being to be saved because of genome sequencing. • Doctors could not diagnose his condition, which caused strange infections; he went through nearly 100 surgeries. • Genome sequencing revealed a rare mutation in a gene linked to a defect inhis immune system. • This led doctors to use advancedimmunotherapy, which saved the child.
Brief History of Genome Sequencing • Late 1970s: Walter Gilbert and Frederick Sanger develop independent sequencing methods. • 1980: They share the Nobel Prize in Chemistry. • Still, their sequencing methods were too expensive for large genomes: with a $1 per nucleotide cost, it would cost $3 billion to sequence the human genome. Walter Gilbert Frederick Sanger
Brief History of Genome Sequencing • 1990: The public Human Genome Project, headed by Francis Collins, aims to sequence the human genome. • 1997: Craig Venter founds Celera Genomics, a private firm, with the same goal. Francis Collins Craig Venter
Brief History of Mammalian Genome Sequencing • 2000: The draft of the human genome is simultaneously completed by the (public) Human Genome Consortium and (private) Celera Genomics.
Brief History of Mammalian Genome Sequencing • 2000s: Many more mammalian genomes are sequenced.
The Arrival of Personal Genomics • 2000s: Many companies launch projects aimed at reducing sequencing costs by orders of magnitude. • 2010: The market for sequencing machines takes off. • Illumina reduces the cost of sequencing an individual human genome from $3 billion to $10,000. • Complete Genomics builds a genomic factory in Silicon Valley that sequences hundreds of genomes per month. • Beijing Genome Institute orders hundreds of sequencing machines, becoming the world’s largest sequencing center. • 23andMe offers partial genome sequencing for $499. • Many universities introduce new courses in which students study their own genomes.
The Future of Genome Sequencing • 2010s?: Genome sequencing will hopefully continue to bloom. • The $1,000 human genome may arrive as early as in 2012. • Hopefully, sequencing an individual genome will soon become as routine as an X-ray.
What Makes Genome Sequencing So Difficult? • When we read a book, we can read the entire book one letter at a time from the beginning to the end. • However, modern sequencing machines cannot read an entire genome one nucleotide at a time from beginning to end. They can only shred the genome and read the short pieces. • Thus, we can identify very short fragments of DNA (~100 nucleotides long), called reads. • But we have no idea which genomic positions these reads come from! • We must figure out how to put the reads back together to assemble a genome.
The Newspaper Problem as an “Overlap Puzzle” • The newspaper problem is not the same as a jigsaw puzzle: • We have multiple copies of the same edition of a newspaper. • Plus, some pieces of paper got blown to bits in the explosion. • Instead, we must use overlapping shreds of paper to reconstruct what the newspaper said. • This gives us a giant overlap puzzle!
Sequencing is Harder than Newspaper Problem • In the newspaper problem, we have the rules of language and common sense (e.g. “murder” and “suspect” would often appear near each other in a newspaper.) • However, the “language” of DNA remains largely unknown.
Sequencing is Harder than Newspaper Problem • There are lots of repeated substrings in every genome (50% of human genome is formed by repeats). • Example: GCTTis repeated 4 times in the following: • AAGCTTCTATTGCTTAATTGGCTTGCTTCGCTTTG • Analogy: The Triazzle puzzle contains lots of repeated figures. This makes it very difficult tosolve (even with just 16 pieces).
Sequencing a Genome: Lab + Computation • Read Generation (Experimental):Generate many reads from multiplecopies of the same genome. • Fragment Assembly (Computational):Use these reads to algorithmicallyput the genome back together.
Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies
Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Read Generation
Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Read Generation Reads
Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Read Generation Reads Fragment Assembly
Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Read Generation Reads Fragment Assembly Sequenced Genome …GGCATGCGTCAGAAACTATCATAGCTAGATCGTACGTAGCC…
Section 3: DNA Chips: A First Shot at Sequencing with Short Reads
DNA Chips: From an Idea to a New Industry • 1989:RadojeDrmanac, AndreyMirzabekov, and Edwin Southern independently invent DNA chips (arrays) for read generation. • Key Idea: Generate all k-mers (see below) from the genome in the hope that they can be assembled to reconstruct the genome. • 1989: Science magazine writes, “Using DNA arrays for sequencing would simply be substituting one horrendous task for another.” • 2000: Arrays are a multi-billion dollar industry Mirzabekov Drmanac Southern k-mer: A string of length k (in an alphabet of 4 nucleotides)
DNA Chips: Implementation • Synthesizea distinct k-mer in each of 4k cells in the array. • Cover the array with multiple copies of a fluorescently-labeled unknown DNA fragment. • DNA will hybridize witha k-mer if it contains thecomplement of that k-mer. • Use a spectroscope todetermine which sites emitlight …the complementsof these sites will reveal thek-mers in the unknownDNA fragment = our reads!
DNA Chips: Example • What are our reads?
DNA Chips: Example • What are our reads? CAT
DNA Chips: Example • What are our reads? CAT ||| ATG
DNA Chips: Example • What are our reads? CAT ATG
DNA Chips: Example • What are our reads? CAT ATG
DNA Chips: Example • What are our reads? CAT ATG
DNA Chips: Example • What are our reads? CAT ATG
DNA Chips: Example • What are our reads? • So 3-mer ATG mustoccur in the genome! ATG
Red 3-mers Must Occur in the Genome • What are our reads? CAC GTG CGC GCG • CAT ATG TGC GCA ACG CGT ATT AAT CCA TGG GCA TGC GCC GGC TTG CAA
Red 3-mers Must Occur in the Genome • What are our reads? • CAC CGC GCG • CAT ATG
Red 3-mers Must Occur in the Genome • What are our reads? • CAC GTG CGC GCG • CAT ATG
Red 3-mers Must Occur in the Genome • What are our reads? • CAC GTG • CGC • CAT ATG • TGC GCA • ACG CGT • ATT AAT • CCA TGG • GCA TGC • GCC GGC • TTG CAA
Red 3-mers Must Occur in the Genome • What are our reads? • CAC GTG • CGC GCG • CAT ATG • TGC GCA • ACG CGT • ATT AAT • CCA TGG • GCA TGC • GCC GGC • TTG CAA
Red 3-mers Must Occur in the Genome • What are our reads? • CAC GTG • CGC GCG • CAT ATG • TGC