New Sequencing Technologies & Diploid Personal Genomes

New Sequencing Technologies& Diploid Personal Genomes George Church Thu 27-Apr-2006 9:30-11 Broad-MPG Thanks to: NHGRI Seq Tech 2004:Agencourt, 454, Microchip, 2005: Nanofluidics, Network, VisiGen Affymetrix, Helicos,Solexa-Lynx

‘Next Generation’ Technology Development Multi-molecule Our role Affymetrix Software Gorfinkel Polony to Capillary 454 LifeSci Paired ends, emulsion Lynx/Solexa Multiplexing & polony Agencourt Seq by Ligation (SbL) Single molecules Helicos Biosci SAB, cleavable fluors Pacific Biosci - Agilent Nanopores Visigen Biotech - Complete Genomics SbL

Sequencing components • Applications & goals • Cost, accuracy, continuity goals • Source, consent, ELSI • Sample prep • Technology development, deployment, scaling • Software: data acquisition to interpretation • Human interface, education

Sequencing applications • Environment (genetic): maternal, allergens, microbes • Small mutations: whole genome vs targeted • DNA copy number & rearrangements (paired ends) • Exons conserved &/or mutable regions • Haplotype: LD &/or causative combinations in cis • RNA Digital Analysis of Gene Expression (by counting) • RNA splicing (that arrays can’t handle) • Proteomics: MS, Ab, aptamers • Metabolomics: MS, Ab, aptamers • Microbial evolution resequencing (needs consensus accuracy) • Cancer resequencing • Gene synthesis by sequencing (needs raw accuracy) • DNA methylation

Why single chromosome sequencing? (or single cell or single particle?) (1) When we only have one cell as in Preimplantation Genetic Diagnosis (PGD) or environmental samples (2) Sequence relations >100 kbp (haplotypes) (3) Prioritizing or pooling (rare) species based on an initial DNA screen (4) Anything relating 2 or more chromosomes (in a cell or virus) (5) Cell-cell interactions (e.g. predator-prey, symbionts, commensals, parasites, etc)

Sequencing/genotyping on single human chromosomes Method#1: ‘in situ’ haplotyping 153Mbp Zhang et al. Nature Genet. Mar 2006

Sequencing/genotyping on single human chromosomes Method#2: Chromosome dilution library QC: Reverse-FISH of amplicons Amplicon 19 Amplicon 6q

Single chromosome molecule sequencing • How? • Isothermal Strand Displacement Amplification from a single chromosome (Ploning) • Shotgun sequencing on the amplicon • Challenges • Non-specific amplification competes with a single template molecule • Amplicons have high-order DNA structures, which creates issues in sequencing library construction

Single cell chromosome molecule sequencing S1 nuclease digestion DNA pol I nick translation Phi-29 debranching Reduce chimeras when cloning from SDA Plones From 19% to 6%

Single cell chromosome molecule sequencing Ploning & sequencing 2.5 Mbp molecules Plone amplification errors: < 1.7×10-5

In vitro paired tag libraries Monolayer gel immobilization SOFTWARE Images → Tag Sequences Tag Sequences → Genome Integrated Polony Sequencing Pipeline(open source hardware, software, wetware) Enrich amplified beads Bead polonies via emulsion PCR SBE or SBL sequencing Epifluorescence & Flow Cell Shendure, Porreca, Reppas, Lin, McCutcheon, Rosenbaum, Wang, Zhang, Mitra, Church (2005) Science 309:1728.

Shear or Nla III digest Paired-end libraries Shendure, Porreca, et al. (2005) Science 309: 1728 Margulies et al. (2005) Nature 437: 376. ligate select + dilute, ligate amplify digest hRCA Mme I ligate L R M amplify ePCR

Distribution of Distances Between Mate-Paired Tags 10.7 bp FT 2.0 kb 1.0 kb frequency 980 ± 96 bp distance (bp)

ePCR bead 4 positions for paired-end anchor 'primers' Tag 1 Tag 2 L M R 5’ 3’ 7 bp 7 bp 6 bp 6 bp Each yields 6 to 7 bp of contiguous sequence 34 bp new sequence per 135 bp amplicon

Sequencing by Ligation (SBL) with fluorescent combinatorial 9-mers ExcitationEmission 647 700 555 605 572 630 555 700 5’-Cy5-nnnnAnnnn-3’ 5’-Cy3-nnnnGnnnn-3’ 5’-TR-nnnnCnnnn-3’ 5’-Cy3+Cy5-nnnnTnnnn-3’ nm 5'PO4 ACUCAUC… (3’)…TAGAGT????????????????TGAGTAG…(5’) Shendure, Porreca, et al. (2005) Science 309:1728

Automation Schematic microscope & xyz stage flow-cell HPLC autosampler (96 wells) syringe pump temperature control

Off the Shelf Instrumentation $140,000 Mitra Shendure Porreca

Image Collection & Data Processing 514 raster positions x 4 images per cycle 26 cycles of sequencing 2 additional image sets for object-finding algorithms 54996 images (1000 x 1000, 14-bit) 100GBytes 5M reads $500 run Porecca et al.

Hash all the reads (n) Scan genome (m), and for each window: Does current window exist in hash? If so, move downstream, scan d positions & test hash for membership Hash all possible reads from genome (m) Scan the reads (n), and for each: Does it occur in the hash? If so, does the second exist? If so, take union (k) Open Source Readmapper v2.0 (Gary Gao, Sasha Wait) v1.0 (Shendure, Porreca et al) n * k = 10 hours, 1 node, 1.6e6 reads m + (n * d) = 10+ hours, 20 nodes, 1.6e6 reads

Error quantitation 6X consensus <3E-7 [>Q65, 99.99997%] Median raw Polony = 3E-3 (99.7%) 454 raw = 4E-2 (96%) Shendure, Porreca et al, 2005

ABI454 Sep05 PolonySep05 Feb 06 $/kb@4E-5 $7 $9 0.8 0.07 $/3e9@1X 3M 300K $30K Paired ends yes no yes Device $ 300K 500K 140K Cost vs consensus error rate

Why low error rates? Goal of genotyping & resequencing  Discovery of variants E.g. cancer somatic mutations ~1E-6 (or lab evolved cells) Consensus error rateTotal errors(E.coli)(Human) 1E-4 Bermuda/Hapmap 500 600,000 4E-5 454 @40X 200 240,000 3E-7 Polony-SbL @6X 0 1800 1E-8 Goal for 2006 0 60 Also, effectively reduce (sub)genome target size by enrichment for exons or common SNPs to reduce cost & # false positives.

Mutation Discovery in Engineered/Evolved E.coli Shendure, Porreca, et al. (2005) Science 309:1728

Sequence monitoring of evolution(optimize small molecule synthesis/transport) Sequence trp- Reppas, Lin & Church

ompF - non-specific transport channel AAAGAT CAAGAT -12 -11 -10 -9 -8 -7 -6 Can increase import & export capability simultaneously • Glu-117 → Ala (in the pore) • Charged residue known to affect pore size and selectivity • Promoter mutation at position (-12) • Makes -10 box more consensus-like

Co-evolution of mutual biosensors sequenced across time & within each time-point 3 independent lines of Trp/Tyr co-culture frozen. OmpF: 42R-> G, L, C, 113 D->V, 117 E->A Promoter: -12A->C, -35 C->A Lrp: 1bp deletion, 9bp deletion, 8bp deletion, IS2 insertion, R->L in DBD. Heterogeneity within each time-point reflecting colony heterogeneity.

Mixture of wild & 2kb Inversion (pin) proximal tag placement Incorrect distance Red=same strand Black opposite strand distal tag placement 1,206k 1,210k Using paired ends, rearrangement & copy-number detection is >1000X easier than point mutation detection (6X vs 6000X)

Human Diplome Sequencing Strategies Open source hardware, software, wetware Diplome chromosome dilution shotgun (0.01X $300) Exons & conserved 3% (6X $9K) 40K RNA diplome (10X MIP pool $20) 1M Causative Genome Changes CGCs (10X MIP pool $20) Strand displacement amplification (ploning) Polony sequencing 7E8 pixels Chip Genotyping/ Haplotyping Personal Genome Project (ELSI)

Padlock, Molecular Inversion Probes (MIPs) Causative Genomic Changes (CGCs, e.g. conserved 3%) (not restricted to Single Nucleotides or Polymorphisms >1%) R Optional multiplex tag Universal primers L Genomic DNA CG CA TG Alternative alleles Hardenbol .. Landegren Davis et al. Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat Biotechnol. 2003 21:673-8 . “10,000 targeted SNPs genotyped in a single tube assay.” Genome Res. 2005 15:269 Vitkup, Sander, Church (2003) The Amino-acid Mutational Spectrum of Human Genetic Disease. Genome Biol. 4: R72. (CG to CA, TG)

MIPs for VDJ Polonies Over the whole field of human T-cells 1 TRAC + 2 TRBC primers cDNA xxx 47 TRAV * 50 TRAJ + 46 TRAV * 13 TRBJ = 2948 MIP oligos or 47 TRAV * 1 TRAC + 46 TRAV * 2 TRBC = 139 MIP oligos In situ RCA or PCR for each T-cell Polony sequencing of tag &/or gap fill (e.g. 18 to 33bp in CDR3) (two tags per cell sufficient?) http://www.infobiogen.fr/services/chromcancer/Genes/TCRBID24.html

‘Next Generation’ Technology Development Multi-molecule Our role Affymetrix Software Gorfinkel Polony to Capillary 454 LifeSci Paired ends, emulsion Lynx/Solexa Multiplexing & polony Agencourt Seq by Ligation (SbL) Single molecules Helicos Biosci SAB, cleavable fluors Pacific Biosci - Agilent Nanopores Visigen Biotech - Complete Genomics SbL

Human subjects consent “Because the database will be public, people who do identity testing, such as for paternity testing or law enforcement, may also use the samples, the database, and the HapMap, to do general research. However, it will be very hard for anyone to learn anything about you personally from any of this research because none of the samples, the database, or the HapMap will include your name or any other information that could identify you or your family.” http://www.hapmap.org/downloads/elsi/CEPH_Reconsent_Form.pdf YRI= Yoruba, Ibadan, Nigeria JPT= Japan, Tokyo CHB= China (Han) Beijing CEU= CEPH (N&W Europe) Utah

Is anonymity in genomics realistic? http://arep.med.harvard.edu/PGP/Anon.htm 1) Re-identification after “de-identification” using other public data. Group Insurance Commission list of birth date, gender, and zip code was sufficient to re-identify medical records of Governor Weld & family via voter-registration records (1998) (2) Hacking. “Drug Records, Confidential Data vulnerable via Harvard ID number & PharmaCare loophole” (2005). A hacker gained access to confidential medical info at the U. Washington Medical Center -- 4000 files (names, conditions, etc, 2000) (3) Combination of surnames from genotype with geographical info An anonymous sperm donor was traced on the internet 2005 by his 15 year old son who used his own Y chromosome genealogy to access surname relations. (4) Inferring phenotype from genotypeMarkers for eye, skin, and hair color, height, weight, racial features, dysmorphologies, etc. are known & the list is growing. (5) Unexpected self-identification. An example of this at Celera undermined confidence in the investigators. Kennedy D. Science. 2002 297:1237. Not wicked, perhaps, but tacky. (6) A tiny amount of DNA data in the public domain with a name leverages the rest. This would allow the vast amount of DNA data in the HapMap (or other study) to be identified. This can happen for example in court cases even if the suspect is acquitted. (7) Identification by phenotype.If CT or MR imaging data is part of a study, one could reconstruct a person’s appearance . Even blood chemistry can be identifying in some cases.

"Open-source" Personal Genome Project (PGP) • Harvard Medical School IRB Human Subjects protocol • submitted Sep-2004, approved Aug-2005 renewed Feb-2006. • Start with 3 highly-informed individuals consenting to non-anonymous genomes & extensive phenotypes (medical records, imaging, omics). • Cell lines in Coriell NIGMS Repository • G M Church GM (2005) The Personal Genome Project • Nature Molecular Systems Biology doi:10.1038/msb4100040 • Kohane IS, Altman RB. (2005) Health-information altruists--a potentially critical resource.N Engl J Med. 10;353(19):2074-7.

Discussion: Ascertainment bias vs. risk of disclosure without consent. It is likely that less-privileged citizens ‘might be’ less likely to volunteer & will be more likely to volunteer due to higher financial risk. These same people ‘might be’ even less likely to volunteer is the data might become public. These same folks might be especially impacted socially if identifying (genome and/or phenome) data were to get out after they were assured that it would not.

Proposal for multi-tiered (re)consent of subjects in genomic studies • Five categories: • Withdrawal from studies due to new information on risks • (all data destroyed). • 2) Highest security (possibly higher than the original study) encryption, aggressive de-identification, only expert access with IRB-approval of each person, not whole teams. Consent form clearly states the risks (see previous slides). • 3) Medium security, similar to current practice, but consented as above. IRB approval for teams to download de-identified data. • 4) Open-PGP-type security. Click-through agreement. IRB-approval only for data collection, not for data reading. • 5) Fully open. No IRB approval; full web access e.g. subject initiated.

New Sequencing Technologies & Diploid Personal Genomes