500 likes | 652 Vues
10 Billion Piece Jigsaw Puzzles. John Cleary Real Time Genomics. Genome Exome Transcriptome Metagenome. Differences between …. Individuals in populations Child and parents Cancer and host genome Large pedigrees of animals Bacterial populations inside individuals
E N D
10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics
Genome Exome Transcriptome Metagenome
Differences between … • Individuals in populations • Child and parents • Cancer and host genome • Large pedigrees of animals • Bacterial populations inside individuals • Bacterial populations in the world
Real world problems … • What is wrong with this new born child? • Why are these cells cancerous and what should we do about it? • We have 6,000 individuals in 1,500 families with cleft-palate – what causes this?
Real world problems … • There is a hard to treat infectious disease in a hospital ward – where did it come from and is it the same as the one at another hospital? • Is this water safe to drink? • …
Human Genome 3 billion nucleotides Exome 30 million nucleotides
A C G T T A G T G A A C G T T A G T G A A C G T T C G T G A A C G T T G G T G A Differences between humangenomes - SNPs ~ 1 / 1,000 3,000,000 nt
A C G T T AG T G A A C G T T AG T G A A C G T T CA G A A C G T T GT G A Differences between humangenomes - MNPs
A C G T T A G T G A A C G T T A G T G A A C G T T G T G A A C G T T G G T G A Differences between humangenomes - indels ~ 1 / 10,000 300,000
A C G T T A G T G A A C G T T A G T G A Differences between humangenomes - inserts T T A G G A C C C A Up to 1,000,000 nt total 3,000,000 nt
REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctgg SIM: T AAGAAT SIM: T AAGAAT CALL: T G CALL: T T READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG
Solving the Jigsaw • Indexing • Alignment • SNP/MNP/Indel calling Mapping
Indexing A C G T T A G T G A A G A C G T T C G T G A A G A C G T T A G T G A A G A C G T T C G T G A A G 4.5 billion
A C G T T A G T G A A G A C G T T C G T G A A G Aligning 1.6 billion
Cutting Edge Run • Human genome (3 billion nt) • 1 billion reads of 100 nt coverage of 30 • Indexing + Aligning in 27 minutes
2 sockets X 4 cores X 2 hyperthreads = 16 48 GB RAM 10 computers 1 TB disk/genome = 500GB + 200GB + 200GB + 0.3GB X thousands of genomes
100 nt 100 nt Paired End Reads 100 - 1,000 nt Index Align Index Align Match
Solving the Jigsawwithout the picture • Indexing • Alignment Assembly
Assembly A C G T T C G T G A A G T A G T G A A G A A T T A C G T T C G T G A A G T A G T G A A G A A T T A C G T T ? G T G A A G A A T T
15A 4C 5A 2C Bayesian statistics (SNPs 1/1,000) 1A 2C SNP calling 15A 13C AC heterozygous SNP Throw it out 31A 42C
REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctgg SIM: T AAGAAT SIM: T AAGAAT CALL: T G CALL: T T READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG
Multiple technologies and read lengths Lane Mapping SAM Calibration SNP calling Filtering Complex regions VCF SNPs, MNPS, indels
SNP calling - Diploid Bayesian SAM Calibration Genome statistics Error model Priors Bayesian Model A C G T A:C A:G A:T C:G C:T G:T 23.1 43.2 … log posteriors Counts filter Ambiguity filter insert Adjacent SNPs, inserts Simple isolated SNP Complex region calling SNPs, indels, MNPs VCF
Complex Region Calling Genome Aligned Reads Modified Genome Probabilistic realignment through all paths for each read against each modified genome
Comparing twins 3,000,000 SNPs Do any of them differ between the twins? 15A 4C 3A 10C 3G
Gene DNA mRNA protein
Copy Number Variants • Varying levels of extraction of reads across genome (use differences) • Locate boundaries (as accurately as possible) • Extract number of variants • Use in combination with calling SNPs
Metagenomics or what is living on you • Mapping reads back onto a database of known bacteria/viruses • Many are ambiguous • Many don’t map at all • Estimate frequency of each species • Remove human “contamination”
TS1 0.389 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482 0.183 gi|187734516|ref|NC_010655.1| Akkermansia muciniphila ATCC BAA-835 0.145 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 8482 0.037 gi|119025018|ref|NC_008618.1| Bifidobacterium adolescentis ATCC 15703 TS4 0.428 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482 0.210 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 8482 0.149 gi|60650141|ref|NC_006873.1| Bacteroides fragilis NCTC 9343 plasmid pBF9343 0.037 gi|121999251|ref|NC_008790.1| Campylobacter jejuni subsp. jejuni 81-176 plasmid pTet 0.036 gi|238922432|ref|NC_012781.1| Eubacterium rectale ATCC 33656 TS25 0.752 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482 0.073 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 8482 0.041 gi|121999251|ref|NC_008790.1| Campylobacter jejuni subsp. jejuni 81-176 plasmid pTet 0.020 gi|58036264|ref|NC_004307.2| Bifidobacterium longum NCC2705 0.018 gi|189438863|ref|NC_010816.1| Bifidobacterium longum DJO10A
Metagenomics • Map reads to database • Estimate most likely frequenciesa hill climbing estimation problem • Can anything be done about unmapped reads?
How do we get there? • Software engineering (500,000 lines code) • Algorithms • Bayesian statistics • Testingcalibration/simulation/analysis
How do we get there? • Performance optimizationalgorithmsdisk I/O and compressionparallel executionoptimization for memory sizeoptimization for cache sizetargeted code optimization