VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 7 - Mittwoch

VL Algorithmische BioInformatik (19710) WS2015/2016Woche 7 - Mittwoch Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Including slides by Serafim Batzoglou

Vorlesungsthemen Part 5: Secondary Structures (4) 11. Obtaining Secondary Structure from Sequence 12. Predicting Secondary Structures Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships Part 7: Cells and Organisms (8) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology Part 1: Background Basics (4) 1. The Nucleic Acid World 2. Protein Structure 3. Dealing with Databases Part 2: Sequence Alignments (3) 4. Producing and Analyzing Sequence Alignments 5. Pairwise Sequence Alignment and Database Searching 6. Patterns, Profiles, and Multiple Alignments Part 3: Evolutionary Processes (3) 7. Recovering Evolutionary History 8. Building Phylogenetic Trees Part 4: Genome Characteristics (4) 9. Revealing Genome Features 10. Gene Detection and Genome Annotation 2 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Eugene Myers • Was Professor at University of California, Berkeley • Now at MPI Dresden • His BLAST paper (1990) was cited more that 41.000 times (among the most highly cited papers ever) • Inventor of suffix array data structure “If you take a sequence and just run a gene prediction program on it, the programs don’t usually do very well. But if you take human and mouse sequence, and compare them against each other — looking for similar regions — you get better predictions. And the more genomes we have, the better it will get.” 3 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Two (main) Paradigms of Biology Molecular Paradigm Evolution Paradigm 4 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Comparison of 1196 orthologous genes • Sequence identity between genes in human/mouse • exons: 84.6% • protein: 85.4% • introns: 35% • 5’ UTRs: 67% • 3’ UTRs: 69% • 27 proteins were 100% identical. Makalowski et al., 1996 5 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Full Genomes are available 6 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Full Genomes are available http://genomesonline.org/ 7 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Sequencing Growth Cost of one human genome • 2004: $30,000,000 • 2008: $100,000 • 2010: $10,000 • 2013: $2,000 (today) 8 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Sequencing Applications 100 million species Each individualhas different DNA http://www.1000genomes.org/ Within individual, some cells have different DNA(e.g. cancer) http://cancergenome.nih.gov/ 9 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Typical Bioinformatics Flow Chart 1a. Sequencing 6. Gene & Protein expression data 1b. Analysis of nucleic acid seq. 7. Drug screening 2. Analysis of protein seq. (e.g. gene finding) 3. Molecular structure prediction Ab initio drug design OR Drug compound screening in database of molecules 4. Molecular interaction 8. Genetic variability 5. Metabolic and regulatory networks 10 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 So, if a new species/individual is available there are two tasks: • Generategenomicsdata • SequencingandAssembly • Comparetoavailabledata (genomes) • Improvegenefinder • Evolution of organisms? • Why do certain bacteria cause diseases while others do not? • Identification and prioritization of drug targets 11

Genome Assembly 12 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Steps to Assemble a Genome Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig ..ACGATTACAATAGGTT.. 13 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Shotgun Sequencing genomic segment cut many times at random (Shotgun) Get one or two reads from each segment 14 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Reconstructing the Sequence (Fragment Assembly) reads Cover region with high redundancy Overlap & extend reads to reconstruct the original genomic region 15 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Definition of Coverage Length of genomic segment: G Number of reads: N Length of each read: L Definition:Coverage C = N * L / G How much coverage is enough? Lander-Waterman model: Prob[ not covered bp ] = e-C Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides C 16 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Repeats Bacterial genomes: 5% Mammals: 50% Repeat types: • Low-Complexity DNA (e.g. ATATATATACATA…) • Microsatellite repeats (a1…ak)N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) • Transposons • SINE(Short Interspersed Nuclear Elements) e.g., ALU: ~300-long, 106 copies • LINE(Long Interspersed Nuclear Elements) ~4000-long, 200,000 copies • LTRretroposons(Long Terminal Repeats (~700 bp) at each end) cousins of HIV • Gene Families genes duplicate & then diverge (paralogs) • Recent duplications ~100,000-long, very similar copies 17 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT Sequencing and Fragment Assembly 3x109 nucleotides 50% of human DNA is composed of repeats Error! Glued together two distant regions 18 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads 19 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT A R B D R C Sequencing and Fragment Assembly 3x109 nucleotides ARB, CRD or ARD, CRB ? 22 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT Sequencing and Fragment Assembly 3x109 nucleotides 23 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Strategies for whole-genome sequencing • Hierarchical – Clone-by-clone • Break genome into many long pieces • Map each long piece onto the genome • Sequence each piece with shotgun Example: Yeast, Worm, Human, Rat • Online version of (1) – Walking • Break genome into many long pieces • Start sequencing each piece with shotgun • Construct map as you go Example: Rice genome • Whole genome shotgun One large shotgun pass on the whole genome Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Dog 24 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

cut many times at random Whole Genome Shotgun Sequencing genome forward-reverse paired reads known dist 25 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Fragment Assembly(in whole-genome shotgun sequencing) 26 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm 27 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Steps to Assemble a Genome 1. Find overlapping reads 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs 4. Derive consensus sequence ..ACGATTACAATAGGTT.. 28 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 K-mer based overlap

Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 Sorting k-mers

1. Find Overlapping Reads (read, pos., word, orient.) aaactgcag aactgcagt actgcagta … gtacggatc tacggatct gggcccaaa ggcccaaac gcccaaact … actgcagta ctgcagtac gtacggatc tacggatct acggatcta … ctactacac tactacaca (word, read, orient., pos.) aaactgcag aactgcagt acggatcta actgcagta actgcagta cccaaactg cggatctac ctactacac ctgcagtac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatc gtacggatc tacggatct tacggatct tactacaca aaactgcagtacggatct aaactgcag aactgcagt … gtacggatct tacggatct gggcccaaactgcagtac gggcccaaa ggcccaaac … actgcagta ctgcagtac gtacggatctactacaca gtacggatc tacggatct … ctactacac tactacaca 35 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

T GA TACA | || || TAGA TAGT 1. Find Overlapping Reads • Find pairs of reads sharing a k-mer, k ~ 24 • Extend to full alignment – throw away if not >98% similar TAGATTACACAGATTAC ||||||||||||||||| TAGATTACACAGATTAC • Caveat: repeats • A k-mer that occurs N times, causes O(N2) read/read comparisons • ALU k-mers could cause up to 1,000,0002 comparisons • Solution: • Discard all k-mers that occur “too often” • Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available 36 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA 37 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

1. Find Overlapping Reads • Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA TAG-TTACACAGATTATTGA insert A correlated errors— probably caused by repeats  disentangle overlaps replace T with C TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA In practice, error correction removes up to 98% of the errors TAG-TTACACAGATTATTGA TAG-TTACACAGATTATTGA 38 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

2. Merge Reads into Contigs • Overlap graph: • Nodes: reads r1…..rn • Edges: overlaps (ri, rj, shift, orientation, score) Reads that come from two regions of the genome (blue and red) that contain the same repeat Note: of course, we don’t know the “color” of these nodes 39 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

repeat region 2. Merge Reads into Contigs We want to merge reads up to potential repeat boundaries Unique Contig Overcollapsed Contig 40 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

2. Merge Reads into Contigs • Remove transitively inferable overlaps • If read r overlaps to the right reads r1, r2, and r1 overlaps r2, then (r, r2) can be inferred by (r, r1) and (r1, r2) r r1 r2 r3 41 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

2. Merge Reads into Contigs 42 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Overlap graph after forming contigs Unitigs: Gene Myers, 95 43 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Repeats, errors, and contig lengths • Repeats shorter than read length are easily resolved • Read that spans across a repeat disambiguates order of flanking regions • Repeats with more base pair diffs than sequencing error rate are OK • We throw overlaps between two reads in different copies of the repeat • To make the genome appear less repetitive, try to: • Increase read length • Decrease sequencing error rate Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate  decreases effective repeat content  increases contig length 44 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

3. Link Contigs into Supercontigs Normal density Too dense  Overcollapsed Inconsistent links  Overcollapsed? 45 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

3. Link Contigs into Supercontigs Find all links between unique contigs Connect contigs incrementally, if  2 forward-reverse links supercontig (aka scaffold) 46 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

3. Link Contigs into Supercontigs • Fill gaps in supercontigs with paths of repeat contigs • Complex algorithmic step • Exponential number of paths • Forward-reverse links 47 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

De Brujin Graph formulation • Given sequence x1…xN, k-mer length k, Graph of 4k vertices, Edges between words with (k-1)-long overlap 48 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

4. Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting (Alternative: take maximum-quality letter) 49 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

Genome Comparison 50 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016

VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 7 - Mittwoch

VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 7 - Mittwoch

Presentation Transcript

9. Algorithmische Skelette