Celera Assembler

Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland

Whole Genome Shotgun Sequencing Slides by Art Delcher, Mike Schatz, and Adam Phillippy Center for Bioinformatics and Computational Biology Univ. of Maryland

Shotgun DNA Sequencing (Technology) DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. End Reads (Mates) 550bp LIGATE & CLONE Primer SEQUENCE Vector

Whole Genome Shotgun Sequencing + single highly automated process + only three library constructions – assembly is much more difficult • Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads for Human. Short Long 10Kbp 2Kbp • Collect another 20X in clone coverage of 50Kbp end sequence pairs: ~ 1.2million pairs for Human. • Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously. BAC 3’ BAC 5’

Clone-by-Clone Genome Sequencing Physical Mapping Minimum Tiling Set (~33,000 BACs for human) Shotgun Assembly – 2 separate processes – clone libraries unstable, maps hard to complete – sequencing libraries must be made for every clone + assembly problem ‘easy’ and well understood Target

Celera’s Sequencing Factory

Celera’s Sequencing Factory(circa 2001) • 300 ABI 3700 DNA Sequencers • 50 Production Staff • 20,000 sq. ft. of wet lab • 20,000 sq. ft. of sequencing space • 800 tons of A/C (160,000 cfm) • $1 million / year for electrical service • $10 million / month for reagents

Human Data (April 2000) • Collected 27.27 Million reads = 5.11X coverage • 21.04 Million are paired (77%) = 10.52 Million pairs • 2Kbp 5.045 M 98.6% true <6% std.dev. • 10Kbp 4.401 M 98.6% true <8% std.dev. • 50Kbp 1.071 M 90.0% true <15% std.dev. • Validated against finished Chrom. 21 sequence • The clones cover the genome 38.7X times • Data is from 5 individuals (roughly 3X, 4 others at .5X)

Pairs Give Order & Orientation Contig Assembly without pairs results in contigs whose order and orientation are not known. Consensus (15- 30Kbp) Reads ? 2-pair Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized. Mean & Std.Dev. is known Scaffold

Anatomy of a WGS Assembly STS Chromosome STS-mapped Scaffolds Contig Gap (mean & std. dev. Known) Read pair (mates) Consensus Reads (of several haplotypes) SNPs External “Reads”

Whole Genome Shotgun Assembly WGS Sequencing WGS Assembly Performance

Assembler Design Philosophy • Detect repeats and so avoid being misled by them, leave for the last. • Make 1st order use of mate-pairs: first to circumnavigate and later to fill in repeats. • Make all the sure moves first • tiered phases that get progressively more aggressive • output a complete audit trail of the evidence for assembly.

Assembly Pipeline (circa 2006) Trim & Screen • Reads (typically 800bp) are quality-trimmed so that average error rate is .5% with 1-in-1000 having more than 2% error. Average trim length is 500-900bp, depending on the genome. (590bp for human in year 2000) • Contaminant and vector sequence is removed • Repeat screening makes run time and overlap graph size reasonable, e.g. 106 overlaps per Alu read must be avoided. • Now we dynamically limit repetitive overlaps in the overlap phase. • gatekeeper program to vet inputs/assign ID’sReads stored in compressed, random-access binary store. Overlapper Unitiger Scaffolder Repeat Rez I, II

Assembly Pipeline A B implies TRUE A B OR A B REPEAT-INDUCED Trim & Screen Find all overlaps  40bp allowing 6% mismatch. Overlapper Unitiger Scaffolder Repeat Rez I, II

Assembly Pipeline Trim & Screen Compute all “overlap consistent” sub-assemblies: Unitigs (Uniquely Assembled Contig) Overlapper Unitiger Scaffolder Repeat Rez I, II

OVERLAP GRAPH A A B B B A B A A B A B Edge Types: Regular Dovetail Prefix Dovetail Suffix Dovetail E.G.: Edges are annotated with deltas of overlaps

The Unitig Reduction A C A B C B 1. Remove “Transitively Inferrable” Overlaps:

The Unitig Reduction A 412 352 A B B 45 2. Collapse “Unique Connector” Overlaps:

Unitigs: Definition Conflicting edge Conflicting edge Uniquely Assemble-able Contig Chordal Subgraph with no conflicting edges.

Unitig Theorem (Myers, JCB ‘95) (1) Remove contained fragments (2) Remove transitively inferred edges (3) Collapse into unitigs (*) Restore t.i. edges between unitig ends. THM: Shortest Common Superstring of unitigs = Shortest Common Superstring of reads Caveat: SCS is not the right objective for assembly.

Revised Unitigger Algorithm • Preceding algorithm is computationally expensive • Current unitigger finds the “best” overlap on each end of each read—its “best buddy”. • Unitigs are chains of mutually unique best buddies—adjacent reads are best buddies of each other and of no other read. • This takes time and space linear in the number of reads. • In rare cases results are different from graph reduction.

Branch Point Extension B D C Peers of A B C • A repeat boundary reflected on an underlying sequence read. A Genome • Compare peers to detect branch pts. • Consider graph without repeat-full edges and recompute unitigs A • Makes sure you get a read-length into each repeat induced gap (most Alu sized elements are resolved) D

Bubble Smoothing 412 352 486 245

Identifying Unique DNA Stretches Repetitive DNA unitig Unique DNA unitig Arrival Intervals Arrival rate statistic (A-stat) is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA. +10 -10 0 Dist. For Unique Dist. For Repetitive Definitely Repetitive Don’t Know Definitely Unique

Assembly Pipeline Trim & Screen Fill repeat gaps with doubly anchored positive unitigs Overlapper Unitig>0 Unitiger Scaffolder Repeat Rez I, II

Assembly Pipeline Stones Trim & Screen Fill repeat gaps with assembled, singly anchored reads Overlapper Unitiger Scaffolder Repeat Rez I, II

Surrogates • Stones containing more than 1 read are added to contigs as consensus sequence only, without underlying reads. • Called “surrogates” • Allows repeat unitigs to be put in multiple positions in the assembly, but leaves regions without underlying read coverage. • We later attempt to resolve surrogates, by assigning reads from the original repeat unitig to the separate surrogate copies, based on mate pairs.

Celera Assembler

Celera Assembler

Presentation Transcript

MSP430 Assembler

Assembler – Assembler Design Options

MIPS Assembler Programming

Assembler

Assembler Programming

Assembler Tutorial

Assembler Language

Assembler Exercises

Assembler source

Assembler Basics

Assembler Design Options

Assembler Tutorial

Assembler Directives

AVR Assembler

x86, Assembler

PC Assembler

Assembler Tutorial

Assembler

Assembler

MIPS Assembler Programming