Scaffolding Large Genomes Using Integer Linear Programming
This study explores an innovative approach to scaffolding large genomes using Integer Linear Programming (ILP). Addressing the complexities involved in de novo genome assembly, we present an efficient algorithm for configuring contig orientations, ordering, and relative distances using paired read information. By formulating the scaffolding problem as an undirected multi-graph, we effectively reconstruct true scaffolds and optimize the use of read pairs. Our findings indicate that this method not only maximizes consistent read pairs but also offers improvements in handling large genome assemblies, paving the way for future research in structural variation and assembly preprocessing.
Scaffolding Large Genomes Using Integer Linear Programming
E N D
Presentation Transcript
Scaffolding Large Genomes Using Integer Linear Programming James Lindsay*, HamedSalooti, Alex Zelikovski, Ion Mandoiu* University of Connecticut* Georgia State University
De-novo Assembly Paradigm The Reads The Genome Sequencing Assembly The Scaffolds Scaffolding The Contigs
Why Scaffolding? No scaffold gene XYZ Scaffold 5’ UTR gene XYZ 3’ UTR • Annotation • Comparative biology • Re-sequencing and gap filling • Structural variation!
Why Scaffolding? Biologist: There are holes in my genes! 5’ UTR gene XYZ 3’ UTR Sanger Sequencing 5’ UTR gene XYZ 3’ UTR • Annotation • Comparative biology • Re-sequencing and gap filling • Structural variation!
Why Scaffolding? • Annotation • Comparative biology • Re-sequencing and gap Filling • Structural variation!
Read Pairs Informative Reads Paired Read Construction 2kb 2kb same strand and orientation R2 R1 • Align each read against the contigs • Only accept uniquely mapped reads • Use the non-unique reads later • Both reads in a pair must map to different contigs
Linkage Information Possible States 5’ 3’ R2 R1 A B C D contigi contig j • Two contigs are adjacent if: • A read pair spans the contigs • State (A, B, C, D) • Depends on orientation of the read • Order of contigs is arbitrary • Each read pair can be “consistent” with one of the four states
The Scaffolding Problem • Given • Contigs • Paired reads • Find • Orientation • Ordering • Relative Distance • Goal • Recreate true scaffolds • Possible Objectives • Un-weighted • Max number of consistentread pairs • Weighted • Each states is weighted: • Overlap with repeat • Deviation of expected distance • …
Graph Representation E, set of Using input we can define a scaffolding graph: This is an undirected multi-graph Assume it is connected
Integer Linear Program Formulation Variables Contig Orientation: Pairwise Contig Consistency: Contig Pair State: ,, Objective Maximize weight of consistent pairs
Constraints Pairwise Orientation Mutually Exclusivity Forbid 2 and 3 Cycles Explicitly
Graph Decomposition: Articulation Points solve Articulation point
Graph Decomposition: 2-cuts 2-cut + + - - + - + -
Non-Serial Dynamic Programming • SPQR-tree to scheduledecomposition • Traverse tree using DFS • NSDP utilizes solutions of previous stage in current stage
Post Processing ILP Solution outgoing incoming A A B B C C D D E E ILP Solution F F B D F A E C May have cycles Not a total ordering for each connected components • Bipartite matching • Objectives: • Max weight • Max cardinality • Max cardinality / Max weight
Testing Framework Venter Genome • 4x Assembly
Testing Metrics • Computer Scientists • Finding Scaffold = Binary Classification Test • n contigs, try to predict n-1 adjacencies • TP,FP,TN,FN, Sensitivity, PPV • Biologists (main focus) • N50 (basically average scaffold size, ignore gaps) • TP50 • Break scaffold at incorrect edges, then find N50
Conclusions • Success • ILP solves scaffolding problem! • NSDP works. • Improvements • Finalize large test cases (then publish?!) • Practical considerations (read style, multi-libraries, merge ctgs) • Future Work • Where else can I apply NSDP? • Scaffold before assembly?? • Structural Variation??