310 likes | 442 Vues
This paper presents an innovative approach to genome scaffolding using Integer Linear Programming (ILP) based on maximum likelihood estimation. Written by researchers from the University of Connecticut and Georgia State University, it details the importance of scaffolding in de-novo assembly, addressing gaps in sequenced genes and structural variations. The method enhances the assembly of contigs through effective linkage information from paired reads, leading to accurate contig orientation and ordering. The approach demonstrates significant utility in large-scale genomic projects, including metagenomics.
E N D
ILP-based maximum likelihood genome scaffolding James Lindsay Ion Mandoiu University of Connecticut HamedSalooti Alex Zelikovsky Georgia State University
De-novo Assembly Genome Reads Sequencing Assembly Scaffolds Scaffolding Contigs
Why Scaffolding? • Annotation • Comparative biology • Re-sequencing and gap filling • Structural variation! No scaffold gene XYZ Scaffold 5’ UTR gene XYZ 3’ UTR
Why Scaffolding? • Annotation • Comparative biology • Re-sequencing and gap filling • Structural variation! Biologist: There are holes in my genes! 5’ UTR gene XYZ 3’ UTR Sanger Sequencing 5’ UTR gene XYZ 3’ UTR
Why Scaffolding? • Annotation • Comparative biology • Re-sequencing and gap Filling • Structural variation!
Read Pairs Informative Reads Paired Read Construction • Align each read against the contigs • Uniquely mapped reads • Save repetitive • Both reads in a pair must map to different contigs 2kb 2kb same strand and orientation R2 R1
Linkage Information Possible States • Two contigs are adjacent if: • A read pair spans the contigs • State (A, B, C, D) • Depends on orientation of the read • Order of contigs is arbitrary • Each read pair can be “consistent” with one of the four states 5’ 3’ R2 R1 A B C D contigi contig j
Scaffolding Problem Input • Contigs • Linkage information from paired reads Output • Contig orientation • Contig ordering • Relative distance between contigs Objective: Find the longest and most accurate scaffolds
Existing Tools Bambus2 GRASS OPERA MIP SCAPRA SOPRA SGA SOAPDENOVO SSPACE
Scaffolding Graph read pairs Scaffolding graph: G = (V, E, w) V = set of all contigs E = set of pairs of contigs connected with mapped read pairs of a particular state (A,B,C, or D) Edge weight = probability of read pairs being correctly aligned: • Amount of repeat overlap • Contig coverage dissimilarity contigs
Structure of Scaffold Graph • Paired reads have upper bound • Contigs have minimum size • Upper bound of # contigs spanned Scaffold graph has bounded width (Opera) 2kb 2kb
Elimination Order: SPQR-tree Bi-connected Elimination order Tri-connected
Maximum Likelihood Contig Orientation • Given read pair r, let pr be probability of r being correctly aligned • For given orientation O, let RO be subset of all reads R agreeing with O • Probability of O being correct is estimated as • Log likelihood • Equivalent to maximize
Integer Linear Program Formulation • Variables: • Binary Si= 0 if i-thcontig keeps default orientation, =1 if it is flipped • Binary Sij= 0 if contigs in (i,j)-edge are both flipped or both not, = 1 otherwise • Binary Aij=1 if the edge (i,j) in state A, = 0 otherwise (similarlyBij,Cij, Dij) • Weight of edge • Objective
i i i i ILP Constraints j j j j i i j j k k k k • Connecting Si and Sij • Connecting Sijand Aij • Forbidding 2-cycles • Forbidding 3-cycles
? Non-Serial Dynamic Programming ? 2-Component A ? ? ILP_A ? ? • 1-cuts Collapsing 1-Cut Splitting 1-Cut ? 1-Cut ? ? ? ? ? ILP_B ? ? 2-Component B ?
Non-Serial Dynamic Programming • 2-cuts 3-Component A ? ILP_A + ILP_A ILP_A ? ? Collapsing 2-cut Splitting 2-Cut ? ? 2-Cut ? ? ILP_B ? ? ? 3-Component B
A A B A B B Total Ordering via Bipartite Matching A C F A B F B C C D E C D E C D D ILP Bipartite Bipartite Total Output graph matching order E D E E F F F
Gap Estimation via QP CONTIG i CONTIG i+1 CONTIG i+2 CONTIG i+3 • Maximum likelihood gap estimation following Opera and previous scaffolders
Results • Verification on simulation • Simulate contigs from real draft assemblies (GAGE) & real reads • Control error rate • Actual orientation, order and position is known • Evaluation on real assembly • Draft Human genome & real reads • Alignment based evaluation is challenging
GAGE Simulation • Step: • Staph, Rhodo and Chr14 assemblies • Mix all contig sizes, sample uniformly at random to build simulated contigs (or gaps) • Align reads against simulated contigs • Control error rates • Repeat 10x • Metrics: • Binary classification (n-1) real edges • Corrected N50 (break edges at bad points) • Runtime • Scaffold size distribution
Conclusions • Powerful and flexible ILP for genome scale problems • NSDP is a viable solution for big problems • ILP will probably work well in Metagenomicsdomain • Validation through N50, gene content, • Future work: • Validate through recent framework from Genome Biology 03/ 2014 • http://genomebiology.com/2014/15/3/R42 • http://genomebiology.com/2014/15/3/R42/table/T1
Scaffolding NSDP Algorithm Nonserial Dynamic Programming (NSDP) • Compute ILP solution in stages such that each uses results from the previous stage. • Use SPQR-tree to determine variable elimination order stack = [root of SPQR-tree] visited = {empty}while stack is not empty do p = stack.pop() foreach child q of p do if p not in visited then stack.push(p) endif endfor if p is root then solfinal = ILP(p) else (s,t) = getcut(p, parent(p)) sol00 = ILP(p, s=0, t=0), sol01 = ILP(p, s=0, t=1) sol01 = ILP(p, s=0, t=1), sol00 = ILP(p, s=1, t=1) endif endwhile