1 / 31

Scaffolding Large Genomes Using Integer Linear Programming

Scaffolding Large Genomes Using Integer Linear Programming. James Lindsay* , Hamed Salooti , Alex Zelikovski , Ion Mandoiu * ACM-BCB 2012. University of Connecticut*. Georgia State University. De-novo Assembly Paradigm. short reads. the genome. shotgun sequencing. d enovo a ssembly.

emilie
Télécharger la présentation

Scaffolding Large Genomes Using Integer Linear Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaffolding Large Genomes Using Integer Linear Programming James Lindsay*, HamedSalooti, Alex Zelikovski, Ion Mandoiu* ACM-BCB 2012 University of Connecticut* Georgia State University

  2. De-novo Assembly Paradigm short reads the genome shotgun sequencing denovo assembly the scaffolds scaffolding short contigs

  3. Why Scaffolding? No scaffold gene XYZ Scaffold 5’ UTR gene XYZ 3’ UTR • Annotation • Comparative biology • Re-sequencing and gap filling • Structural variation!

  4. Why Scaffolding? Biologist: There are holes in my genes! 5’ UTR gene XYZ 3’ UTR Sanger Sequencing 5’ UTR gene XYZ 3’ UTR • Annotation • Comparative biology • Re-sequencing and gap filling • Structural variation!

  5. Why Scaffolding? • Annotation • Comparative biology • Re-sequencing and gap Filling • Structural variation!

  6. Fragmented Genomes Massive Sequencing Projects Effects of Read Length • I5k • 5000 insect and arthropod species • G10k • 10,000 vertebrate species • Dog Genome • 7.5x Sanger • N50: 180Kb • Chicken Genome • 6x Illumina • N50: 12Kb • Human Genome • 100x Illumina • N50: 24Kb

  7. The Scaffolding Problem Given Contigs, Paired reads Find Orientation, Ordering, Relative distance Goal Recreate true scaffolds

  8. Paired Reads Paired Read Construction Paired Read Styles 2kb 10kb 2kb same strand and orientation different strand and orientation R2 R1 R1 R2 100b 100b Mate Pair Paired End

  9. Linkage Information Possible States (mate pair) 5’ 3’ R2 R1 A B C D contigi contig j • Two contigs are adjacent if: • A read pair spans the contigs • State (A, B, C, D) • Depends on orientation of the read • Order of contigs is arbitrary • Each read pair can be “consistent” with one of the four states

  10. Scaffolding Graph Nodes Edges State A contig j contigi Nodes are contigs Adjacent contigs have 4 edges (one for each state) Weighted by overlap with repetitive region

  11. Integer Linear Program Formulation Variables Contig orientation: Adjacent contig consistency: Contig pair state: ,, Objective Maximize weight of consistent pairs

  12. Constraints Variables Contig orientation: Adjacent contig consistency: Contig pair state: ,, Pairwise Orientation

  13. Constraints Variables Contig orientation: Adjacent contig consistency: Contig pair state: ,, State Variables

  14. Constraints Variables Contig orientation: Adjacent contig consistency: Contig pair state: ,, Mutual Exclusivity

  15. Constraints Forbid 2 Cycles Forbid 3 Cycles 2 2 2 2 2 2 2 2 *larger cycles are broken at the end

  16. Largest Connected Component

  17. Graph Decomposition: Articulation Points solve Articulation point MIP, Salmela 2011

  18. Largest Biconnected Component

  19. Non-Serial Dynamic Programming A technique which exploits the sparsity of the scaffolding graph by computing the solution in stages, incorporating the results from previous stages ~inspired by (Neumaier, 06)

  20. Non-Serial Dynamic Programming 2-cut + + - - + - + -

  21. Non-Serial Dynamic Programming Objective Modification: + + + - - + - + -

  22. SPQR-tree Based Implementation • SPQR-tree efficiently finds 2 cuts (Tarjan, 73) • DFS of SPQR-tree is used to schedule elimination order for NSDP

  23. Post Processing ILP Solution outgoing incoming A A B B C C D D E E ILP Solution F F B D F A E C May have cycles Not a total ordering for each connected components • Bipartite matching • Objectives: • Max weight • Max cardinality • Max cardinality / Max weight

  24. GAGE Framework • Assembled using: • ABySS, Allpaths-LG, Bambus2, CABOG, MSR-CA, SGA, SOAPdenovo, Velvet • Scaffolded using: • SILP (our method), Opera, MIP, Bambus2

  25. Testing Metrics • TPN50 • Break scaffold at incorrect edges, then find N50 • Size of contig where 50% of the contigs are this size • Binary Classification • Given n contigs in a scaffold • How many of n-1 adjacencies can you predict • PPV • Sensitivity • MCC

  26. Results

  27. Results

  28. Results

  29. Results

  30. Conclusions • Success • ILP solves scaffolding problem! • NSDP works • Improvements • Include SOAPdenovo, Allpaths-LG scaffolds in comparison • Look at parameter effects • Practical considerations (read style, multi-libraries, merge ctgs) • Future Work • Where else can I apply NSDP? • Scaffold before assembly … promising • Structural Variation??

  31. Questions?

More Related