1 / 47

Spooky halloween haplotype assembly

Spooky halloween haplotype assembly. CSCI2820: Medical Bioinformatics Derek Aguiar. http://paradigmthrift.blogspot.com/2010/06/here-kitty-kitty.html. Important bits from last class. Data generation and preparation Sequencing, quality control, read alignment to a reference genome

rodd
Télécharger la présentation

Spooky halloween haplotype assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spooky halloween haplotype assembly CSCI2820: Medical Bioinformatics Derek Aguiar http://paradigmthrift.blogspot.com/2010/06/here-kitty-kitty.html

  2. Important bits from last class • Data generation and preparation • Sequencing, quality control, read alignment to a reference genome • Problem optimizations • Minimum fragment removal • remove minimum number of vertices in the fragment conflict graph to make it bipartite Haplotype A: 1 0 0 1 − 0 1 0 − − 0 0 0 1 − 0 − 1 − 1 1 Haplotype B: 0 1 1

  3. Important bits from last class • Problem optimizations • Minimum SNP removal • Maximum independent set in the SNP conflict graph • Minimum error correction • Correct the minimum number of base call errors such that the fragment conflict graph is bipartite Schwartz 2010

  4. Important bits from last class • Levy et al. 2007 algorithm • Greedily seed haplotypes • Extend haplotype seeds with maximum overlapping fragments • Iterative refinement in terms of variants and reads

  5. Important bits from last class • HapCUT algorithm • Compute large cuts in a graph which correspond to improvements in the haplotype solution • Recall: Graph cuts • Partition of the vertices of a graph into two disjoint subsets A and B such that at least one edge exists from a vertex in set A to a vertex in set B s5 s1 s7 s3 s4 s6 s2

  6. Today • HapCompass • Evaluating the quality of a haplotype assembly • Polyploid and cancer haplotype assembly

  7. Devising a new graph model • What are the disadvantages of using reads for vertices (fragment conflict graph) ? • Number of reads >> number of variants • What are the disadvantages of using a local maximum cut approximation step (hapCUT) ? • Maximum cut difficult to approximate and expensive computation, best known approximation ratio is 0.878 (and not practical implementation)

  8. The compass graph GC • Vertices are variants and edges connect vertices spanned by reads • Edges are phased based on overlapping read content • Edge weights: w(si,sj) is the difference between 00 or 11 fragments and 01 or 10 fragments. • Note: Because each edge represents the phasing between two SNPs, and a SNP has exactly two alleles, there are only two possible phasings: or • ( ) 00 11 • ( ) 01 10

  9. Compass graph construction

  10. Key properties of compass graphs Compass Graph s6 -2 -2 s5 3 4 s1 s3 -2 2 -3 s2 s4 -2

  11. Key properties of compass graphs Compass Graph s6 (I) If w(si,sj) > 0 then the phasing is else if w(si,sj) < 0 then the phasing is -2 • ( ) 00 11 -2 s5 3 • ( ) 4 01 10 s1 s3 -2 2 -3 s2 s4 -2

  12. Key properties of compass graphs Compass Graph s6 10 01 (II) Unique pairwise phasings of edges can be extended to paths, that is, the phasing is transitive among the SNPs along a path. E.g. the path (s6,s5,s3) defines the phasing -2 -2 s5 00 11 3 4 • ( ) s1 s3 100 011 -2 2 -3 s2 s4 -2

  13. Key properties of compass graphs Compass Graph s6 -2 Lemma A cycle with an odd number of negative weight edges is conflicting. (III) A non-conflicting cycle is a simple cycle that has a unique phasing for any path. (IV) A conflicting cycle is a simple cycle that does not. -2 s5 3 4 s1 s3 -2 2 -3 s2 s4 -2

  14. Intuition – Cycling through trees • View cycles as vectors of edges • Cycle space is the collection of all cycles • Cycle basis is  maximal set of linearlyindependent cycles  v8 v7 v4 v1 v3 v5 v6 v2

  15. Optimization • Minimum weighted edge removal (MWER): Given a compass graph G, remove a set of edges of minimum weight such that G is conflict-free.

  16. Basic HapCompass Algorithm Compass Graph Compute simple cycles s6 -2 -2 s5 3 4 s1 s3 -2 2 -3 s2 s4 -2

  17. Basic HapCompass Algorithm Compass Graph Remove an edge from a conflicting cycle s6 -2 -2 s5 3 4 s1 s3 -2 2 -3 s2 s4 -2

  18. Basic HapCompass Algorithm Compass Graph If graph still has conflicts, iterate. s6 -2 -2 s5 3 4 s1 s3 2 -3 s2 s4 -2

  19. Phasings of compass graphs Conflict-free Compass Graph s6 -2 Example spanning tree Theorem Every spanning tree of a compass graph is a conflict-free graph. Every spanning tree of a conflict-freecompass graph has the same unique phasing. SNPs -2 s5 s1 s2 s3 s4 s5 s6 edges {s1,s3} {s1,s2} {s5,s6} {s3,s4} {s3,s5} 1 1 3 1 1 4 s1 s3 1 0 1 0 2 -3 1 1 haplotype 1 1 1 0 1 0 s2 s4 -2

  20. HapCompass Algorithm (diploid) input: Sequence reads, variant calls output: 2 haplotypes GC← maximum spanning tree cycle basis CC ← set of conflicting cycles in respect to GC forcinCCdo: resolve(c) GC← rebuild() output: haplotype assembly according to maximum weight spanning tree of Gg Lemma The graph Gcremains connected after each resolve(c). Runs in O(m(m-n+1)2+ (m-n+1)(m log n)) time, m=|E|, n=|V|

  21. Advanced local step – set cover Compass Graph Two conflicting cycles s6 -2 -2 s5 3 4 s1 s3 -2 2 -3 s2 s4 -2

  22. Set cover 2 4 3 2 s3 s1 s3 s2 s4 s2 2

  23. The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes

  24. Evaluation • How can we evaluate a haplotype assembly? true haplotypes computed haplotypes

  25. Evaluation • Error type I: disconnected assemblies computed haplotypes 4 s5 s1 4 8 s3 s4 8 9 s6 s2 9 7 true haplotypes

  26. Evaluation • Error type II: base call error computed haplotypes 4 s5 s1 4 7 s3 s4 -1 9 s6 s2 9 1 true haplotypes

  27. Evaluation • Error type III: phase switch computed haplotypes 4 s5 s1 4 -1 s3 s4 8 9 s6 s2 9 7 true haplotypes

  28. Evaluation s99 • Error type IV: bad alignment computed haplotypes 1 4 s5 s1 4 3 s3 s4 8 9 s6 s2 9 7 true haplotypes

  29. evaluation • 6 main metrics for evaluating accuracy of haplotypes • (1) Completeness • Error(s) type I • Not all algorithms use all of the data • However, longer haplotypes does not necessarily mean better

  30. Haplotype assembly completeness

  31. evaluation • 6 main metrics for evaluating accuracy of haplotypes • (2) Minimum error correction • Error(s) type I, II, III, IV

  32. minimum error correction • Counts the number of bit flips in the reads to construct the haplotypes • The distance d(r,h) between a read r and a haplotype h is just the Hamming distance (number of positions they differ) • Define the distance d of a read r and the computed haplotypes h and h’: d(r,h,h’)=min(d(r,h),d(r,h’)) • So, the total distance is the sum over all read fragments

  33. evaluation • 6 main metrics for evaluating accuracy of haplotypes • (3) Haplotype reconstruction rate • Error(s) type I, II

  34. reconstruction rate • Using the same distance function now counting the distance between two haplotypes, the reconstruction rate is the percentage of correctly computed bases • Requires knowledge of the true haplotypes • Given the computed haplotypes h and h’, known haplotypes g and g’, and n SNPs

  35. evaluation • 6 main metrics for evaluating accuracy of haplotypes • (4) Haplotype switch error • Error(s) type II, III, IV

  36. haplotype switch error • Counts the number of haplotype switches when comparing computed haplotypes to true haplotypes • Requires knowledge of true haplotypes • Was originally developed for haplotype phasing and not well suited for haplotype assembly assembled haplotypes true haplotypes SE = 4 1 1 0 1 – – 1 0 1 1 0 0 1 0 – – 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

  37. evaluation • 6 main metrics for evaluating accuracy of haplotypes • (5) Fragment mapping phase relationship • Error(s) type I, II, III, IV • Fragment mapping phase relationship (FMPR) counts the number of pairwise phase relationships in the reads that are not present in the solution 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 0 FMPR = 2 or 2/6 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 true haplotypes

  38. evaluation • 6 main metrics for evaluating accuracy of haplotypes • (6) Boolean fragments mapped • Error(s) type I, II, III, IV • Boolean fragment mapping (BFM) counts the percentage of fragments that map with errors 1 1 1 1 1 0 BFM = 1 or 1/2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 true haplotypes http://b3ta.hnldesign.nl/519-muhahaha-lolshark.html

  39. Polyploidy k=2 k=4,6,8 k=3 k=4 k=4

  40. GC haplotype phasing ## 11 11 01 11 01 10 00 00 10 00 2 ## v3 v0 v1 v2

  41. G`C haplotype phasing ## 10 11 11 11 11 ## 01 00 00 00 00 ## v3 v0 v1 v2 00 00 00 00 00

  42. Edge decidability • New to the haplotype assembly model • Phasings of edges are no longer disjoint triploid diploid 11 01 11 01 00 10 00 10 00 00 shared haplotypes={} shared haplotypes={00}

  43. Polyploid edge decidability • Edge phasing is decided by maximum likelihood

  44. Probability of read • Calculating P(r | se, pi)? • Assume read may be generated from any of the haplotypes of pi with equal probability.

  45. Disjoint paths • Given k source and sink pairs in a graph, find k pairwise disjoint paths between each source-sink pair that are • node disjoint • edge disjoint • If k is not part of the input, NP-complete (hard) • If k is part of the input, polynomial time solutions exist (although not practical)

  46. 2 v3 v1 v2 GC subgraph(no conflict) The chain graph is a trellis graph. 11 11 11 v1v2 v2v3 v3v1 00 00 00 s1 t1 The chain graph 11 11 11 • Vertex for each 2 variant phased haplotype organized in levels • Edges connect valid haplotype extensions (shared variant allele) Disjoint paths in trellis graphs are polynomially solvable and algorithms are fast in practice. 00 00 00 s0 t0

  47. GC subgraph(conflict) v0 v1 v2 v0v1v1v2v2v0 01 11 11 10 00 00 s1 t1 The chain graph 10 11 11 • Vertex for each 2 variant phased haplotype organized in levels • Edges connect valid haplotype extensions (shared variant allele) 01 00 00 s0 t0

More Related