Download
hapcompass a fast cycle basis algorithm for accurate haplotype assembly of sequence data n.
Skip this Video
Loading SlideShow in 5 Seconds..
HapCompass : A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data PowerPoint Presentation
Download Presentation
HapCompass : A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data

HapCompass : A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data

130 Vues Download Presentation
Télécharger la présentation

HapCompass : A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. HapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data By Derek Aguiar and SorinIstrail (Brown University) Journal of Computational Biology, June 2012 Presented by KWOK TszPiu (Bill) 19/12/2013

  2. Introduction • Genetic variation is present in the form of single nucleotide polymorphisms(SNPs), insertions/deletions, inversions, translocations, copy number variations, etc. • The abundance of SNPs in human genome and the development of high-throughput genotyping technologies • SNPs become the marker of choice for understanding human genetic variation.

  3. Introduction • Human genome contains a pair of DNA sequences : one from each parent called haploid sequences or haplotypes • Haplotypes differ in SNP/insertion/deletion… • SNPs are single bpmutations (~0.1%; non-uniform) • SNP positions contain one of two possible alleles … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcTgtatacacgggTctata… … ataggtccCtatttcgcgcCgtatacacgggTctata …

  4. Haplotypes and Genotypes • Haplotype: description of SNP alleles on a chromosome • 0 for major allele, 1 for minor • Diploids: two homologous copies of each autosomal chromosome • One inherited from mother and one from father • Genotype: description of alleles on both chromosomes • 0 - both chromosomes contain the major allele; • 1 - both chromosomes contain the minorallele; • 2 - the chromosomes contain different alleles 021200210 011000110 001100010 genotype + two haplotypes per individual

  5. Goal of Haplotype assembly • Reconstruct the two haplotypesby the aligned sequence fragments

  6. Goal of Haplotype assembly • Sequence reads are sampled from haploid fragments

  7. Gene‐Disease Association Studies • Haplotypes increase power of association

  8. Haplotype assembly problem • In the absence of error in sequenced read, the correct haplotype assembly is unique. • In the real case, the problem become finding the haplotype assembly that optimizes a certain objective function • E.g., minimize the number of conflicts with the sequenced reads. (MEC)

  9. Input

  10. Input

  11. Compass Graph • Weight = Number of phasings – number of phasings • Positive => suggest phasings • Negative => suggest phasings • Zero (small absolute value) => both phasings are ok.

  12. Compass Graph

  13. Properties of compass graph • There is a unique phasing between two SNPs si and sj if and only if for any two simple edge-disjoint paths p and q in GC between si and sj, the number of negative edges of p plus the number of negative edges of q is even, and p and q include no 0-weight edges. • S1->S2->S4 • S1->S3->S4

  14. Definitions • Conflicting cycle is: • Simiple cycle contains odd number of negative edges • Or has at least one 0-weight edges • GC(Compass graph) with no conflicting cycle is happy • Happy graph can be uniquely phased • We can observe that • Every spanning tree of a compass graph is a happy graph

  15. Problem formulations • Target: • Remove conflicting cycles with Minimum weighted edge removal (MWER)

  16. Problem formulations • Target: • Remove conflicting cycles with Minimum weighted edge removal (MWER)

  17. Algorithm 1 • Remove all 0-weight edges from GC.  • Construct a maximum spanning tree T. • Mark all conflicting cycles. • Repeat 4.1 & 4.2 until Gcis happy: • Randomly select a conflicting cycle, remove the edge e with weight closest to 0 on the cycle. • Re-mark the conflicting cycles • Output the phasing corresponding to any spanning tree of GC m = |Ec|, n = |Vc| Time complexity: O(m(m-n+1)2)+(m-n+1)(m log n))

  18. Algorithm 1

  19. Algorithm 1

  20. Improvement • Idea: • Want to remove edges that are in multiple conflicting cycles • Formulate the problem to set cover problem: • Set: edges • Elements: conflicting cycles • Target: Find the set of edges(sets) of minimum weight s.t. they cover all of the conflicting simple cycles (elements) Universe = {1, 2, 3, 4, 5} (5 elements) Set = {{1, 2, 3}, {2, 4}, {3, 4}, {4, 5}} Best = {{1, 2, 3}, {4, 5}}

  21. Results • Real Data: 1000 genome data, chr 22 of NA12878 • FMPR: Number of mismatch of each fragment to haplotypes • BFM: Number of fragments that are not perfectly match the haplotypes • Block size = number of SNPs

  22. Results • Simulated data: • Chr 22, NA12878 • 10M simulated reads, error rate = 0.05, read length = 100bp

  23. Results

  24. Conclusion • Haplotype assembly is becoming increasingly important • Cost of sequencing decreases • More genome-wide and whole-exome studies are conducted • A new haplotype assembly algorithm • New formulation of the graph • Some useful observations to make the algorithm works • Quality of SNP calls and sequence base call scores will be included in the future.

  25. Thank you!