1 / 74

Optimization Problems for Polymorphisms of Single Nucleotides

Optimization Problems for Polymorphisms of Single Nucleotides. Polymorphisms. A polymorphism is a feature. Polymorphisms. A polymorphism is a feature - common to everybody. Polymorphisms. A polymorphism is a feature - common to everybody - not identical in everybody.

juana
Télécharger la présentation

Optimization Problems for Polymorphisms of Single Nucleotides

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimization Problems for Polymorphisms of Single Nucleotides

  2. Polymorphisms A polymorphism is a feature

  3. Polymorphisms A polymorphism is a feature - common to everybody

  4. Polymorphisms A polymorphism is a feature - common to everybody - not identical in everybody

  5. Polymorphisms A polymorphism is a feature - common to everybody - not identical in everybody - the possible variants (alleles) are just a few

  6. Polymorphisms A polymorphism is a feature - common to everybody - not identical in everybody - the possible variants (alleles) are just a few E.g. think of eye-color

  7. Polymorphisms A polymorphism is a feature - common to everybody - not identical in everybody - the possible variants (alleles) are just a few E.g. think of eye-color Or blood-type for a feature not visible from outside

  8. At DNA level, a polymorphism is a sequence of nucleotides varying in a population.

  9. At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP)

  10. At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP) atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac

  11. At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP) atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

  12. - SNPs are predominant form of human variations - On average one every 1,000 bases - Used for drug design, study disease, forensic, evolutionary... atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

  13. - Multimillion dollar SNP consortium project - 1st step: buildmaps of severalthousandSNPs - Goal: associate SNPs (or group of SNPs) to geneticdiseases atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

  14. HOMOZYGOUS: same allele on both chromosomes atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

  15. HOMOZYGOUS: same allele on both chromosomes atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

  16. HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

  17. HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

  18. HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

  19. HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites atcggcttagttagggcacaggacgtac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgt atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

  20. HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites ct cg ag at at at ct ag ag cg ag ag ag cg

  21. HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites GENOTYPE: “union” of 2 haplotypes ct OcE cg ag OaE at at OaOt at ct EE ag ag EOg cg ag ag OaOg OgE ag cg

  22. CHANGE OF SYMBOLS: each SNP onlytwovalues in a poplulation (bio). Call them1 and O. Also, call *the factthat a site isheterozygous HAPLOTYPE: string over 1,O GENOTYPE: string over 1,O,* ct OcE cg ag OaE at at OaOt at ct EE ag ag EOg cg ag ag OaOg OgE ag cg

  23. CHANGE OF SYMBOLS: each SNP onlytwovalues in a poplulation (bio). Call them1and O. Also, call *the factthat a site isheterozygous HAPLOTYPE: string over 1,O GENOTYPE: string over 1,O,* o1 o* oo 1o 1* 11 11 11 11 o1 ** 1o 1o *o oo 1o 1o *o *o 1o oo

  24. THE HAPLOTYPING PROBLEM Single Individual: Given genomic data of one individual, determine 2 haplotypes (one per chromosome) Population : Given genomic data of k individuals, determine (at most) 2k haplotypes (one per chromosome/indiv.) For the individual problem, input is erroneous haplotype data, from sequencing For the population problem, data is ambiguous genotype data, from screening OBJ is lead by Occam’s razor: find minimum explanation of observed data under given hypothesis (a.k.a. parsimony principle)

  25. Theory and Results Single individual - PolynomialAlgorithms for gaplesshaplotyping(L, Bafna, Istrail, Lippert, Schwartz 01 & Bafna, L, Istrail, Rizzi 02) - Polynomial Algorithms for bounded-length gapped haplotyping (BLIR 02) - NP-hardness for general gapped haplotyping (LBILS 01) Population - APX-hardness (Gusfield 00) - Reduction to Graph-Theoretic model and I.P. approach(Gusfield 01) -New formulations and DiseaseDetection(L, Ravi, Rizzi, 02) - Exactalgorithms for min-sizesolution (L,Serafini 2011) - Heuristics(Tininini, L, Bertolazzi 2010)

  26. The Single-IndividualHaplotyping problem

  27. Shotgun Assembly of a Chromosome [ Webber and Myers, 1997] ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA fragmentation ACTGA GATTT GCCTAG CTATCTT ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA sequencing TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA TAGAGATTTC TCCTAAAGAT CGCATAGATA assembly ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

  28. MAIN ERROR SOURCES -Sequencing errors: ACTGCCTGGCCAATGGAACGGACAAG CTGGCCAAT CATTGGAAC AATGGAACGGA -Contaminants

  29. Givenerrors, the data may be inconsistent with exactly 2 haplotypes Hence, assembler is unable to build 2 chromosomes PROBLEM: Find and remove the errors so that the data becomes consistent with exactly 2 haplotypes

  30. The data: a SNP matrix ACTGAAAGCGA ACTAGAGACAGCATG ACTGATAGC GTAGAGTCA ACTG TCGACTAGA CATG ACTGA CGATCCATCG TCAGC ACTGAAA ATCGATC AGCATG ACTGAAAGCGAACTAGAGACAGCATG ACTGATAGCGTAGAGTCA ACTGTCGACTAGACATG ACTGACGATCCATCGTCAGC ACTGAAAATCGATCAGCATG 11O OO1 1 11 1 O

  31. Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 31 1 O 1 1 - - - - 4 O O1 - - - - O - 5 - - - - - - - 1 O 6 - - - - O OO1 - Fragments 1,..,m

  32. Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 31 1O 1 1 - - - - 4 O O1 - - - - O - 5 - - - - - - - 1 O 6 - - - - O OO1 - Fragments 1,..,m Fragment conflict: can’t be on same haplotype

  33. Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 31 1O 1 1 - - - - 4 O O1 - - - - O - 5 - - - - - - - 1 O 6 - - - - O OO1 - Fragments 1,..,m Fragment conflict: can’t be on same haplotype Fragment Conflict Graph GF(M) 1 4 We have 2 haplotypes iff GF is BIPARTITE 5 2 6 3

  34. Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 31 1O 1 1 - - - - 4 O O1 - - - - O - 5 - - - - - - - 1 O 6 - - - - O OO1 - Fragments 1,..,m PROBLEM (Fragment Removal): make GF Bipartite 1 4 5 2 6 3

  35. Snips 1,..,n 1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 31 1 O 1 1 - - - - 4 O O1 - - - - O - 5 - - - - - - - 1 O 6 - - - - O OO1 - Fragments 1,..,m PROBLEM (Fragment Removal): make GF Bipartite 1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 4 O O1 - - - - O - 31 1 O 1 1 - - - - 5 - - - - - - - 1 O 1 4 5 2 O O1 O 1 1 O O1 6 3 1 1 O 1 1 - - 1 O

  36. Removing fewest fragments is equivalent to maximum induced bipartite subgraph NP-complete [Yannakakis, 1978a, 1978b; Lewis, 1978] O(|V|(log log |V|/log |V|)2)-approximable [Halldórsson, 1999] not O(|V|)-approximable for some  [Lund and Yannakakis, 1993] Are there cases of M for which GF(M) is easier? YES: the gapless M ---O11OO1O1O1OO1--- gapless ---O11OO---O1OO1--- gap ---O11--1O----O1--- 2 gaps

  37. Why gaps? Sequencingerrors (don’t call with lowconfidence) ---OO11?11--- ===> ---OO11-11---

  38. Why gaps? Sequencingerrors (don’t call with lowconfidence) ---OO11?11--- ===> ---OO11-11--- Celera’s mate pairs attcgttgtagtggtagcctaaatgtcggtagaccttga attcgttgtagtggtagcctaaatgtcggtagaccttga

  39. THEOREM For a gapless M, the Min Fragment Removal Problem is Polynomial NOTE: Does not need to be gapless. Enough if it can be sorted to become such (Consecutive Ones Property, Booth and Lueker, 1976)

  40. 3 An O(nm + n ) D.P. algo 1 - O O1 1 O O - - 2 - - 1 O 1 1 O - - 3 - - - 1 1 O - - - 4 - - - - O O1 O - 5 - - - - - 1 O 1 O

  41. 3 An O(nm + n ) D.P. algo LFT(i) RGT(i) 1 - O O1 1 O O - - 2 - - 1 O 1 1 O - - 3 - - - 1 1 O - - - 4 - - - - O O1 O - 5 - - - - - 1 O 1 O sort according to LFT

  42. 3 An O(nm + n ) D.P. algo LFT(i) RGT(i) 1 - O O1 1 O O - - 2 - - 1 O 1 1 O - - 3 - - - 1 1 O - - - 4 - - - - O O1 O - 5 - - - - - 1 O 1 O sort according to LFT D(i;h,k) := min cost to solve up to row i, with k, h not removed and put in different haplotypes, and maximizing RGT(k), RGT(h) { D(i-1; h,k) if i, k compatible and RGT(i) <= RGT(k) or i, h compatible and RGT(i) <= RGT(h) 1 + D(i-1; h, k) otherwise D(i; h,k) = OPT is min h,k D( n; h, k ) and can be found in time O(nm + n^3)

  43. WITH GAPS….. Th: NP-Hard if 2 gaps per fragment proof: (simple) use factthat for every G thereis M s.t. G = GF(M) and reduce from Max Bip. InducedSubgraphon 3-regular graphs (in eachrow, max 3 non-bit, hencemax 2 gaps)

  44. WITH GAPS….. Th: NP-Hard if 2 gaps per fragment proof: (simple) use factthat for every G thereis M s.t. G = GF(M) and reduce from Max Bip. InducedSubgraphon 3-regular graphs (in eachrow, max 3 non-bit, hencemax 2 gaps) Th: NP-Hard if even 1 gap per fragment proof: technical. reduction from MAX2SAT

  45. WITH GAPS….. Th: NP-Hard if 2 gaps per fragment proof: (simple) use factthat for every G thereis M s.t. G = GF(M) and reduce from Max Bip. InducedSubgraphon 3-regular graphs (in eachrow, max 3 non-bit, hencemax 2 gaps) Th: NP-Hard if even 1 gap per fragment proof: technical. reduction from MAX2SAT But, gaps must be long for problem to be difficult. We have O( 2 mn + 2 n ) D.P. for MFR on matrix with total gaps length L 2L 3L 3

  46. What for MFR with gaps? Why not ILP...

  47. What for MFR with gaps? Why not ILP... 1/2 1 0 2 5 1/3 4 3 1/4 1/2

  48. What for MFR with gaps? Why not ILP... 1/2 1 1 1 0 2 2 5 5 2 5 1/3 4 4 3 3 4 3 1/4 1/2

  49. What for MFR with gaps? Why not ILP... 1/2 1 1 5/12 5/12 1 0 2 2 5 5 2 5 1/3 4 4 3 3 4 3 1/4 1/2

  50. What for MFR with gaps? Why not ILP... 1/2 1 1 5/12 5/12 1 0 2 2 5 5 2 5 1/3 4 4 3 3 4 3 1/4 1/2

More Related