Alessandra Godi

Solving Haplotyping Inference Parsimony problem using a polynomial class representative formulation and a set covering formulation Alessandra Godi Martine Labbé Université Libre de Bruxelles IASI (CNR) Roma Airo Winter 2007 - Cortina d’Ampezzo, February 5th -9th, 2007

The alphabet of life… DNA structure= Double Helix (Watson-Crick) Basic unit = nucleotide: Sugar Phosphate Base (A, G, T, C) • Base pairs (A-T, G-C) are complementary

Human Chromosomes In the nucleus of each cell, the DNA molecule is packaged into thread-like structures called chromosomes. Humans have 23 pairs of chromosomes: 22 autosome pairs 1 pair of sex chrom. Each chromosome includes hundreds of different genes.

Mother Father C C C C M1 M2 P1 P2 Children CM CP Human Chromosomes

AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATAATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGAT AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGAT AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCCAGGAT AATATATCGCTATCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGAT AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGAT AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGAT Chromosomes

Chromosomes A single ‘copy’ of a chromosome is called haplotype, while a description of the mixed data on the two ‘copies’ is called genotype. For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect. Genotype data is easy to collect.

SNPs All humans are 99,99 % identical. Diversity? polymorphism. A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).

G A C A T A C G T C C G C T A T A T C T A G C T SNP (Single Nucleotide Polymorphism) TCCGTATACCTA GGGGTGTGTGTAC TGCTAGCACGCG TGTGTAATATACG AATATATCG AATATATCG TCCGTATACCTA GGGGTGTGTGTAC TGCTAGCACGCG TGTGTAATATACG AATATATCG TCCGTATACCTA GGGGTGTGTGTAC TGCTAGCACGCG TGTGTAATATACG TGTGTAATATACG AATATATCG TCCGTATACCTA GGGGTGTGTGTAC TGCTAGCACGCG AATATATCG TCCGTATACCTA GGGGTGTGTGTAC TGCTAGCACGCG TGTGTAATATACG AATATATCG TCCGTATACCTA GGGGTGTGTGTAC TGCTAGCACGCG TGTGTAATATACG

G A C A T A C G T C C G C T A T A T C T A G C T SNP (Single Nucleotide Polymorphism)

SNP 1 SNP 2 SNP 3 SNP 4 G A C A T A C G T C C G C T A T A T C T A G C T Hetero Hetero Homo Homo zigous zigous zigous zigous SNP (Single Nucleotide Polymorphism) Haplotype 1:A G A C Haplotype 2: T T A C Genotype:A/T T/G A C

SNP 1 SNP 2 SNP 3 SNP 4 1 1 0 0 G A C A 0 1 0 1 T A C G 0 0 0 1 T C C G 0 0 C 0 T 1 A T 0 1 0 1 A T C T 0 1 0 1 A G C T SNP: encoding Haplotype 1:0 1 1 0 Haplotype 2: 1 0 1 0 Genotype:0/1 1/0 1 0 2 2 1 0

Given a set of genotypes G(strings on {0,1,2}n alphabet), find a set of “generating” haplotypes H(strings on {0,1}n alphabet). genotype  individual Haplotyping of a population

The DNA sequence is a linear disposition of 4 different molecule, nucleotide, or bases: A, T, C, G. The bases are paired each other by hydrogen bonds. The GENOME is the set of genetic information which lies in the DNA sequence of each living organism.

The DNA implies differences between the individuals of the same species. What makes us different from each other is called polymorphism.

atcagattagttagggcacaggacggac atccgattagttagggcacaggacgtac atcagattagttagggcacaggacgtac atccgattagttagggcacaggacggac atcagattagttagggcacaggacggacgtac atccgattagttagggcacaggacgtac atcagattagttagggcacaggacggacgtac atcagattagttagggcacaggacggac atcagattagttagggcacaggacggacggac atccgattagttagggcacaggacggacggac At DNA level: a Polymorphism is a nucleotide sequence which varies within a chromosome population: Single Nucleotide Polymorphism (SNP)

At DNA level: a Polymorphism is a nucleotide sequence which varies within a chromosome population: Single Nucleotide Polymorphism (SNP) atcagattagttagggcacaggacggac atccgattagttagggcacaggacgtac atcagattagttagggcacaggacgtac atccgattagttagggcacaggacggac atcagattagttagggcacaggacggacgtac atccgattagttagggcacaggacgtac atcagattagttagggcacaggacggacgtac atcagattagttagggcacaggacggac atcagattagttagggcacaggacggacggac atccgattagttagggcacaggacggacggac

HOMOZYGOUS: same allele on both chromosomes atcagattagttagggcacaggacggac atccgattagttagggcacaggacgtac atcagattagttagggcacaggacgtac atccgattagttagggcacaggacggac atcagattagttagggcacaggacggacgtac atccgattagttagggcacaggacgtac atcagattagttagggcacaggacggacgtac atcagattagttagggcacaggacggac atcagattagttagggcacaggacggacggac atccgattagttagggcacaggacggacggac

HOMOZYGOUS: same allele on both chromosomes ETEROZYGOUS: different alleles atcagattagttagggcacaggacggac atccgattagttagggcacaggacgtac atcagattagttagggcacaggacgtac atccgattagttagggcacaggacggac atcagattagttagggcacaggacggacgtac atccgattagttagggcacaggacgtac atcagattagttagggcacaggacggacgtac atcagattagttagggcacaggacggac atcagattagttagggcacaggacggacggac atccgattagttagggcacaggacggacggac

HOMOZYGOUS: same allele on both chromosomes ETEROZYGOUS: different alleles HAPLOTYPES: chromosome at SNP level atcagattagttagggcacaggacggac atccgattagttagggcacaggacgtac atcagattagttagggcacaggacgtac atccgattagttagggcacaggacggac atcagattagttagggcacaggacggacgtac atccgattagttagggcacaggacgtac atcagattagttagggcacaggacggacgtac atcagattagttagggcacaggacggac atcagattagttagggcacaggacggacggac atccgattagttagggcacaggacggacggac

HOMOZYGOUS: same allele on both chromosomes ETEROZYGOUS: different alleles HAPLOTYPES: chromosome at SNP level ag c t at c g at ct at a g a g c g

HOMOZYGOUS: same allele on both chromosomes ETEROZYGOUS: different alleles HAPLOTYPES: chromosome at SNP level GENOTYPES: “union” of two haplotypes ag ct OaE OcE at cg at ct OaOt EE at ag ag EOg cg

CODING: each SNP has only 2 possible values in a biological population. Let us call them ‘0’ and ‘1’. Moreover, let ‘2’ be the eterozygous site. ag ct OaE OcE at cg at ct OaOt EE at ag ag EOg cg

CODING: each SNP has only 2 possible values in a biological population. 0  0 = 0 1  1 = 1 0  1 = 1  0 = 2 :{0,1}  {0,1,2} 01 10 02 12 00 11 00 10 00 22 00 01 01 21 11

Given a set G(strings in {0,1,2}n), find a set of generator haplotypes H(strings in {0,1}n) genotype  individual HAPLOTYPING of a population

HAPLOTYPING of a population:State of the Art Perfect Phylogeny (Bafna, Gusfield, Yooseph 02) Estimation of haplotype frequencies (probabilistic studies: Fallin – Shork, 00) Parsimony Objective (Gusfield 02, Brown 05)

Combinatorial Methods (Gusfield 2002, Brown 2004, LANCIA –Rizzi, 2002): Exponential and Polynomial ILP formulations Rule-based methods (HAPINFER - Clark 1990): Starting from genotypes, haplotypes are inferred Statistical methods (PHASE- Stephens 2004, HAPLOTYPER – Niu 2001, GERBIL – Shamir 2005) HAPLOTYPING of a population:Parsimony Objective (NP-hard)

A new polynomial formulation A formulation using class representatives • A pure set covering model obtained by Fourier-Motzking procedure by Gusfield (2002)model • A branch and cut procedure to decrease the number of constraints A new exponential formulation HAPLOTYPING of a population:our approach to the problem by using ILP

A new polynomial formulation Main idea: class representatives K’ = {1’, 2’, …, m’} genotypes of length n K = {1, 2, …, m} G={g1,g2,…,gm} I={h1,…, hq} a solution of the problem Each haplotype induces a subset of ordinated genotypes, and each geno belongs to exactly two of these subsets: h1 {gi, gj, gk,…} = Si The smallest index geno identifies the subset; the prime appears if the correspondent index has been already used. h2 {gi, gl, gr, gs…} = Si’ h3 {gk, gl, gs, gt…} = Sk …. ….

A new polynomial formulation VARIABLES 1 If geno gk belongs to two subset of geno’s, one having a geno with smallest index equal to i and the other one having the geno with smallest index j yk{i,j}= • k  K • i, j  K  K’ 0 Otherwise

A new polynomial formulation Let us note that some y variables do not exist: y2{1’,2’} = 0 If y2{1’,2’} = 1  S1={g1,….} S1’={g1,g2….} S2={g2,…} S2’={g2,…} Absurd!!! Ex: g1= 021, g2= 002, g3 = 012 h1 = 001  {g1, g2} = S1 y1{1,1’} = 1 h2 = 011  {g1, g3} = S1’

A new polynomial formulation 1 It is the value of the p-th coordinate of the haplo explaining the subset of geno’s used in the solution and having geno i as geno with smallest index zi,p = • i  K  K’ •  p  SNP 0 OBJECTIVE FUNCTION: min xi i  K  K’ VARIABLES • If there exists a subset of geno’s of the solution having geno i as geno with smallest index xi =  i  K  K’ 0 Otherwise

A new polynomial formulation 1. xi  xi’  i  K,  i  K’ 2.  yk{i,j}  1  k  K i,j  K  K’, i≤k, j≤k CONSTRAINTS:

A new polynomial formulation  k  K  yk{i,j} +  yk{i,j}≤ xi 3.  i  K  K’, j  K  K’, j ≥ i j  K  K’, j < i  k  K 3a.  yk{k,k’}≤ xk’  i = k’ CONSTRAINTS:

 i  K  K’ pSNP s.t. gi(p)=0 zi,p= 0 4a.  i  K  K’ pSNP s.t. gi(p)=1 zi,p= 1 4b.  {i,j}  K  K’ pSNP s.t. gi(p)=2 zi,p+ zj,p = 1 4c. A new polynomial formulation CONSTRAINTS:

A new polynomial formulation zi,p ≤ 1 -  yk{i,j} -  yk{i,j} xi 5.  k  K j  K  K’, j < i  pSNP : gk(p)=0 j  K  K’, j ≥ i  i  K  K’ 5a. yk{k,k’} + zk’,p≤ 1  k  K, i = k’  pSNP : gk(p)=0 CONSTRAINTS:

zi,p ≥  yk{i,j} +  yk{i,j} 6.  k  K j  K  K’, j < i  pSNP : gk(p)=1 j  K  K’, j ≥ i  i  K  K’ 6a. zk’,p ≥yk{k,k’}  k  K, i = k’  pSNP : gk(p)=1 A new polynomial formulation CONSTRAINTS:

zi,p + zj,p≥yk{i,j}  k  K 7.  pSNP : gk(p)=2  i,j  K  K’ 7a. zi,p + zj,p≤ 2 - yk{i,j}  k  K  pSNP : gk(p)=2  i,j  K  K’ A new polynomial formulation CONSTRAINTS:

Preliminar results

From Gusfield’s formulation (2002)… ^ Let G be the genotype set and H the set of haplotypes which are compatible with some genotype in G. For each g G ^ Pg = {(h1,h2) con h1,h2H | h1 h2 = g} INTEGER VARIABLES • if (h1,h2) is • selected 1 if his chosen y Xh h1,h2 0 otherwise 0 otherwise

From Gusfield’s formulation (2002)… OBJECTIVE FUNCTION min  Xh ^ hH CONSTRAINTS y   1 1.  g  G h1,h2 (h1,h2)  Pg y X  2. (h1,h2)  Pg ,g  G h1,h2 h1 y X  3. (h1,h2)  Pg ,g  G h1,h2 h2

Facets and Valid Inequalities Genotype Structure + Basic SC theory …to a new set covering formulation by using the Fourier- Motzkin procedure min  xh ^ hH x g  G   1 Set-Covering h h=h1 h=h2 ˇ s.t. (h1,h2)  Pg x {0,1}n

Set-covering for HIP F N is the set of SNP N\F F={pN: g(p) {0,1}} Proposition • The polytope HSC if full-dimensional IFF •  g  G , |N\F|=2. • 2. xj  0 is a facet for HSCIFF  g  G there exists hi s.t. • hj  hi=g, we have |N\F|=3. • 3. xj 1 is facet j . fixed fixed g free

Set-covering for HIP F  xi 1 i  S N\F F’ N\F’ F={pN: g(p) {0,1}} F’={pN: g’(p) {0,1}} C=(N\F’)F N is the set of SNPs g fixed fixed free free g’ fixed free

Set-covering for HIP   1. xh h  S Theorem Let us consider a genotype g and a subset S of haplotypes which are associated to a minimal set covering inequality: This inequality is facet defining IFF for each genotype g’g one of the following conditions holds: |C|=|(N\F’)F| 3 |C|=|(N\F’)F|= 2 e(N\F)(N\F’)  

Set-covering for HIP NOTE: For the following cases: 1st case: If |C|=|(N\F’)F|= 2 (N\F)(N\F’) =  If |C|= |{p}|=1 2nd case : 3rd case : If C=  the set covering inequality is dominated by another one that can be defined by using a SEQUENTIAL LIFTING procedure.

Set-covering for HIP: main idea To overcome the exponential structure of the formulation: • Add only set-covering inequalities which are facet-defining • Add them in branch and cut procedure

Set-covering for HIP:a branch and cut procedure a fractional solution of a subproblem of the original one g: (h1, h2 ) (h3,h4) (h5, h6) (h7, h8) x* All set covering inequalities associated with g have the following structure: x{1 or 2}+ x{3 or 4} + x{5 or 6}+ x{7 or 8} ≥ 1

Set-covering for HIP:a branch and cut procedure min {x*1,x*2} + min {x*3,x*4} + min {x*5,x*6} + min {x*7,x*8} < 1 We want to find a set covering inequality of g that violates x* If it esists, we have found a set covering inequality which cut off x* !!! We choose to add it to the system only if it is facet-defining.

Alessandra Godi

Alessandra Godi

Presentation Transcript

Alessandra TESEI, Alessandra BARBIERI NATO - CMRE, Italy Ion ROCEANU, D. BELIGAN

Alessandra Aloisi ( STScI)

Alessandra Alfieri United Nations Statistics Division

Alessandra Fabi

Alessandra Fabi

Alessandra Oliveira - Graphic and Web Designer

LONDON Alessandra Cervesi

Alessandra Galli 1,2,3

What Alessandra Tortone Does?