Linear Reduction for Haplotype Inference

Linear Reduction forHaplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004

Outline • SNP, haplotypes and genotypes • Haplotype Inference • Linear reduction method • Improvements • Experimental results • Conclusions & future work

Human Genome and SNP • Length of Human Genome  3  109 base pairs • Difference between any two people  0.1% of genome  3  106 base pairs • Total number of single nucleotide polymorphisms (SNP)  1  107 base pairs • SNP’s are mostly bi-allelic, e.g., • two variants (alleles) out of 4 possible (A,C,T,G) = A/C • having a nucleotide in a certain position or missing it = A/- • Major allele = more frequent allele = wild type vs SNP • Minor allele (snip) frequency should be biologically considerable, e.g., over 1% • There are more less frequent SNP

Haplotype and Disease Association • Deafness inheritance  moral problems • SNP contribute to risk factors of complex diseases: • having certain SNP increases 10 times chances of having diabetes • but association is too “fragile” for doctors 3  10-6 30  10-6 • combinations of SNP’s = haplotypesare responsible for diseases • International HapMap project: http://www.hapmap.org • SNP maps are constructed across the human genome with density of about one SNP per thousand nucleotides. • HapMap tries to identify 1 million tag SNP’s providing almost as much mapping information as entire 10 million SNP’s • Unfortunately, not as much known about SNP combinations

0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 Two Two haplotypes haplotypes per individual per individual  1 1 1 1 0 0 1 0 0 1 1 0 0 1 0 0 0 0 Genotype for the individual Genotype for the individual 2 2 1 1 2 2 1 0 0 1 1 0 0 1 2 2 0 0 Haplotypes and Genotypes • Diploid organisms = two different “copies” of each chromosome = recombined copies of parents’ chromosomes • Too expensive to examine two versions of a chromosome separately • Much cheaper to obtain genotype (mixed) data rather than haplotype (separated) data • Haplotype = description of single copy (0=wild type,1=minor allele) Genotype = description of mixed two copies (0=00, 1=11, 2=01) WABI 2004

Haplotype Inference Problem • Haplotype Inference (HI) Problem: • Given:ngenotype vectors (0, 1 or 2), • Find:n pairs of haplotype vectors, one pair of haplotypes per each genotype explaining genotypes • For individual genotype with h heterozygous sites there are 2h-1possible haplotype pairs explaining this genotype • This is hopeless without genetic model • Parsimonious models  minimize number of haplotypes WABI 2004

Computational Haplotype Inference Problem • Assumptions: • small number of repeated mutations • small number of recombinations • If data allow, then explain them only with mutations (perfect phylogeny) • It is possible when there no 4-gamete rule violations: • for any pair of SNP’s only 3 combinations out of 4 (00/01/10/11) are present • Fastest implemented algorithm DPPH • Known programs for general data (with possible 4-gamete rule violations): • PHASE, HAPLOTYPER, HAP, Set-cover based, etc. WABI 2004

Reducing the Set of SNP’s • Often many columns corresponding to SNP sites are analogous – one column can be obtained from another by swapping 0’s and 1’s • One of such columns can be dropped – same as for two equal columns • What would be generalization? • If one site is “dependent” (or can be reconstructed) from k other sites, then drop this dependent site – it does not carry any useful additional information • General reduction method: • Encoding: reduce number of sites be removing dependent sites • Infer site-reduced haplotypes for the site-reduced genotypes using known haplotype inference method • Decoding: reconstruct dependent SNP’s from sites of reduced haplotypes • Main requirement to reduction method – should be fast WABI 2004

Linear Dependence of SNP’s • Consider linear dependence: • To make analogous sites linearly dependent – change notations: 0/1  -1/1 • Also for genotypes 0/1/2  -1/1/0 and genotype is half-sum of (linearly dependent from explaining haplotypes) • Keep only linear independent SNP (tag SNP’s) – all other SNP can be reconstructed using linear combinations • Equivalent factorization problem – find representation G = IX × H WABI 2004

Factorization Problem • Factorization problem • Given a 0/1/-1 genotype matrix G • Find representation, G = IX × H where IX = graph incidence matrix (exactly two 1’s in each row) and H = -1/1 haplotype matrix • Solution: • Factorize G = T× (ET|C) T = tags = basis of columns of G - solve factorization for T: T = IX × H’ - finally G = (IX × H’)× (ET|C) = IX × (H’× (ET|C))= IX × H WABI 2004

Linear Encoding Algorithm WABI 2004

Linear Decoding Algorithm WABI 2004

Graph-Based Decoding • Extend haplotype graph Xr obtained from HI algorithm to Xm for all m sites • Very often the graphs Xr andXm are isomorphic, but not always • Consider example • g1 = (1, 0, 1) and g2 = (0, -1, -1) • reduced set = (1,0) and (0,-1) • The corresponding reduced haplotype graph has 3 vertices, while Xm has 4 vertices • The simple way is to split the vertices if we find an error WABI 2004

Handling Imperfect Phylogeny • The genotype data may have indications of inconsistency with the perfect phylogeny model, 4 gamete rule violation • We could choose h independent columns without such violation • Algorithm in greedy manner WABI 2004

Experimental Results • In Table 1, Our Results show that the advantage in runtime of Linearly Reduced DPPH grow fast with testcase size and reaches factor of 60 for largest instances. • In all testcases, if DPPH find unique solution, so does the LR DPPH and the solution is identical. • In Table 2 and 3, we can see the running time is drastically reduced compared to the original PHASE while the quality measured is not larger. • In Table 4 and 5, we can see same advantage by using Linearly Reduced HAPLOTYPER instead original HAPLOTYPER. • The last two data, we work on the real data from the drosophila haplotypes and human chromosome. WABI 2004

Experimental Results WABI 2004

Conclusions and Future work • Our method significantly speed up popular haplotype inference tools such as DPPH, HAPLOTYPER and PHASE in all cases thus not compromising the quality. • We ever reach 50 faster than DPPH. • Future work includes implement handling imperfect phylogeny algorithm. • We are going to investigate an application of suggested linear reduction to finding a small number of representative sites sufficient to distinguish all haploytpes WABI 2004

Linear Reduction for Haplotype Inference

Linear Reduction for Haplotype Inference

Presentation Transcript

Combinatorial Approaches to Haplotype Inference

Non-linear Dimensionality Reduction

Combinatorial Algorithms for Haplotype Inference

Continuation of inference testing 9E.1: Inference Testing for Linear Regression

Amortized Integer Linear Programming Inference

Continuation of inference testing 9E.1 : Inference Testing for Linear Regression

Non-Linear Dimensionality Reduction

Bayesian Haplotype Inference for Multiple Linked Single Nucleotide Polymorphisms

Chapter 12 Inference for Linear Regression

Computational Approaches to Haplotype Inference

Inference in Simple Linear Regression

Algorithms for Genotype and Haplotype Inference from Low-Coverage Short Sequencing Reads

Linear Reduction Method for Tag SNPs Selection

Haplotype inference and haplotype-based transmission disequilibrium test (Hap-TDT)

Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference

METHODS FOR HAPLOTYPE RECONSTRUCTION

Chapter 27 Inference for Simple Linear Regression

Combinatorial Algorithms for Haplotype Inference

Linear Reduction Method for Tag SNPs Selection

Algorithms for Genotype and Haplotype Inference from Low-Coverage Short Sequencing Reads