1 / 18

Linear Reduction for Haplotype Inference

Linear Reduction for Haplotype Inference. Alex Zelikovsky joint work with Jingwu He. WABI 2004. Outline. SNP, haplotypes and genotypes Haplotype Inference Linear reduction method Improvements Experimental results Conclusions & future work. Human Genome and SNP.

dayton
Télécharger la présentation

Linear Reduction for Haplotype Inference

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Reduction forHaplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004

  2. Outline • SNP, haplotypes and genotypes • Haplotype Inference • Linear reduction method • Improvements • Experimental results • Conclusions & future work

  3. Human Genome and SNP • Length of Human Genome  3  109 base pairs • Difference between any two people  0.1% of genome  3  106 base pairs • Total number of single nucleotide polymorphisms (SNP)  1  107 base pairs • SNP’s are mostly bi-allelic, e.g., • two variants (alleles) out of 4 possible (A,C,T,G) = A/C • having a nucleotide in a certain position or missing it = A/- • Major allele = more frequent allele = wild type vs SNP • Minor allele (snip) frequency should be biologically considerable, e.g., over 1% • There are more less frequent SNP

  4. Haplotype and Disease Association • Deafness inheritance  moral problems • SNP contribute to risk factors of complex diseases: • having certain SNP increases 10 times chances of having diabetes • but association is too “fragile” for doctors 3  10-6 30  10-6 • combinations of SNP’s = haplotypesare responsible for diseases • International HapMap project: http://www.hapmap.org • SNP maps are constructed across the human genome with density of about one SNP per thousand nucleotides. • HapMap tries to identify 1 million tag SNP’s providing almost as much mapping information as entire 10 million SNP’s • Unfortunately, not as much known about SNP combinations

  5. 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 Two Two haplotypes haplotypes per individual per individual  1 1 1 1 0 0 1 0 0 1 1 0 0 1 0 0 0 0 Genotype for the individual Genotype for the individual 2 2 1 1 2 2 1 0 0 1 1 0 0 1 2 2 0 0 Haplotypes and Genotypes • Diploid organisms = two different “copies” of each chromosome = recombined copies of parents’ chromosomes • Too expensive to examine two versions of a chromosome separately • Much cheaper to obtain genotype (mixed) data rather than haplotype (separated) data • Haplotype = description of single copy (0=wild type,1=minor allele) Genotype = description of mixed two copies (0=00, 1=11, 2=01) WABI 2004

  6. Haplotype Inference Problem • Haplotype Inference (HI) Problem: • Given:ngenotype vectors (0, 1 or 2), • Find:n pairs of haplotype vectors, one pair of haplotypes per each genotype explaining genotypes • For individual genotype with h heterozygous sites there are 2h-1possible haplotype pairs explaining this genotype • This is hopeless without genetic model • Parsimonious models  minimize number of haplotypes WABI 2004

  7. Computational Haplotype Inference Problem • Assumptions: • small number of repeated mutations • small number of recombinations • If data allow, then explain them only with mutations (perfect phylogeny) • It is possible when there no 4-gamete rule violations: • for any pair of SNP’s only 3 combinations out of 4 (00/01/10/11) are present • Fastest implemented algorithm DPPH • Known programs for general data (with possible 4-gamete rule violations): • PHASE, HAPLOTYPER, HAP, Set-cover based, etc. WABI 2004

  8. Reducing the Set of SNP’s • Often many columns corresponding to SNP sites are analogous – one column can be obtained from another by swapping 0’s and 1’s • One of such columns can be dropped – same as for two equal columns • What would be generalization? • If one site is “dependent” (or can be reconstructed) from k other sites, then drop this dependent site – it does not carry any useful additional information • General reduction method: • Encoding: reduce number of sites be removing dependent sites • Infer site-reduced haplotypes for the site-reduced genotypes using known haplotype inference method • Decoding: reconstruct dependent SNP’s from sites of reduced haplotypes • Main requirement to reduction method – should be fast WABI 2004

  9. Linear Dependence of SNP’s • Consider linear dependence: • To make analogous sites linearly dependent – change notations: 0/1  -1/1 • Also for genotypes 0/1/2  -1/1/0 and genotype is half-sum of (linearly dependent from explaining haplotypes) • Keep only linear independent SNP (tag SNP’s) – all other SNP can be reconstructed using linear combinations • Equivalent factorization problem – find representation G = IX × H WABI 2004

  10. Factorization Problem • Factorization problem • Given a 0/1/-1 genotype matrix G • Find representation, G = IX × H where IX = graph incidence matrix (exactly two 1’s in each row) and H = -1/1 haplotype matrix • Solution: • Factorize G = T× (ET|C) T = tags = basis of columns of G - solve factorization for T: T = IX × H’ - finally G = (IX × H’)× (ET|C) = IX × (H’× (ET|C))= IX × H WABI 2004

  11. Linear Encoding Algorithm WABI 2004

  12. Linear Decoding Algorithm WABI 2004

  13. Graph-Based Decoding • Extend haplotype graph Xr obtained from HI algorithm to Xm for all m sites • Very often the graphs Xr andXm are isomorphic, but not always • Consider example • g1 = (1, 0, 1) and g2 = (0, -1, -1) • reduced set = (1,0) and (0,-1) • The corresponding reduced haplotype graph has 3 vertices, while Xm has 4 vertices • The simple way is to split the vertices if we find an error WABI 2004

  14. Handling Imperfect Phylogeny • The genotype data may have indications of inconsistency with the perfect phylogeny model, 4 gamete rule violation • We could choose h independent columns without such violation • Algorithm in greedy manner WABI 2004

  15. Experimental Results • In Table 1, Our Results show that the advantage in runtime of Linearly Reduced DPPH grow fast with testcase size and reaches factor of 60 for largest instances. • In all testcases, if DPPH find unique solution, so does the LR DPPH and the solution is identical. • In Table 2 and 3, we can see the running time is drastically reduced compared to the original PHASE while the quality measured is not larger. • In Table 4 and 5, we can see same advantage by using Linearly Reduced HAPLOTYPER instead original HAPLOTYPER. • The last two data, we work on the real data from the drosophila haplotypes and human chromosome. WABI 2004

  16. Experimental Results WABI 2004

  17. Experimental Results WABI 2004

  18. Conclusions and Future work • Our method significantly speed up popular haplotype inference tools such as DPPH, HAPLOTYPER and PHASE in all cases thus not compromising the quality. • We ever reach 50 faster than DPPH. • Future work includes implement handling imperfect phylogeny algorithm. • We are going to investigate an application of suggested linear reduction to finding a small number of representative sites sufficient to distinguish all haploytpes WABI 2004

More Related