1 / 13

Optimal Tag SNP Selection for Haplotype Reconstruction

Optimal Tag SNP Selection for Haplotype Reconstruction. Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut. Approaches to Phasing. Motivation and Contributions.

gail
Télécharger la présentation

Optimal Tag SNP Selection for Haplotype Reconstruction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut

  2. Approaches to Phasing Motivation and Contributions • To reduce prohibitively expensive haplotyping costs, a two stage methodology has been recently proposed [3] • Pilot Study • All SNPs of interest are genotyped in a small sample of the population • Commonhaplotypes are inferred using statistical methods • A set of tag SNPsis selected for the population study • Population Study • Tag SNPs are genotyped in the remaining population • Statistical methods are used to infer haplotypesover the tag SNPs • Haplotypesover the tag SNPs are extrapolated to full haplotypes • We propose novel tag SNP selection methods based on integer linear programming. Our methods • Allow computing the complete tradeoff curve between genotyping cost andreconstruction accuracy • Yield improved reconstruction accuracy by taking haplotype frequencies into account

  3. Background • A Single Nucleotide Polymorphism (SNP) is a position in the genome at which exactly two of the possible four nucleotides occur in a large percentage of the population. SNPs account for most of the genetic variability between individuals. • In diploid organisms such as humans, there are two non-identical copies of each chromosome. A description of the SNPs in each chromosome is called a haplotype. • At present, it is prohibitively expensive to directly determine the haplotypes of an individual, but it is possible to obtain rather easily the conflated SNP information in the so called genotype. • The genotyping cost is affected by the number of SNPs typed. In order to reduce this cost, a small number of SNPs (Tag SNPs) which predicts the rest of SNPs are needed.

  4. Previous Work on Tag SNPs • Bafna et al.[1] : Informative SNP Set Problem • Find set of k SNPs with maximum “informativeness” • Sebastiani et al. [5]: Best Enumeration SNP Tags (BEST) • Generates all optimum fully informative Tag SNPs sets • Limitation:worst-case runtime grows exponentially • Barzuza et al.[2]: Phasing Tagging SNP problem • Find the minimum number of SNPs for which every two distinct haplotype pairs yield distinct (XOR) genotypes • Limitation: in practice, many pairs of haplotypes will give the same genotype even if all SNPs are used as tags • Halperin et al.[4]: Genotype Tagging SNPs • Find set of k SNPs allowing most accurate genotype reconstruction * running BEST on the n x n identity matrix

  5. Optimum Fully Informative Tag SNP Sets by Integer Programming • Given: haplotypes h1, h2, …, hmover n SNPs • Find: minimum number of tag SNPs • Such that: every two distinct haplotypes differ in at least one tag SNP • Integer Program Formulation • 0/1 variable xj for every SNP • xj = 1 if SNP j is selected as a tag SNP • xj= 0 otherwise • Can be solved efficiently using general purpose solvers such as CPLEX • In practice significantly faster than BEST

  6. Tag SNP Selection and Haplotype Reconstruction Flow Population Study Pilot Study Population Sample Remaining Population Phasing Genotype (tag SNPs) Sample haplotypes (with frequencies) Phasing Tag Selection Haplotype pairs (tag SNPs) Tag SNP Set Extrapolation Haplotype pairs (all SNPs)

  7. Tag SNP Selection for Haplotype Reconstruction • Reconstruction Errors • Haplotypes not represented in sample population • Cannot be reconstructed! • Minimized by choosing sample large enough • Incorrect inferred haplotypes over tag SNPs • Minimized by using accurate haplotype inference (phasing) methods • We use PHASE [6] for phasing sample genotypes as well as population genotypes over tag SNPs • Incorrect haplotype extrapolation • Our extrapolation procedure • Find sample haplotype with minimum Hamming distance • Break ties according to the frequency of sample haplotypes (most frequent haplotypes are given preference) Informal Problem Definition Given: sample haplotypes and frequencies Find: K tag SNPs maximizing reconstruction accuracy

  8. ILP Formulation (1) • Integer program formulation similar to that for the fully informative tag SNP problem • 0/1 variable xj set to 1 iff SNP j is selected as a tag SNP • Only K SNPs can be selected • 0/1 variable yi,i’ set to 1 iff haplotypes hi, hi’ are distinguished by at least one selected SNP • Objective is to maximize informativeness, i.e., number of pairs of haplotypes distinguished by selected SNPs ILP1

  9. ILP Formulation (2) • Reconstruction accuracy can be improved by considering haplotype frequencies ILPf : ILP with frequency • Select K tag SNPs maximizing the total probability of distinguished pairs of haplotypes • The probability of haplotype in the population is estimated from the initial sample using PHASE computed frequencies

  10. Experimental Setup Datasets and Parameters: We used synthetic datasets generated following the methodology in [3] for 2 populations (European and West African) on 2 regions (IL8 and 5q31). For each of the 4 populations, we used haplotypes and frequencies inferred in [3] from the real data to generate 5 datasets containing between 200 and 1000 individuals. For each dataset, we picked 5 random samples with size 5 times the number of SNPs (we ran our algorithm using predetermined blocksizes of 10 and 20). Random selections of Tag SNPs (Rand) were performed for comparison.

  11. Phasing Accuracy (%)

  12. Error Analysis • Correct haplotype pairs • Single-Correct: inferred haplotype pair over tag SNPs compatible with a single pair of sample haplotypes • Multi-Correct: inferred haplotype pair over tag SNPs compatible with multiple pairs of sample haplotypes, and most frequent is correct • Incorrect haplotype pairs • -Missing: one or both real haplotypes not present in sample population • -Wrong Short: incorrect inferred haplotypes over tag SNPs • -Multi-Wrong: inferred haplotype pair over tag SNPs compatible with multiple pairs of sample haplotypes, and most frequent is incorrect

  13. Conclusions • Preliminary experiments show that use of the haplotype frequencies improves reconstruction accuracy compared to random selection and ILP1 • In ongoing work we are extending our methods to reconstruction of long haplotypes by using integer program formulations based on overlapping blocks, and are comparing them to other reconstruction flows, including tag SNP based genotype reconstruction as in [4] followed by phasing • References: • V. Bafna, B.V. Halldórsson, R.S. Schwartz, A.G. Clark, and S. Istrail, Haplotypes and informative SNP selection algorithms: Don’t block out information. RECOMB’03, pp. 19-27, 2003. • T. Barzuza, J.S. Beckmann, R. Shamir, and I. Pe’er, Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs, CPM 2004, LNCS 3109, pp. 14–31, 2004. • J.Forton, D. Kwiatkowski, K. Rockett, G. Luoni, M. Kimber, and J. Hull, Accuracy of haplotype reconstructionfrom haplotype-tagging single-nucleotide polymorphisms, American Journal of Human Genetics, 76(3), pp. 438-48,2005. • E. Halperin, G. Kimmel, and R. Shamir. Tag SNP Selection in Genotype Data for Maximizing SNP Prediction Accuracy, Proc. ISMB 2005. • P. Sebastiani, R. Lazarus, S.T. Weiss, L.M. Kunkel, I.S. Kohane, and M.F. Ramoni, Minimal haplotypetagging, Proc. National Academy of Sciences, 100(17), pp. 9900-9905, 2003. • M. Stephens, N. Smith, and P. Donnelly. A new statistical method for haplotype reconstruction from populationdata. American Journal of Human Genetics, 68, pp. 978-989, 2001.

More Related