Multiple-Locus Genome-Wide Association Testing

Multiple-Locus Genome-Wide Association Testing David Dean CSE280A

Genome-wide Association Testing • Genome-wide association tests have used the concept of linkage disequilibrium (LD) to identify individual genes that correlate with disease phenotypes. • However, many human diseases arise out of the interaction of multiple genes, rather than just a single gene.

Linkage Dis-equilibrium • SNPs that are close to each other on a chromosome tend to have a high correlation, relative to ones that are far apart from each other. Recombination works to undo this correlation. • Without recombination • P11 is not equal to P1*P*1 • D = |P11 – P*1P1*| • With recombination, LD will decay with distance between the two loci • Linkage Equilibrium: P11 = P1*P*1 (loci are independent)

Disease Gene Mapping • The disease phenotypes of the individuals being studied can be treated as a column vector, similar to a column vector of SNPs. LD is used to find a locus that is close to the locus of interest. • If you find a locus (and a particular allele at that locus) that correlates highly with a particular disease phenotype, then one can infer that the allele “may play an important role” in the development of that disease.

Epistasis • The interaction between genes, or epistasis, is an important area of genetics research, where much is still unknown. • For example, one gene may suppress the expression of another gene. • Gene-gene interactions can be synergistic (positive) or antagonistic (negative).

The Problem • Testing multiple loci across the whole genome that interact and contribute to a particular phenotype can present a computational challenge. • Example: 104 individuals * 106 SNPs • # of SNP pairs = 106 * 106 = 1012 • # of SNP trios = 106 * 106 * 106 = 1018

Objective • The objective is discover an efficient method to perform genome-wide association testing, which identifies multiple loci that may be interacting and contributing to a disease phenotype.

Evans et al 2006 • 4 strategies tested: • Single-locus tests of association • Exhaustive two-locus search • Fit all possible two-locus models of association to all pairs of SNPs • “Both Significant” two-stage strategy • Applies single-locus test to determine which loci to include in the second stage of pairwise association testing • “Either Significant” two-stage strategy • Applies single-locus test to determine a set of loci to then test in second stage, but only requires 1 of pair to pass initial phase • These two-stage strategies were less powerful than the exhaustive two-locus search strategies, but were able to significantly reduce the computational burden

Current Project • Start with n x m SNP matrix (Rana et al 2007) • n = # of haplotypes (~104) • m = # of SNPs (~106) • For a pair of SNPs, s1 and s2 • Labeled-hamming-distance: H[s1, s2] = min{p1p2 + q1q2, p1q2 + p2q1} if H is low, then s1 and s2 are correlated if H is high, then s1 and s2 are uncorrelated • Formalize and quantify an efficient filtering method • Identify a hamming distance, d1, to act as a threshold that filters out pairs that may be correlated • This small subset can then be exhaustively tested for epistatic interactions

Current Project • PairedSNPs(δ,k) • Repeat for l iterations: • Select k rows of haplotypes at random • For each SNP location, j, hash into the SNP vector hj and the bitwise complement ĥj • Filter pairs of SNPs that have a hamming distance < d1n • Identify all pairs of SNPs that are filtered out at least (1 - δ)µ1 times • µ1 is the expected number of times that a SNP pair is filtered out, if the hamming distance is low (= d1) • µ1 = le-kd1

Haploview • An open source application designed to analyze and visualize patterns of LD, and perform association testing on genetic data. • Haploview is developed and maintained by Dr. Mark Daly’s lab at MIT (Barrett et al 2005).

References • Barrett, J.C., Fry, B., Maller, J., and Daly, M.J. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21:263-265, 2005. • Brizna, D., He, J., and Zelikovsky, A. Combinatorial search methods for multi-SNP disease association. Proc. of IEEE EMBS Annual International Conference, 2006. • Evans, D.M., Marchini, J., Morris, A.P., and Cardon, L.R. Two-stage two-locus models in genome-wide association. PLoS Genetics, 2:e157, Sep 2006. • Rana, B.K., Insel, P.A., Payne, S.H., Abel, K., Beutler, E., Ziegler, M.G., Schork, N.J., and O’Connor, D.T. Population-based sample reveals gene-gender interactions in blood pressure in white americans. Hypertension, 49:96-106, Jan 2007.

Multiple-Locus Genome-Wide Association Testing