Data Mining in Linkage Disequilibrium Mapping

Data Mining in Linkage Disequilibrium Mapping Jing Hua Zhao Epidemiology j.zhao@public-health.ucl.ac.uk June 2003

Outline of the Talk • The problem • Why data mining? • Haplotype construction • Challenging issues

Current Paradigm • Complex traits (Lander & Schork 1994) • Association mapping (Risch & Merikangas 1996) • The need of both family and population-based study (Hodge et al. 2003) • # SNPs

Linkage Disequilibrium • The raw data is genetic markers • LD is the non-random association between alleles at different loci • Contains information on genetics of population (selection, mutation, recombination, admixture)

An Model with LDs • Log-linear model to allow for higher order interaction (Weir & Wilson 1986) • Applicable to a variety of null hypotheses (Huttley & Wilson 2000) • Number of terms is exponential

Why Data Mining? • 1.8 million SNPs, 1,240 hits on “haplotype and data mining” in 0.15 seconds • Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and results (Berry & Linoff, 1997, 2000)

A Statistical Perspective • Traditionally EDA, for a particular question • Sheer size of data is problematic • Now DM could be defined as the process of secondary analysis of large datrabases aimed at finding unsuspected relationships which are of interest or value to the database owners (Hand 1998)

Haplotype Pattern Mining • Figure 1 (a) Strongly disease-associated haplotype patterns • Enumeration • DFS, which has good running time property

Significance • A simple Chi-squared statistic: by a 2x2 table containing disease-associated and control chromosomes, in accordance with D’, significance determined via simulation • Simulation on prevalence, evolutionary history and sample size, robustness • Applicable to family data (Zhang et al. 2001)

Emerging Rules • LD patterns are highly strutured (Daly et al. 2001) • 5-8 markers (Niu et al. 2002; Zaykin et al. 2002;Toivonen et al. 2000) • htSNPs (Johnson et al. 2001)

Problem of Haplotype Uncertainty • EM (Cepellini et al. 1955) • MCMC (Guo & Thompson 1992; Lazzaroni & Lange 1997; Stephens et al. 2001, Niu et al. 2002) • Heuristic algorithms

Haplotype Reconstruction • Table of genotypes (Xie & Ott 1993) • Table of sufficient statistics (Zhao et al. 2000) and linked list • Binary trees (Zhao & Sham 2002) • Mixed-radix number (Zhao & Sham 2003) • QuickSort (Zhao & Qian submitted)

Examples • HLA (the evolution of EM algorithms, information content of SNP and SSR) • ALDH2 (missing data, effectiveness of heuristic method) • APOC (the disadvantage of QuickSort, heuristics, the inclusion of covariates)

Challenging Issues • Genotype/Phenotype relationship by Whitehall II data (10,308 civil servants, with 7.000 APOE genotypings) • Associated with cognitive declines • Need longitudinal data • Will tie up with BioBank project

Statistical Methodology • GLM needs to be extended • The same with LDA models such as GLMM • Search and Sort paradigm (Knuth)

Data Mining in Linkage Disequilibrium Mapping