FastANOVA: an Efficient Algorithm for Genome-Wide Association Study

FastANOVA: an Efficient Algorithm for Genome-Wide Association Study Xiang Zhang Fei Zou Wei Wang University of North Carolina at Chapel Hill Speaker: Xiang Zhang

Genotype-phenotype association study • Goal: finding genetic factors causing phenotypic difference Mouse genome Phenotype variation http://www.bcgsc.ca http://www.jax.org/

Genotype-phenotype association study Chrom1 bp3,568,717 Chrom6 bp120,323,342 • Single Nucleotide Polymorphism • Mutation of a single nucleotide (A,C,T,G) • The most abundant source of genotypic variation • Server as genetic markers of locations in the genome • High throughput genotyping -- thousands to millions of SNPs …… A A A C G …… A A T C C …… …… A A A C G …… A A T C C …… …… A A A C G …… A A T C G …… …… A A A C G …… A A T C G …… …… A A A C G …… A A T C G …… …… A A A C G …… A A T C G …… …… A A T C G …… A A T C C …… …… A A T C G …… A A T C C …… …… A A T C G …… A A T C G …… …… A A T C G …… A A T C C …… …… A A T C G …… A A T C C …… …… A A T C G …… A A T C C …… Thousands to millions of SNPs

Genotype-phenotype association study • Genotype • SNPs can be represented as binary {0,1} (e.g. inbred mouse strains) • Quantitative phenotypes • Body weight, blood pressure, tumor size, cancer susceptibility, …… • Question • Which SNPs are the most highly associated with the phenotype? Phenotype value SNPs …… 0 0 0 1 0 1 …… 8 …… 0 0 0 0 0 0 …… 7 …… 0 1 1 0 0 1 …… 12 …… 0 1 0 0 1 0 …… 11 …… 0 1 0 1 0 1 …… 9 …… 0 1 0 0 0 0 …… 13 …… 1 0 1 1 1 1 …… 6 …… 1 0 0 0 1 0 …… 4 …… 1 1 1 1 1 1 …… 2 …… 1 0 0 1 0 0 …… 5 …… 1 0 0 1 0 1 …… 0 …… 1 0 1 1 0 0 …… 3

A simple example: single marker association study • Partition individuals into groups according to genotype of a SNP • Do a statistic (t, ANOVA) test • Repeat for each SNP Phenotype value SNPs …… 0 0 0 1 0 1 …… 8 …… 0 0 0 0 0 0 …… 7 …… 0 1 1 0 0 1 …… 12 …… 0 1 0 0 1 0 …… 11 …… 0 1 0 1 0 1 …… 9 …… 0 1 0 0 0 0 …… 13 …… 1 0 1 1 1 1 …… 6 …… 1 0 0 0 1 0 …… 4 …… 1 1 1 1 1 1 …… 2 …… 1 0 0 1 0 0 …… 5 …… 1 0 0 1 0 1 …… 0 …… 1 0 1 1 0 0 …… 3

Two-locus association mapping • Many phenotypes are complex traits • Due to the joint effect of multiple genes • Single marker approach may not suffice • Consider SNP-SNP interactions • Four possible genotype combinations for each SNP-pair: 00, 01, 10, 11 • Split mice into four groups according to the genotype of each SNP-pair • Do statistic test for each SNP-pair

Statistical issue • Multiple test problem • Do n tests with Type I error , the family-wise error rate is • Example • Performing 20 tests with Type I error=0.05, family-wise error rate = 0.64 • 64% probability to get at least one spurious result • Solution • permutation test

Permutation test • K permutations of phenotype values • For each permutation, find the maximum test value • Given Type I error α, the critical value Fαis αK-thlargest value among K maximum values • SNP-pairs whose test values are greater than Fα are significant

Genome-wide association study • What’s GWA? • Simple Idea: search for the associations in the whole genome • Hard to implement • Enormoussearch space: 10,000 SNPs and 1,000 permutations, number of SNP-pairs need to be tested: 5 ×1010

Preliminary: ANOVA test and F-statistic • ANOVA test • To determine whether the group meansare significantly different • Partition Total sum of squares into Between-group sum of squares and Within-group sum of squares • F-statistic • SNPs {X1, X2, …, XN}, • a quantitative phenotype Y • Single SNP test -- F(Xi, Y) • SNP-pair test --F(XiXj, Y) SST SSB SSW

Problem Formalization • Dataset: M individuals, N SNPs {X1, X2, …, XN}, a quantitative phenotype Y, and its K permutations {Y1, Y2, …, Yk}. • Maximum ANOVA test (F-statistic) value of permutation Yk FYk = max {F(XiXj, Yk)|1≤i<j≤N} • Problem 1: Given Type I error threshold α, find critical valueFα, which is αK-th largest value among {FYk|1≤k≤K} • Problem 2: Given the threshold Fα, find all significant SNP-pairs such that F(XiXj, Y)≥ Fα

Brute force approach • Problem 1: Permutation test to find critical value • For permutation Yk, test all SNP-pairs to find the maximum test value FYk • Repeat for all permutations • Report αK-th largest value in {FYk|1≤k≤K} • Problem 2: Finding significant SNP-pairs • For phenotype Y, test all SNP-pairs and report the SNP-pairs whose test values are above Fα Problem 1 is more demanding due to large number of permutations

Overview of FastANOVA • Goal: Scale large permutation test to genome-wide • Question: Do we have to perform ANOVA tests for every SNP-pair and repeat for all permutations? • Idea: • Develop an upper bound: to filter out SNP-pairs having no chance to become significant (all nodes on the same level of the search tree, no sub-tree pruning, how?) • Efficiently compute the upper bound: calculate the upper bound for a group of SNP-pairs together (possible?) • Identify redundant computations in the permutation tests (reuse computations, how?)

The upper bound • For any SNP-pair (XiXj) equivalent SSB (XiXj, Y) ≥θ F(XiXj, Y) ≥ Fα Fixed for given Fα • Bound on SSB Need to be greater than θ for (XiXj) to be significant

The upper bound Given Xi ,Xj ,and Y Constant f(na) f(nb) Only depend on the genotype ofXj

Applying the upper bound For a given Xi , let AP= {(XiXj)|i+1≤j≤N}. Index the SNP-pairs in AP in the 2D space of (na, nb). (X1X3) (X1X5) (X1X6) (1,3) (3,3) (X1X2) (X1X4) (2,1)

Key properties f(na) f(nb) • Maximum possible size: • Many SNP-pairs share the same entry • All SNP-pairs in the same entry have the same upper bound • The indexing structure does not depend on the phenotype permutations Same upper bound value

Schema of FastANOVA (for permutation test) • For each Xi , index the SNP-pairs {(XiXj)|i+1≤j≤N} in the 2D space of (na, nb) • For each permutation, find the candidate SNP-pairs by accessing the indexing structure • Candidates are SNP-pairs whose upper bounds are above the threshold. • The dynamic threshold is the maximum test value found so far.

Complexity of FastANOVA • Time complexity • FastANOVA: O(N2M + KNM2 +CM) • Brute force: O(KN2M) • Space complexity • O((N+K)M) N = # SNPs M = # individuals K = # permutations C = # candidates M << N

Brute force v.s. FastANOVA Two orders of magnitude faster than the brute force alternative #SNPs = 44k, #individuals = 26, phenotype: metabolism (water intake) SNP and phenotype data available at http://www.jax.org

Pruning power of the bound

Runtime of each component One time cost

Future work • Association study involving more than two SNPs • Computationally much more demanding • Three loci VS. two loci: in the order of number of SNPs • Association study for heterozygous case • SNPs are encoded as ternary variables {0, 1, 2}

Thank You ! Questions?

FastANOVA: an Efficient Algorithm for Genome-Wide Association Study