Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association a

Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical Informatics 40 (2007) 750-760

Pathway/SNP • A software application that allows its user to utilize pathway data in the analysis of high-density genomic SNP data derived from disease association studies. • - The purpose is to analyze the underlying etiology of disease through the integration of pathway information using statistical and data mining approaches.

Background: • Large scale genome-wide association (GWA) studies are now available to identify genomic mutations associated with wide range of diseases. • Complex diseases, like, diabetes, hypertension, etc. are believed to be caused by the interaction of multiple genes and environmental factors. • The number of mathematical operations required to assess the association between multiple interacting genomic loci and disease grows exponentially with the number of interacting SNPs. • Various statistical approaches, like stepwise algorithm, varying parameters, etc. are used to analyze these associations. • Data mining approaches are used for multi-locus association with traits.

Computational complexity for brute-force ‘full-scan’ interaction analysis between all possible combinations of n genomic markers and a disease is exponential in n. For Affymetrix 100K SNP GeneChip, m = 100,000 genomic markers Full scan requires # of marker interaction# of tests 2 5.00 x109 3 1.66 x1014 4 4.16 x1018 5 8.33 x1022 Fastest supercomputer can perform ~3.67x1014 flops/s

Conclusion: • “One model fits all” approach is not optimal. • Pathway/SNP • – Designed as an exploratory tool which integrates pathway information, gene annotation, and SNP location to identify the pathways that are most strongly associated with disease. • Architecture: 3-tier architecture written in Java • 1> Presentation tier – written in Java Server Pages • 2> Logic tier – statistical and data mining algorithms in Java • 3> Data tier – genotype, phenotype and annotation data stored in heavily indexed relational database.

Biological Data • - Annotations for 561 pathways – • 181 KEGG, 314 BioCarta and 66 GenMAPP human pathways. • Gene annotation data – from NCBI Entrez Gene • Affymetrix 100k and 500k GeneChip microarray annotation files are preloaded in the database. Relevant SNPs: In a given biological pathway if SNPs are located within 10,000 base pairs (bp) of a pathway gene’s location, they are considered as relevant. Relevant Genes: First gene list is extracted from a particular database then it is augmented from literature and Entrez gene.

Algorithms: 1> Single SNP association with disease - Chi square and Armitage’s trend test 2> Pathway association with disease - U-statistics or data mining algorithms 3> Permutation-based statistical significance inference - Bonferroni adjustment or False discovery rate (FDR)

Single SNP association with disease: 1> Chi square test 2> Armitage’s trend test 1 degree of freedom More preferred Allele-based: Chi square test Genotype-based: 2 degrees of freedom

Armitage’s Trend Test This test is performed of case vs. control having a ‘trend’ with different models of association between a SNP and disease. Additive interaction model: This model tests the association that depend additively upon the risk or minor allele, 0 for homozygous non-risk alleles, 1 for heterozygous alleles and 2 for homozygous risk alleles. Dominant model: tests the association of having at least one risk allele in homozygous (1) or heterozygous (1) vs. no risk in homozygous non-risk allele (0). Recessive model: tests the association of having one homozygous risk allele (1) vs. having at least one non-risk allele in homozygous (0) or in heterozygous (0). Armitage’s Trend Test statistic has 1 degree of freedom

U-statistics for pathway association with disease: • Non-parametric algorithm that can simultaneously test the association of multiple markers with disease, with only a single degree of freedom. • First measures a score over all markers for pairs of subjects (set of SNPs) within each of the case and control groups. Genetic scoring for a pair of subjects is measured by a “kernel” function, like recessive, dominant and linear dosage. • Then compares the average scores between cases and controls by use of a global statistic with one degree of freedom instead of the implicit many degrees of freedom when many markers are analyzed. • The resulting z-scores can be used to rank pathways and also to calculate an approximate p-value.

Consider b as risk allele and a as non-risk allele

Data mining for pathway association with disease: • Data mining classifiers (e.g., SVM, Random Forests, logistic, tree-based) can be used to explore the association between pathways and disease. • The “percent correct” classification of cases and controls estimated with the genotypes at the pathway SNPs can be used as a statistic for measuring the association between pathways and disease. • Incorporated using Weka data mining program, classifiers are run by default with a 10-fold cross validation.

Multiple testing corrections: • It may be possible that a good test statistic value that we have obtained would have occurred by chance alone. Multiple testing corrections are designed to help one to ensure, if possible, that this is not the case. Bonferroni adjustments: • The Bonferonni adjustment multiplies each individual p-value by the number of times that same test was performed (the value of markers tested). • This value, which is quite conservative, seeks to estimate the probability that this test would have come out this well by chance at least once from all of the times this test was performed.

Statistical significance using permutation based FDR: • The False Discovery Rate (FDR) option calculates the False Discovery Rate for each statistical test selected. This is a test which is itself based upon the p-values from the original tests. • The interpretation of the False Discovery Rate is “What would the rate of false discoveries (false positives) be if I accepted ALL of the tests whose p-value is at or below the p-value of this test?” • The aim of the FDR procedure is to control at a desired level a (e.g., 0.05) the proportion of type I errors (false positives) among all significant results.

- Suppose m hypotheses are tested, and R of them are rejected (positive results). Of the rejected hypotheses, suppose that V of them are really null–that is, that V is the number of type I errors, or false positive results. The False Discovery Rate is defined as that is, the expected proportion of false positive findings among all rejected hypotheses times the probability of making at least one rejection. - This procedure may yield higher statistical power compared to family wise error rate. Pathways with low FDR (e.g., below 0.05) are considered significant. FDR = E(V/R | R > 0). P(R > 0),

Using Pathway/SNP to analyze AMD data set: • This data set contains 116,204 genome wide SNPs genotyped with Affymetrix 100k Gene Chip • Case-control study of 146 caucasian individuals • 50 controls and 96 cases with advanced AMD • 50 patients with wet AMD (severe) and 46 patients with dry AMD. • Initial analysis identifies a mutation in complement factor H (CFH) on chromosome 1 to be strongly associated with AMD. • Identified 46 genes (from KEGG & NCBI genome 35 version) • Total 94 SNPs are relevant (within 10,000 bp). • Armitage’s trend test with additive model and U-statistics with 5 kernels (dominant, recessive, linear, quadratic, allele match) and 4 data-mining algorithms (J48, Random Forests, SVM, Naïve Bayes) were performed. • Patients were grouped in 4 categories: control vs. all cases (wet+dry), control vs. wet AMD, control vs. dry AMD, dry AMD vs. wet AMD.

Identified two additional pathway genes, C7 and MBL2:

Explanation of the difference between progressing to dry AMD, less severe form to wet AMD, more severe one

Lessons learned: • The potential need for high performance computation to support a tool like Pathway/SNP • The need for permutation testing to evaluate the results of the analysis • Dealing with different versions of the biological data and knowledge • Why different analysis algorithms might work better with different data sets and different diseases • The complexity of the “clinical phenotype”

Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association a