Disease-Associated Multi-SNP Combination Search: A Combinatorial Approach

Our contributions SNP and Disease MSC x x 1 x x 2 x x x 0 1 1 0 1 2 1 0 2 sick • A novel combinatorial method for finding disease- associated multi-SNP combinations was developed. • Multi-SNP combinations significantly associating with diseases were found. • For Crohn's disease data (Daly, et al., 2001), a few associated multi-SNP combinations with multiple-testing-adjusted to p < 0.05 were found, while no single SNP or pair of SNPs showed significant association. • For a dataset for an autoimmune disorder (Ueda, et al., 2003), a few previously unknown associated multi-SNP combinations were found. • For tick-borne encephalitis virus-induced disease, a multi-SNP combination within a group of genes showing a high degree of linkage disequilibrium significantly associated with the severity of the disease was found. • A model-fitting disease susceptibility prediction methods based on the developed search methods were proposed. • SNP - single nucleotide polymorphism where two or more different nucleotides occur in a large percentage of population • 0 = willde type/major (frequency) allele • 1 = mutation/minor (frequency) allele • 2 = heterozygous allele • Searching for genetic risk factors for diseases • Monogenic diseases • A mutated gene is entirely responsible for the disease • Complex diseases • Affected by the interaction of multiple genes • Significance of risk factor is usually measured by Risk Rate or _ _ _Odds Ratio • We measure significance by the p-value of the set of genotypes _defined by risk factor 0 1 1 1 0 2 0 0 1 sick 4 sick : 1 healthy 0 0 1 0 0 0 0 2 1 sick 0 1 1 1 1 2 0 0 1 sick check significance 0 0 1 0 1 2 1 0 2 sick 0 1 0 0 1 1 0 0 2 healthy 0 1 1 0 1 2 0 0 2 healthy Statistical significance • Multi-SNP combination (MSC) define a set of case and control individuals • MSC is considered statistically significant if the frequency of cases and controls distribution has p-value < 0.05 • A lot of reported findings are frequently not reproducible on different populations. It is believed that this happens because the p-values are unadjusted to multiple testing Disease-Associated Multi-SNP Combinations Search Disease association analysis • Given: a population of n genotypes (or haplotypes) each containing values of m SNPs from {0,1,2} and disease status (case or control) • Find: all multi-SNP combinations with multiple testing adjusted p-value of the frequency distribution below 0.05 • Analysis of variation in suspected genes in case and controls individuals is aimed at identifying SNPs with considerably higher frequencies among the case individuals than among the control individuals • Most searches are done on a SNP-by-SNP basis • Recently two-SNP analysis shows promising results (Marchini et al, 2005) • Multi-SNP analyses are expected to find even stronger disease associations • Common diseases can be caused by combinations of several unlinked gene (SNPs) variations • We address the computational challenge of searching for such multi-gene causal combinations • The number of multi-SNP combinations is infeasible high (3100 for 100 SNPs). • How to find associated multi-SNP combinations without total checking? • Disease association analysis searches for a SNPs or multi-SNP combinations with frequency among cases considerably higher than among controls. • If the reported SNP is found among 100 SNPs then the probability that the SNP is associated with a disease by mere chance becomes 100 times larger (Bonferroni). • Bonferroni is too crude (e.g., 3-SNP combinations among 100 SNPs, p < 0.05×10-6) • We adjust resulted p-values via randomization • Unadjusted p-value:Probability of case/control distribution in a set defined by MSC, computed by binomial distribution • Multiple-testing adjusted p-value :randomization • Randomly permute the disease status of the population to generate 10000 instances. • Apply searching methods on each instance to get MSCs. • Compute the probability of MSCs that have a higher unadjusted p-value than the observed p-value. • In our search we report only MSC with adjusted p-value < 0.05 • Combinatorial Search (CS) for Disease-Association: checks all one-SNP, two-SNP, ..., m-SNP case-closed MSCs Case-closureof a MSC C is an MSC C’, with maximum number of SNPs, which consists of the same set of cases and minimum number of controls. • Case-closure allow finding of the statistically significant MSC on the earlier stage of searching. • Trivial MSCs and MSCs which coincide after case-closure are avoided. That significantly speedups the searching. • Faster than exhaustive search • Finds more significant association on the early stage of searching • Still slow for wide-genome studies • Clustering-based Model-Fitting Algorithm for Disease Susceptibility Prediction: • For the given training dataset and tested genotype consider two cases: • tested genotype is added to the training dataset as a sick • tested genotype is added to the training dataset as a healthy • For the both cases obtain clustering by applying CGS to find: • the most disease-associated MSC (defines a set of sick genotypes) • the most disease-resistant MSC (defines a set of healthy genotypes) • Remove from the original dataset one which is larger • Repeat this procedure until all genotypes are removed • Predict susceptibility of the tested genotype according to the case which has lower entropy of clustering. Results for Disease Susceptibility Prediction Maximum Case(Control)-Free Cluster Problem • Quality measure Find a maximum size cluster C containing only cases or controls • Complimentary Greedy Search (CGS): 1. Find SNP with allele value removing a set of genotypes with highest ratio of controls over cases. 2. Add the SNP to resulted MSC 3. Repeat 1-2 until all controls are removed. Resultant MSC defines a subset of sick genotypes. 4. Adjust to multiple testing the p-value of the resultant MSC. • Leave-one-out cross validation results Data Sets • [3] Crohn's disease : 387 genotypes with 103 SNPs derived from the 616 KB region of human Chromosome 5q31, 144 disease genotypes and 243 nondisease genotypes. (Daly et al., 2001). • [10] Autoimmune disorder : 1024 genotypes with 108 SNPs containing gene CD28, CTLA4 and ICONS, 378 disease genotypes and 646 nondisease genotypes. (Ueda et al., 2003). • [4] Tick-borne encephalitis : 75 genotypes with 41 SNPs containing gene TLR3, PKR, OAS1, OAS2, and OAS3, 21 disease genotypes and 54 nondisease genotypes. (Barkash et al., 2006). Disease Susceptibility Prediction Problem • Given a sample population S (a training set) and one more individual tS with the known SNPs but unknown disease status (testing individual), find (predict) the unknown disease status • Disease Clustering Problem: • Given a population sample S, find a partition P of S into clusters S = S1..Sk , with disease status 0 or 1 assigned to each cluster Si , minimizing entropy(P) • Comparison of 5 prediction methods on [4] data on all SNPs. Area under the CSP’s ROC curve is 0.87 vs 0.52 under the SVM’s curve Results/comparison of searching methods • Comparison of three methods for searching the disease-associated and disease-resistant multi-SNPs combinations with the largest PPV. • Combinatorial search is able to find statistically significant multi-gene interactions, for data where no significant association was detected before • Complimentary greedy search can be used in susceptibility prediction • Optimization approach to prediction • New susceptibility prediction is by 8% higher than the best previously known • MLR-tagging efficiently reduces the datasets allowing to find associated multi-SNP combinations and predict susceptibility for a given bound on the number of individuals who are assigned incorrect status in clusters of the partition P, error(P)< *|P|.

Disease-Associated Multi-SNP Combination Search: A Combinatorial Approach

Disease-Associated Multi-SNP Combination Search: A Combinatorial Approach

Presentation Transcript

Surgery for Inflammatory Bowel disease

Rose Rosette Disease

Sector Search Pattern

Surgery for Inflammatory Bowel disease

Echocardiography

Female genital system

Genome Rearrangements: from Biological Problem to Combinatorial Algorithms (and back)

Alzheimer’s Disease

Combinatorial Chemistry

Semantic Search Engines – On the Way to Web 3.0

Combinatorial Pattern Matching

Approximation Algorithms for Stochastic Combinatorial Optimization

Combinatorial Optimization for Graphical Models

Iterative Methods and Combinatorial Preconditioners

Outline

Iterative Methods and Combinatorial Preconditioners

Newcastle Disease

BEHCET’S DISEASE

Combinatorial Pattern Matching

Search Patterns

Analyzing Brain Signals by Combinatorial Optimization