Uncovering Biomarkers for Chronic Fatigue Syndrome Using Data Mining Techniques

Mining of Microarray, Proteomics, and Clinical Data for Improved Identification of Chronic Fatigue Syndrome Zoran Obradovic Hongbo Xie, Slobodan Vucetic Information Science and Technology Center Temple University, Philadelphia

Biomarker Identification • Objective: • Select a small number of informative attributes (genes; protein) • Useful for • disease diagnosis, • disease progress monitoring, • evaluation of treatment effects etc. • Challenges include finding many irrelevant attributes; uncertainty is due to • small sample size vs. number of attributes • lack of replicates

Approaches to Biomarkers Identification • Select • significantly differentially expressed genes (in microarray data), or • the most discriminative mass-charge peaks (in proteomics data) • Measure difference among classes of samples using • statistics tests: • T-test, ANOVA, and Non-Parametric test • data mining procedures • SVM, Neural networks, etc

Limitations • Very noisy data is subject to false discoveries • Relationships among selected attributes are often ignored • For many diseases, multiple data resources are available; however how to use them together is often unclear

Our Approach • Motivation: • For various diseases the most discriminative genes are likely to correspond to a limited set of biological functions or pathways • Hypothesis: • Focusing to key functional expression patterns could result in improved accuracy as compared to analyzing individual gene expression readings • Approach: • Exclude genes whose biological properties deviate from other selected genes

Challenges of Biomarke Identification for Chronic Fatigue Syndrome (CFS) • CFS diagnosis is less accurate than for some other diseases (e.g. cancer) • Pathophysiology of CFS is insufficient understood • Diagnosis of CFS is highly depending on clinical practice • Patients’ response is often subjective • There is no standard criteria or laboratory technique to reduce the risk of malpractice

CFS Data • CFS Microarray data • 79 arrays representing 39 clinical identified CFS samples and 40 non-CFS samples • 20,160 genes for each sample • Using SOURCE database(http://source.stanford.edu) 13,213 genes annotated by 4,110 unique GO terms • CFS Proteomics data • 65 samples representing 33 CFS and 32 non-CFS samples • Each sample was profiled under 48 conditions, with factors such as fractionation, protein-chip surfaces, and binding and elution conditions • CFS Clinical data • 227 samples representing 43 CFS, 60 NF, and 123 others, CFS/NF are defined by Empiric attribute • each sample contained 85 attributes

Task 1: Identifying Biomarker Genes from CFS Microarray Data • Objective: • Identify a robust set of genes discriminating patients (CFS) from normal subjects (NF) • Method: • Identify a Subset of Genes (SG) significantly different between CFS and NF in training sample (use a non-parametric statistical test) • Select a subset of SG annotated with a specific function (use domain knowledge of GO) • Evaluate the method (Use leave one out cross validation)

Identifying a Subset of Significant Genes by Kruskal-Wallis (KW) Test • For each gene, its expression values for CFS samples and NF samples are compared, p-value is obtained comparing to a random population • Gene with p-value less than a threshold is selected as significant (SG) • Traditional approaches use those SG as markers to discriminate classes of samples. However: • A large proportion of such genes are irrelevant; applying false discovery rate control won’t help much in most case • Functional correlations among those genes are ignored

Selecting Significant Functions by Hypergeometric Test • Given: Set of k genes selected by KW test • Objective: Determine whether a given term GOi is overrepresented by the selection • The idea: If the gene selection were random, the number Xi of selected genes annotated with GOi would followhypergeometric distribution • Approach: So, significance of GOi is measured usingthe p-value of GOi = P(X Xi), where • X ~ H(K, k, ki) is selecting probability for a random gene • ki is the number of genes annotated with GOi

KW Statistical Attribute Selection • All genes selected by KW test, {gi, Gi < }, are used as attributes in classification

Knowledge Based Selection • TopGO: • Select GO term with the smallest p-value GO*. Use only genes selected by KW test and annotated with GO* • nGO: • Select GO terms with n smallest p-values GOn. Use only genes selected by KW test and annotated with GOn • AllTopGO: • Use all genes annotated with GO* • AllSignificantGO: • Use all significant genes annotated with one of significant GO terms

Comparison of 5 Attribute Selection Methods on CFS Data • Using domain knowledge for attributes selection procedure improved the prediction accuracy • TopGO was the most accurate domain knowledge based attributes selection method • Decision Tree Classifiers were less accurate than corresponding Support Machines (SVM)

Comparison Details of 5 Attribute Selection Methods for p-value = 0.05

Further Comparison of KW Statistical vs. TopGO Selection • Overall, knowledge based TopGo selection was the most accurate (SVM: 58% vs. 72%; Decision Tree: 55% vs. 62%) • For very small threshold, KW Statistical selection was slightly more accurate • However, knowledge based TopGO selection always used far less number of attributes

Comparison Using Same Number of Attributes(Statistical vs.TopGO by SVM)

Comparison Using Same Number of Attributes (Statistical vs.TopGO by Decision Tree)

Comparison of SVM vs. Decision Trees for TopGO Attribute Selection

Most Overrepresented GO Terms among Significantly Differentially Expressed Genes in CFS (by TopGo) * Top 2 functions are consistent with previously reported result on CFS (Whistler, T.et al, Transl Med. 2003; 1: 10)

mRNA Processing Genes Identified as Potential Biomarkers

Using Only Significant Genes Associated with a Given Function • Several key functions could well discriminate Chronic Fatigue Syndrome from non-Fatigue population • How to select the best function(s) for out of sample prediction is still a challenge • The most overrepresented functions identified by our analysis were the most discriminative

Accuracy by Using Only Significant Genes Associated with a Given Function (p-value <0.05)

Evaluation on Additional Data • Central Nervous System (CNS) • "Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression", Letters to Nature, Nature, 415:436-442, January 2002. • CNS Data Source: • http://www-genome.wi.mit.edu/mpr/CNS/ • Description: The data set contains 60 patient samples, 21 are survivors and 39 are failures. There are 7129 genes in the dataset.

Results on CNS Data • Findings were consistent to CFS analysis

Task 2: Proteomics Based Approach to Diagnostics

Proteomics Based CFS Data Analysis Overall data preprocessing protocol: • Baseline correction • Peak alignment • Spectra normalization • Smooth spectrogram • Normalize using QC samples: • For each test sample at every condition, its m/z value is divided by control QC m/z value and followed by taking log hood (a relative ratio is obtained for each testing sample).

Proteomics Based CFS Classification Procedure • Used leave-one-sample-out cross validation to train and test the data • Prediction on replicates of same sample is obtained by voting with tie labeled as CFS • Kruskal-Wallis analysis of ranks and the Median test are applied for all mass/charge values. P-values are ranked and peaks with p-value less than a threshold are selected as attributes. • P-value threshold of 0.05 resulted in selection of over 2000 attributes • Trained SVM classifier with selected attributes and evaluated for discriminating out of sample test data

Result of Proteomics Based CFS Classification • Accuracy of our method of separating CFS samples and NF samples was just slightly better than trivial predictor • IMAC chips provided the best overall results. The accuracy of an ensemble of IMAC classifiers by the leave-one-sample-out cross-validation was 51%.

Task 3: Combining Microarray Data and Proteomics Data for CFS Diagnoses (SVM) • Used integrated data of 38 subjects (20 CFS and 18 non-CFS samples) containing both proteomics and microarray data • Proteomics and microarray-based CFS predictions agreed for 50% of the sample (19 subjects) • When two classification methods agreed, the accuracy of a combined approach was significantly improved to 79%

Task 4: Analysis of Clinical CFS Data • Motivation: • Reason of low accuracy of prediction could lie in CFS clinical data attributes • Objective: • Detect potential factors that reveal the reasons of disagreement between microarray and proteomics CSF classifiers • Approach: • Applied ANOVA analysis on each attribute of two groups of clinical data (groups were subjects where microarray and proteomics predictions agree on vs. remaining subjects where microarray and proteomics predictions disagree)

Result of Clinical Data Analysis by ANOVA • Three of the clinical data classifying attributes are discovered as significantly different between two groups • Mental heath • Physical fatigue • General fatigue • Low accuracy of CFS diagnosis could be partially blamed on the clinical definition of the disease

Conclusions • Complementing statistical gene selection and domain knowledge to focus on the most significantly overrepresented GO terms was beneficial for • improving accuracy • identifying much smaller number of attributes • Integrating information from multiple sources (microarray, proteomics and clinical data) could lead to improved understanding and diagnosis of CFS

Thank You ! • More information: • http://www.ist.temple.edu • Contact: • Zoran Obradovic, director • IST Center, Temple University • 215-204-6265 • zoran@ist.temple.edu

Uncovering Biomarkers for Chronic Fatigue Syndrome Using Data Mining Techniques

Uncovering Biomarkers for Chronic Fatigue Syndrome Using Data Mining Techniques

Presentation Transcript

An NSF Science and Technology Center

Phonpasit PHISSAMAY Director of Information Technology Center Science Technology and Environment Agency

Grid Monitoring By Zoran Obradovic CSE-510 October 2007

Computing and Information Technology Center

Center for Materials for Information Technology an NSF Materials Science and Engineering Center

Information Technology and Computer Science

Geographic Information Science and Technology

GeoPad and Field Science Information Technology

Center for Ethics of Science and Technology

Prof. Dr. Hsien-Chun Meng Science and Technology Information Center (STIC)

International Science and Technology Center (ISTC)

Computer Science and Information Technology, B.S.

Xie-Xie

Citizen Science: People, Information, and Technology

Information Technology and Materials Science

Center for Materials for Information Technology an NSF Materials Science and Engineering Center

Supervised Clustering of Label Ranking Data Mihajlo Grbovic, Nemanja Djuric, Slobodan Vucetic

Chen Wang and Hongbo Jiang Huazhong University of Science and Technology, China

Yichun Xie Center for Environmental Information Technology and Application

Geographic Information Science and Technology