320 likes | 432 Vues
The GeneFilter framework aims to improve gene expression analysis by integrating intelligent, automated, and efficient methods for identifying key genes associated with complex biological targets. This platform addresses limitations found in existing tools, offering advanced preprocessing, analysis, ranking, and validation modules. Applications have successfully extended from bladder cancer to other diseases, incorporating techniques such as normalization, clustering, and correlation-based techniques. The goal is to enhance biological knowledge discovery and facilitate innovative research in genomics.
E N D
A Framework for Effective Gene Expression Analysis and Biological Knowledge Discovery FLINT-CIBI 2003 Vincent S. M. Tseng Dept. Computer Science and Information Engineering National Cheng Kung University Tainan, Taiwan, R.O.C. Email: tsengsm@mail.ncku.edu.tw
Outline • Goal of the Framework (GeneFilter) • Architecture of GeneFilter • Main Functions of GeneFilter • Preprocessing Module • Analysis Module • Gene Ranking Module • Feedback Validation Module • Future Directions
Goal of GeneFilter Framework • Insufficiency in existing gene expression analysis tools: • Intelligence; Automation; High Integration; Efficiency • We aim to develop an intelligent, integrated, automatic, and high-performance gene expression analysis platform by using various soft-computing methods for finding out interesting genes for complex analysis targets • Applications on disease analysis • Starting from bladder cancer analysis • Extension to other diseases like lung cancer, etc.
Preprocessing • Handling of Missing Gene Information • Query BioDB • Normalization Methods • Integration of various normalization methods • Quality Analysis • Handling of missing expression data • Integration of regression and clustering techniques • Identification of defect data • By statistics and feedback analysis
Normalization Methods • To remove systematic effects (mRNA abundance effect, chip effect, block effect,… ) Ymjkg = mRNAm + Chipi + Dyej + Blockk + Geneg + emjkg • Median normalization • Lowess normalization:Dudoit et al. (2001) • etc.
Analysis Module • Expression Patterns Analysis • Gene Chips Correlation Analysis • Clustering Analysis • Classification Analysis
Expression Patterns Analysis • Definition of Expression Patterns • e.g. set t = 0.5 (other criteria provided) y If x>t and y>t, we consider this gene as up-regulated in stages S1->S2 and S2->S3 x S1 S2 S3
Effective Microarray Clustering [Tseng 02] • Iterative and “Divide-and Conquer” computation for automatic mining • CAST-based algorithm for clustering efficiency • Hubert’s Γ statistic for validating clustering results
Effective Microarray Mining [Tseng 02] (cont.) 1. Narrow down the threshold range 2. Split and Conquer: find “nearly-best” result LM: Left Margin RM: Right Margin LM RM 0 100% Affinity Threshold (t)
Experimental Evaluation • Original dataset • Data source: Lawrence Berkeley National Lab (LBNL) (http://rana.lbl.gov/EisenData.htm) • microarray expression data of yeast saccharomyces cerevisiae • contain the expressions of 6221 genes under 80 experimental conditions • Testing datasets • Dataset I: low similarity dataset (avg similarity: 0.137) • Dataset II: high similarity dataset (avg similarity: 0.696)
Experimental Evaluation:Low Similarity Dataset Table 1. Experimental results (dataset I) Table 2. Distribution of clusters (dataset I)
Experimental Evaluation:High Similarity Dataset Table 3. Experimental results (dataset II) Table 4. Distribution of clusters (dataset II)
Time Series Clustering: Main Problems Incurred Absolute offset Scaling Shift Noise
Time Series Clustering(cont.) Time Point Pearson correlation coefficient: -0.50936 Data from [Spellman 98]
Time Series Clustering (cont.) Time Point Pearson correlation coefficient: 0.62328
mismatch M. Input : Two gene expression time series S, T and number of allowed Output : The time series similarity between S and T. Method : CDAM(S, T, M). Procedure CDAM(S, T, M){ transfor m the sequences S and T into rank value sequences Q and R; = for m 0 to M{ <= calculate r(i, j) for all i, j N to find the minimum D of (Q, R); alignment (Q' , R' ) with mismatch m trace the warping path of minimum D; } best alignment (S' , T' ) the alignments (Q' , R' ) whose similarity is highest; return the similarity of (S, T); } Proposed Method: Correlation-based Dynamic Alignment with Mismatch (CDAM) Sequence Transformation Find the best alignment
Empirical Evaluation • Gene expression data • Cho/Spellman’s time series microarray data of 6178 yeast genes under 18 time points • 255 distinct genes were included in the dataset when mapping 343 known activations onto Spellman data set [Filkov 01] • Similarity of the genes in the 343 activations
Ranking Genes Genes list is (ABC) and (GenAsia) and (DiGiGen.) list R1: sum of differences between two sample R2:Chi-square value
Feedback Validation • Biological Experiments • Q-PCR (Quantitative real time polymerase chain reaction-Q-PCR ) • 2D Gel • Validation • Assessing Preprocessing Protocols • Assessing Analysis Protocols
Conclusions • GeneFilter • http://biosys.csie.ncku.edu.tw/genefilter/index.jsp • Has been applied on analysis of bladder cancer, hepatitis diseases, etc. • Short turnaround time for analysis • In benefits of high integration • Effective analysis results • Narrow down the interesting genes from 10,000+ to 50 genes
Future Directions • Future Directions • Incorporation of more soft-computing methods • Fuzzy logic for • Clustering & similarity measurement • Classifications • Quality validation • Gene Ontology Analysis • Applications on more disease analysis
Acknowledgement • Collaborators • Prof. H. S. Liu (NCKU) • Prof. N. H. Cho (NCKU) • Prof. C. L. Ho (NCKU) • Prof. J. H. Chiang (NCKU) • Prof. Y. L. Sheh (NSYSU) • Prof. H. L. Wu (NCKU) • Sponsoring • National Science Council, Taiwan
Thanks Email: tsengsm@mail.ncku.edu.tw
Example Goal of Gene Expression Analysis Interesting Gene set