Integrated Computational Platform for Analyzing Time-Series Gene Expression Data

193962_at Target 150 188023_at 6 100 Time 5 50 Residual 4 Height 3 0 40 30 20 10 0 2 Expression 1 4 SIMULATIONS A simulation model [5] was first used to generate time series expression profiles. The simulator generates networks with scale-free topology organization and clustering coefficient not dependent on the number of genes in the network. Then it integrates differential equations and Boolean logic to reproduce some important features of regulation in real biological systems, such as the continuous nature of generated data and the interaction among regulatory genes in controlling both production and degradation. The rate of change of gene x expression level at time t is modeled by a differential equation – the selected model being either linear or non-linear. On several simulations with both linear and non-linear data and with different sizes of cluster groups, the series selected as reconstructing functions either lie in the cluster or they appear consistently (~75%) as regulatory factors of members of the corresponding cluster. Subsystem selection from time series microarray data in stimulus-response experiments Jurman G*, Di Camillo B**, Galea A*, Merler S*, Toffolo G**, Cobelli C**, Furlanello C* *ITC-irst, Trento, Italy, **Information Engineering Department, University of Padova, Padova, Italy MGED 9September 7-10, 2006Seattle, WA, U.S.A. INTRODUCTION We present an integrated computational platform for the analysis of time varying microarray data obtained from dynamic stimulus-response experiments. External or internal stimuli such as hormone stimulation, disease progression, or drug assumption are cause of direct regulatory and functional actions involving RNA, regulatory regions on DNA, proteins and metabolites. We consider the available gene expression data as a proxy of such complex protein activities in a case-control setup (e.g. disease vs. normal, or treated vs. non-treated experimental conditions) where high-throughput expression data are evaluated at different timestamps in stimulated biomaterial (e.g. cells). 178717_at 175599_s_at 2 PLATFORM WORKFLOW 171971_x_at 194189_x_at 1 192731_at 173156_s_at 171904_x_at GEO 178316_at 190664_s_at 192934_s_at 187177_at 181715_s_at • Dataset (C. elegans GSE2180) • Gene Expressions (Affy GPL200, N=22625 genes, 123 samples, 4 classes, 10 time steps) • Gene Ontology, … 193833_s_at 193772_s_at 173826_s_at Top 30 genes 185282_s_at 181714_at 189002_s_at 173804_s_at Hierarchical clusteringof the top ranked 1000 genes into 20 groups (complete, DTW distance) 186790_at 188822_at 183457_s_at GUS 173894_s_at full set of N genes 191208_s_at GO basedquery 173203_at expressions values per time step+ class labels + GO labels 188169_at RAD 177204_s_at Annotated genes: TF 178259_at 172156_x_at 191520_s_at DTW - Clustering • A priori regulatory relationships 0 100 200 300 400 • Principal temporal patterns of regulation • Target functions: (e.g. sums of expressions from clusters of regulated genes 1 0 50 100 150 0 50 100 150 Multiplicity Profiled subset of n << Nby predictive classifiers 2 Target functions: cluster means 5 cluster : 1 cluster : 2 cluster : 3 cluster : 4 cluster : Multiplicity of extractions of the best 30 genes in the top-20 positions of all 400 gene ranking lists. A BioDCV classification experiment (34 WildType samples vs 28 mex-3;skn-1 samples) with Support Vector Machines coupled with Entropy-based Recursive Feature Extraction [6]. 4 2 0 -2 DTW - Friedman time series boosting -4 -6 cluster : 6 cluster : 7 cluster 8 9 : 10 : cluster : cluster 4 2 0 • Small size panel of transcription factors best reconstructing the target regulated functions -2 3 -4 Reverse Engineering Modeling System (Bayesian Networks) -6 Average expression 11 12 cluster : cluster : cluster : 13 cluster : 14 cluster : 15 4 2 0 -2 Group Enrichment analysis for GO Biological Process (MappFinder) -4 3 -6 20 cluster : 16 cluster : 17 cluster : 18 cluster : 19 cluster : 4 2 0 • regulatory network model • THE PIPELINE • To elucidate which are the main biological subsystems responding to the stimulus through the identification of major temporal patterns and modules of regulation, we combine predictive machine learning with reverse engineering methods. The resulting computational platform embeds a pipeline for functional genomics based on the BioDCV high performance system (Support Vector Machine profiling with selection bias control), and on a stage-wise boosting procedure for time series target function reconstruction [1]: • Gene selection induced by a classification problem (Box 1) • Selection of regulation networks by DTW clustering of transcription factors and regulated genes (Box 2) • Definition of the target function associated to the cluster groups • Reconstruction of the target function as a linear combination of the transcription factor (Box 3) • Toy examples on simulated data (Box 4) • Exploration of the regulatory networks underlying the reconstructing genes by querying ontology databases (Box 5) -2 -4 -6 4 5 • Identification of reconstructing TF 0 50 100 150 0 50 100 150 0 50 100 150 Time 5 TIME SERIES: C. ELEGANS • The GSE2180 data were analyzed according to the above workflow and • results were compared with a priori biological knowledge. MAPPFinder [3] and NetAFFX were used to link each selected transcript with corresponding GO biological processes and to compute the enrichment of each GO group for biological processes. In particular: • 31 of the selected genes are in the development group • 9 of them are transcription factors • Among those 9 genes, 7 are present as reconstructing factors of many of the target functions appearing in Box 2. Reconstruction of the target function associated to the cluster 18. A linear combination is computed from Friedman boosting, based on a DTW distance coupled with Euclidean correction [1]; simulated annealing is used as residual minimization strategy. The solid black line represents the target function; the dashed black line is its approximation computed as the α multiple of the time series for gene 193962_at. The solid red line displays the residual remaining after Step 1 approximation. The dashed red line is the reconstructing factor at Step 2. The complete reconstruction table is shown below: α is the linear multiplier, β is the area of the residual function after subtracting the corresponding reconstruction factor and γ is the percentage of reconstruction at the Step. THE GUS INSTANCE The system includes a PostgreSQL version of the Genomic Unified Schema (GUS). The C. elegans Affy chip platform information was loaded into GUS, with the goal of connecting the ontology information into the machine learning processes (classification, feature selection, DTW-boosting reconstruction). FEATURES The target functions driving the search of temporal patterns are obtained by a feature ranking process and a connection to the GO Molecular Function and Biological Process. The network information known for analyzed dataset are retrieved so to label regulators and regulated genes. Different definitions of the target functions can be used, ranging from simple linear combination of regulator main patterns to more complex regulatory models such as Boolean models of regulation known from reverse engineering. • DATA • The platform was validated first on simulated data (Box 4). • A real case study is also presented (Box 5) to illustrate the use of the method in practice. The multiple early embryonic time series for C. elegans (GEO GSE2180) [7] with 10 time points, each for 4 different genotypes: wild-type , pie-1(zu154), pie-1(zu154); pal-1(RNAi) and mex-3(zu155); skn-1(RNAi). Platform: Affy GPL200, N=22625 genes, nsamples=123. REFERENCES [1] Furlanello C, Merler S, Jurman G. Combining feature selection and DTW for time varying functional genomics. IEEE Transactions on Signal Processing, 54(6), 2436-43, 2006. [2] Di Camillo B, Sanchez-Cabo F, Toffolo G, Nair SK, Trajanoski Z, Cobelli C. A quantization method based on threshold optimization for microarray short time series. BMC Bioinformatics 6 (4): S11, 2005. [3] Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Conklin BR: MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biology 2003, 4:R7. [4] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight S, Eppig J, Harris M, Hill D, Issel-Tarver L, Kasarskis A, Lewis S, Matese J, Richardson J, Ringwald M, Rubin G, Sherlock G: Gene Ontology: tool for the unification of biology. Nat Genet 2000, 25(1): 25-29. [5] Di Camillo B, Toffolo G, Cobelli C. A gene network simulator integrating Boolean regulation in a continuous dynamic model. 5th European Conference on Computational Biology - ECCB '06 Eilat, Israel, 2006 [6] Furlanello C, Serafini M, Merler S, Jurman G. Semi-supervised learning for molecular profiling. IEEE Transactions on Computational Biology and Bioinformatics, 2(2):110-118, 2005. [7] Baugh LR, Hill AA, Claggett JM, Hill-Harfe K, Wen JC, Slonim DK, Brown EL, Hunter CP. (2005) The homeodomain protein PAL-1 specifies a lineage-specific regulatory network in the C. elegans embryo. Development. 132(8):1843-54. • ACKNOWLEDGMENTS • ITC-irst: R. Flor, S. Paoli, D. Albanese • CBIL: E. Manduchi • CONTACTS • http://mpa.itc.it - - - http://biodcv.itc.it - - - jurman@itc.it http://www.genome.gov/10000570

Integrated Computational Platform for Analyzing Time-Series Gene Expression Data

Integrated Computational Platform for Analyzing Time-Series Gene Expression Data

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction