Effective Enrichment of Gene Expression Data Sets

Effective Enrichment of Gene Expression Data Sets UtkuSirina, UtkuErdogdua, FarukPolata, Mehmet Tan b, and RedaAlhajjc a Department of Computer Engineering Middle East Technical University Ankara, Turkey b Department of Computer Engineering TOBB University of Economics and Technology Ankara, Turkey c Department of Computer Science University of Calgary Alberta, Canada IEEE 11th International Conference on Machine Learning and Applications December 13th, 2012

Outline • Background and Motivation • Multi-Model Framework & Evaluation Metrics • Generative Models • Probabilistic Boolean Networks • Ordinary Differential Equations • Experimental Evaluation • Conclusion & Future Works

Background & Motivation • Gene expression data is the main source of information for many applications in computational systems biology • However, the datasets suffer from the problem of skewed data matrices • There are thousands of genes and just several tens of samples • So few samples lower the confidence level of any computational method significantly

Background & Motivation • How to enrich available gene expression datasets confidently ? • There are several tools generating synthetic gene expression samples, such as GeneNetWeaver (Schaffter et. al., 2011) or SynTReN (Bulcke et. al.,2006) • However, all of them use single model such as ordinary differential/stochastic equations or boolean networks, which makes them to model gene regulation restrictively • Our idea is to integrate different machine learning techniques into single unified multi-model framework so that we can benefit from different models concurrently • Thereby, we aimed to generate synthetic gene expression samples more confidently and mitigate the low sample size problem for gene expression datasets by producing high qualitative data

Multi-Model Gene Regulation Model • Construct different gene regulation models from available gene expression samples • Sample from each of them equally and pool the generated samples • Select the best samples from the pool and output them • Each model contributes its own characteristics and we utilize all of them concurrently • How to select the best samples? Original Gene Expression Data Model 1 … Model N … k Samples k Samples • Multi-Objective Selection k Samples

Evaluation Metrics • After having generated samples from each model, it is very important to select the most qualitative samples to output. Otherwise our method would be impractical • To decide on the quality of generated samples, we defined three metrics measuring quality of the generated samples from different aspects: Compatibility, Diversity and Coverage. • Compatibility • How much close the generated samples to the original samples? • To assure that the generated samples are similar to the original samples • Mean of the euclidean distances of each generated sample to the original samples • Diversity • How much different the generated samples from the original samples? • To assure that the generated samples are not the duplicate of the original samples but carry always new information • We calculate the entropy value of each sample in the dataset and sum the differences. For each sample, we add the new sample to the original dataset and again sum the differences of entropy values. The diversity value is the ratio of the latter value to the former value for that sample • Coverage • How much the generated samples cover the sample space? • To assure to cover the sample space as much as possible • Mean of the euclidean distances of each generated sample to the already generated samples

Evaluation Metrics, Multi-Objective Selection • After calculating three metric values, we have a vector of metric results for each sample • To select the best samples among the generated samples we applied multi-objective selection mechanism to the vector of metric results using strict dominance rule • Strict dominance rule: A sample is more qualitative than some other sample, if all of its metric results are greater than that specific sample. • We sort all of the generated samples multi-objectively and select the best k ones to output • Non-dominant samples are grouped together and selected randomly

Generative Models • In our framework there may be any number of gene regulation models • The important point is the models should be least dependent so that the generated gene expression samples cover different parts of sample space • In this study we representatively select two generative models for our multi-model data generation framework • Probabilistic Boolean Networks (PBNs) (Shmulevich et. al., 2002) • Ordinary Differential Equations (ODEs) (Bansal et. al., 2006)

Probabilistic Boolean Networks (PBNs) • Probabilistic versions of Boolean Networks (Kauffman, 1993) • Each gene is either ON of OFF (Binary Values) • Each gene is associated with a set of boolean functions and each boolean function is associated with a set of genes (variables) • Each set of boolean function is also associated with a probability distribution so that each time step the value of each gene is determined by a boolean function which is selected according to its probability value

Probabilistic Boolean Networks (PBNs) g1 g2 gn Time t … • We construct the PBNs by adaptingthe MATLAB PBN Toolbox • Then, we run the PBN and generate synthetic gene expression samples to feed into our multi-model framework g1 g2 gn Time t+1 …

Ordinary Differential Equations (ODEs) • One of the oldest methods to model gene regulation • Each gene’s expression value is associated with other gene’s expression values through a regulation matrix presented as A below • The differentiation of each gene expression value is determined by a linear combination of the expression values of all other genes • There are many algorithms modeling gene regulation with ODEs • “Network Identification by Multiple Regression” (NIR), applying multiple linear regression(Gardneret. al., 2003) • “Differential Equation-based Local Dynamic Bayesian Network” (DELDBN), combining differential equations and dynamic bayesian networks(Li et. al., 2011) • In our study, we use the algorithm “Time Series Network Identification” (TSNI) due to its prevailing properties to the other methods(Bansal et. al., 2006) • It can handle both time series and steady state gene expression data sets • It can easily be applied to large datasets due to its utilization of principal component analysis • It can determine external perturbation automatically from the data

Ordinary Differential Equations (ODEs) • The only unknowns are A and B matrices • If we write the equation by concatanating the known and unknown matrices Differentiation Term Regulation Matrix Expression Values Perturbation Matrix Perturbation Values KNOWN ! UNKNOWN

Ordinary Differential Equations (ODEs) • It is easy to find the unknown matrix H in this schema • However, the number of equations should be greater than the number of variables, which may not hold always. • At this point, TSNI applies Principal Component Analysis (PCA) to the Y matrix and reduce the dimension of the matrix Y and solve the equation. • Then, the unknown matrices A and B can be obtained easily • By running the ODE model we generated, it is easy to produce synthetic gene expression samples to feed into our multi-model framework

Experimental Evaluation, Datasets • We evaluated our framework using three different real life biological datasets • The first dataset is the gene expression profile of metastatic melanoma cells (Bittneret. al., 2000) • It originally includes 8067 genes and 31 samples. We have used its reduced from composed of 7 most important genes and 31 samples (Datta et. al., 2003) • The second dataset is the gene expression data set of yeast cell cycle (Spellman et. al., 1998) • It includes 25 genes and 77 samples • The third dataset is siRNAdisruptant dataset in human umbilical vein endothelial cells (HUVECs) (Hurley et. al., 2011) • It includes 379 genes and 400 samples • Newly published very useful source for our model

Experimental Evaluation, Results • We evaluated our framework based on thethree metrics we defined in two different settings • In the first setting, we used the melanoma and yeast datasets without partitioning the datasets into training and testing sets • This is because the melanoma and yeast datasets have relatively less number of samples such that dividing them into training and testing sets would be meaningless • Then, we used the yeast and HUVECs datasets by partitioning them into training and testing datasets in the second setting. Here, we have enough number of samples to divide. HUVECs dataset has 400 samples. • In the first set of experiments, the results are always suspicious since training and testing sets are same. The second set of experiments provides to see the picture clearer and increase the confidence level of our framework • Note that because yeast dataset is middle-sized, we used it both in our first and second sets of experiments to see the results comparatively

Experimental Evaluation, Results, Setting #1 • In this set of experiments, we increased the number of generated samples as 10, 20, …, 500 by our framework and checked the mean of the metric results w.r.ttraining samples. • Figure 1 and 2 shows the compatibility and diversity results. Compatibility results show that the data generated by our framework converges to the original dataset since it gets closer and closer to original dataset • The diversity results on the other hand say that although generated samples are getting closer to the original dataset, they always carry new information with respect to the original dataset. • That means, our multi-model gene expression data generation framework always produces qualitative samples which are both very close to the original dataset and bringing new information • For melanoma dataset, newly generated samples bring almost % 30 new information , which is a very good result Figure 1: Compatibility Figure 2: Diversity

Experimental Evaluation, Results, Setting #1 • Coverage results concludes our first set of experiments • As seen from Figure 3, coverage results are decreasing for both datasets. This is consistent with the compatibility results. Because system converges to generate similar results to the original dataset, hence to each other also. Figure 3: Coverage

Experimental Evaluation, Results, Setting #2 • In the previous experiments the testing and training datasets were same due to low sample sizes, which lowers the confidence level of the experimental results • In this set of experiments, we divide the yeast and HUVECs datasets into training and testing sets. We constructed our generative models based on training samples and checked the metric results based on testing samples • Note that we also found the metric results based on training samples to compare the results • We used first 50 samples for training and last 27 samples for testing in yeast dataset • We used first 300 samples for training and last 100 samples for testing in HUVECs dataset

Experimental Evaluation, Results, Setting #2 • First we generate 50 samples and checked the metric results for each generated sample separately • Figure 4 and 5 shows the results for yeast dataset in terms of compatibility and diversity • They verify our concern on low confidence level of first set of experiments. Because we see that the generated data is less close to the original samples and more diverse than the original samples when it is evaluated w.r.t testing data Figure 4: Compatibility for Yeast Figure 5: Diversity for Yeast

Experimental Evaluation, Results, Setting #2 • Figure 6 and 7 shows the results for HUVECs dataset in terms of compatibility and diversity • They again verify our concern on low confidence level of first set of experiments. Because we see that the generated data is less close to the original samples and more diverse than the original samples when it is evaluated w.r.t testing data • So, we can say that our generated samples are actually more qualitative than it is shown in the first set of experiments. Because, we still have a very good compatibility values around % 93, and the diversity values are greater than their previous values Figure 6: Compatibility for HUVECs Figure 7: Diversity for HUVECs

Experimental Evaluation, Results, Setting #2 • Now we know that our generated data is less close and more diverse w.r.t to the original dataset • So what happens when we generate large number of samples? To understand this, we generate 10, 20, …, 500 samples and checked the difference of the mean of the metric results w.r.t testing and training • That is, for each generated sample set, we evaluate them w.r.t testing dataset and w.r.t training dataset and weplot the difference • Results w.r.t training comprise a baseline for us and we try to understand how the metric results w.r.t testing samples change relatively

Experimental Evaluation, Results, Setting #2 • Figure 8 and 9 show the results for Yeast dataset • These results show that our generated samples are very close to the original dataset because there is only % 5 percentage difference between compatibility values • Moreover, they always carry new information because the diversity values are always greater than zero • Nonetheless, the results for yeast dataset is not promising, because as we generate more and more samples they do not pose a regular pattern Figure 8: Compatibility for Yeast Figure 9: Diversity for Yeast

Experimental Evaluation, Results, Setting #2 • Figure 10 and 11 show the results for HUVECs dataset • Now, we actually see much better results. First of all the compatibility difference is less than that of yeast. We have only % 2 percent value , which is a very good result • Secondly and more importantly, the diversity values are always increasing. That means, as we generate more and more samples, our generated samples are not only very close to the original dataset but also bring always new, even more and more information to the original dataset • It is a very important result, actually. Because we see that computationally we can generate gene expression samples just like generating original samples. Hence, the complex internal dynamics of gene regulation can successfully be simulated by superposing different methods and generating data as if it were generated originally by the complex internal dynamics itself. • We think the reason for this result is the number of training samples we have in HUVECs dataset • It does not only show the power of computational methods but also provide practically very valuable result of generating highly qualified gene expression data Figure 10: Compatibility for HUVECs Figure 11: Diversity for HUVECs

Conclusion & Future Work • By integrating different machine learning methods we can simulate complex gene regulation system successfully • System always produces samples that are both similar to the original gene expression dataset and carrying new information • Our system can be used as a preprocessor for any computational approach requiring gene expression data • As future work; • The framework can be extended by integrating more models • Moreover, the produced samples may be studied under a pre-determined analysis task verifying the effectiveness of our system • Furthermore, a bound can be determined for number of required samples to train our multi-model framework

References • GeneNetWeaver: In silico benchmark generation and performance profiling of network inference methods. Schaffter T, Marbach D, and Floreano D. Bioinformatics, 27(16):2263-70, 2011. • SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. Tim Van den Bulcke, Koenraad Van Leemput, Bart Naudts, Piet van Remortel, Hongwu Ma, Alain Verschoren, Bart De Moor and Kathleen Marchal. BMC Bioinformatics, 26;7:43, 2006. • I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, “Probabilistic booleannetworks: a rule-based uncertainty model for gene regulatory networks,” Bioinformatics, vol. 18, no. 2, pp. 261–274, 2002. • M. Bansal, G. D. Gatta, and D. Di Bernardo, “Inference of gene regulatory networks and compound mode of action from time course gene expression profiles,” Bioinformatics, vol. 22, no. 7, pp. 815–822, Apr. 2006. • S. A. Kauffman, The Origins of Order: Self-Organization and Selection in Evolution, 1st ed. Oxford University Press, USA, June 1993. • T. S. Gardner, D. di Bernardo, D. Lorenz, and J. J. Collins, “Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling,” Science, vol. 301, no. 5629, pp. 102–105, 2003. • Z. Li, P. Li, A. Krishnan, and J. Liu, “Large-scale dynamic gene regulatory network inference combining differential equation models with local dynamic Bayesian network analysis,” Bioinformatics, vol. 27, no. 19, pp. 2686–2691, Oct. 2011. • M. Bansal, G. D. Gatta, and D. Di Bernardo, “Inference of gene regulatory networks and compound mode of action from time course gene expression profiles,” Bioinformatics, vol. 22, no. 7, pp. 815–822, Apr. 2006. • A. Datta, A. Choudhary, M. L. Bittner, and E. R. Dougherty, “External control in markovian genetic regulatory networks,” Mach. Learn., vol. 52, no. 1-2, pp. 169–191, Jul. 2003. • P. Spellman, G. Sherlock, M. Zhang, V. Iyer, K. Anders, M. Eisen, P. Brown, D. Botstein, and B. Futcher, “Comprehensive identification of cell cycle regulated genes of yeast saccharomycescerevisiae by microarray hybridization. • D. Hurley, H. Araki, Y. Tamada, B. Dunmore, D. Sanders, S. Humphreys, M. Affara, S. Imoto, K. Yasuda, Y. Tomiyasu, K. Tashiro, C. Savoie, V. Cho, S. Smith, S. Kuhara, S. Miyano, D. S. Charnock-Jones, E. J. Crampin, and C. G. Print, “Gene network inference and visualization tools for biologists: application to new human transcriptome datasets,” Nucleic Acids Research, 2011.

Any Question or Comment? This research is partially supported by The Scientific and Technological Research Council of Turkey (TUBITAK), with project #110E179.

Effective Enrichment of Gene Expression Data Sets

Effective Enrichment of Gene Expression Data Sets

Presentation Transcript

Finding Transcription Modules from large gene-expression data sets

Gene Expression Data Analyses (1)

Efficient Gene Selection with Rough Sets From Gene Expression Data

Microarray Gene Expression Data Analysis

Analysis of Gene Expression Data

Clustering Gene Expression Data

Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

Gene expression data in VectorBase

Classification of Microarray Gene Expression Data

Clustering Large Data Sets in Gene expression analysis Daniel Weaver

4. Gene Expression Data Analysis

Clustering Gene Expression Data

Gene Expression Data

More Analysis of Gene Expression Data

Classification of Microarray Gene Expression Data

Revealing the internal structures of gene expression data sets

Soft clustering of gene expression data

Clustering Gene Expression Data

Bioinformatics : Gene Expression Data Analysis

Clustering Gene Expression Data