Clustering of Gene Expression Time Series with Conditional Random Fields

Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Microarray and Gene Expression • Microarray is a high throughput technique that can assay gene expression levels of a large number of genes in a tissue • Gene expression level is the relative amounts of mRNA produced at specific time point and under certain experiment conditions. • Thus microarray provides a mean to decipher the logic of gene regulation, by monitoring the gene expression of all genes in a tissue.

Gene Expression • Gene expression data are obtained from microarrays and organized into gene expression matrix for analysis in various methodologies for medical and biological purposes.

Gene Series Time Series • A sequence of gene expression measured at successive time points at either uniform or uneven time intervals. • Reveal more information than static data as time series data have strong correlations between successive points. Time Series Clustering • Assumption: co-expression indicates co-regulation, thus clustering identify genes that share similar functions.

Probabilistic models A key challenge of gene expression time series research is the development of efficient and reliable probabilistic models • Allow measurements of uncertainty • Give analytical measurement of the confidence of the clustering result • Indicate the significance of a data point • Reflect temporal dependencies in the data points

Goal • Identify highly informative genes • Cluster genes in the dataset • GO (Gene Ontology) analysis of biological function for each cluster.

HMMs and CRFs • HMMs CRFs • HMMs are trained to maximize the joint probability of a set of observed data and their corresponding labels. • Independence assumptions are needed in order to be computationally tractable. • Representing long-range dependencies between genes and gene interactions are computationally impossible.

Conditional Random Fields • CRFs are undirected graphical models that define a probability distribution over the label sequences, globally conditioned on a set of observed features. • X = {x1, x2,…, xn}: variable over the observations; • Y = {y1, y2,…, yn}: variable over the corresponding labels. • Observed data xj and class labels yj for all j in a voting pool Ni for sample xi;

CRFs Model • The CRFs model can be formulated as follows • The CRFs model can be expressed in a Gibbs form in terms of cost functions

Cost function • The conditional random field model can also be expressed in a Gibbs form in terms of cost functions • Cost function

Potential function • Real-value potential functions are obtained and used to form the cost function • D: the estimated threshold dividing the set of Euclidean distances into intra- and inter-class distances

Finding the optimal labels • We adopt deterministic label selection, the optimal label is determined by

Pre-processing • Linear Warping for data alignment • τ -time point data transformed into τ-1feature space Differences between consecutive time points inversely proportional to time intervals are used as features as they can reflect the temporal structures in the time series. • Voting pool: keeps one most similar sample, one most-different sample and k-2 randomly selected samples.

Process • Initialization • Each sample is assigned a random label • Voting pools are formed randomly • Samples interact with each other via its voting pool progressively • Update labels • Updata voting pool • Until steady

Experimental Validation • Both biological dataset and simulated dataset • Adjusted Rand index: Similarity measure of two partitions • Yeast galactose dataset • Gene expression measurements in galactose utilization in Saccharomyces cerevisiae • Subset of meansurements of 205 genes whose expression patterns reflect four functional categories in the Gene Ontology (GO) listings • 4 repeated measurements across 20 time points

Results for Yeast galactose dataset Experimental results on Yeast galactose dataset • The four functional categories of • Yeast galactose dataset We obtained an average Rand index value of 0.943 in 10 experiments, greater than the result 0.7 in Tjaden et al. 2006.

Simulated Dataset • Data are generated for 400 genes across 20 time points from six artificial patterns to model periodic, up-regulated and down regulated gene expression profiles. • High Gaussian noise is added. • Perfect partitions are obtained with 10 iterations

Conclusions • A novel unsupervised Conditional Random Fields model for efficient and accurate gene expression time series clustering • All data points are randomly initialized • The randomness of the voting pool facilitates global interactions

Future work • Various similarity measurement • Advantage of information from repeated measurements • Training and testing procedures

Clustering of Gene Expression Time Series with Conditional Random Fields