A comparative approach for gene network inference using time-series gene expression data

A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques, Université de Montréal October 2003

http://www.sri.com/pharmdisc/cancer_biology/laderoute.html DNA Microarrays • Experiment design • Noise reduction • Normalization • … • Data analysis

Gene Expression Data

_ x1 _ + ? _ x2 x3 + + _ x4 _ Gene network Beyond Clustering… Time series

Comparative Framework Specie A Specie B Specie C

Harder Problem? • This new problem seems more ambitious and harder to solve. • BUT, we will show that, for closely related species (samples), the comparative framework can actually improve the quality of the solutions recovered. • The repetitive nature of the data can be used to sort through some of the noise and some of the ambiguity.

Outline • Gene network model • Single network inference • Algorithm • Simulations • Multiple networks inference • Algorithm • Simulations • Conclusions

Gene Network Model • We use linear differential equations to model the gene trajectories (Chen et al. ‘99, D’haeseleer et al. ‘99): dxi(t) / dt = a0 + ai,1 x1(t)+ ai,2 x2(t)+ … + ai,n xn(t) • Several reasons for that choice: • Takes advantage of the continuous aspect of the data. • Allows for feed-back loops. • Low number of parameters implies that we are less likely to over fit the data. • Sufficient to model complex interactions between genes.

_ x1 _ + _ x2 x3 + + _ x4 _ Small Network Example dx1(t) / dt = 0.491 - 0.248 x1(t) dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t) dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t) dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)

Small Network Example _ x1 _ + _ x2 x3 + + _ x4 interaction coefficient _ dx1(t) / dt = 0.491 - 0.248 x1(t) dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t) dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t) dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)

_ x1 _ + _ x2 x3 + + _ x4 _ Small Network Example constant coefficient dx1(t) / dt = 0.491 - 0.248 x1(t) dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t) dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t) dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)

Problem Revisited Given the time-series data, can we find the interactions coefficients?

Linear Differential Equations • Even under the simplest linear model, there are m(m+1) unknown parameters to estimate: • m(m-1) directional effects • m self effects • m constant effects • Number of data points is mn and we typically have that n << m (few time-points). • To avoid over fitting, extra constraints must be incorporated into the model such as: • Smoothness of the equations (D’haeseleer et al. ‘99) • Sparseness of the network, i.e. few non-null interaction coefficients (Yeung et al. ‘02, De Hoon et al. ‘02)

Algorithm for Network Inference • To recover the interaction coefficients, we use stepwise multiple linear regression. • Why? • This procedure finds coefficient that significantly improve the fit in the regression. It limits the number of non-zero coefficients (i.e. it finds sparse networks) a feature we were seeking. • It is highly flexible and provides p-value scores which can be interpreted easily.

Partial F Test • The procedure finds the interaction coefficients iteratively for each gene xi. • A partial F test is constructed to compare the total square error of the predicted gene trajectory with a specific subset of coefficients being added or removed. • If the p-value obtained from the test exceeds a certain cutoff, the subset of coefficients is significant and will be added or removed. • The procedures iterates until no more subsets of coefficients are either added or removed.

Simulations • Difficult to find coefficients that will produce realistic gene trajectories. • We select coefficients such that the resulting trajectories satisfy 3 conditions: • They are bounded • The correlation of any pair is not too high • They are not too stable • We added gaussian noise to model errors.

Gaussian Noise

_ x1 + _ x2 _ x3 + _ + x4 _ Network Inference regression procedure Procedure recovers perfectly this network with 4 genes and 10 interactions coefficients.

10 Genes Procedure also recovers perfectly this network with 10 genes and 22 interactions coefficients.

Multiple Networks Specie A Specie B Specie C

Types of Problems • Multiple networks related by a graph or a tree can arise from various situations: • Different species • Different developments stages • Different tissues • The goal is now not only to maximize the fit (with as few interactions as possible) but also to minimize an evolutionary cost on the graph of the networks.

Evolutionary Cost {1, 2} sets of predicted regulators evolutionary event {1, 2, 3} {1} {1, 3} {1, 2, 3} Evolutionary cost = 3

Multiple Network Inference • The stepwise regression algorithm is modified to add/remove subsets of regulators directly on the edges of the graph. • Partial F tests are computed on the vertices affected by this change the evaluate the change in fit. • The p-values obtained are then modified based on the change in evolutionary cost. • The p-values are finally combined into a scoring function using a Kolmogorov-Smirnov Test. • The algorithm iteratively adds/removes the best scoring move when above/below a certain threshold.

Simulation Example

Simulation Results

Conclusions • The comparative framework actually simplifies the inference process especially for instances of the problem with more genes, more noise or fewer time-points. • The procedure could also be used for the revision of gene networks. • Possibility of exploring different evolutionary models. • We need to try the procedure on real data.

A comparative approach for gene network inference using time-series gene expression data

A comparative approach for gene network inference using time-series gene expression data

Presentation Transcript

Beyond Co-expression: Gene Network Inference

Gene Expression

Project 3: Cluster Analysis of Time Series Gene Expression Data

Gene Regulatory Network Inference

Clustered alignments of gene-expression time series data

Clustering Gene Expression Data

Gene Expression

Gene Expression

Gene Expression

Gene Expression

Clustering short time series gene expression data

Clustering Gene Expression Data

GENE EXPRESSION

Gene Expression Data

Gene expression

Gene Expression

Clustering Gene Expression Data

Clustering Gene Expression Data