Making Time: Pseudo Time-Series for the Temporal Analysis of Cross-Section Data

Making Time: Pseudo Time-Series for the Temporal Analysis of Cross-Section Data Emma Peeling, Allan Tucker Centre for Intelligent Data Analysis Brunel University West London

Cross-Section Data • Studies often involve data sampled from a cross-section of a population • Especially in biological and medical studies • Collecting medical information on patients suffering from a particular disease and controls (healthy) • Essentially these studies show a “snapshot” of the disease process

Cross-Section Data • Many processes are inherently temporal in nature • Previously healthy people can develop a disease over time going through different stages of severity • If we want to model the development of such processes, usually require longitudinal data

Cross-Section vs Longitudinal Longitudinal Study Onset Disease Progression Cross Section Study

Pseudo Time-Series Models • In this presentation we explore: • Ordering data based upon Minimum Spanning Trees & PQ-Trees (Rifkin et al. 2000) • Treating this ordered data as “Pseudo Time-Series” • Using Pseudo Time-Series to build temporal models • Test using a dynamic Bayesian network model for classifying: • Medical Data • Gene Expression Data

Multi-Dimensional Scaling • Can be used to visualise distance between data points and pathways • Here we use classic MDS • Metric-based – Euclidean Distance

Minimum Spanning Tree • Connects all nodes in graph • Links contain minimal weights Weighted Graph MST

PQ-Tree • PQ-Trees are used to encode partial orderings on variables • P nodes: children can be in any order • Q nodes: children order can only be reversed

Dynamic Bayesian Network Classifiers • DBNCs are used to calculate: P(C|Xt, Xt-1) • Here, we use the DBNC to model the Pseudo Time-Series for classifying data

Pseudo Time-Series Models • In Summary: 1: Input: Cross-section data 2: Construct weighted graph and MST 3: Construct PQ tree from MST 4: Derive Pseudo Time-Series from PQ-tree using hill-climb search on P-nodes to minimise sequence length 5: Build DBNC model using pseudo temporal ordering of samples 6: Output: Temporal model of cross-section data

The Datasets • B-Cell Microarray Data • 3 classes of B-Cell data • A number of patients • Pre-ordered into expert pseudo time-series • Visual Field Test Data • One large cross-section study • Healthy and Glaucomatous eyes • One longitudinal study for testing the models

B-Cell: MDS & Pseudo Time-Series • Plots show • discovered path in 3D • Classification of B-Cell data in 2D

B-Cell Accuracy • Plot shows mean accuracy and variance over Cross-Validation with repeats

Expert Knowledge • Ordering Sequence length • Biologist = 512.0506: • 1-26 • PQ-tree: = 528.9907: • 1-6,7,9,8,11,10,12-18,26,19,21,20,22-25 • PQ-tree and hill-climb = 521.1865: • 1-18,26,19-25

Visual Field: MDS & Pseudo Time-Series • Plots show • Path found for VF data in 3D • Classification of VF data in 2D

VF Accuracy • Plot shows mean accuracy and variance over Train / Test data with repeats

Related Work • Semi-Supervised Methods • Some datapoints are labelled with classes • These are used to assist classification of others in an incremental manner • Pseudo MTS imposes an order on the data as well as a distance between data • Allows for the prediction of future states

Conclusions • Cross Section data usually models snapshot of a process • Longitudinal data usually needed to model temporal nature • Here we use ordering methods to create Pseudo Time-Series models • Early results on medical and biological data are promising

Future Work • Dealing with outliers in dataspace • Multiple trajectories (e.g. in VF data) • Normalisation (rather than discretisation) • Combining a number of longitudinal and cross-section studies

Multiple Trajectories

Acknowledgements • Thanks to: • David Garway-Heath, Moorifield’s Eye Hospital, London • Paul Kellam, University College London

Making Time: Pseudo Time-Series for the Temporal Analysis of Cross-Section Data