Multivariate analyses and decoding

Multivariate analysesand decoding Kay H. Brodersen Computational Neuroeconomics GroupInstitute of Empirical Research in EconomicsUniversity of Zurich Machine Learning and Pattern Recognition GroupDepartment of Computer ScienceETH Zurich

1 Introduction

Why multivariate? Haxby et al. 2001 Science

Why multivariate? Kriegeskorte et al. 2007 NeuroImage Multivariate approaches can reveal information jointly encoded by several voxels.

Why multivariate? Boynton 2005 Nature Neuroscience Multivariate approaches can exploit a sampling bias in voxelized images.

Mass-univariate vs. multivariate analyses Mass-univariate approaches treat each voxel independently of all other voxels such that the implicit likelihood factorises over voxels: Spatial dependencies between voxels are introduced after estimation, during inference, through random field theory. This allows us to make multivariate inferences over voxels (i.e., cluster-level or set-level inference). Multivariate approaches, by contrast, relax the assumption about independence and enable inference about distributed responses without requiring focal activations or certain topological response features. They can therefore be more powerful than mass-univariate analyses. The key challenge for all multivariate approaches is the high dimensionality of multivariate brain data.

Models & terminology stimulus encoding of propertiesof stimulus and behaviour context behaviour decoding Encoding or decoding? Prediction or inference? 1 0 Univoxel or multivoxel? 2 Classification or regression? 3 • The goal of prediction is to maximize the accuracy with which brain states can be decoded from fMRI data. • The goal of inference is to decide between competing hypotheses about structure-function mappings in the brain. For example: compare a model that links distributed neuronal activity to a cognitive state with a model that does not.

Models & terminology 1 2 3 e.g., or e.g., • Encoding or decoding? • An encoding model (or generative model) relates context (independent variable) to brain activity (dependent variable). • A decoding model (or recognition model) relates brain activity (independent variable) to context (dependent variable). • Univoxel or multivoxel? • In a univoxel model, brain activity is the signal measured in one voxel. (Special case: mass-univariate.) • In a multivoxel model, brain activity is the signal measured in many voxels. • Regression or classification? • In a regression model, the dependent variable is continuous. • In a classification model, the dependent variable is categorical (typically binary).

2 Classification

Classification fMRI timeseries 1 e.g., voxels Feature extraction Trials Voxels Training examples Test examples 3 Classification A A B A B A A B A A A B A Accuracyestimate[% correct] ? ? ? Featureselection 2

Linear vs. nonlinear classifiers If the data are not linearly separable, a nonlinear classifier may still be able to tell different classes apart. here: discriminative point classifiers Most classification algorithms are based on a linear model that discriminates the two classes.

Training and testing Examples Training examples 1 Test examples 2 3 ... ... ... ... ... ... 99 100 Folds 1 3 99 2 100 • We need to train and test our classifier on separate datasets. Why? • Using the same examples for training and testing means overfitting may remain unnoticed, implying an optimistic accuracy estimate. • Instead, what we are interested in is generalizability: the ability of our algorithm to correctly classify previously unseen examples. • An efficient splitting procedure is cross-validation.

Target questions for decoding studies (a) Pattern discrimination (overall classification) (b) Spatial pattern localization Accuracy [%] 80% 100 % 50 % Healthy or diseased? Left or right button? Truthorlie? 55% Classification task (c) Temporal pattern localization (d) Pattern characterization Accuracy [%] Participant indicates decision Inferring a representational space andextrapolation to novel classes 100 % 50 % Accuracy rises above chance Mitchell et al. 2008 Science Intra-trial time Brodersen et al. (2009) The New Collection

(a) Overall classification ... out of 10 test examples correct 6 5 7 8 4 9 6 7 7 5 Brodersen et al. (2010) ICPR Performance evaluation – example Given 100 trials, leave-10-out cross-validation, we measure performance by counting the number of correct predictions on each fold: How probable is it to get 64 out of 100 correct if we had been guessing? Thus, we have made a Binomial assumption about the Null model to show thatour result is statistically significant at the 0.01 level. Problem: accuracy is not a good performance measure.

The support vector machine Intuitively, the support vector machine finds a hyperplane that maximizes the margin between the plane and the nearest examples on either side. For nonlinear mappings, the kernel converts a low-dimensional nonlinear problem into a high-dimensional linear problem.

Temporal feature extraction trial-by-trial design matrix time [volumes] plus confounds trial 1decidephase trial 1resultphase trial 2resultphase trials and trial phases ... Deconvolved BOLD signal  result: one beta value per trial, phase, and voxel

(b) Spatial information mapping METHOD 1 Consider the entire brain, and find out which voxels are jointly discriminative • e.g., based on a classifier with a constraint on sparseness in featuresHampton & O’Doherty (2007); Grosenick et al. (2008, 2009) METHOD 2 At each voxel, consider a small local environment, and compute a distance score • e.g., based on a CCANandy & Cordes (2003) Magn. Reson. Med. • e.g., based on a classifier • e.g., based on Euclidean distances • e.g., based on Mahalanobis distancesKriegeskorte et al. (2006, 2007a, 2007b)Serences & Boynton (2007) J Neuroscience • e.g., based on the mutual information

(b) Spatial information mapping Example 1 – decoding whether a subject will switch or stay Example 2 – decoding which option was chosen t ≥ 5 t ≥ 5 x = 12 mm decision outcome t = 3 t = 3 Brodersen et al. (2009) HBM Hampton & O‘Doherty (2007) PNAS

(c) Temporal information mapping Example – decoding which button was pressed classificationaccuracy motor cortex decision response frontopolar cortex Soon et al. (2008) Nature Neuroscience

(c) Pattern characterization Example – decoding which vowel a subject heard, and which speaker had uttered it voxel 1 voxel 2 ... fingerprint plot (one plot per class) Formisano et al. (2008) Science

Limitations • Constraints on experimental design • When estimating trial-wise Beta values, we need longer ITIs (typically 8 – 15 s). • At the same time, we need many trials (typically 100+). • Classes should be balanced. • Computationally expensive • e.g., fold-wise feature selection • e.g., permutation testing • Classification accuracy is a surrogate statistic • Classification algorithms involve many heuristics

3 Multivariate Bayesian decoding

Multivariate Bayesian decoding (MVB) • Multivariate analyses in SPM are not implemented in terms of the classification schemes outlined in the previous section. • Instead, SPM brings classification into the conventional inference framework of hierarchical models and their inversion. • MVB can be used to address two questions: • Overall classification –using a cross-validation scheme(as seen earlier) • Inference on different forms of structure-function mappings –e.g., smooth or sparse coding(new)

Model Encoding models X as a cause Decoding models X as a consequence

Empirical priors on voxel weights Null: Spatial vectors: Smooth vectors: Singular vectors: Support vectors: Friston et al. (2008) NeuroImage Decoding models are typically ill-posed: there is an infinite number of equally likely solutions. We therefore require constraints or priors to estimate the voxel weights . SPM specifies several alternative coding hypotheses in terms of empirical spatial priors on voxel weights.

MVB – example design matrix • This dataset is based on a simple block design. Each block is a combination of some of the following three factors: • photic – there is some visual stimulus • motion – there is motion • attention – subjects are paying attention • We form a design matrix by convolving box-car functions with a canonical haemodynamic response function. photic motion attention constant blocks of10 scans MVB can be illustrated using SPM’s attention-to-motion example dataset.Buechel & Friston 1999 Cerebral CortexFriston et al. 2008 NeuroImage

MVB – example

MVB – example MVB-based predictions closely match the observed responses. But crucially, they don’t perfectly match them. Perfect match would indicate overfitting.

MVB – example log BF = 3 The highest model evidence is achieved by a model that recruits 4 partitions. The weights attributed to each voxel in the sphere are sparse and multimodal. This suggests sparse coding.

4 Further model-based approaches

Challenges for all decoding approaches Challenge 1 – feature selection and weightingto make the ill-posed many-to-one mapping tractable Challenge 2 –neurobiological interpretability of modelsto improve the usefulness of insights that can be gained from multivariate analysis results

Further model-based approaches (1) Mitchell et al. (2008) Science • Approach 1 – identification (inferring a representational space) • estimation of an encoding model • nearest-neighbour classification or voting

Further model-based approaches (2) Paninski et al. (2007) Progr Brain Res • Pillow et al. (2008) Nature Miyawaki et al. (2009) Neuron • Approach 2 – reconstruction / optimal decoding • estimation of an encoding model • model inversion

Further model-based approaches (3) • Brodersen et al. (2010) NeuroImage • Brodersen et al. (2010) HBM Approach 3 – decoding with model-based feature construction

Summary Multivariate analyses can make use of information jointly encoded by several voxels and may therefore offer higher sensitivity than mass-univariate analyses. There is some confusion about terminology in current publications. Remember the distinction between prediction vs. inference, encoding vs. decoding, univoxel vs. multivoxel, and classification vs. regression. The main target questions in classification studies are (i) pattern discrimination, (ii) spatial information mapping, (iii) temporal information mapping, and (iv) pattern characterization. Multivariate Bayes offers an alternative scheme that maps multivariate patterns of activity onto brain states within the conventional statistical framework. The future is likely to see more model-based approaches.

5 Supplementary slides

The most common multivariate analysis is classification Haxby et al. 2001 Science Classification is the most common type of multivariate fMRI analysis to date. By classification we mean: to decode a categorical label from multivoxel activity. Lautrup et al. (1994) reported the first classification scheme for functional neuroimaging data. Classification was then reintroduced by Haxby et al. (2001). In their study, the overall spatial pattern of activity was found to be more informative in distinguishing object categories than any brain region on its own.

Temporal unit of classification • The temporal unit of classification specifies the amount of data that forms an individual example. Typical units are: • one trial  trial-by-trial classification • one block  block-by-block classification • one subject  across-subjects classification • Choosing a temporal unit of classification reveals a trade-off: • smaller units mean noisier examples but a larger training set • larger units mean cleaner examples but a smaller training set • The most common temporal unit of classification is an individual trial.

Temporal unit of classification Normalized BOLD responsefrom ventral striatum rewarded trials unrewarded trials time [s] time [s] Subject 1 n = 24 subjects, 180 trials each Brodersen, Hunt, Walton, Rushworth, Behrens 2009 HBM

Alternative temporal feature extraction signal (a.u.) averaged signalacross all trials subject 1 subject 2 subject 3 microtime decision delay result Interpolated raw BOLD signal  result: any desired number of sampling points per trial and voxel

Alternative temporal feature extraction Step 1 Step 2 Basis fn 1 Basis fn 2 Basis fn 3 Deconvolved BOLD signal, expressed in terms of 3 basis functions Step 1: sample many HRFs from given parameter intervals Step 2:find set of 3 orthogonal basis functions that can be used to approximate the sampled functions  result: three values per trial, phase, and voxel

Classification of methods for feature selection • A priori structural feature selection • A priori functional feature selection • Fold-wise univariate feature selection • Scoring • Choosing a number of features • Fold-wise multivariate feature selection • Filtering methods • Wrapper methods • Embedded methods • Fold-wise hybrid feature selection • Searchlight feature selection • Recursive feature elimination • Sparse logistic regression • Unsupervised feature-space compression

Training and testing a classifier • Training phase • The classifier is given a set of n labelled training samplesfrom some data space , where • is a d-dimensional attribute vector • is its corresponding class. • The goal of the learning algorithm is to find a function that adequately describes the underlying attributes/class relation. • For example, a linear learning machine finds a functionwhich assigns a given point x to the classsuch that some performance measure is maximized, for example:

Training and testing a classifier • Test phase • The classifier is now confronted with a test set of unlabelled examplesand assigns each example x to an estimated class • We could then measure generalization performance in terms of the relative number of correctly classified test examples:

The support vector machine • Nonlinear prediction problems can be turned into linear problems by using a nonlinear projection of the data onto a high-dimensional feature space. • This technique is used by a class of prediction algorithms called kernel machines. • The most popular kernel method is the support vector machine (SVM). • SVMs make training and testing computationally efficient. • We can easily reconstruct feature weights: • However, SVM predictions do not have a probabilistic interpretation.

Performance evaluation Brodersen, Ong, Buhmann, Stephan (2010) ICPR Evaluating the performance of a classification algorithm critically requires a measure of the degree to which unseen examples have been identified with their correct class labels. The procedure of averaging across accuracies obtained on individual cross-validation folds is flawed in two ways. First, it does not allow for the derivation of a meaningful confidence interval. Second,it leads to an optimistic estimate when a biased classifier is tested on an imbalanced dataset. Both problems can be overcome by replacing the conventional point estimate of accuracy by an estimate of the posterior distribution of the balanced accuracy.

Performance evaluation Brodersen, Ong, Buhmann, Stephan (2010) ICPR The simulations show how a biased classifier applied to an imbalanced test set leads to a hugely optimistic estimate of generalizability when measured in terms of the accuracy rather than the balanced accuracy.

Multivariate Bayes – maximization of the model evidence

Multivariate Bayesian decoding – example • MVB can be illustrated using SPM’s attention-to-motion example dataset.Buechel & Friston 1999 Cerebral CortexFriston et al. 2008 NeuroImage • This dataset is based on a simple block design. Each block belongs to one of the following conditions: • fixation – subjects see a fixation cross • static – subjects see stationary dots • no attention – subjects see moving dots • attention – subjects monitor moving dots for changes in velocity • We wish to decode whether or not subjects were exposed to motion. We begin by recombining the conditions into three orthogonal conditions: • photic – there is some form of visual stimulus • motion – there is motion • attention – subjects are required to pay attention

Further model-based approaches Kay et al. 2008 Science Approach 1 – identification (inferring a representational space)

Multivariate analyses and decoding