Introduction to resampling in MATLAB

Feb 5 2009 Austin Hilliard Introduction to resampling in MATLAB

So you've done an experiment... • Two independent datasets: • control • experimental • Have n numbers in each dataset, representing independent measurements • Question: Did the experimental manipulation have an effect, i.e. did it make a difference?

The question restated • Is there a significant (i.e. “real”) numerical difference between the control and experimental datasets? • What is the probability that the two datasets are subsets of two different/independent distributions? • What is the probability that the two datasets are NOT subsets of the same underlying distribution?

Answering the question • What do we need? • A number(s) to represent/summarize each dataset → datadescriptors • A test for comparing these numbers → test statistic • A way to assess the significance of the test statistic → distribution of test statistics generated from datasets that ARE from the same underlying distribution

Descriptors and tests • Know your data! • Visualize it • Does it have a central tendency? (i.e. is histogram vaguely mound shaped?) • - If yes, use the mean as descriptor • - If no, median may be better • Typical test: compute difference (subtraction) between descriptors

We'll discuss how to assess the significance of your test statistic soon, that's the resampling part... • But first let's visualize some data

Simulate and inspect data ctrl = 10.*(5.75+randn(25,1)); exp = 10.*(5+randn(25,1)); boxplot([ctrl exp]) groups = {'control''experimental'} boxplot([ctrl exp],'labels',groups)

Central tendency? hist(ctrl) hist(exp) title('Control group') title('Experimental group')

Compute descriptors and test stat • Our data looks pretty mound shaped, so let's use the mean as data descriptor • Take the absolute value of the difference as our test stat ctrl_d = mean(ctrl); exp_d = mean(exp); test_stat = abs(ctrl_d - exp_d);

Assessing the test statistic • What we would really like to test: • The probability that the two datasets are subsets of two different/independent distributions • Problem: • We don't know what those hypothetical independent distributions are, if we did we wouldn't have to go through all of this!

Assessing the test statistic • Solution: • Compute the probability that the two datasets are NOT subsets of the same underlying distribution • How to do this? • Start by assuming datasets are subsets of same distribution → null hypothesis • See what test statistic looks like when null hypothesis is true

Assessing the test statistic • Need to generate a distribution of test statistic generated when datasets really ARE from the same underlying distribution → distribution of test stat under null hypothesis • Then we can quantify how (a)typical the value of our test stat is under null hypothesis • if typical, our datasets are likely from same distribution → no effect of experiment • if atypical, there is a good chance datasets are from different distributions → experiment had an effect

Generate the null distribution using bootstrap • Basic procedure: • Combine datasets • Randomly sample (w/ replacement) from combined dataset to create two pseudo datasets w/ same n as real datasets • Compute descriptors and test statistic for pseudo datasets • Repeat 10,000 times, keeping track of pseudo test stat

Combine datasets and resample box = [ctrl; exp]; % combine datasets into 'box' % could use concat() pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist' % to store results of resampling % --- start resampling --- for trials = 1:10000 % resample 10000 times pseudo_ctrl = ... % create pseudo groups to be sample(length(ctrl), box); % same size as actual groups pseudo_exp = ... sample(length(exp), box); pseudo_ctrl_d = ... % compute pseudo group mean(pseudo_ctrl); % descriptors pseudo_exp_d = ... mean(pseudo_exp); pseudo_stat = ... % compute pseudo test stat abs(pseudo_ctrl_d - pseudo_exp_d); pseudo_stat_dist(trials) = ... % store pseudo test stat to pseudo_stat; % build null distribution end

Now what? • Compute how many values of pseudo test stat are greater than our actual test stat, then divide by the total number of data points in the null distribution • This approximates the likelihood of getting our actual test stat (i.e. the likelihood of seeing a difference as big as we see) if our two datasets were truly from the same underlying distribution → p-value

Compute p-val and visualize null distribution bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats % bigger than actual stat pval = bigger / length(pseudo_stat_dist) % divide by total number % of pseudo test stats to % compute p-value % could use proportion() hist(pseudo_stat_dist) % show histogram of pseudo title('Pseudo test stat distribution') % test stats xlabel('Pseudo test stat') ylabel('Frequency')

How to interpret the p-value? • Assuming that the null hypothesis is true, the p-value is the likelihood that we would see a value of the test statistic that is greater than the value of our actual test statistic Restated → • Assuming both groups really are from the same underlying distribution, the p-value is the likelihood that we would see a difference between them as big as we do

“Paired” data • What if your two datasets are not actually independent? • Maybe you have a single group of subjects/birds/cells/etc. with measurements taken under different conditions • Important: the measurements within each condition must still be independent

Bootstrap to test paired data • Basic strategy is the same, but... • Instead of one descriptor/group, have as many descriptors as data pairs • compute difference between measurement under condition1 and measurement under condition2 • Test statistic is mean/median/etc. of the differences • Slightly different resampling procedure

Visualize data • Investigate distributions for raw data, also should plot distribution of the differences between paired data-points • Why? → Test statistic is computed from these differences, not the raw datasets

Resampling procedure • Randomly sample 1 or -1 and store in vector • Multiply by vector of differences • randomizes direction of change • Compute test stat (mean, median, etc.) for randomized differences • Repeat • Now we have distribution of test statistic if direction of change is random, i.e. NOT dependent on experimental conditions

Compute descriptors and test stat diffs = ctrl - exp; % compute change from ctrl condition % to exp condition test_stat = abs(mean(diffs)); % compute test statistic pseudo_stat_dist = zeros(1,10000); % create vector to store pseudo test % stats generated from resampling

Resampling for trials = 1:10000 % resample 10,000 times signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's % and -1's same size as diffs pseudo_diffs = signs .* diffs; % multiply by actual diffs to % randomize direction of diff pseudo_stat = abs(mean(pseudo_diffs)); % compute pseudo test stat from % pseudo diffs pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to % grow null distribution end

Compute p-val and visualize null distribution bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats % bigger than actual stat pval = bigger / length(pseudo_stat_dist) % divide by total number % of pseudo test stats to % compute p-value % could use proportion() hist(pseudo_stat_dist) % show histogram of pseudo title('Pseudo test stat distribution') % test stats xlabel('Pseudo test stat') ylabel('Frequency')

function [] = bootstrap2groupsMEAN_PAIRED (ctrl, exp) % --- prepare for resampling --- diffs = ctrl - exp; % compute change from ctrl condition % to exp condition test_stat = abs(mean(diffs)) % compute test statistic pseudo_stat_dist = zeros(1,10000); % create vector to store pseudo test % stats generated from resampling % --- resampling --- for trials = 1:10000 % resample 10,000 times signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's % and -1's same size as diffs pseudo_diffs = signs .* diffs; % multiply by actual diffs to % randomize direction of diff pseudo_stat = abs(mean(pseudo_diffs)); % compute pseudo test stat from % pseudo diffs pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to % grow null distribution end % --- end resampling --- bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats % bigger than actual stat pval = bigger / length(pseudo_stat_dist) % divide by total number % of pseudo test stats to % compute p-value % could use proportion() hist(pseudo_stat_dist) % show histogram of pseudo title('Pseudo test stat distribution') % test stats xlabel('Pseudo test stat') ylabel('Frequency')

function [] = bootstrap2groupsMEANS (ctrl, exp) % % function [] = bootstrap2groupsMEANS (ctrl, exp) % % 'ctrl' and 'exp' are assumed to be column vectors, this is % necessary for creation of 'box' % % --- prepare for resampling --- ctrl_d = mean(ctrl); % compute descriptors exp_d = mean(exp); % 'ctrl_d' and 'exp_d' test_stat = abs(ctrl_d - exp_d) % compute test statistic 'test_stat' box = [ctrl; exp]; % combine datasets into 'box' % could use concat() pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist' % to store results of resampling % --- start resampling --- for trials = 1:10000 % resample 10,000 times pseudo_ctrl = ... % create pseudo groups to be sample(length(ctrl), box); % same size as actual groups pseudo_exp = ... sample(length(exp), box); pseudo_ctrl_d = ... % compute pseudo group mean(pseudo_ctrl); % descriptors pseudo_exp_d = ... mean(pseudo_exp); pseudo_stat = ... % compute pseudo test stat abs(pseudo_ctrl_d - pseudo_exp_d); pseudo_stat_dist(trials) = ... % store pseudo test stat to pseudo_stat; % build null distribution end % --- end resampling --- bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats % bigger than actual stat pval = bigger / length(pseudo_stat_dist) % divide by total number % of pseudo test stats to % compute p-value % could use proportion() hist(pseudo_stat_dist) % show histogram of pseudo title('Pseudo test stat distribution') % test stats xlabel('Pseudo test stat') ylabel('Frequency')

Don't forget to look online to get a more generalized version of 'bootstrap2groupsMEANS.m' called 'bootstrap2groups.m' • It's more flexible (can define method for descriptor and iterations as input), stores output in a struct, plots raw data, has one-tailed tests if needed, and can be used for demo (try running 'out = bootstrap2groups(@mean)'

Possible topics for next week • confidence intervals • normality testing • ANOVA (1 or 2-way) • power

Introduction to resampling in MATLAB