Sample Size calculations for multilevel models (Part II)

Sample Size calculations for multilevel models (Part II) William Browne and Mousa Golalizadeh Department of Clinical Veterinary Sciences University of Bristol

Summary • Introduction to sample size calculations. • Simulation-based approaches. • The MLPOWSIM software. • Two level normal examples. • Binary responses. • Cross classified models. • Future plans.

Background • Many quantitative social science research questions are of the form of a hypothesis – A has a significant effect on B. • To answer such a question data is collected that allows the researcher to (hopefully) test whether statistically A has a significant effect on B. (In fact we aim to reject the hypothesis that A doesn’t significantly affect B). • A test is performed and either the researcher is happy and A indeed has a significant effect on B or is left wondering why the data collected do not back up their hypothesis. Is the hypothesis false or was the data not sufficient? • The sufficiency of the data is the motivation for sample size calculations.

Example • Suppose I have the research question ‘Are Welshmen on average taller than 175 cms?’ • I now need to get hold of a random sample of n Welshmen and measure each of their heights. • I make some statistical assumption about the distribution of the heights of Welshmen e.g. that they come from a Normal distribution. • I might like to check this assumption by plotting a histogram of the data. • I can then form a statistical hypothesis test and test whether indeed Welshmen are taller than 175cms. • I need to decide how big to make n, my sample of Welshmen.

Hypothesis Testing • Let us assume our null hypothesis is that the average height of Welshmen (μ) is 175cm. • So we test H0:μ=175 vs HA:μ>175 (or alternatively H0:θ=0 vs HA:θ>0 where θ=μ-175) • In practice we calculate from our sample its mean ( ) and standard deviation (s2) and use these along with n to form a test statistic which we can compare with the distribution assumed under H0

Type I and Type II errors • No hypothesis test is perfect and there is always the possibility of errors • P(Type I error) = α = significance level or size • P(Type II error) = β, 1-β is the power of the test. • In general we fix α to some value e.g. 0.05, 0.01 then 1-β depends on our sample size.

Example hypothesis test • Let us assume that in reality our sample mean is 180cms and the population standard deviation (sd) is 5cms (known). • We can then form a test statistic as follows: • Note here that for small n and unknown sd we should use a student-t distribution rather than Normal. • For a 1-sided Z test we wish Z= > 1.645 and so we need our sample to be of size 3 to reject H0, using a student-t distribution increases this to 5. (Here α=0.05) • However if the sample mean had been only 176cms then we would need n > (1.645*5)2 = 68 Welshmen to reject H0

Power calculations • Our last slide in some sense is backwards as we cannot get from a given sample mean to choosing a sample size! • What we do instead is use different terminology and play God! • We will choose an ‘effect size’, γ which will represent a guess at the increase in the sample mean for Welshmen. • There then exists an (approximate) formula that links four quantities, size (α), power (1-β), effect size (γ) and sample size (n) • Note that the standard error (SE) of γ is a function of n and σ the population sd which is assumed known. • We can now evaluate one of these quantities conditional on the others e.g. what sample size is required given α,1-β and γ? Here RHS is sum of cases H0 true and H0 false.

Welsh height example Here we have looked at two examples with effect sizes 5 and 1 respectively. Assume σ takes the value 5 and so let us suppose we take a sample of size 25 Welshmen. Then Case 1: 5/(5/√25)=1.645+z1-β,z1-β=3.355 β=0.9996 Case 2: 1/(5/ √25)=1.645+z1-β,z1-β=-0.645 β=0.25946 So here a sample of 25 Welshmen from a population with mean 180cms would almost always result in rejecting H0, but if the population mean is 176cms then only 26% of such samples would be rejected. We can plot curves of how power increases with sample size as shown in the next slide.

Power curve for Welshmen example Here we see the two power curves for the two scenarios:

Extending the idea • The simple formula can be used in many situations and hypothesis tests. • To generalise the idea we assume that γ is an effect size associated with a statistic that we wish to compare with a (null) hypothesized value of 0. • The complication occurs in finding a formula for the standard error for the statistic and relating this formula to the sample size, n. • We will next consider an alternative approach before returning to look at how the above approach extends to multilevel models.

The use of simulation • In reality our (hoped for) research path will be as follows: Construct research question -> Form null hypothesis that we believe false -> Collect appropriate data -> Reject hypothesis therefore proving our research question. • Assuming what we believe in our research question is correct and hence null hypothesis is false we can still be let down by not collecting enough data. • The idea behind using simulation is to simulate the data gathering process (assuming we know the right answer) many times and see how often we can reject the null hypothesis. The percentage of rejected null hypotheses (via simulation) will then estimate power.

Simulation in our example • Consider our Welsh height example case 2 where we believe Welshmen have a mean height of 176cms (and sd = 5cms) and we are testing the hypothesis H0:μ=175cms, and we consider a sample size 25. • Then we generate N samples (e.g. 5000) of size 25, • and for each sample form a lower bound for the confidence interval of the form • . This we compare with the value 175 and the proportion greater than 175 is an estimate of the power of the test. • We can repeat this exercise for different sample sizes and form a power curve.

Power curve comparison Note simulation curve is a good approximation of the theoretical curve although there are some minor (Monte Carlo) errors even with 5000 simulations per sample size.

Advantages/Disadvantages • Theoretical approach is quick when the formula can be derived. • Approximations for more complex situations exist which are equally quick. • Simulation approach generalizes to more situations but is much slower and we may need large numbers of simulations per scenario to get accurate power estimates. • Note that alternative method described next typically needs less simulations per scenario for the same accuracy.

Methods to calculate Power calculations from simulations • The method illustrated previously we will describe as the 0/1 method where each simulation results in a sample that would either reject or fail to reject the null hypothesis. • An alternative method (the standard error method) is to instead take from each simulation the estimated standard error of the parameter to be tested. Then the average of these standard errors can be used with the effect size to construct the power. • A 2nd alternative (when MCMC is used) is as in the 0/1 approach to establish whether the simulation sample rejects the null hypothesis but rather than construct the interval assuming normality to use the quantiles of the MCMC chain.

MLPOWSIM software package • Software package currently being written and tested (90% complete). • A ‘rather old fashioned’ text-based interface allows user to specify sample size scenarios. • Software then generates either MLwiN macro code or an R command file to run the simulations to calculate power for scenarios. • Normal, Binomial and Poisson response offered. • Software will cope with 1-level, 2-level (balanced and unbalanced), 3-level nested (balanced and unbalanced) and cross-classified (balanced and unbalanced) with 2 higher classifications models. • Many options for unbalanced designs. • Extensive user manual in preparation (~100 pages with examples)

MLPOWSIM state of play • The software still requires code to generate macros for unbalanced cross-classified models in MLwiN to be written. • Documentation for normal responses is almost complete with short chapters on other response types to write. • Currently MLwiN code (using IGLS) runs considerably faster than R code (using lmer) for most models though we have had some success speeding up R code. • Exception is cross-classified models where we have only programmed an MCMC-based macro in MLwiN that is typically slower than the lmer code in R. See more details later.

What happens with multilevel data? We will here mainly consider 2-level models and take as our application area education, so we have students nested within schools. When deciding on a sampling scheme we have many choices: • How many schools, N ? • How many pupils per school, nj ? • Should we collect the same size sample from each school ? Our decision will depend on which parameter we wish to estimate in the model.

Education Example • For motivation we considered a two level dataset with exam marks measured for each student in a collection of schools. In fact this dataset exists and has 4915 students in 96 schools. • Our hypothesis of interest is that the exam mark for an average student is > 20 (null hypothesis = 20) which with such a large sample results in the null hypothesis being rejected for our particular data. • If we fit the following multilevel model to the data we get the estimates given: • If we treat these estimates as population values, we are interested in what power for testing our hypothesis results from various combinations of N and nj

Design effect formula • If we assume balance then with n pupils in each of N schools for our simple model (and only this simple model) the following formula holds: • Design effect = 1 + (n-1)ρ where ρ is the intra-class correlation. • So if we know the simple random sample size required for a given power we need to multiply this by the design effect. • For example our data has ρ=16.205/(16.205+139.367)=0.104 • So for schools of size 10 pupils we would need 1+9*0.104=1.94 times as many students (in total) to get the same power. • For this model (and this model only) we could therefore perform our power calculations assuming simple random sampling from a population with variance 155.572 and scale up the sample required based on the design. • So • And for schools of size 10 we require 1.94*338.4=657 pupils which we can round up to 66 schools.

Comparison of formula/simulations • The following graph compares the design effect formula to the simulation approach:

Other sample size issues • There are other reasons why we may be interested in sample size questions in multilevel modelling. • It is often problematic to fit multilevel models when the number of higher level units is small as demonstrated in the last graph • Also some methods can be biased for small sample sizes. • Note although method comparison is done using a similar approach of generating simulated datasets, here power calculations are not the main aim; that said when performing power calculations parameter bias of methods should be noted as this will result in bias of predicted power (see later). • Browne (1998), Browne and Draper(2006) compare MCMC, RIGLS and IGLS for small sample sizes and continuous responses, and MCMC, MQL and PQL for binary response models. • Maas and Hox (2004) also look at small sample sizes and how robust estimation is to the Normal distributional assumption of the level 2 residuals.

Sampling policy The design effect formula: Design effect = 1 + (n-1)ρ suggests that if we are to sample a fixed (balanced) number of pupils n*N then our best power results when n is smallest i.e. sampling one pupil each from 100 schools is better than sampling 100 pupils from the same school. The effect of sampling policy is most important in scenarios where ρ is large e.g. repeated measures designs. The simulation procedure gives approximately the same power curve and so in this simple example we have an easy to use formula. The reason in practice for sampling several pupils from each school is purely the additional cost incurred in visiting additional schools.

More complex examples – random intercepts and random slopes • We will now look at more complex random effect models with predictor variables. • We will consider the random intercept model • and the random slopes model • We will consider how to extend (approximately) the theoretical approach and also the simulation approach.

Example problem • We will continue with our educational example but also consider the effect of gender (β1). For a random intercepts model let us assume the true parameter values are β0=20.9, β1= 1.6, σ2u=15, σ2e=135. • We have two hypotheses to test: • Hypothesis 1: boys get on average more than 20 marks (H0: β0=20 vs HA: β0>20) • Hypothesis 2: girls do better on average than boys (H0: β1=0 vs HA: β1>0) • We will also consider the effect of random slopes on the dataset so will have a second model with additionally σ2u0=10, σu01=2, σ2u1=5

What happens when we include random slopes? • The following table gives power values for β1 for the random intercept model. • Note that pairs of values with the same total N*n have similar powers.

What happens when we include random slopes? • The following table gives power values for β1 for the random slopes model. • Note that pairs of values with the same total N*n now do not have similar powers and larger N is better.

Binary response models • The simplest model for binary data is to compare an observed proportion with some fixed proportion, e.g. to detect whether more than 9 out of 10 cats prefer…! • We will be using as our example the use of contraceptives by women in Bangladesh taken from the MLwiN user guide. Here the full dataset (nearly 2000 women) has roughly 40% contraceptive use and we are interested in what sample size we need to say that less than 50% use contraceptives. • The standard approach uses a normal approximation to the binomial for the underlying probability to give a standard formula that can be easily calculated and gives a sample size of 153 for a power of 0.80.

Model based approach • We will use a logistic regression model • and now the probability 0.5 will correspond to the value 0 for β0. • The MLPOWSIM software will perform approximate Wald tests i.e. assume that β0 is normally distributed. • As can be seen by the curves in the following slide the power is fairly similar (sample size of ~ 160 for power of 0.8) but not identical to that obtained by the standard formulae as a different quantity is being assumed normally distributed.

Power graphs Here we see the results of the 0/1 approach in blue and the much smoother red line corresponds to the standard error approach for calculating power.

Multilevel binary response models • We can extend the earlier modelling of the Bangladeshi dataset to account for the structure of the data – women nested within districts. • We will again test whether usage of contraceptive is less than 50% but now we account for the clustering in the model with a random effects logistic regression model. • The estimate of the district level variability is 0.246 • We also now have options of different estimation methods and will consider MQL1 and PQL2.

Differences between methods – MQL 1 Here we see that the SE method (red) gives higher power than the 0/1 method (blue) but this is due to the bias in the estimates and hence the SEs. Here the 0/1 method is preferable.

Differences between methods – PQL 2 Here we see less difference between the SE method (red) and the 0/1 method (blue) as PQL is far less biased than MQL.

Cross-classified models • We will here consider a dataset of education results from Fife in Scotland. Here for each student their attainment at 16 can be affected by both the secondary school and the primary school they attended. We will look at perhaps the simplest model that accounts for the structure. Here our response is a mark out of 10 from which we subtract 5 which we treat as a pass mark. This response is then assumed to be affected by both the primary and secondary school attended Approximate estimates from the actual data that we will use for our power calculations are β0 = 0.5, variances for primary, secondary and residual are 1.2, 0.4 and 5 respectively. We are interested in sample sizes to detect an average greater than 5.

Cross classified methods • Currently in MLPOWSIM we have only implemented cross classified models using MCMC. • The IGLS method uses a clever trick of converting cross-classified models to constrained nested models but this is harder to do within simulation macros. • The R function lmer allows cross-classified models which although slower than nested models can be computed with no problems. • We looked at balanced structures with 5 pupils per combination of secondary and primary school, 20-50 primary schools in steps of 10 and 10-30 secondary schools in steps of 10. • The MCMC method was run for a burnin of 500 and main run of 5001 for each simulation. We used the default inverse-gamma (ε,ε) priors for the variance parameters. • The power values for various combinations can be seen on the next page based on the standard error method.

Comparison of Power estimates

Cross-classified models • From the table we see reasonable agreement between MLwiN and R. • When the number of higher level units is 10 we have bigger differences as the effect of prior and running for only 5000 will affect the MCMC estimates. • This was more marked when we initially also considered only 10 primary schools. • It is reassuring that in the range that counts i.e. where power is ~0.8 the methods agree well. • R took of the order of 1 hour 20 minutes whilst the MCMC code ran overnight.

Conclusions • In this talk we have shown the flexibility of using simulation to perform power calculations for multilevel models. • Although computationally the approach can be slow the flexibility of the approach means it can be used for virtually all models given enough assumptions. • Low powered studies often involve small amounts of data thus making the power calculations quicker. • Other software – PINT (Bosker,Snijders and Guldemond, 1996) is quick for specific models – balanced 2 level models and we compare answers in our documentation.

Further work • Find (faster) approximations to simulation results – potentially create look up tables, power curves etc. • Investigate efficient MCMC methods. • Consider the IGLS algorithm for cross-classified models • Expand the model classes considered. • Link MLPOWSIM directly within MLwiN.

References • Bosker, R.J., Snijders, T.A.B. and Guldemond, H. (1996) PINT (Power IN Two-level designs) User Manual. • Browne, W.J. (1998) Applying MCMC methods to Multi-level models. Unpublished PhD. thesis. University of Bath. • Browne, W.J. and Draper, D. (2006) A comparison of Bayesian and likelihood-based methods for fitting multilevel models (with discussion). Bayesian Analysis1, 473-550. • Mass, C.J.M. and Hox, J.J. (2004) Robustness issues in multilevel regression analysis. Statistica Neerlandica58, 127-137.

Sample Size calculations for multilevel models (Part II)