780 likes | 787 Vues
From t-test to … multilevel analyses. Stein Atle Lie. Outline. Pared t-test (Mean and standard deviation) Two-group t-test (Mean and standard deviations) Linear regression GLM (general linear models) GEE (general estimation equations) GLMM (general linear mixed model) …
E N D
From t-test to … multilevel analyses Stein Atle Lie
Outline • Pared t-test (Mean and standard deviation) • Two-group t-test (Mean and standard deviations) • Linear regression • GLM (general linear models) • GEE (general estimation equations) • GLMM (general linear mixed model) • … • SPSS, Stata, R, MLwiN, gllamm (Stata)
Multilevel models • “Same thing – many names”: • Generalized estimation equations • Random effects models • Random intercept and random slope models • Mixed effects models • Variance component models • Frailty models (in survival analyses) • Latent variables
Objective • Take the general thinking from simple statistical methods into more sophisticated data-structures and statistical analyses • Focus on the interpretation of the results with respect to those found in basic statistical methods
Multilevel data Types of data: • Repeated measures for the same individual • The same measure is repeated several times on the same individual • Several observers have measured the same individual • Several different measures for the same individual • Related observations (siblings, families, …) • A categorical variable with ”many” levels (multicenter data, hospitals, clinics, …) • Panel data
Null hypotheses • In ordinary statistics (using both pared and two‑sample t-tests) we define a null hypothesis. H0: m1 = m2 • We assume that mean from group (or measure) 1 is equal to the mean from group (or measure) 2. • Alternatively H0: D = m1-m2 = 0
p-value • Definition: • “If our null-hypothesis is true - what is the probability to observe the data* that we did?” * And hence the mean, t-statistic, etc…
p-value • We assume that our null-hypothesis is true (m0=0 or m1-m2=0) • We observe our data • Mean value etc. • Under the assumption of normal distributed data p-value • The p-value is the probability to observe our data (or something more extreme) under the given assumptions m0
Pared t-test • The straightforward way to analyze two repeated measures is a pared t-test. • Measure at time1 or location1 (e.g. Data1) is directly compared to measure at time2 or location2 (e.g. Data2) • Is the difference between Data1 and Data2 (Diff = Data1-Data2) unlike 0?
Pared t-test (n=10) PASW: T-TEST PAIRS=Data1 WITH Data2 (PAIRED).
Pared t-test • The pared t-test will only be performed for complete (balanced) data. • What happens if we delete two observations from data2? • (Only 8 complete pairs remain)
Pared t-test (n=8) PASW: T-TEST PAIRS=Data1 WITH Data2 (PAIRED). Excel
Two group t-test • If we now consider the data from time1 and time2 (or location1 and location2) to be independent (even if their not) and use a two group t-test on the full dataset, 2*10 observations
Two group t-test (n=20 [10+10]) PASW: T-TEST GROUPS=Grp(1 2) /VARIABLES=Data.
Two group t-test • Observe that mean for Grp1 and Grp2 is equal to mean for Data1 and Data2 • And that the mean difference is also equal • The difference between pared t-test and two group t-test lies in the • Variance - and the number of observations • and therefore in the standard deviation and standard error • and hence in the p-value and confidence intervals
Two group t-test • The two group t-test are performed on all available data. • What happens if we delete two observations from Grp2? • (Only 8 complete pairs remain - but 18 observations remain!)
Two group t-test (n=18 [10+8]) PASW: T-TEST GROUPS=Grp(1 2) /VARIABLES=Data.
Two group t-test (s1=s2) s1 s2 m1 m2 D
Two group t-test (s1=s2) s1 s2
Linear regression • If we now perform an ordinary linear regression with the data as outcome (dependent variable) and the group variable (Grp=1 and 2) as independent variable • the coefficient for group is identical to the mean difference • and the standard error, t-statistic, and p‑value are identical to those found in a two‑group t‑test
Linear regression (n=20) Stata: . regress data grp Source | SS df MS Number of obs = 20 -------------+------------------------------ F( 1, 18) = 1.38 Model | 21.0124998 1 21.0124998 Prob > F = 0.2554 Residual | 274.01701 18 15.2231672 R-squared = 0.0712 -------------+------------------------------ Adj R-squared = 0.0196 Total | 295.02951 19 15.5278689 Root MSE = 3.9017 ------------------------------------------------------------------------------ data | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- grp | 2.05 1.744888 1.17 0.255 -1.615873 5.715873 _cons | 5.33 2.75891 1.93 0.069 -.4662545 11.12625 ------------------------------------------------------------------------------
Linear regression • Now exchange the independent variable for group (Grp=1 and 2) with a dummy variable (dummy=0 for grp=1 and dummy=1 for grp=2) • the coefficient for the dummy is equal to the coefficient for grp (the mean difference) • and the coefficient for the constant term is equal to the mean for grp1 (the standard error is not!)
Linear regression (n=20) Stata: . regress data dummy Source | SS df MS Number of obs = 20 -------------+------------------------------ F( 1, 18) = 1.38 Model | 21.0124998 1 21.0124998 Prob > F = 0.2554 Residual | 274.01701 18 15.2231672 R-squared = 0.0712 -------------+------------------------------ Adj R-squared = 0.0196 Total | 295.02951 19 15.5278689 Root MSE = 3.9017 ------------------------------------------------------------------------------ data | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- dummy | 2.05 1.744888 1.17 0.255 -1.615873 5.715873 _cons | 7.38 1.233822 5.98 0.000 4.787836 9.972164 ------------------------------------------------------------------------------
Linear models in Stata • In ordinary linear models (regress and glm) in Stata one may add an option for clustered data – to obtain standard errors adjusted for intragroup correlation • This is ideal when you want to adjust for clustered data, but are not interested in the correlation within or between groups • And - you will still have the population effects!!
Linear regression (n=20) Stata: . regress data dummy, cluster(id) Linear regression Number of obs = 20 F( 1, 9) = 2.64 Prob > F = 0.1388 R-squared = 0.0712 Root MSE = 3.9017 (Std. Err. adjusted for 10 clusters in id) ------------------------------------------------------------------------------ | Robust data | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- dummy | 2.05 1.262145 1.62 0.139 -.8051699 4.90517 _cons | 7.38 1.224847 6.03 0.000 4.609204 10.1508 ------------------------------------------------------------------------------
Linear models in Stata • Thus, we now have an alternative to the pared t‑test. The mean difference is identical to that obtained from the pared t‑test, and the standard errors (and p-values) are adjusted for intragroup correlation • As an alternative we may use the program gllamm (Generalized Linear Latent And Mixed Models) in Stata • http://www.gllamm.org/
gllamm (n=20) gllamm (Stata): . gllamm data dummy, i(id) number of level 1 units = 20 number of level 2 units = 10 ------------------------------------------------------------------------------ data | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- dummy | 2.05 1.167852 1.76 0.079 -.2389486 4.338949 _cons | 7.379808 1.172819 6.29 0.000 5.081124 9.678492 ------------------------------------------------------------------------------ Variance at level 1 6.8193955 (3.0174853) Variances and covariances of random effects ------------------------------------------------------------------------------ level 2 (id) var(1): 6.8114516 (4.5613185)
Linear models in Stata • If we now delete two of the observations in Grp2 • We then have coefficients (“mean differences”) calculated based on all (n=18) data • and standard errors corrected for intragroup correlation - using the commands <regress>, <glm> or <gllamm>
Linear regression (n=18) Stata: . regress data dummy, cluster(id) Linear regression Number of obs = 18 F( 1, 9) = 1.63 Prob > F = 0.2332 R-squared = 0.0587 Root MSE = 4.1303 (Std. Err. adjusted for 10 clusters in id) ------------------------------------------------------------------------------ | Robust data | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- dummy | 1.9575 1.531486 1.28 0.233 -1.506963 5.421963 _cons | 7.38 1.228869 6.01 0.000 4.600105 10.1599 ------------------------------------------------------------------------------
gllamm (n=18) gllamm (Stata): . gllamm data dummy, i(id) number of level 1 units = 18 number of level 2 units = 10 log likelihood = -48.538837 ------------------------------------------------------------------------------ data | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- dummy | 2.458305 1.253552 1.96 0.050 .0013882 4.915223 _cons | 7.357426 1.232548 5.97 0.000 4.941677 9.773176 ------------------------------------------------------------------------------ Variance at level 1 6.4041537 (3.3485133) level 2 (id) var(1): 8.7561818 (5.1671805)
Intra class correlation (ICC) Variance at level 1 6.4041537 (3.3485133) level 2 (id) var(1): 8.7561818 (5.1671805) • The total variance is hence • 6.4041 + 8.7561= 15.1603 • (and the standard deviation is hence 3.8936) • The proportion of variance attributed to level 2 is therefore • ICC = 8.7561/15.1603 = 0.578
Linear regression • Ordinary linear regression • Assumes data is Normal and i.i.d. (identical independent distributed)
Linear regression b1 residual Y Regression line: y = b0 + b1·x (x1,y1) (xn,yn) (xi,yi) b0 Height * Weight Kortisol * Months Kortisol * Time X
Linear regression • Assumptions: 1) y1, y2,…, yn are independent normal distributed 2) The expectation of Yi is: E(Yi) = b0+ b1·xi(linear relation between X and Y) 3) The variance of Yi is: var(Yi) = s2(equal variance for ALL values of X)
Linear regression • Assumptions - Residuals (ei): yi = a + b·xi + ei 1) e1, e2,…, en are independent normal distributed 2) The expectation of ei is: E(ei) = 0 3) The variance of ei is: var(Yi) = s2
^ yi=a+b·xi ^ (yi-yi)2 ^ _ y (xi,yi) _ x Y Regression What is the ”best” a and b? Least squares method (xi,yi) residual (e) residual (e) X
Regression • Least squares method: • We wish that the sum of squares (The distance from all points to the line [the residuals]; squared) is as least as possible – we whish to find the minimum
Regression • The least squares method: • The solution is:
Regression • The maximum likelihood method: • Assumptions: 1) y1, y2,…, yn are random (independent), normal-distributed observations, i.i.d. 2) Expectation for Yi is: E(Yi) = a + b·xi 3) Variance for Yi is: var(Yi) = s2 f(y) maximized v.r.t. a and b. (The likelihood-function) This is the same as finding the minimum of For simple linear regression the least squares method and the maximum likelihood method are equal!
^ _ y (xi,yi) _ x Y Regression The maximum likelihood method ”The probability that the line fits the observed points” residual (e) (xi,yi) X
Ordinary linear regression • The formula for an ordinary regression can be expressed as: yi = b0 + b1·xi + ei ei ~N(0, se2)
100 90 80 Vekt i kg (Y) 70 60 Kvinner Menn 50 150 160 170 180 190 200 210 Høyde i cm (X) Interpretation of coefficients Y = - 97.6 + 0.96*X Y = a + b*X Det vil si: a = -97.6 og b=0.96
Interpretation of coefficients Y = - 85.0 + 0.91*X1 - 1.86*X2 } = 1.86 kg
Random intercept model b1 Y Regression lines: yij = b0 + b1·xij+vij (x11,y11) (xnp,ynp) b0+uj (xij,yij) su se X
Random intercept model • For a random intercept model, we can express the regression line(s) - and the variance components as yij = b0 + b1·xij + vij vij = uj + eij eij ~N(0, se2) (individual) uj ~N(0, su2) (group)
Random intercept model • Alternatively we may express the formulas, for the simple variance component model, in terms of random intercepts: yij = b0j + b1·xij + eij b0j = b0 + uj eij ~N(0, se2) (individual) uj ~N(0, su2) (group)
Random slope model • For a random slope model (the intercepts are equal), we can express the regression line(s) and the variance components as yij = b0 + b1j·xij + eij b1j = b1+ wj eij ~N(0, se2) (individual) wj ~N(0, sw2) (group)