Introduction to Generalized Estimating Equations (GEEs) for Correlated Data Modeling

Generalized Estimating Equations (GEEs) Purpose: to introduce GEEs These are used to model correlated data from Longitudinal/ repeated measures studies Clustered/ multilevel studies

Outline • Examples of correlated data • Successive generalizations • Normal linear model • Generalized linear model • GEE • Estimation • Example: stroke data • exploratory analysis • modelling

Treatment groups Measurement times A Subjects, i = 1,…,n B C Randomize Correlated data • Repeated measures: same subjects, same measure, successive times – expect successive measurements to be correlated

Correlated data Level 3 • Clustered/multilevel studies Level 2 Level 1 E.g., Level 3: populations Level 2: age - sex groups Level 1: blood pressure measurements in sample of people in each age - sex group We expect correlations within populations and within age-sex groups due to genetic, environmental and measurement effects

Notation • Repeated measurements: yij,i = 1,… N, subjects; j = 1, … ni, times for subject i • Clustered data: yij, i = 1,… N, clusters; j = 1, … ni, measurements within cluster i • Use “unit” for subject or cluster

Normal Linear Model For unit i: E(yi)=i=Xi; yi~N(i, Vi) Xi: nip design matrix : p1 parameter vector Vi: nini variance-covariance matrix, e.g., Vi=2I if measurements are independent For all units: E(y)==X, y~N(,V) This V is suitable if the units are independent

Normal linear model: estimation We want to estimate and V Use Solve this set of score equations to estimate

Generalized linear model (GLM)

Generalized estimating equations (GEE)

Generalized estimating equations Di is the matrix of derivatives i/j Vi is the ‘working’ covariance matrix of Yi Ai=diag{var(Yik)}, Ri is the correlation matrix for Yi  is an overdispersion parameter

Estimated using the formula: Overdispersion parameter Where N is the total number of measurements and p is the number of regression parameters The square root of the overdispersion parameter is called the scale parameter

Estimation (1) • More generally, unless Vi is known, need iteration to solve • Guess Vi and estimate  by b and hence  • Calculate residuals, rij=yij-ij • Estimate Vi from the residuals • Re-estimate b using the new estimate of Vi • Repeat steps 2-4 until convergence

Estimation (2) – For GEEs

Start with Ri=identity (ie independence) and =1: estimate  Use estimates to calculated fitted values: And residuals: These are used to estimate Ai, Ri and  Then the GEE’s are solved again to obtain improved estimates of  Iterative process for GEE’s

Correlation For unit i For repeated measures = correl between times l and m For clustered data = correl between measures l and m For all models considered here Vi is assumed to be same for all units

Types of correlation • Independent: Vi is diagonal • 2. Exchangeable: All measurements on the same unit are equally correlated • Plausible for clustered data • Other terms: spherical and compound symmetry

Types of correlation 3. Correlation depends on time or distance between measurements l and m e.g. first order auto-regressive model has terms , 2, 3 and so on Plausible for repeated measures where correlation is known to decline over time 4.Unstructured correlation:no assumptions about the correlations Lots of parameters to estimate – may not converge

Missing Data For missing data, can estimate the working correlation using the all available pairs method, in which all non-missing pairs of data are used in the estimators of the working correlation parameters.

Choosing the Best Model Standard Regression (GLM) AIC = - 2*log likelihood + 2*(#parameters) • Values closer to zero indicate better fit and greater parsimony.

Choosing the Best Model GEE QIC(V) – function of V, so can use to choose best correlation structure. QICu – measure that can be used to determine the best subsets of covariates for a particular model. the best model is the one with the smallest value!

Other approaches – alternatives to GEEs • Multivariate modelling – treat all measurements on same unit as dependent variables (even though they are measurements of the same variable) and model them simultaneously • (Hand and Crowder, 1996) • e.g., SPSS uses this approach (with exchangeable correlation) for repeated measures ANOVA

Other approaches – alternatives to GEEs • Mixed models – fixed and random effects • e.g., y = X + Zu + e • : fixed effects; u: random effects ~ N(0,G) • e: error terms ~ N(0,R) • var(y)=ZGTZT + R • so correlation between the elements of y is due to random effects Verbeke and Molenberghs (1997)

Example of correlation from random effects Cluster sampling – randomly select areas (PSUs) then households within areas Yij =  + ui + eij Yij : income of household j in area i  : average income for population ui : is random effect of area i ~ N(0, ); eij: error ~ N(0, ) E(Yij) = ; var(Yij) = ; cov(Yij,Ykm)= , provided i=k, cov(Yij,Ykm)=0, otherwise. So Vi is exchangeable with elements: =ICC (ICC: intraclass correlation coefficient)

Numerical example: Recovery from stroke Treatment groups A = new OT intervention B = special stroke unit, same hospital C= usual care in different hospital 8 patients per group Measurements of functional ability – Barthel index measured weekly for 8 weeks Yijk : patients i, groups j, times k • Exploratory analyses – plots • Naïve analyses • Modelling

Numerical example: time plots Individual patients and overall regression line

Numerical example: time plots for groups

Numerical example: research questions • Primary question: do slopes differ (i.e. do treatments have different effects)? • Secondary question: do intercepts differ (i.e. are groups same initially)?

Numerical example: Scatter plot matrix

Numerical example Correlation matrix

Numerical example1. Pooled analysis ignoring correlation within patients

Numerical example 2. Data reduction

Numerical example 2. Repeated measures analyses using various variance-covariance structures For the stroke data, from scatter plot matrix and correlations, an auto-regressive structure (e.g. AR(1)) seems most appropriate Use GEEs to fit models

Numerical example 4. Mixed/Random effects model • Use model • Yijk = (j + aij) + (j + bij)k + eijk • j and j are fixed effects for groups • other effects are random • and all are independent • Fit model and use estimates of fixed effects to compare j’s and j’s

Numerical example: Results for intercepts Results from Stata 8

Numerical example: Results for slopes Results from Stata 8

Numerical example: Summary of results • All models produced similar results leading to the same conclusion – no treatment differences • Pooled analysis and data reduction are useful for exploratory analysis – easy to follow, give good approximations for estimates but variances may be inaccurate • Random effects models give very similar results to GEEs • don’t need to specify variance-covariance matrix • model specification may/may not be more natural

Introduction to Generalized Estimating Equations (GEEs) for Correlated Data Modeling

Introduction to Generalized Estimating Equations (GEEs) for Correlated Data Modeling

Presentation Transcript

Bee Gees

Mapping the Environmental Science Landscape - a snapshot

Generalized Estimating Equations (GEE): A Modern Love Story

About this session

Enterprise Education in the GEES Disciplines

Employability of GEES graduates: issues from the Environment Agency

GEES Conference

Understanding and Promoting Student Engagement in GEES Learning Communities

Assessment and Feedback in GEES

GEES motivations

Bee Gees

Assessment and Feedback in GEES

Neil Thomas (Earth Sciences Subject Advisor, Subject Centre for GEES)

New to Teaching in GEES 13-14 May, RGS (with IBG)

Generalized Estimating Equations (GEEs)

Recognising and achieving effective feedback in the GEES disciplines