220 likes | 413 Vues
Longitudinal Data Spring 2006. Chapter 2 Graphical Displays of Longitudinal Data Part II. Instructors Alan Hubbard Nick Jewell. Fitting model by individual. In addition to examining raw data one can also fit statistical models by subject.
E N D
Longitudinal DataSpring 2006 Chapter 2 Graphical Displays of Longitudinal Data Part II Instructors Alan Hubbard Nick Jewell
Fitting model by individual • In addition to examining raw data one can also fit statistical models by subject. • The type of fit will depend on many things including: • How much data per subject • Hypotheses about trends. • In this case, we going to assume a simple linear model per subject.
Fitting linear models by individual • In notation, for each subject i from i=1,..,m we use STATA to fit the linear models: • Where Yij is the CD4 count on the jth measurement of subject i made at time Tij. • After the fit, we get estimates of the slopes and intercepts, , for each subject, i.
Code to Fit Linear Models by ID • Need to write a little STATA program: * Blank variables to hold results gen predcd4 = . gen slope = . gen intercept = . * Define Program that does regression by ID and saves * predicted values, slopes and intercepts program define regbyid, byable(recall) syntax [varlist] [if] [in] marksample touse capture matrix drop beta regress `varlist' if `touse' matrix beta = get(_b) replace slope = beta[1,1] if `touse' replace intercept = beta[1,2] if `touse' capture drop predt predict predt replace predcd4 = predt if `touse' end
Code to Fit Linear Models by ID * Runs regression program by id sort id quietly by id: regbyid cd4 etime *Plots example of fits and raw data for just 2 *ids #delimit ; scatter cd4 etime if id==1, ms(O) c(.) || scatter predcd4 etime if id==1,ms(i) c(l) || scatter cd4 etime if id==3, ms(D) c(.) || scatter predcd4 etime if id==3,ms(i) c(l) legend(off) ytitle("CD4") xtitle("time(days)");
Example of fitting linear models by individual for two of the id’s
What is the distribution of Slopes and Intercepts? • Want to see an estimate of the distribution of slopes and intercepts of CD4 count versus time. • Can do so by using histograms. ** Create a variable that counts observations within an id sorted by time capture drop idcnt sort id etime quietly by id: gen idcnt = _n twoway histogram slope if idcnt==1, freq saving(graph1, replace) yscale(range(0 200)) ylabel(0(20)200, grid) twoway histogram intercept if idcnt==1, freq saving(graph2, replace) yscale(range(0 200)) ylabel(0(20)200, grid) gr combine graph1.gph graph2.gph
Display a subset of subjects based on slope (using both raw data and fit) • Same principal as already examined for mean(CD4) and AUC. • Sort subjects by slope (from most negative to most positive). • Plot either the raw data or fitted lines for a subset of subjects evenly distributed with respect to the estimated slope.
First Raw Data ** Plot raw data evenly distributed w.r.t. slopes capture drop rslope egen rslope = rank(slope) if idcnt==1, unique capture drop mr ** Trick just to get the rank at id level on every ** observation for that id egen mr = mean(rslope), by(id) capture drop maxrmr ** Maximum rank (or total number of subjects) egen maxrmr = max(mr) capture drop ii * m/k gen ii = int(maxrmr/20) capture drop bb gen bb = mod(mr,ii)==0 sort id etime #delimit ; overlay cd4 etime if bb==1 & etime > 0 & etime < 2000, by(id) connect(l) symbol(o) xlab(0,500,1000,1500,2000) ylab(0,500,1000,1500,2000) saving(graph1, replace) ti("20 Evenly Spaced Subjects Ranked by Slope (CD4 vs. T)");
20 subjects evenly distributed with respect to slope of CD4 vs. time
Fitted values for same subjects #delimit ; overlay predcd4 etime if bb==1 & etime > 0 & etime < 2000, by(id) connect(l) symbol(i) xlab(0,500,1000,1500,2000) ylab(0,500,1000,1500,2000) saving(graph1, replace) ti(“30 Evenly Spaced Subjects Ranked by Slope (CD4 vs. T)");
Smoothing • One way to examine mean trends over time is to assume no particular form (linear, quadratic, etc.), but “smooth” the data (perform smooth regression). • Many types of smoothing (moving average, kernel density, smoothing splines, local linear, etc.). • Let the value of the smooth at time t be m(t) - almost all of the techniques to estimate m(t) can be understood as local (weighted) averages of the Y’s in a neighbor around an t.
How smooth regression works • Ignore the repeated measures structure of the data for now. • Assume we have Y’s and T’s (CD4 and time). • The way smooth regression works is to: • make a small grid of points t along the time axis (in our case, say t=0,2,4,...,2000). • for each t, estimate E(Y|T=t)=m(t) as: where w[Ti-t]/h] is a weight which gets smaller as the distance between Ti and t gets larger (and how quickly it gets small depends on h called the bandwidth)
How smooth regression works, cont. • Making a scatter plot (connecting points) of vs t. • The bigger h is, the smoother the regression: • if h is really big, then at every point t you just get the average Y (a straight line). • if h is really small, just connect the raw data (no smoothing). • There are ways to select the “best” h using model selection techniques (such as x-validation). • Smoothing is almost always a useful way to look at plots of Y versus time (or explanatory variables) particularly with noisy data.
Smoothing all data (CD4 vs. Time) using STATA ** At bandwidth that is 80% of data (gen generates a new variable which is m(t) at each t, or etime) lowess cd4 etime if etime > 0 & etime < 2000, gen(smooth1) nograph bwidth(0.8) ** At bandwidth that is 50% of data lowess cd4 etime if etime > 0 & etime < 2000, gen(smooth2) nograph bwidth(0.5) ** At bandwidth that is 10% of data lowess cd4 etime if etime > 0 & etime < 2000, gen(smooth3) nograph bwidth(0.1) sort etime ** Plot all together #delimit; scatter smooth1 etime if etime > 0 & etime < 2000, ms(i) c(l) || scatter smooth2 etime if etime > 0 & etime < 2000, ms(i) c(l) || scatter smooth3 etime if etime > 0 & etime < 2000, ms(i) c(l) legend(off) ytitle("Smooth CD4") xtitle("time(days)") ysc(r(250 400));
Smoothing one subject at a time • Can make a similar plot to one showing the fitted linear models by id using the “smooths” instead. • Like the linear model example, need to right a little STATA program first. ***** Program to estimate smooth by id capture drop predcd4 gen predcd4 = . capture program drop smthbyid program define smthbyid, byable(recall) syntax [varlist] [if] [in] marksample touse capture drop predt lowess `varlist' if `touse', gen(predt) bandwidth(0.5) nograph replace predcd4 = predt if `touse' end
Clustering subjects based on similarity of CD4 versus time • Relatively new stuff and we won’t go in great detail. • Idea is to treat the CD4 observations at different times as simply different variables one measures on a subject. • Technique requires all subjects to have the measurements at same times, so we need to interpolate at a fixed number of times for each subject so that everyone has identical times. • Then, after putting the data in wide format, one simply clusters subjects based on these interpolated times. • Then, after going back to long format, plot smooths by cluster assignment