1 / 22

Chapter 2 Graphical Displays of Longitudinal Data Part II

Longitudinal Data Spring 2006. Chapter 2 Graphical Displays of Longitudinal Data Part II. Instructors Alan Hubbard Nick Jewell. Fitting model by individual. In addition to examining raw data one can also fit statistical models by subject.

ally
Télécharger la présentation

Chapter 2 Graphical Displays of Longitudinal Data Part II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Longitudinal DataSpring 2006 Chapter 2 Graphical Displays of Longitudinal Data Part II Instructors Alan Hubbard Nick Jewell

  2. Fitting model by individual • In addition to examining raw data one can also fit statistical models by subject. • The type of fit will depend on many things including: • How much data per subject • Hypotheses about trends. • In this case, we going to assume a simple linear model per subject.

  3. Fitting linear models by individual • In notation, for each subject i from i=1,..,m we use STATA to fit the linear models: • Where Yij is the CD4 count on the jth measurement of subject i made at time Tij. • After the fit, we get estimates of the slopes and intercepts, , for each subject, i.

  4. Code to Fit Linear Models by ID • Need to write a little STATA program: * Blank variables to hold results gen predcd4 = . gen slope = . gen intercept = . * Define Program that does regression by ID and saves * predicted values, slopes and intercepts program define regbyid, byable(recall) syntax [varlist] [if] [in] marksample touse capture matrix drop beta regress `varlist' if `touse' matrix beta = get(_b) replace slope = beta[1,1] if `touse' replace intercept = beta[1,2] if `touse' capture drop predt predict predt replace predcd4 = predt if `touse' end

  5. Code to Fit Linear Models by ID * Runs regression program by id sort id quietly by id: regbyid cd4 etime *Plots example of fits and raw data for just 2 *ids #delimit ; scatter cd4 etime if id==1, ms(O) c(.) || scatter predcd4 etime if id==1,ms(i) c(l) || scatter cd4 etime if id==3, ms(D) c(.) || scatter predcd4 etime if id==3,ms(i) c(l) legend(off) ytitle("CD4") xtitle("time(days)");

  6. Example of fitting linear models by individual for two of the id’s

  7. What is the distribution of Slopes and Intercepts? • Want to see an estimate of the distribution of slopes and intercepts of CD4 count versus time. • Can do so by using histograms. ** Create a variable that counts observations within an id sorted by time capture drop idcnt sort id etime quietly by id: gen idcnt = _n twoway histogram slope if idcnt==1, freq saving(graph1, replace) yscale(range(0 200)) ylabel(0(20)200, grid) twoway histogram intercept if idcnt==1, freq saving(graph2, replace) yscale(range(0 200)) ylabel(0(20)200, grid) gr combine graph1.gph graph2.gph

  8. Histograms of Slope and Intercepts

  9. Display a subset of subjects based on slope (using both raw data and fit) • Same principal as already examined for mean(CD4) and AUC. • Sort subjects by slope (from most negative to most positive). • Plot either the raw data or fitted lines for a subset of subjects evenly distributed with respect to the estimated slope.

  10. First Raw Data ** Plot raw data evenly distributed w.r.t. slopes capture drop rslope egen rslope = rank(slope) if idcnt==1, unique capture drop mr ** Trick just to get the rank at id level on every ** observation for that id egen mr = mean(rslope), by(id) capture drop maxrmr ** Maximum rank (or total number of subjects) egen maxrmr = max(mr) capture drop ii * m/k gen ii = int(maxrmr/20) capture drop bb gen bb = mod(mr,ii)==0 sort id etime #delimit ; overlay cd4 etime if bb==1 & etime > 0 & etime < 2000, by(id) connect(l) symbol(o) xlab(0,500,1000,1500,2000) ylab(0,500,1000,1500,2000) saving(graph1, replace) ti("20 Evenly Spaced Subjects Ranked by Slope (CD4 vs. T)");

  11. 20 subjects evenly distributed with respect to slope of CD4 vs. time

  12. Fitted values for same subjects #delimit ; overlay predcd4 etime if bb==1 & etime > 0 & etime < 2000, by(id) connect(l) symbol(i) xlab(0,500,1000,1500,2000) ylab(0,500,1000,1500,2000) saving(graph1, replace) ti(“30 Evenly Spaced Subjects Ranked by Slope (CD4 vs. T)");

  13. Smoothing • One way to examine mean trends over time is to assume no particular form (linear, quadratic, etc.), but “smooth” the data (perform smooth regression). • Many types of smoothing (moving average, kernel density, smoothing splines, local linear, etc.). • Let the value of the smooth at time t be m(t) - almost all of the techniques to estimate m(t) can be understood as local (weighted) averages of the Y’s in a neighbor around an t.

  14. How smooth regression works • Ignore the repeated measures structure of the data for now. • Assume we have Y’s and T’s (CD4 and time). • The way smooth regression works is to: • make a small grid of points t along the time axis (in our case, say t=0,2,4,...,2000). • for each t, estimate E(Y|T=t)=m(t) as: where w[Ti-t]/h] is a weight which gets smaller as the distance between Ti and t gets larger (and how quickly it gets small depends on h called the bandwidth)

  15. How smooth regression works, cont. • Making a scatter plot (connecting points) of vs t. • The bigger h is, the smoother the regression: • if h is really big, then at every point t you just get the average Y (a straight line). • if h is really small, just connect the raw data (no smoothing). • There are ways to select the “best” h using model selection techniques (such as x-validation). • Smoothing is almost always a useful way to look at plots of Y versus time (or explanatory variables) particularly with noisy data.

  16. Smoothing all data (CD4 vs. Time) using STATA ** At bandwidth that is 80% of data (gen generates a new variable which is m(t) at each t, or etime) lowess cd4 etime if etime > 0 & etime < 2000, gen(smooth1) nograph bwidth(0.8) ** At bandwidth that is 50% of data lowess cd4 etime if etime > 0 & etime < 2000, gen(smooth2) nograph bwidth(0.5) ** At bandwidth that is 10% of data lowess cd4 etime if etime > 0 & etime < 2000, gen(smooth3) nograph bwidth(0.1) sort etime ** Plot all together #delimit; scatter smooth1 etime if etime > 0 & etime < 2000, ms(i) c(l) || scatter smooth2 etime if etime > 0 & etime < 2000, ms(i) c(l) || scatter smooth3 etime if etime > 0 & etime < 2000, ms(i) c(l) legend(off) ytitle("Smooth CD4") xtitle("time(days)") ysc(r(250 400));

  17. Smoothed CD4 vs. Time for different bandwidths using STATA

  18. Smoothing one subject at a time • Can make a similar plot to one showing the fitted linear models by id using the “smooths” instead. • Like the linear model example, need to right a little STATA program first. ***** Program to estimate smooth by id capture drop predcd4 gen predcd4 = . capture program drop smthbyid program define smthbyid, byable(recall) syntax [varlist] [if] [in] marksample touse capture drop predt lowess `varlist' if `touse', gen(predt) bandwidth(0.5) nograph replace predcd4 = predt if `touse' end

  19. Plot an example of two subjects of the smooth CD4 vs. time

  20. Random Sample of 30 smooths time

  21. Clustering subjects based on similarity of CD4 versus time • Relatively new stuff and we won’t go in great detail. • Idea is to treat the CD4 observations at different times as simply different variables one measures on a subject. • Technique requires all subjects to have the measurements at same times, so we need to interpolate at a fixed number of times for each subject so that everyone has identical times. • Then, after putting the data in wide format, one simply clusters subjects based on these interpolated times. • Then, after going back to long format, plot smooths by cluster assignment

  22. Clustering subjects based on similarity of CD4 versus time

More Related