1 / 35

Advanced Stata Workshop

Advanced Stata Workshop. Nealia Khan Tom Tomberlin Learning Technologies Center Harvard University, Graduate School of Education. Contact Information. Located in Gutman Library, 3 rd floor Contact us at: stathelp@gse.harvard.edu

maia
Télécharger la présentation

Advanced Stata Workshop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Stata Workshop Nealia Khan Tom Tomberlin Learning Technologies Center Harvard University, Graduate School of Education

  2. Contact Information • Located in Gutman Library, 3rd floor • Contact us at: stathelp@gse.harvard.edu • Can make an appointment or have us respond to your request via email

  3. Generating Variables • Generate (gen) – allows the user to create or change the contents of a variable. The generate command allows for the use of mathematical functions and conditional statements. There are also many specific commands which can be included in the gen statement. • Extensions to generate (egen) – egen allows the use of some specific functions in the creation of new variables. Egen cannot be used interchangeably with the gen command – you must use egen specific functions when using this command.

  4. Gen and Egen Functions • group • concat • cond

  5. Group • Assign a unique, three digit numeric value to each district and school name • Code: sort district egen group = group(district) sort group gen districtid = group+100 drop group

  6. Apply Your Knowledge • Generate the variable schoolid that is a three digit number beginning with 301 which uniquely identifies each school in the schname variable. • Hint: Sorting by both district and schname will keep schools within a district together in the id creation.

  7. Concat • Join the two newly generated ids (district and school) into one six digit number that uniquely identifies each school. • Code: egen id = concat(districtid schoolid)

  8. Condition (cond) • We can use the cond function of the generate command to identify the number of duplicate observations of an id number. • Code: sort id race female quietly by id: gen dup = cond(_N==1,0,_n)

  9. Apply Your Knowledge • Check to see how many duplicates there are for each value of id. • Create the variable student and assign a value of 0 if there are 3 or fewer duplicates of id, a value of 1 if there are 4-6 duplicates. • Generate a new variable, studentid, which joins id and student together. • Drop the id, dup, and student variables from the dataset.

  10. Forming Composites • We can create a new variable, risk, that is the sum of the four dummy variables. • Code: • gen risk = lep + sped + lo_read + frlunch • We can eliminate the missing values problem with an option in the egen command. • Code: • egen risk = rowtotal(lep sped lo_readfrlunch)

  11. Forming Composites • We can use principal components analysis (PCA) to generate weights for each of the items in risk. The predict command will generate a value of risk for each observation in the dataset. • Code: • pcalep sped lo_readfrlunch • predict risk • browse lep sped lo_readfrlunch risk

  12. Categorical Variables • Our dataset contains the categorical variable race. • We can deal with this variable in one of two ways: • Form dummy variables for each race subgroup. Code: gen race1=race if race==1 replace race1=0 if race~=1

  13. Categorical Variables • Our dataset contains the categorical variable race. • We can deal with this variable in one of two ways: • Indicate the categorical nature of the variable in the regression model. Code: regress mathrawi.race

  14. Interactions with Categorical Variables • Does the effect of risk vary by racial group? • xi3 allows us to form interactions from within the regression model. Code: xi3: regress mathraw i.race*risk

  15. Apply Your Knowledge • Use the xi3 function to fit a regression model that includes interaction effects between race and class size. Do we see a differential effect of class size for any race groups? • Use the xi3 function to fit a regression model that includes interaction effects between race and both class size and time on the bus.

  16. Creating Regression Tables • Stata has the ability to create formatted tables for regression models. • These tables can be created within the Stata program or exported to a text format. • Both methods of table creation rely on the estimate store (eststo) command in Stata

  17. Creating Regression Tables • There are two methods of storing the estimates of a regression model into eststo: • Invoking eststo immediately after a regression procedure and assigning a name for the stored values. Code: regress mathrawi.race risk eststo m1, title(Model 1)

  18. Creating Regression Tables • There are two methods of storing the estimates of a regression model into eststo: • Using eststo: before a regression model. Stata will automatically assign consecutive model numbers to the stored values. Code: eststo: regress mathraw risk (est1 stored)

  19. Creating Regression Tables • There are two commands that allow you to access the information in the estimate store memory – estout and esttab. • estout is the most flexible in its ability to modify the appearance of the formatted regression table, but it also requires more programming code to achieve APA style tables. • esttab is “wrapper” for estout and simplifies the coding process.

  20. ESTOUT • Code: eststo: xi3: regress mathraw i.race (est1 stored) eststo: xi3: regress mathraw i.race risk class_sz bus_time (est2 stored) eststo: xi3: regress mathraw i.race*risk class_sz bus_time (est3 stored)

  21. ESTOUT • Code: estout est1 est2 est3 • Now let’s try to format this table into something suitable for a research paper: • Code: estout using models_out.rtf, cells(b(star fmt(3)) se(par fmt(2))) legend label title(Regression Models) mlabels("Model A" "Model B" "Model C") varlabels(_cons INTERCEPT) stats(N r2 df_r, fmt(0 3 0) label (N R2 DF)) style(fixed) • Here is the result of this code:

  22. ESTOUT

  23. ESTTAB • Code: esttab • We can make modifications to the standard esttab table: • Code: esttab using models.rtf, se r2 ar2 label title({\b Table 1.} {\i Hierarchy of Fitted Models}) nonumbers mtitles("Model A" "Model B" "Model C") varlabels(_cons INTERCEPT) order( _Irace_2 _Irace_3 _Irace_4 _Irace_5 class_sz bus_time risk _Ira2Xri _Ira3Xri _Ira4Xri _Ira5Xri) style(fixed) • Here is the result of this code:

  24. ESTTAB

  25. ESTTAB • Here is the code for estout to produce the same table we just created in esttab: • estout using `"models1.rtf"' , cells(b(fmt(a3) star) se(fmt(a3) par)) stats(N r2 r2_a, fmt(%18.0g 3 3) labels(`"Observations"' `"{\i R}{\super 2}"' `"Adjusted {\i R}{\super 2}"')) starlevels("{\super *}" 0.05 "{\super **}" 0.01 "{\super ***}" 0.001, label(" {\i p} < ")) varwidth(20) modelwidth(12) begin({\trowd\trgaph108\trleft-108@rtfrowdefbrdr\pard\intbl\ql {) delimiter(}\cell \pard\intbl\qc {) end(}\cell\row}) title({\b Table 1.} {\i Hierarchy of Fitted Models}) prehead(`"{\rtf1\ansi\deff0 {\fonttbl{\f0\fnil Times New Roman;}}"' `"{\info {\author .}{\company .}{\title .}{\creatim\yr2010\mo3\dy31\hr14\min14}}"' `"\deflang1033\plain\fs24"' `"{\footer\pard\qc\plain\f0\fs24\chpgn\par}"' `"{\pard\keepn\ql @title\par}"' {) posthead() prefoot() postfoot(`"{\pard\ql\fs20 Standard errors in parentheses\par}"' `"{\pard\ql\fs20 @starlegend\par}"' } `"{\pard \par}"' `"}"') label varlabels(_cons INTERCEPT) mlabels("Model A" "Model B" "Model C",) nonumberscollabels(, none) eqlabels(, begin("{\trowd\trgaph108\trleft-108@rtfrowdefbrdrt\pard\intbl\ql {") replace nofirst) notype level(95) replace order( _Irace_2 _Irace_3 _Irace_4 _Irace_5 class_szbus_time risk _Ira2Xri _Ira3Xri _Ira4Xri _Ira5Xri) style(fixed)

  26. Graphing - Scatterplots • Bivariate Scatterplot – mathraw on risk • Code: scatter mathraw risk • Since risk is essentially a bin, the graph will have a “stacked” appearance to it. We can lessen this effect with the jitter option. • Code: scatter mathraw risk, jitter(4) • Here is the resulting graph:

  27. Graphing - Scatterplots • Now let’s add a fitted trend line to the scatterplot of mathraw on risk. • Code: twoway scatter mathraw risk, jitter(4) || lfit mathraw risk • The lfit option gives us a linear fitted trend line. There are two other fit options for the trend line – qfit (quadratic fit) and fpfit (fractional polynomial fit). • Here are three graphs that illustrate these three fitted line options:

  28. Graphing - Scatterplots • We can easily add 95% confidence intervals to the fitted trend line for any of the fitted trend line options. • Code: • twoway scatter mathraw risk, jitter(4) || lfitcimathraw risk, ciplot(rline) • twoway scatter mathrawgpa || qfitcimathrawgpa, ciplot(rline) • twoway scatter mathraw risk, jitter(4) || fpfitcimathraw risk, ciplot(rline) • Here are the three graphs from the code above:

  29. Graphing – Residual Scatterplots • Let’s begin by fitting our regression model: xi3: regress mathraw female class_sz bus_time i.race*risk • Stata has two postestimation commands that allow us to check (raw) residuals against predictors and fitted values. Code: rvpplot class_sz, yline(0) rvfplot, yline(0) • Here are the graphs for these two commands:

  30. Graphing – Residual Scatterplots • Suppose we want to plot the studentized residuals against the predictors and fitted values. • We must generate studentized residuals for each observation and also predict fitted values of mathraw for each observation. • Code: predict student if e(sample), rstudent predict fitted scatter student fitted, yline(2) yline(-2) scatter student class_sz, yline(2) yline(-2)

  31. Graphing – Regression Lines • We can create fitted regression lines in Stata by using the xi3 function with the regression command. • The graph is generated by the postgr3 command followed by the variable of interest. • Code: xi3: regress mathraw i.race*risk class_sz bus_time female postgr3 risk • Here is the graph which results from this code:

  32. Graphing – Regression Lines • We can enhance our graph to show the effect of risk on mathraw by including prototypical values of class_sz in the graph. • Code: • gen class_cat=1 if class_sz<=17 • replace class_cat=2 if class_sz>17 & class_sz<=30 • replace class_cat=3 if class_sz>30 • xi3: regress mathrawi.race*risk class_catbus_time female • postgr3 risk, by(class_cat) • Here is the graph from the code:

  33. Graphing – Regression Lines • We can also generate a graph of the regression lines which show the interactions between risk and race. • Code: postgr3 risk, by(race) • Here is the graph produced by the code:

  34. Graphing – Regression Lines • We can use the graph combine command in Stata to join two graphs together. • Code: xi3: regress mathrawclass_szbus_time female i.race*risk postgr3 risk, by(race) x(female=1) name(female) postgr3 risk, by(race) x(female=0) name(male) graph combine female male, ycommon • Here is the graph for the preceding code:

  35. Questions: • Please complete the evaluation of this workshop • Thank you!

More Related