Post-collection processing of data

Post-collection processing of data Survey Research and Design Spring 2006 Class #12 (Week 14)

Today’s objectives • To review some things from last week • To answer questions you have • To discuss application exercise #4 • To examine issues related to post-collection processing • To explore handling of missing data Survey Research and Design (Umbach)

Survey Process Define Research Objectives Choose Mode of Collection Choose Sampling Frame Construct and Pretest Questionnaire Design and Select Sample Recruit and Measure Sample Code and Edit Data Make Postsurvey Adjustments Perform Analysis Survey Research and Design (Umbach)

Quality perspective (Groves, et al., p. 48) Measurement Representation _ Y Target Population m1 Construct Coverage Error Validity _ yc Measurement Yi Sampling Frame Sampling Error Measurement Error _ ys Sample Response yi Nonresponse Error Processing Error _ yr Respondents Edited Response yip Adjustment Error Postsurvey Adjustments _ yrw _ yprw Survey Statistic Survey Research and Design (Umbach)

Post-collection processing of survey data • Several different steps • Coding • Data entry • Editing • Handling item-missing data • Weighting • Sampling variance estimation • Not a whole lot of agreement on how you should proceed Survey Research and Design (Umbach)

Coding • Involves assigning a unique number to be used later for statistical computing • Generally, each response category should have a number • Although you can enter letters for a survey response in SPSS, numbers give you more flexibility (e.g., use ‘1’ for female, not ‘F’) • Codes should be exhaustive and mutually exclusive; you can combine categories later in SPSS • You should code all possible responses • Have separate codes for did not answer the question; marked more than one response (when they shouldn’t have), etc. • Allows you to analyze these phenomenon later; you do not want to go back and reenter your data to get more detail • You probably want to recode some responses to inapplicable if they were truly inapplicable Survey Research and Design (Umbach)

Coding • Groves et al. also discusses coding of open-ended answers. • Here the coding is not as clear as the one-to-one coding for closed-ended questions; takes much more thought • Generally you should have two people code all the responses and then check for consistency (intercoder reliability) • If you have open-ended questions, do some research as to existing coding structures that you can use in your survey • Examples: occupation; field of major • Coding takes place • When you initially enter the data; you need to develop a coding sheet for your response categories • After the data are entered, using SPSS • Globally change all responses to “Are you happy in your marriage?” for single respondents to inapplicable code Survey Research and Design (Umbach)

Data entry • With web surveys, the respondent does your data entry for you • But you still want to test your survey so that each response category has the right value associated with it • For paper surveys, someone has to enter the data • The better approach is to have two people enter the same set of surveys, and then compare the responses • Chances are you won’t do this, so • Have a clear and straightforward sheet for coding your survey • Take your time doing data entry • Check the frequencies for each variable for bad codes Survey Research and Design (Umbach)

Editing • Altering data recorded by the respondent to improve data quality • You will mostly do • Range edits • Checks of highest and lowest values • Consistency edits • And probably not • Ratio edits • Comparisons to historical data • Balance edits • You must develop and record the set of rules you use to edit data • Aim to change the most implausible values for the smallest number of cases • If you’re uncertain, leave the data alone Survey Research and Design (Umbach)

Missing data • Missing data is a serious problem in surveys • Term generally refers to missing data for questions; i.e., respondents do not answer all questions • Problems: • Reduces the sample size for analysis; less power for statistical tests • Can bias estimates if the cause is not random • If you’re not careful, you can end up doing analyses for different subsets of the data • But the biggest problem is that we generally do not know why the data are missing, which complicates efforts to deal with it Survey Research and Design (Umbach)

Types of missing data • Missing completely at random (MCAR) • Missingness is unrelated to the survey question as well as other variables in the survey • E.g., someone accidentally skips a question • Missing at random (MAR) • Missingness is unrelated to the survey question, but is related to some other variable • E.g., older people have problems with recall • Not missing at random (NMAR) • Missingness is related to the survey question • E.g., response to sensitive question like illegal drug use; users less likely to respond • This is the worst kind; can seriously bias estimates Survey Research and Design (Umbach)

Patterns of missing data • Missing by design • Here some answers are missing due to the research design; e.g., not all respondents get all questions • Partial nonresponse • After a certain point in survey, all data are missing • Two types: • Interview break off (very common, especially in multi-page web surveys) • Panel attrition • Item nonresponse • Respondent completes surveys but does not answer all items • Respondent gives unusable answer (more common with paper surveys) • Data coding/editing error Survey Research and Design (Umbach)

Handling missing data • First strategy is to “ignore” the missingness • Listwise deletion – a case that has missing data for one variable (that is used) is deleted from all analyses • Commonly used • Assumes MCAR (strong assumption) • You can lose a lot of cases; especially in multivariate analyses • Pairwise deletion – cases are deleted only when a statistic is calculated; e.g., a correlation matrix would have different n’s for each correlation • Less commonly used • Also assumes MCAR • The number of cases varies from analysis to analysis; generally people are more comfortable with a single N Survey Research and Design (Umbach)

Handling missing data • The second strategy is to estimate or impute values for the missing data for each case • Mean or median value – means or median calculated from cases with data are used to replace the missing value • Commonly used • Can lead to “spikes” in the distribution; standard errors are underestimated • Can also be calculated for subgroups • Need to include dummy variables indicating missing data in multivariate models • Regression – use of a regression model to create a value • EM algorithm – use of maximum likelihood to create a value • Hot-deck – use of “adjacent” case’s data • Multiple imputation – creates multiple datasets, using idea that an imputed value really has a distribution rather than a single value Survey Research and Design (Umbach)

Handling missing data • The first step is to try and minimize missing data during the questionnaire and administration design phase • Pretests/pilot tests are very useful in identifying problematic items • Remember that “do not know” may or may not be considered missing data, depending on the item • The second step is to understand your missing data and consider your research goal • Take a close look at the missing data with frequencies and examining individual cases • Are there patterns, e.g., interview breakoffs, certain questions like income? • What will you be doing with the data? Simple frequencies, crosstabulations, or multivariate models? Survey Research and Design (Umbach)

Handling missing data • The third step is deciding how to handle it • How much missingness is there? • If only 5%-10% is missing, probably not a big issue – remember, bias depends on difference between the two groups and size of the two groups • What kind of analysis are you doing? • If you’re estimating frequencies and percentages, imputing can really change your results • E.g., you’re missing 10% for a satisfaction variable, and you replace missing with the median response; % choosing that category just increased by quite a bit • Same applies for crosstabulations • For regression, having large category changes is not necessarily a problem, because you’re interested in the regression coefficients, not the distribution • Imputation can be important, to preserve sample size Survey Research and Design (Umbach)

Handling missing data • Dependent versus independent variables • People seem to be more comfortable altering independent variables than dependent variables • For multivariate models: • Nominal variables must be used as a series of dummy variables • So instead of using median replacement, create another category of missing, e.g., for race/ethnicity • Ordinal can be entered as a single variable or multiple dummy variables • Try using a series of dummy variables with a dummy variable for missing rather than median replacement • For ratio/interval you can use mean replacement, but remember to also include a dummy variable that flags the imputed cases Survey Research and Design (Umbach)

Handling missing data • Croninger argues that with listwise deletion and MAR, including variables associated with the missingness in the model reduces bias • For scales it gets tricky • With listwise deletion, you toss out a lot of data • E.g., suppose with a 10-items scale, most people only filled out 9 of the items • So some sort of mean imputation may be necessary • I’ve made some judgment calls in this situation • Try dropping the item with the most missing data, if missingness is concentrated in one or two variables • Or, create an algorithm, such as: • If respondent answered at least 3 out of 5 items (some say 7 out of 10), use mean imputation for the items with missing data • If respondents answered less than 3 items, delete from analysis dataset Survey Research and Design (Umbach)

A big caveat • Allison (2002) argues that • The methods just described using dummy variables produce biased estimates • So you should use listwise deletion, unless the data loss is too severe • In which case multiple imputation is a good approach • He shows how to use PROC MI and PROC MIANALYZE in SAS to do this • PROC MI produces multiple datasets, default is 5 • Five regressions are run, one for each dataset, and the parameter estimates are saved into a new dataset • PROC MIANALYZE is run on this new dataset, and produces a single set of estimates • Allison, P.D. (2002). Missing Data. Thousand Oaks, CA: Sage Publications. Survey Research and Design (Umbach)

For next class… • Readings: • +Kalton, G. & Flores-Cervantes, I. (2003). (http://www.jos.nu/Contents/jos_online.asp) • *Thomas, S. L., & Heck, R. H. (2001) • 3 group presentations Survey Research and Design (Umbach)

Post-collection processing of data