160 likes | 303 Vues
Data Cleaning Process. Patrick Bartels MEA Frankfurt, December 6 th. A short reminder. „Respondents don´t lie!“ only change values if you´re really sure gather information about your country_specific database by references of survey agencies by information of remarks
E N D
Data Cleaning Process Patrick Bartels MEA Frankfurt, December 6th
A short reminder • „Respondents don´t lie!“ • only change values if you´re really sure • gather information about your country_specific database • by references of survey agencies • by information of remarks • by own investigation • write syntax or do-file, don´t change the data directely • save original variable, when recoding valuese.g. varname_original • indicate by flag_variablee.g. varname_flag • save corrected data files with new namee.g. filename_corrected
What we do consistency checks between cv_r & modules between wave_1 & wave_2 for demography for children fixing of interchanged IDs by automatic exchanges Division of work
Automatic corrections (respid) month / year of birth_w2 month / year of birth_w1 gender_w2 sampid respid gender_w1 100123 01 female male Okt. 1945 Apr. 1942 02 male 100123 Apr. 1942 Okt. 1945 female
Automatic corrections (respid) month / year of birth_w2 month / year of birth_w1 gender_w2 sampid respid gender_w1 wave1wave2 100123 female 01 01 02 male Okt. 1945 Apr. 1942 02 01 02 male 100123 Apr. 1942 Okt. 1945 female compute respid_original = respidcompute respid_flag = 1
Overview of merge between wave_1 and wave_2 wave_1 - gender afterauto-corrections afterauto-corrections afterauto-corrections wave_2 - gender afterauto-corrections afterauto-corrections
What we do consistency checks between cv_r & modules between wave_1 & wave_2 for demography for children fixing of interchanged IDs by automatic exchanges correction of wave_1 by further information in wave_2 What we want you to do ID-corrections initiated by survey agencies check booklets, tests, HH-composition (> Omar) check financial modules (> Mario) check remarks (> Laura) check country specific deviations (> Stephanie) encoding open questions priority: education, ep005 Division of work we can fix a lot of cases you´re much better in doing this
What we do consistency checks between cv_r & modules between wave_1 & wave_2 for demography for children fixing of interchanged IDs by automatic exchanges correction of wave_1 by further information in wave_2 response for not fixable cases to country-teams What we want you to do ID-corrections initiated by survey agencies check booklets, tests, HH-composition (> Omar) check financial modules (> Mario) check remarks (> Laura) check country specific deviations (> Stephanie) encoding open questions priority: education, ep005 Division of work we can fix a lot of cases you´re much better in doing this check data again, inquire survey agencies if necessary
Do-File or Syntax • name of author, date of program • short description of ‘what is made‘ • which database • and which modules • version of data, date of publishing • conditions / order of do-files • for STATA-users: define global path
Example of STATA-do_file (1) which dataset short description /****************************************************************************** This program provides changes in cvid and respid variables in wave2 datasets of the longitudinal sample, in order to get exact matching between wave1 and wave2 respondents. A variable called "mix_hh_flag" is added to the final dataset : it is equal to 1 in each household when the value of the respid variable was changed in one or two interviews of that household. • data-version: 2007/Oct/26 • Omar Paccagnella, 30 October 2007 • VERY IMPORTANT! IN ORDER TO GET EXACT MATCHING OF RESPONDENTS WITHIN AND BETWEEN WAVES, THIS PROGRAM MUST BE RUN ONLY AFTER THE PROGRAMS: "IT_DN_changes_w2.do", "IT_CV_changes_w2.do" and "IT_XT_changes_w2.do" ! • **********************************************/ data-version author´s name & date of program order of do-files
Example of STATA-do_file (2) for which modules? global drive global drive “S:/Share/wave2“ /************************************************************* THIS PROGRAM HAS TO BE RUN FOR ALL SECTIONS FROM DN TO IV **************************************************************/ foreach module in ac as br cf ch co cs dn ep ex hc hh ho iv mh pf ph sp ws { use $drive/sharew2_`module' gen mix_hh_flag=0 gen sampid_original = sampid gen respid_original = respid replace respid=1 if sampid=="1604200015300" & cvid==2 & respid==2 replace mix_hh_flag=1 if sampid=="1604200015300" [...] save $drive/sharew2_`module'_corrected } save original variables flag-variable new version of data
Example of SPSS-syntax (1) short description which dataset COMMENT This program provides changes in cvid and respid variables in wave2 datasets of the longitudinal sample, in order to get exact matching between wave1 and wave2 respondents. A variable called "mix_hh_w2" is added to the final dataset (called sharew2_`var'_checked): it is equal to 1 in each household when the value of the respid variable was changed in one or two interviews of that household. * date of data: 2007/Oct/26 * Omar Paccagnella, October 2007 * VERY IMPORTANT! IN ORDER TO GET EXACT MATCHING OF RESPONDENTS WITHIN AND BETWEEN WAVES, * THIS PROGRAM MUST BE RUN ONLY AFTER THE PROGRAMS: "IT_DN_changes_w2.do", * "IT_CV_changes_w2.do" and "IT_XT_changes_w2.do" ! **************************************************************************** *THIS PROGRAM HAS TO BE RUN FOR ALL SECTIONS FROM DN TO IV data-version author´s name order of syntax for which modules?
Example of SPSS-syntax (2) GET FILE='S:\SHARE\wave2\dn_module.sav'. EXE. compute mix_hh_flag=0. compute cvid_original = cvid. compute respid_original = respid. compute sampid_original = sampid. if (sampid = 1604200015300 & cvid = 2) cvid = 1. if (sampid = 1604200015300 & cvid = 2) respid = 2. if sampid = (1604200015300) mix_hh_flag=1. EXE. [...] SAVE OUTFILE='S:\SHARE\wave2\dn_module_corrected.sav'. EXE. save original variables flag-variable
Any problems with programming do-files or syntax? Please give us a call