1 / 27

Peter Jacobebbinghaus 5th German Stata User's Group Meeting April 2nd, 2007 at RWI-Essen

Two Issues on Remote Data Access - Automation of Disclosure Limitation Review - Memory Allocation. Peter Jacobebbinghaus 5th German Stata User's Group Meeting April 2nd, 2007 at RWI-Essen. Research Data Centre of the BA in the IAB in Nuremberg (FDZ). Provides access to micro data of the

casper
Télécharger la présentation

Peter Jacobebbinghaus 5th German Stata User's Group Meeting April 2nd, 2007 at RWI-Essen

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Two Issues on Remote Data Access - Automation of Disclosure Limitation Review - Memory Allocation Peter Jacobebbinghaus 5th German Stata User's Group Meeting April 2nd, 2007 at RWI-Essen

  2. Research Data Centre of the BA in the IAB in Nuremberg (FDZ) • Provides access to micro data of the • Institute for Employment Research (IAB) • Federal Employment Agency (BA) • Data • surveys • administrational (databases > 200 GB, samples ~ 2 GB) • Data protection and data access • anonymization (scientific use files) • on-site use • remote data access

  3. Remote data access • Researchers send do-files – we return log-files • 90% use Stata Disclosure limitation review • Data have detailed information on persons and firms • Neither firms nor persons may be identifiable by the contents of the log-files • One heuristic: delete all information that is based on few observations

  4. Example . table sex edu ---------------------------------- | Education Sex | low medium high ----------+----------------------- male | 1 6 4 female | 8 53 26 ----------------------------------

  5. Example . table sex edu ---------------------------------- | Education Sex | low medium high ----------+----------------------- male | / 6 4 female | 8 53 26 ----------------------------------

  6. Example . table sex edu, c(freq mean wage) ------------------------------------- | Education Sex | low medium high ----------+-------------------------- male | 1 6 4 | 137.973 137.204 117.908 | female | 8 53 26 | 125.450 140.083 134.114 -------------------------------------

  7. Example . table sex edu, c(freq mean wage) ------------------------------------- | Education Sex | low medium high ----------+-------------------------- male | / 6 4 | / 137.204 117.908 | female | 8 53 26 | 125.450 140.083 134.114 -------------------------------------

  8. Disclosure limitation review • Problems • a lot of work if done by hand! • Pearl scripts don‘t get all • number of firms not visible • weights • Solutions • use other software ;-) • „fdz-ados“ • -tablefdz-

  9. tablefdz • sxntax very similar to -table- program define tablefdz, byable(recall) version 9 syntax varlist(min=1 max=2) [if] [in] [pweight /] ,FIRMID(string) [Contents(string) Format(string)] • error messages • preserve • drop all needless variables and observations

  10. by `vars' `firmid': gen `dum' = ( _n == 1 ) by `vars': egen `anzbetr' = sum(`dum') by `vars': egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { by `vars': egen `sum_gew' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew' / `summe' } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) } } by `vars': keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -2 if `anzbetr' <= `minoffirms' } replace `summe' = -2 if `anzbetr' <= `minoffirms' save `cell'

  11. tablefdz • calculation done for inner table, row, col and total • results for inner table, row, col and total are appended • tabdisp • restore

  12. Results of tablefdz . tablefdz sex edu, firmid(firmid) ------------------------------------------ | Education Sex | low medium high Total ----------+------------------------------- male | -1 -1 4 13 female | 8 53 26 87 | Total | 11 59 30 100 ------------------------------------------

  13. . tablefdz sex edu, firmid(firmid) c(mean wage sum wage) -------------------------------------------------- | Education Sex | low medium high Total ----------+--------------------------------------- male | -1 -1 4 13 | -1 -1 117.9078 131.4443 | -1 -1 471.6311 1708.776 | female | 8 53 26 87 | 125.4497 140.0829 134.1142 136.9536 | 1003.598 7424.396 3486.968 11914.96 | Total | 11 59 30 100 | 128.8651 139.7902 131.9533 136.2374 | 1417.516 8247.622 3958.599 13623.74 --------------------------------------------------

  14. . tablefdz sex edu [pw= weight], c(mean wage sum wage) firmid(firmid) -------------------------------------------------- | Education Sex | low medium high Total ----------+--------------------------------------- male | -1 -1 115.1209 431.6714 | -1 -1 115.558 131.4435 | -1 -1 13303.14 56740.38 | female | 349.4451 2696.87 1252.799 4299.114 | 133.8411 133.9306 134.1103 133.9757 | 46770.11 361193.5 168013.3 575976.9 | Total | 497.2254 2865.64 1367.92 4730.786 | 133.3649 134.3813 132.549 133.7446 | 66312.39 385088.5 181316.4 632717.3 --------------------------------------------------

  15. tablefdz • deleted values can somtimes be recalculated very easily • more deletion and • rounding by Pearl still necessary but now it is feasible • tablefdz takes about twice the time as table Questions • ideas to improve? (computation time) • similar programs around?

  16. Topic 2: Memory usage • 16 gigabyte memory in Unix server, 15 gigabyte are free • How much memory is the optimum we should allocate? • 13? 15? • dependant on matsize? • rule-of-thumb? • Automation for running Stata-jobs?

  17. * Program to produce tables with automatically deleted values that are based on few observations * by Stefan Seth and Peter Jacobebbinghaus *! version 1.0.1 30mar2007 program define tablefdz, byable(recall) version 9 syntax varlist(min=1 max=2) [if] [in] [pweight /] , FIRMID(string) [Contents(string) Format(string)] tempvar anzbetr sv1 sv2 sv3 sv4 sv5 dum svar1 svar2 svar3 svar4 svar5 eins summe tempvar sum_gew1 sum_gew2 sum_gew3 sum_gew4 sum_gew5 // used when calculating weighted averages tempfile dat cross row col scol local minoffirms = 3 // min. number of firms per statistic local anzvar = wordcount("`varlist'") local fmt "format(`format')" preserve if _by() { marksample touse // take care of by-group } quietly {

  18. if "`contents'" == "" { local anz_stats = 0 } if "`contents'" != "" { if mod( wordcount( "`contents'" ), 2 ) != 0 { di in red "Invalid statistics specification; please use the Contents option as follows:" di in red "contents( statistic variable )" di in red "You may specify up to 5 statistic/variable-pairs, e.g.: " di in red "tabfdz var1 var2, contents( mean var3 sum var4 ) exit } tokenize "`contents'" local stats "`1' `3' `5' `7' `9'" local svar "`2' `4' `6' `8' `10'" tokenize "`stats'" local anz_stats = wordcount( "`stats'" ) // list of statistics forvalues i = 1/`anz_stats' { local stat`i' "``i''" } tokenize "`svar'" // list of variables forvalues i = 1/`anz_stats' { local svar`i' "``i''" } }

  19. *------------------------------------------------------------ error messages if "`weight'" != "" { forv i = 1/`anz_stats' { if inlist( "`stat`i''", "sum", "mean" ) == 0 { di in red "weights may only be used with 'sum' and 'mean'" exit } } if "`weight'" != "pweight" { di in red "only pweights allowed" exit } } forvalues i = 1/`anz_stats' { if inlist( "`stat`i''", "sum", "mean", "sd", "median", "min", "max" ) == 0 { di in red "Statistics sum mean sd median min max allowed only" exit } }

  20. *------------------------------------------------------------ reduce size of data if _by() { keep if `touse' // restrict to by-group } capture keep `if' `in' // reduction of observations keep `firmid' `varlist' `svar' `exp' // reduction of variables gen byte `eins' = 1 tokenize `varlist' // dropping missings drop if missing( `1' ) if `anzvar' == 2 { drop if missing( `2' ) }

  21. if "`weight'" != "" { replace `eins' = `exp' // multiply statistics variables with weights... forv i = 1 / `anz_stats' { local jmax = `i' - 1 local seen = 0 // ... but only once forv j = 1 / `jmax' { if "`svar`i''" == "`svar`j''" { local seen = 1 } } if `seen' == 0 { replace `svar`i'' = `svar`i'' * `exp' } } } save `dat'

  22. *------------------------------------------------------------ computation of statistics local vars "`varlist'" // inner part of the table sort `vars' `firmid' by `vars' `firmid': gen `dum' = ( _n == 1 ) // mark obs with new firm by `vars': egen `anzbetr' = sum(`dum') // number of firms per cell by `vars': egen `summe' = sum( `eins' ) // (weighted) number of persons per cell forv i = 1 / `anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { // mean with weights per hand by `vars': egen `sum_gew`i'' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew`i''/`summe' // mean = sum(y_i * h_i)/sum(h_i) } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) // other statistics per person } } by `vars': keep if _n == 1 // keep one line per cell forv i = 1 / `anz_stats' { replace `sv`i'' = -1 if `anzbetr' <= `minoffirms' // erase statistics based on few firms } replace `summe' = -1 if `anzbetr' <= `minoffirms' // erase number of cases based on few firms save `cross' // save results for the inner part

  23. if `anzvar' == 2 { use `dat', clear // the same for rows tokenize `varlist' local vars "`1'" replace `2' = . sort `vars' `firmid' by `vars' `firmid': gen `dum' = ( _n == 1 ) by `vars': egen `anzbetr' = sum(`dum') by `vars': egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { by `vars': egen `sum_gew`i'' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew`i'' / `summe' } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) } } by `vars': keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -2 if `anzbetr' <= `minoffirms' } replace `summe' = -2 if `anzbetr' <= `minoffirms' save `row'

  24. use `dat', clear // the same for columns tokenize `varlist' replace `1' = . local vars "`2'" sort `vars' `firmid' by `vars' `firmid': gen `dum' = ( _n == 1 ) by `vars': egen `anzbetr' = sum( `dum' ) by `vars': egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { by `vars': egen `sum_gew`i'' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew`i'' / `summe' } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) } } by `vars': keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -3 if `anzbetr' <= `minoffirms' } replace `summe' = -3 if `anzbetr' <= `minoffirms' save `col' }

  25. use `dat', clear // the same for the total tokenize `varlist' replace `1' = . if `anzvar' == 2 { replace `2' = . } sort `firmid' by `firmid': gen `dum' = ( _n == 1 ) egen `anzbetr' = sum(`dum') egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { egen `sum_gew`i'' = sum( `svar`i'' ) gen `sv`i'' = `sum_gew`i'' / `summe' } else { egen `sv`i'' = `stat`i''( `svar`i'' ) } } keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -4 if `anzbetr' <= `minoffirms' } replace `summe' = -4 if `anzbetr' <= `minoffirms' save `scol'

  26. use `cross', clear // appending the results append using `scol' if `anzvar' == 2 { append using `col' append using `row' } *------------------------------------------------------------ format output forvalues i = 1/`anz_stats' { // headers local svlist = "`svlist'" + " `" + "sv`i'" + "'" } if `anzvar' == 1 & "`contents'" != "" { label variable `sv1' "`stat1'(`svar1')" cap label variable `sv2' "`stat2'(`svar2')" cap label variable `sv3' "`stat3'(`svar3')" cap label variable `sv4' "`stat4'(`svar4')" cap label variable `sv5' "`stat5'(`svar5')" } label variable `summe' "Freq." local linesize = c(linesize) set linesize 255 format `summe' %9.0g } // end of quietly

  27. *------------------------------------------------------------ output di "/*§BEGIN_tabfdz*/" tabdisp `varlist', cellvar(`summe' `svlist' ) totals `fmt' csepwidth(2) missing di "/*§END_tabfdz*/" restore set linesize `linesize' end

More Related