270 likes | 393 Vues
Two Issues on Remote Data Access - Automation of Disclosure Limitation Review - Memory Allocation. Peter Jacobebbinghaus 5th German Stata User's Group Meeting April 2nd, 2007 at RWI-Essen. Research Data Centre of the BA in the IAB in Nuremberg (FDZ). Provides access to micro data of the
E N D
Two Issues on Remote Data Access - Automation of Disclosure Limitation Review - Memory Allocation Peter Jacobebbinghaus 5th German Stata User's Group Meeting April 2nd, 2007 at RWI-Essen
Research Data Centre of the BA in the IAB in Nuremberg (FDZ) • Provides access to micro data of the • Institute for Employment Research (IAB) • Federal Employment Agency (BA) • Data • surveys • administrational (databases > 200 GB, samples ~ 2 GB) • Data protection and data access • anonymization (scientific use files) • on-site use • remote data access
Remote data access • Researchers send do-files – we return log-files • 90% use Stata Disclosure limitation review • Data have detailed information on persons and firms • Neither firms nor persons may be identifiable by the contents of the log-files • One heuristic: delete all information that is based on few observations
Example . table sex edu ---------------------------------- | Education Sex | low medium high ----------+----------------------- male | 1 6 4 female | 8 53 26 ----------------------------------
Example . table sex edu ---------------------------------- | Education Sex | low medium high ----------+----------------------- male | / 6 4 female | 8 53 26 ----------------------------------
Example . table sex edu, c(freq mean wage) ------------------------------------- | Education Sex | low medium high ----------+-------------------------- male | 1 6 4 | 137.973 137.204 117.908 | female | 8 53 26 | 125.450 140.083 134.114 -------------------------------------
Example . table sex edu, c(freq mean wage) ------------------------------------- | Education Sex | low medium high ----------+-------------------------- male | / 6 4 | / 137.204 117.908 | female | 8 53 26 | 125.450 140.083 134.114 -------------------------------------
Disclosure limitation review • Problems • a lot of work if done by hand! • Pearl scripts don‘t get all • number of firms not visible • weights • Solutions • use other software ;-) • „fdz-ados“ • -tablefdz-
tablefdz • sxntax very similar to -table- program define tablefdz, byable(recall) version 9 syntax varlist(min=1 max=2) [if] [in] [pweight /] ,FIRMID(string) [Contents(string) Format(string)] • error messages • preserve • drop all needless variables and observations
by `vars' `firmid': gen `dum' = ( _n == 1 ) by `vars': egen `anzbetr' = sum(`dum') by `vars': egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { by `vars': egen `sum_gew' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew' / `summe' } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) } } by `vars': keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -2 if `anzbetr' <= `minoffirms' } replace `summe' = -2 if `anzbetr' <= `minoffirms' save `cell'
tablefdz • calculation done for inner table, row, col and total • results for inner table, row, col and total are appended • tabdisp • restore
Results of tablefdz . tablefdz sex edu, firmid(firmid) ------------------------------------------ | Education Sex | low medium high Total ----------+------------------------------- male | -1 -1 4 13 female | 8 53 26 87 | Total | 11 59 30 100 ------------------------------------------
. tablefdz sex edu, firmid(firmid) c(mean wage sum wage) -------------------------------------------------- | Education Sex | low medium high Total ----------+--------------------------------------- male | -1 -1 4 13 | -1 -1 117.9078 131.4443 | -1 -1 471.6311 1708.776 | female | 8 53 26 87 | 125.4497 140.0829 134.1142 136.9536 | 1003.598 7424.396 3486.968 11914.96 | Total | 11 59 30 100 | 128.8651 139.7902 131.9533 136.2374 | 1417.516 8247.622 3958.599 13623.74 --------------------------------------------------
. tablefdz sex edu [pw= weight], c(mean wage sum wage) firmid(firmid) -------------------------------------------------- | Education Sex | low medium high Total ----------+--------------------------------------- male | -1 -1 115.1209 431.6714 | -1 -1 115.558 131.4435 | -1 -1 13303.14 56740.38 | female | 349.4451 2696.87 1252.799 4299.114 | 133.8411 133.9306 134.1103 133.9757 | 46770.11 361193.5 168013.3 575976.9 | Total | 497.2254 2865.64 1367.92 4730.786 | 133.3649 134.3813 132.549 133.7446 | 66312.39 385088.5 181316.4 632717.3 --------------------------------------------------
tablefdz • deleted values can somtimes be recalculated very easily • more deletion and • rounding by Pearl still necessary but now it is feasible • tablefdz takes about twice the time as table Questions • ideas to improve? (computation time) • similar programs around?
Topic 2: Memory usage • 16 gigabyte memory in Unix server, 15 gigabyte are free • How much memory is the optimum we should allocate? • 13? 15? • dependant on matsize? • rule-of-thumb? • Automation for running Stata-jobs?
* Program to produce tables with automatically deleted values that are based on few observations * by Stefan Seth and Peter Jacobebbinghaus *! version 1.0.1 30mar2007 program define tablefdz, byable(recall) version 9 syntax varlist(min=1 max=2) [if] [in] [pweight /] , FIRMID(string) [Contents(string) Format(string)] tempvar anzbetr sv1 sv2 sv3 sv4 sv5 dum svar1 svar2 svar3 svar4 svar5 eins summe tempvar sum_gew1 sum_gew2 sum_gew3 sum_gew4 sum_gew5 // used when calculating weighted averages tempfile dat cross row col scol local minoffirms = 3 // min. number of firms per statistic local anzvar = wordcount("`varlist'") local fmt "format(`format')" preserve if _by() { marksample touse // take care of by-group } quietly {
if "`contents'" == "" { local anz_stats = 0 } if "`contents'" != "" { if mod( wordcount( "`contents'" ), 2 ) != 0 { di in red "Invalid statistics specification; please use the Contents option as follows:" di in red "contents( statistic variable )" di in red "You may specify up to 5 statistic/variable-pairs, e.g.: " di in red "tabfdz var1 var2, contents( mean var3 sum var4 ) exit } tokenize "`contents'" local stats "`1' `3' `5' `7' `9'" local svar "`2' `4' `6' `8' `10'" tokenize "`stats'" local anz_stats = wordcount( "`stats'" ) // list of statistics forvalues i = 1/`anz_stats' { local stat`i' "``i''" } tokenize "`svar'" // list of variables forvalues i = 1/`anz_stats' { local svar`i' "``i''" } }
*------------------------------------------------------------ error messages if "`weight'" != "" { forv i = 1/`anz_stats' { if inlist( "`stat`i''", "sum", "mean" ) == 0 { di in red "weights may only be used with 'sum' and 'mean'" exit } } if "`weight'" != "pweight" { di in red "only pweights allowed" exit } } forvalues i = 1/`anz_stats' { if inlist( "`stat`i''", "sum", "mean", "sd", "median", "min", "max" ) == 0 { di in red "Statistics sum mean sd median min max allowed only" exit } }
*------------------------------------------------------------ reduce size of data if _by() { keep if `touse' // restrict to by-group } capture keep `if' `in' // reduction of observations keep `firmid' `varlist' `svar' `exp' // reduction of variables gen byte `eins' = 1 tokenize `varlist' // dropping missings drop if missing( `1' ) if `anzvar' == 2 { drop if missing( `2' ) }
if "`weight'" != "" { replace `eins' = `exp' // multiply statistics variables with weights... forv i = 1 / `anz_stats' { local jmax = `i' - 1 local seen = 0 // ... but only once forv j = 1 / `jmax' { if "`svar`i''" == "`svar`j''" { local seen = 1 } } if `seen' == 0 { replace `svar`i'' = `svar`i'' * `exp' } } } save `dat'
*------------------------------------------------------------ computation of statistics local vars "`varlist'" // inner part of the table sort `vars' `firmid' by `vars' `firmid': gen `dum' = ( _n == 1 ) // mark obs with new firm by `vars': egen `anzbetr' = sum(`dum') // number of firms per cell by `vars': egen `summe' = sum( `eins' ) // (weighted) number of persons per cell forv i = 1 / `anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { // mean with weights per hand by `vars': egen `sum_gew`i'' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew`i''/`summe' // mean = sum(y_i * h_i)/sum(h_i) } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) // other statistics per person } } by `vars': keep if _n == 1 // keep one line per cell forv i = 1 / `anz_stats' { replace `sv`i'' = -1 if `anzbetr' <= `minoffirms' // erase statistics based on few firms } replace `summe' = -1 if `anzbetr' <= `minoffirms' // erase number of cases based on few firms save `cross' // save results for the inner part
if `anzvar' == 2 { use `dat', clear // the same for rows tokenize `varlist' local vars "`1'" replace `2' = . sort `vars' `firmid' by `vars' `firmid': gen `dum' = ( _n == 1 ) by `vars': egen `anzbetr' = sum(`dum') by `vars': egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { by `vars': egen `sum_gew`i'' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew`i'' / `summe' } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) } } by `vars': keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -2 if `anzbetr' <= `minoffirms' } replace `summe' = -2 if `anzbetr' <= `minoffirms' save `row'
use `dat', clear // the same for columns tokenize `varlist' replace `1' = . local vars "`2'" sort `vars' `firmid' by `vars' `firmid': gen `dum' = ( _n == 1 ) by `vars': egen `anzbetr' = sum( `dum' ) by `vars': egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { by `vars': egen `sum_gew`i'' = sum( `svar`i'' ) by `vars': gen `sv`i'' = `sum_gew`i'' / `summe' } else { by `vars': egen `sv`i'' = `stat`i''( `svar`i'' ) } } by `vars': keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -3 if `anzbetr' <= `minoffirms' } replace `summe' = -3 if `anzbetr' <= `minoffirms' save `col' }
use `dat', clear // the same for the total tokenize `varlist' replace `1' = . if `anzvar' == 2 { replace `2' = . } sort `firmid' by `firmid': gen `dum' = ( _n == 1 ) egen `anzbetr' = sum(`dum') egen `summe' = sum( `eins' ) forv i = 1/`anz_stats' { if "`weight'" != "" & "`stat`i''" == "mean" { egen `sum_gew`i'' = sum( `svar`i'' ) gen `sv`i'' = `sum_gew`i'' / `summe' } else { egen `sv`i'' = `stat`i''( `svar`i'' ) } } keep if _n == 1 forvalues i = 1/`anz_stats' { replace `sv`i'' = -4 if `anzbetr' <= `minoffirms' } replace `summe' = -4 if `anzbetr' <= `minoffirms' save `scol'
use `cross', clear // appending the results append using `scol' if `anzvar' == 2 { append using `col' append using `row' } *------------------------------------------------------------ format output forvalues i = 1/`anz_stats' { // headers local svlist = "`svlist'" + " `" + "sv`i'" + "'" } if `anzvar' == 1 & "`contents'" != "" { label variable `sv1' "`stat1'(`svar1')" cap label variable `sv2' "`stat2'(`svar2')" cap label variable `sv3' "`stat3'(`svar3')" cap label variable `sv4' "`stat4'(`svar4')" cap label variable `sv5' "`stat5'(`svar5')" } label variable `summe' "Freq." local linesize = c(linesize) set linesize 255 format `summe' %9.0g } // end of quietly
*------------------------------------------------------------ output di "/*§BEGIN_tabfdz*/" tabdisp `varlist', cellvar(`summe' `svlist' ) totals `fmt' csepwidth(2) missing di "/*§END_tabfdz*/" restore set linesize `linesize' end