120 likes | 282 Vues
Disclosure problems with design information for surveys. Gillian Raab Kathy Buckner/Iona Waterston Napier University Susan Purdon National Centre for Social Research. Background. PEAS project Uses real survey data – not textbook examples Illustrates how they can be analysed using
E N D
Disclosure problems with design information for surveys Gillian Raab Kathy Buckner/Iona Waterston Napier University Susan Purdon National Centre for Social Research Res Meth Workshop Dec 04
Background • PEAS project • Uses real survey data – not textbook examples • Illustrates how they can be analysed using • Different methodologies • Their implementations in software packages • Links the analyses with sections on the theory relevant to the design and analysis of surveys Res Meth Workshop Dec 04
Data availability • ESRC stipulation • Data used in the exemplars must be available via the ESRC data archive • But if this is the ONLY way it is available it it would make site hard to use • So they exemplars use extracts, of just a few variables, available on the web Res Meth Workshop Dec 04
Need to make survey design variables available • Cluster (primary sampling unit) identifiers • If the sample is clustered – here it was • Indicators of the strata used • Here stratification was by local authority • Weights • Cluster and stratum identifiers may not be made available via the data archive or may be in restricted files Res Meth Workshop Dec 04
Clusters are about 10 respondents Strata are local authorities Other cases strata might be (e.g.) large firms in a business survey. Res Meth Workshop Dec 04
Disclosure can happen if • We know the location of individual clusters • We can identify an individual within a cluster • Where a stratum is small and a large proportion of the stratum is sampled • We have some means of linking the data on the web back to the full data source Res Meth Workshop Dec 04
Steps to prevent disclosure • Change cluster identifiers so they no longer reveal location • Change IDs so they cannot link back • Add noise to the weights so they do not identify individuals • Make the details of how the strata are defined unavailable (not in this exemplar) • Maybe more things?? Res Meth Workshop Dec 04
What are the principles? • Do we need to worry about • Population unique individuals • Sample unique individuals Logically we would expect the former But the latter may also be important If you know you are in the survey? If you know that someone else was in the survey? Principles for individuals and organisations may have to be different Res Meth Workshop Dec 04
Another way round this • Surveys come with sets of replicate weights • Standard errors for surveys are provided using jacknife or bootstrap methods • The user does not need to have access to the individual deign variables • This approach has been pioneered by Statistics Canada • But a sharp investigator could still work out clusters Res Meth Workshop Dec 04
Relevance to researchers • We have been able to get the data we wanted for our exemplars so far • But there are some surveys at the ESRC data archive where the cluster identifiers are • Not available at all • Information is there, but it is obscure • A consistent policy (perhaps with restrictions) would be helpful Res Meth Workshop Dec 04