260 likes | 512 Vues
The importance of data management. Paul Lambert, 31 st January 2012 Talk to the seminar ‘Data management in the social sciences and the contribution of the DAMES Node’, a session organised as part of the Data Management through e-Social Science ESRC research Node www.dames.org.uk.
E N D
The importance of data management Paul Lambert, 31st January 2012 Talk to the seminar ‘Data management in the social sciences and the contribution of the DAMES Node’, a session organised as part of the Data Management through e-Social Science ESRC research Node www.dames.org.uk DAMES, 31/JAN/2012, T1
Today’s session (2V1/2V3) DAMES, 31/JAN/2012, T1
‘Data Management though e-Social Science’ • DAMES – www.dames.org.uk • ESRC funded research Node Funded 2008-11, with ongoing work into 2012 with the NeISS (www.neiss.org.uk) and ‘eStat’ (www.bristol.ac.uk/cmm/research/estat/) projects • Aim: Useful social science provisions • Specialist data topics – occupations; education qualifications; ethnicity; social care; health • Computer science research on secure data models; metadata and linking data; workflows • Programme of case studies and provisions DAMES, 31/JAN/2012, T1
‘Data management’ means… • ‘the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’[…DAMES Node..] • Usually performed by social scientists themselves • Most overt in quantitative survey data analysis • ‘variable constructions’, ‘data manipulations’ • navigating abundance of data – thousands of variables • Usually a substantial component of the work process • Here we differentiate from archiving / controlling data itself DAMES, 31/JAN/2012, T1
Some components… • Manipulating data • Recoding categories / ‘operationalising’ variables • Linking data • Linking related data (e.g. longitudinal studies) • combining / enhancing data (e.g. linking micro- and macro-data) • Secure access to data • Linking data with different levels of access permission • Detailed access to micro-data cf. access restrictions • Harmonisation standards • Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) • Recommendations on particular ‘variable constructions’ • Cleaning data • ‘missing values’; implausible responses; extreme values DAMES, 31/JAN/2012, T1
‘The significance of data management for social survey research’ • The data manipulations described above are a major component of the social survey research workload • Pre-release manipulations performed by distributors / archivists • Coding measures into standard categories; Dealing with missing records • Post-release manipulations performed by researchers • Re-coding measures into simple categories • All serious researchers perform extended post-release management (and have the scars to show for it) • We do have existing tools, facilities and expert experience to help us…but we don’t make a good job of using them efficiently or consistently • So the ‘significance’ of DM is about how much better research might be if we did things more effectively… DAMES, 31/JAN/2012, T1
..being more effective probably involves.. • Knowing about, using and citing previous standard measures/strategies • Effective documentation/dissemination of information on the approach used • Being proactive (not just relying on the most convenient measure to hand) • Trying a few alternatives – sensitivity analysis DAMES, 31/JAN/2012, T1
‘Documentation’ (and its dissemination) is probably the key… • By documentation we mean the ‘paper trail’ • (such as data & syntax files during secondary survey research) • For scientists, this is the log book / journal / laboratory notebook • For social sciences, there are few agreed standards Effective documentation is possible, but requires some effort (e.g. Long, 2009) Image of Alexander Graham Bell’s 1876 notebook, taken from: http://sandacom.wordpress.com/2010/03/11/the-face-rings-a-bell/
..good levels of documentation are not engrained in the social sciences! • “…Little or nothing is systematically archived from these electronic sources. How many of us routinely keep copies of our old word-processing files once they are no longer of current relevance for research or teaching activities. We have been reminded…of the insecurity and non-survival of departmental and professional files stored in broom cupboards, but how many electronic files even get into that cupboard in the first place?” (p142 of Scott, J. (2005) ‘Some principal concerns in the shaping of sociology’, in Halsey, A.H. and Runciman, W. (eds) British Sociology: See from without and within. London: British Academy) ...Yet, ‘documentation for replication’ is a reasonable expectation for a scientific model of research (e.g. Steuer, Dale, Freese)… Steuer, M. (2003). The Scientific Study of Society. Boston: Kluwer Academic. Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158. Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods & Research, 36(2), 153-71. DAMES, 31/JAN/2012, T1
A bit of focus… • Most of the DAMES applications aim to facilitate one of two data management activities, their documentation, and the dissemination of that documentation: • Variable constructions • Coding and re-coding values • Linking datasets • Internal and external linkages DAMES, 31/JAN/2012, T1
‘Documentation for replication’ supports replication of.. • Your own analysis • in response to comments, revisions, requests for access) • Others’ analysis • To build upon – cumulative science • To critique / cross-examine • In secondary survey research • Complex data is often updated (new related records; revised and re-released; re-weighted or re-standardardised; new levels of access/linkage) • New analysis feasible - variable operationalisations; new statistical methods • Most documentation requirements are achieved by effective use of software (‘syntax’ programming) • See our training workshops, www.dames.org.uk/workshops DAMES, 31/JAN/2012, T1
Keep clear records of your DM activities! Reproducible (for self) Replicable (for all) Paper trail for whole lifecycle • In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata) Syntax Examples: www.dames.org.uk/workshops DAMES, 31/JAN/2012, T1
We’ve written a guide for researchers... • ‘Software Session 1: Documentation and workflows with popular software packages’ (www.dames.org.uk/workshops/stir10/docs_workflows_2010.html) • Dozens of sample command files in SPSS, Stata and R from DAMES Node workshops at www.dames.org.uk DAMES, 31/JAN/2012, T1
For data distributors, the provision of systematic metadata is also beneficial Example of DDI format metadata (see also talk 5) DAMES, 31/JAN/2012, T1
NESSTAR DAMES, 31/JAN/2012, T1
What more is needed for good data management? • Good standards in the operationalisation of variables • See yesterday’s workshop sessions (www.dames.org.uk) • Most options have already been studied! • Using GEODE/GEMDE/GEEDE to facilitate sensitivity analysis and comparisons of alternative plausible measures • Collect documentation/metadata on specialist records • Promote more effective measurement options e.g. effect proportional scaling; replication of measures used before; derivation of recommended standards DAMES, 31/JAN/2012, T1
DAMES ‘GESDE’ tools: online services for data coordination/organisation Tools for handing variables in social science data Recoding measures; standardisation / harmonisation; Linking; Curating
Predictors of ‘poor health’ in Sweden (comparison of different occupation-based measures, from DAMES, TP 2011-1)
What more is needed for good data management? • Incentives/disincentives • Arguably, good data management is penalised at present (‘Don’t get it right, get it published’) • Few formalised requirements of documentation or data management activity (cf. metadata publishing standards such as DDI) • Citation rankings might incentivise here (citation of your do files..) • Prospects are probably rather bleak for good science..!! DAMES, 31/JAN/2012, T1
Summary the ‘significance’ of DM is about how much better research might be if we did things more effectively… • Can (try to) provide data oriented facilities supporting improved data management • May also need a cultural change in expectations… DAMES, 31/JAN/2012, T1