90 likes | 179 Vues
This informative guide by Jon Johnson on migrating from SPSS to SIR delves into the challenges, history, current data collection methods, and implementation strategies. It explores taming complex data, data migration with minimum information loss, and practical steps for successful implementation using Python, Perl, and more.
E N D
Migrating from SPSS to SIR Return from Anarchy Jon Johnson 11 May 2005
Introduction • CLS runs 3 / 4 British Birth Cohort Studies • Multi-disciplinary study of the life-course of three generations born in 1958,1970 and 2000 • Data collected in various ways, paper, CAPI, administrative data • Complex data, 100,000 variables, 18,000 participants per study
History • Punch cards, different data centres, SIR, SPSS • The data has been through the range of data storage fashions • Social science versus Medical data access models • Goal of increased accessibility and understanding of relationships within data • Development of social science meta-data standards
Current Data Collection • Data collection methods such as CAPI has a negative and positive side • Data is pre-punched • Data is pre-checked • Data is less understandable • Data is more complicated • Recent data supplied for one sweep was > 100,000 variables
Taming data • Datasets are routinely supplied in SPSS format • SPSS is not an ideal environment to manage such data • SIR is an ideal environment to manage this data
Data Migration with minimum information loss • SPSS Data List • Rarely used, high level of manual intervention • Visual Basic (a.k.a. SaxBasic) • Platform dependent • Limited functionality, multi-step process • ODBC • Flaky at best • Reverse engineer SPSS file • SPSS Portable format - stable if poorly documented format
Implementation • PQL, Perl, Python ? • Stable across OS’s • Good text manipulation • Good XML support • Case based databases
How it works • parse spss file • grabs variable name, value labels, data values etc • looks up a configuration file for BDI settings • check if also setting up database or just adding a new record • do some conversions: time, date, scaled vars • do some analysis of the data to grab range of values, • write out warning if > 3 missing values or a range of missing values • write out schema • python spss_parser.py -f <input filename> -s <sir config file> -d <ddi config file>
Use • Once into SIR the data can be restructured • Extend to other datasets held in other statistical packages such as Stata or SAS going via StatTransfer -> SPSS portable format and go from there • Also creates XML to add to a data store - superseded !!!