Download
data processing at icpsr n.
Skip this Video
Loading SlideShow in 5 Seconds..
Data Processing at ICPSR PowerPoint Presentation
Download Presentation
Data Processing at ICPSR

Data Processing at ICPSR

144 Vues Download Presentation
Télécharger la présentation

Data Processing at ICPSR

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Data Processing at ICPSR Peggy Overcashier Senior Systems Analyst, ICPSR CESSDA Expert Seminar Neuchâtel, Switzerland September 9, 2004

  2. What is ICPSR? • Membership organization founded in 1962 • Over 500 colleges and universities • 2004-2005 budget approximately $10 million (USD) (8.2 million EUR; 12.5 million CHF) • 30% from membership fees • 70% from grants and contracts • Around 100 employees; 40 data processing staff • World’s largest archive of computer-readable social science data • About 7,000 titles and 140,000 data files • Close to 300 data files available for online analysis

  3. Two Kinds of Archival Holdings: • General Archive Holdings are funded with member dues and are available only to members • Special Topic Archives are supported by foundations or federal agencies and holdings are available to all • Aging • Child Care and Early Education • Criminal Justice • Demographic Research • Education • Health and Medical Care • Substance Abuse and Mental Health

  4. Topics • What we do today and how • Current ICPSR processing pipeline • Development of aids to efficient and accurate processing • Automated scripts and tools • Semi-automated techniques • Where we’re headed • ICPSR process improvement initiative

  5. Current ICPSR Processing Pipeline

  6. Scan deposited electronic files for viruses • Inventory files and documentation received • Verify that electronic files open, are readable • Prepare acquisition form (text) • Transmit original data and documentation for preservation

  7. In consultation with processing supervisor • Determine processing level (routine, intensive) • Initial review of files • Potential disclosure risks • Completeness of variable-level metadata • Wild/undocumented codes • Discuss identified problems/solutions

  8. Resolve problems • Eliminate identified disclosure risks • Routine handling vs. full disclosure analysis • Build dataset, typically in SPSS or SAS • Recode • Add and/or delete variables • Fill in missing metadata • Identify missing values • Check full frequencies and/or descriptives • Convert data to ASCII with Data Definition Statements (archival format) • Tools used historically have been buggy • New in-house conversion tool ready for release

  9. Gather existing pieces of documentation • Methodology • Other information received from depositor • Assess what other documentation needs to be included in final products • Hand off to Electronic Document Conversion unit for conversion to PDF or hold until documentation set is completely assembled

  10. Gather and document study-level metadata • Write study summary • Enter into study description form (text) • Submit to editing staff

  11. Optional, at discretion of archive • Assess for potential problems in online analysis • Multiple weights • Outliers • Multiple linkable files • Prepare question text file in SDA native format (DDL) • Configure for online analysis system • Automated test setup; administrators * name = PREGNANT text = The next questions are about your health and health care. Are you currently pregnant? *

  12. Generate frequencies, descriptive statistics for codebook • Document variable-level metadata • Add processor notes • Source documents typically in Word, sometimes WordPerfect, ASCII, PDF, other • Create additional documents as needed • Hand off to Electronic Document Conversion unit for conversion to PDF • The two document steps are frequently combined into one

  13. Make sure all files handed off have been returned and reviewed • study description • PDF documentation • Test all data files and data definition statements • SAS, SPSS, (Stata) • UNIX, Windows • Prepare turnover form (text) • Create turnover directory, move all study files • Quality control check by another processor • Turn over study files for preservation and dissemination

  14. Tool Development • Skill/knowledge set: Programmer vs. Data Processor • Programming skills required for some tools • Fully-automated scripts • Web-based forms • Creativity and software knowledge required for others • Semi-automated techniques • Use of existing software in non-conventional ways

  15. Tools: Semi-automated Methods Regular Expressions • Search and replace using patterns rather than literal strings • Multi-Edit, TextPad: Windows-based editors • Capable of regular expressions • Can save files with UNIX formatting • Extract syntax from existing documentation • Value labels • Question text

  16. Tools: Semi-automated Methods Excel for Text Editing • VLOOKUP • List management • Variable disposition • Merging related information from multiple sources • Running counts • Remapping metadata to new variable names

  17. Tools: Semi-automated Methods SDA Conversion Utilities • Documentation • Frequency, descriptive statistics with variable-level metadata, question text embedded • Can include introductory materials, links to external documents • ASCII→PDF, XML, HTML • DDS conversion • SPSS, SAS, Stata, XML • Prepare metadata for variable-level search

  18. A Little More Technical • Macros • Automate repetitious sequences of commands, keystrokes • Recordable in many applications • Variable Arrays • Pre-define groups of variables on which the same data transformations will be performed • Loops • Repeatedly run a single set of commands as long as a condition is true

  19. Tools: A Few UNIX Script Examples • Automated QC script • Batch-test Data Definition Statements in UNIX • Disclosure analysis and processing system for the Treatment Episode Data Set • Web-based XML generator for Quick Tables configuration files • Hermes: automated batch production system • Early implementation of process improvement recommendations

  20. Process Improvement at ICPSR • Begun in spring 2003 • 4 distinct phases • Mapping the current pipeline • Designing the future • External review • Planning and implementation

  21. Phase 1: Map Current Processing Pipeline • Consultant interviewed groups and individuals • Drew and refined process maps • General agreement that the story and pictures were correct before proceeding

  22. Process Mapping: Insider’s View Overview More detailed, with processing milestones Very detailed, covers a corridor wall

  23. Phase 2: Designing the Future • Internal Process Improvement Committee formed • Brainstorming • “Evolutionary” vs. “Revolutionary” ideas • Formal reports and recommendations

  24. Process Improvement: Guiding Principles • Automation • Standardization • Centralization • Quality Control • Version Control • Focus on the User • Electronic Collection Management • Staff Development and Career Path Expansion

  25. Future Processing Framework

  26. Characteristics • More linear • Integrated; steps connected • Automated milestone tracking • Metadata migrates to database • Eliminate rekeying • Single authoritative source

  27. Phase 3: External Review Committee • Outside experts reviewed reports • Met with individuals and small groups of staff • Endorsed the PIC’s recommendations • Additional recommendations provided • Formal report written

  28. Phase 4: Planning and Implementation • Communication with staff • PIC Web site • PIC/staff information sessions • Implementation manager hired • Implementation plans developed for several recommendations • PIC reconstituted as a standing committee • Review new process improvement suggestions • Provide input for implementation plan

  29. Some Improvements in Development • Automated batch production of enhanced suite of products • Hermes for current and future releases • Retrofit project for previous releases • Web-based forms (acquisition, study description, turnover) • Replace text forms • Eliminate rekeying • Automated processing milestone tracking

  30. Issues Under Consideration • “Ready-to-go” files • How to handle missing data by default (SAS, Stata) • How to best provide SAS formats • Development of standardized bibliographic citation for online analyses • Archival vs. distribution formats • How to handle qualitative data • New formats (e.g., video, audio files) • Development of best practices, automated tools for disclosure analysis

  31. For more information: Peggy Overcashier overcash@icpsr.umich.edu +1 734 615 9529