170 likes | 314 Vues
Preparing Data for Analysis. Nishan Ahmed. Regional Training Workshop on Influenza Data Management Phnom Penh, Cambodia July 27 – August 2, 2013. National Center for Immunization & Respiratory Diseases. Influenza Division. Data Cleaning: What is it?.
E N D
Preparing Data for Analysis Nishan Ahmed Regional Training Workshop on Influenza Data Management Phnom Penh, Cambodia July 27 – August 2, 2013 National Center for Immunization & Respiratory Diseases Influenza Division
Data Cleaning: What is it? • Check for accuracy of observations and correct or eliminate inaccuracies • Important for both simple and complex data • Questions to ask: • Are values outside of what you would normally observe? • If yes, are values due to inaccuracies in the data or to real changes in activity (i.e. an outbreak, start of influenza season) • Values can be inaccurate due to many factors • Data Entry mistake • Incorrect measurement at site • Incorrect analysis
Data Cleaning: Why do it? • To prepare your data for regular analysis • Steps: • Prepare a copy for temporary cleaning, but also clean the original data source as corrections are validated • If data is not cleaned at source, cleaning will need to be done each time analysis is attempted (i.e. records can be temporarily deleted until verified or corrected) • To finalize a dataset for future analysis/create a clean copy to be used for research • Typically a more thorough process than cleaning during a flu season
Data Cleaning: Why do it? • To check for validity and consistency of reported variables • Ensures that the data collected makes sense • Examples: • # of ILI cases is not greater than the # of patient visits • The date of onset is before data of death • Only enrolled sites should be reporting & included in analysis of sentinel data • To check for data outliers • A facility that normally sees ~100 patient visits will probably not see 1,000 patients during a week • To identify and remove duplicate records
Methods to identify problems • How do you find data that has problems? • Eyeball method • Through quick, simple data queries • Access or Excel queries as you go • Statistical methods • Through pre-programmed automated processes • Used for elements that are routinely cleaned • Example: Automated process for deleting duplicate records
Quick and Simple Queries • To find duplicate records, using Access
Quick and Simple Queries • To check validity of variables
Basic Statistic Measures • Measures of Center • Mean: Sum of the observations divided by the number of observations. • Median: The middle value in an ordered list • Mode: The most frequently occurring value • Measures of Variation or Spread • Standard Deviation: measures variation by indication how far, on average, the observations are from the mean
Equations in Excel Mean Median Standard Deviation
Data Cleaning Processes • Example: Checking for outliers • The US ILI system uses a statistical process to check for outliers: • Look at # of patient visits over time from a given provider • That # should be consistent within a certain degree of change (i.e. 4 standard deviations from the mean) • All values above or below this value are selected and checked manually to verify whether or not the values are reasonable and make sense.
Data Cleaning 01002: Data could not be disproved, left in. 04099: Fixed data based on returned workfolder 04108: Data looked OK to surveillance staff, this was the peak of pandemic, and we would have expected numbers to be high
Error Logs • List of errors found during the cleaning process • Helps to keep track of changes made to records during the cleaning process. • Keep track of how the data has changed over time • Used for follow-up on questions to sites • May be manual or automated • Based on needs of the data
Conclusions • Preparing data for analysis includes finding and cleaning as many data errors as possible • Statistical methods, the eyeball method, and simple queries can all be used to find potential data errors • Data cleaning is important because data errors could alter the interpretation of data (i.e. could cause a perceived increase without a true increase in disease activity) • Error logs are useful in accounting for errors and how they were dealt with