Data Editing, Coding, and Just a Little Imputation

Data Editing, Coding, and Just a Little Imputation Katherine (Jenny) Thompson Office of Statistical Methods and Research for Economic Programs Katherine.J.Thompson@census.gov (301) 763-4941

The Basics: What is Editing? • Editing (procedures) review reported/keyed data for errors and pinpoints “inconsistent” values • For “industry” • For respondent Editing does not change the data. Items that fail edits are • referred to an analyst; or • automatically imputed (replaced with consistent values)

The Basics: What Is Imputation? • Imputation is the replacement of a missing or incorrectly reported item using logical edits or statistical procedures. • In other words, • Imputation replaces a missing or incorrect data item with an “educated guess.”

The Basics: What is Coding? • Coding is the assignment of recognizable values to flags that describe key characteristics of the unit or item, such as • Industry (unit level) • Response status (unit or item level) • Source of data correction (item level) • Imputation model (item level)

We Begin With Coding • Before we can evaluate whether a response is reasonable, we have to know where it comes from: • Classification variable(s) value, e.g., industry, state • Frame information may be erroneous or • unit may have changed classification value  Each unit must be assigned classification code(s) before editing/imputation

We End With Coding • At the end of the processing cycle, we want to know • How the data were changed, • Where the data were changed, • Why (if possible) data were changed, and • The final status of the reporting unit (respondent, non-respondent).

Some Edit Definitions • Editing: Procedures for detecting • “incorrect” keyed or respondent • data. • Micro-Editing: Editing at the individual record • (questionnaire) level • Macro-Editing: Editing at the tabulated value level

“Typical” Editing Processing Flow • Micro-editing (static) • Performed on a flow basis • Predetermined edit tests and edit parameters (historic data) • Administered by machine • Resolved by machine and human • Outlier detection (dynamic) • Performed after “close-out” • Administered by machine • Often resolved by human • Macro-editing (dynamic) • See above

Micro-Edits are Either: • Fatal Must be resolved before subsequent • editing • Unit is Out-of-Scope for Survey • Unit is missing classification variable value • Required data item not reported • Query Can be corrected “automatically” • Detail items do not add to reported total • Ratio of two items is outside (user-determined) limits

Where Do Micro-Edits Come From? • Questionnaire • Reality • Subject-Matter Expert Rules • (Enforced) Statistical Relationships

Fictional Sample Questionnaire

Edit Sources: Questionnaire Balance Edit Item 3.a. Value + Item 3.b. Value = Item 3.c. Value Things have to add up!

Edits Sources: Questionnaire/Reality Ratio Edit ANNUAL PAYROLL/1ST QUARTER PAYROLL 1 Can’t spend more on payroll in one quarter than for the entire year!

Edit Sources: Questionnaire/Reality Ratio Edit 0.96 < TOTAL HOURS WORKED/EMPLOYMENT < 8.76

Edits Sources: Questionnaire/Reality Range Edit 0  EMPLOYMENT  5,615,727 A unit can’t have more employees than the population of the resident state (or negatively-value employees!)

Edit Sources: Subject-Matter Rules Ratio Edit TOTAL SALES/ANNUAL PAYROLL > 1 “Full-year reporters should operate at a profit!”

Edit Sources: Statistical Relationships Ratio Edit A  ANNUAL PAYROLL/EMPLOYMENT  B Wage per employee should be within the (industry) range.

Examples of Fatal Micro-edits • Classification Edits • Required Data Item Tests

Examples of Query Micro-edits • List Directed (Verification) Edits • Skip Pattern Validation Edits • Range Edits (Including negative tests) • Ratio Edits • Within same questionnaire • Current to prior period • Balance Edits • Subject-matter rules

List Directed/Verification Edits • Purpose: To compare the reported value of a data field to a pre-determined list of legal values. • Machine edits, but highly dependent on data-quality of list • Human (manual) correction of edit failures

Skip Pattern Validation Edits • Purpose: To verify that values of skip items are • consistent with the skip instructions • provided on the questionnaire. • Machine edits that CAN be resolved by machine-imputation • Subject-matter rules (if..then..logic) • Operations Research approach

Range Edits • Purpose: To check the reported value of a data item to see if it is within specified minimum and maximum values. • Form of edit: lower bound  data item  upper bound • Upper and lower bounds are tolerances. • If data item is not contained within the bounds, then it fails the range edit (“out of tolerance”). • Negative tests are a special case of range edits. • Can be used to define an imputation region.

Range Edits • Examples: • 0  Employment  301,064,982 (2006 U.S. Population) • 0  Sales  12,455.8 billion (2005 Gross Domestic Product) • 0  Percent of work done in category  100%

Ratio Edits • Purpose: To compare two “related” items in a questionnaire to see if reported values are consistent. • Form of Ratio Edit: • Upper and lower bounds are known as tolerances. • Tolerances generally developed from prior period data. • If ratio is not contained within the bounds, then it fails the ratio edit (“out of tolerance”).

Some Reasons for Ratio Editing • One data item is a function of another. • Annual Payroll = 1st Quarter Payroll + Payroll for Remaining 3 Quarters • Ratio Edit:

Some Reasons for Ratio Editing • One data item can only be evaluated in comparison with another item • (reasonable lower bound) • (reasonable upper bound)

Some Reasons for Ratio Editing • One data item is a good predictor of another. • Annual Payroll = factor  Total Employment

Plot of a Typical Ratio Edit

Advantages of Ratio Edits • Useful for detecting systematic and random errors • Reasonable comparisons for quantitative data • Verifiable assumptions • Often insensitive to changes in economy when both items are in the same units • Imply certain imputation models • Can be solved simultaneously • imputation region implications

Disadvantages of Ratio Edits • Edit failure identifies a pair of potentially incorrect data fields • Need to have a “tie-breaker” • Often work best when combined with other edits (can be ratio edits) • Very dependent on the distribution of ratios • Highly correlated • Goes through origin

“Best” Practices for Ratio Edits • Incorporate unit size categories as well as classification variables in editing cells • Perform preliminary data analysis to determine validity of edit model • Incorporate tests to prior data from same unit and item when reasonable • Use non-parametric outlier-resistant methods for setting ratio edit tolerances • Audit edits • An edit test that has a high rate of failure could indicate problems with the tolerances or the test itself

Periodic Data and Ratio Edits (Caution)

Brief Digression on Imputation • Situation: Missing item or item marked for • imputation (replacement) due to • edit failure(s) • We would like the machine to automatically replace the “inconsistent” item with a consistent value.

The ideal “imputations” find replacement values that are still considered reported (from the same respondent) • Examples • divide reported data by correct reporting unit • replace reported total with sum of details

Link Between Imputation and Program • Published tabulations(macro-data) • Ratio imputation models • Regression imputation models • Published micro-data • Hot deck imputation

Commonly-Used Imputation Methods(Economic Data) • Rounding/Data Slides (systematic error) • Respondent data divided by unit conversion factor (e.g., imputed value = reported value/1,000) • Direct Substitution • Another data item (same questionnaire) • Absolute value of reported/keyed item • Sum of Reported Details (logical edit) • Derived value from other reported/keyed item • Previously reported value (historic) from same respondent • Administrative data value (same respondent)

Ratio Imputation (Model Imputation) • imputed item = (factor)  (another data field) • Same reporting unit/questionnaire • Edit-passing item • Industry (Category) Average Ratio • (use average ratio of two items in industry/category) • e.g., factor = industry wage/employee ratio • Historic Imputation (Auxiliary Trend) • (use ratio of prior data to current data for same respondent) • e.g. factor = previous tabulated value of edit-failing item • previous tabulated value of auxiliary data field

Balance Edits • Purpose: To determine if detail items add to associated reported total. • Form of Edit: TOTAL = DETAIL1 + DETAIL2 + ... + DETAILn • Developed from questionnaire • A set of details along with their associated total is called a balance complex. • More complicated balance complexes • Nested 1-Dimensional • 2-Dimensional

Sample Questionnaire Example

“Fixing” a Failed Balance Edit • Editing generally integrated with imputation: • Editor decides which is more believable: TOTAL or SUM OF DETAILS • Only change one side of balance complex (TOTAL or SUM OF DETAILS)

Balance Edit Definitions • Residual: TOTAL - SUM OF DETAILS • Failed edit solution can depend on • SIZE of residual (absolute tolerance) • RATIO of residual to total (relative tolerance)

A Few Balance Edit Fixes • RAKE* Rake all detail items to TOTAL • YSUMX* Replace TOTAL with the SUM OF DETAILS • ROUND Divide all details by 1000 or • divide TOTAL by 1000 • RESIDUAL Set one missing DETAIL to the RESIDUAL • IMPUTE* Replace all DETAILS with imputed values • *Briefly discussed…

Raking • Adjust each detail item as • Conditions: • Reported TOTAL must be “acceptable.” • Relative tolerance is “small” (e.g., within 5%).

Raking -- Considerations • Is not considered imputation • Preserves reported distribution of the detail items

YSUMX • Set TOTAL equal to SUM OF DETAILS • Conditions: • TOTAL can be changed by edit (not “fixed”); • (Optional, but preferable) SUM OF DETAILS is “reasonable” (e.g., verify with ratio test or range test)

YSUMX -- Considerations • Not (considered: imputation; • logical edit or deductive imputation • Useful when TOTAL is missing (and details are not); • Can be imputation solution to ratio edit

Impute • Replace ALL reported DETAILS with imputed values • Imputed DETAILi for reporting unit c is given by • factori  TOTAL • Conditions • TOTAL > 0 (and value of TOTAL “acceptable”) • No restriction on SUM OF DETAILS (all DETAILS are replaced...] • Difference between TOTAL and SUM OF DETAILS too large for raking

Macro-Editing (Brief Comments) • Systematic review of tabulations (estimates) • Tendency to rely on ratio comparisons to identify outlying estimates • Hidiroglou-Berthelot edit • Ratio Edits • Need to analyze micro-data in outlying cells

Back to Coding… • Throughout the editing and imputation process, what do we need to keep track of?

Back to Coding… • Original source of data item • Reported from respondent • Elicited by analyst/subject-matter expert • Missing/not reported

Data Editing, Coding, and Just a Little Imputation

Data Editing, Coding, and Just a Little Imputation

Presentation Transcript

Editing And Imputation For Manufacturing Statistics At Statistics Canada

Data Imputation

JUST A LITTLE KINDNESS…

SAS Enterprise Guide project for editing and imputation

Statistical Data Editing and Imputation

UNECE workshop on data editing and imputation, Vienna 22. April 2008

Data Editing

Data entry and editing

Data editing and validation

Integrated Data Editing and Imputation

Editing and Coding: Transforming Raw Data into Information

Quality Metrics for Assessing the Impact of Editing and Imputation on Economic Data

Direction and system changes impacting on data editing and imputation at Statistics New Zealand

JUST A LITTLE TALK WITH JESUS

Just a Little Respect Please

Study of Editing and Imputation Practices at Statistics Finland

Data Editing and Logic

Data Imputation Methods and Technologies

Data Creation and Editing

JUST A LITTLE TALK WITH JESUS

Data editing and scanner data

DATA VALIDATION-I Evaluation of editing and imputation