230 likes | 358 Vues
SIPP IMPUTATION SCHEME AND DISCUSSION ITEMS. Presenters: Nat McKee - Branch Chief Census Bureau Demographic Surveys Division (DSD) Income Surveys Programming Branch (SIPP) 301-763-5244 Zelda McBride - Supervisor Census Bureau
E N D
SIPP IMPUTATION SCHEME ANDDISCUSSION ITEMS • Presenters: • Nat McKee - Branch Chief • Census Bureau • Demographic Surveys Division (DSD) • Income Surveys Programming Branch (SIPP) • 301-763-5244 • Zelda McBride - Supervisor • Census Bureau • Demographic Surveys Division (DSD) • Income Surveys Programming Branch (SIPP) • 301-763-2942 • ASA/SRM SIPP WORKING GROUP MEETING • September 16, 2008
OVERVIEW OF IMPUTATION • TYPES OF MISSING DATA • Item Non-Response • as refusals, blanks, don’t know, incompatible answers • Handled via hot deck imputation • Unit Non-Response • as person level non-interviews or insufficient • partial • Handled via Type Z and/or hot deck imputation
HOT DECK OVERVIEW • File is sorted geographically – allocated data likely to come from geographically proximate case • Replace missing data items with reported data from another similar person/household
EDITING STEPS • Before Pass 1 – cold (initial) values are in the decks, • missing data is not imputed yet • Pass 1 – cold values are replaced by the live hot data but editing is not saved • Pass 2 – the last values updated in Pass 1 are the starting Values for the edit pass
GENDER X AGE CATEGORIES INITIAL VALUES What did you have for lunch today? 1-Hamburger 2-Yogurt 3-Salad 4-Chicken 5-Roast Beef 6-Other Male Female 1. Under 30 2. 30 - 64 3. 65+
VALUES AFTER PASS 1 BEFORE EDITING Nat, Tracy, Zelda, Jeff, Martha 5 2 4 R R M F 1. 2. 3.
COUNTERS FOR DONOR USAGE M F 1 2 3
IMPUTING FOR MISSING DATA • Process sequentially by unit for each section: demographics, • household characteristics, labor force, assets, general • income, health insurance and program participation • If non missing data --- replaces the hot deck value • If missing takes the last hot deck value and increments the counter • Repeating the same edit program/imputation will give the same results each time • (i.e. rerun – no changes – same donors, same results)
IMPUTATION MATRICES • Matrix defined with stratifying parameters relevant to the item • Sex, race, age (with categories) are used frequently in matrices • Other specialized relevant variables are used too as when imputing class of worker a recode of industries is used in the matrix
USING PREVIOUS WAVE DATA • Wave 2+ sometimes use previous wave data as a parameter in the hot deck • Advantage – more consistency wave to wave • Disadvantage – a particular donor has the potential to influence every wave
ALLOCATION FLAGS • 0 – no imputation initialized • 1 – hot deck imputation • 2 – set to cold value • 3 – logical (derived) • 4 – used previous wave data
TYPE Z NONINTERVIEW • Type Z Noninterview = Noninterviewed Person Within Interviewed Household: • EPPINTVW (Wave 3) Frequency Percent • ------------------------------------------------------------- • -1=Noninterview in all 4 months 14254 12.34 • 1=Interview (Self) 44912 38.89 • 2=Interview (Proxy 29844 25.84 • 3=Non-Interview - Type Z 3042 2.63 • 4=Non-Interview - Psuedo Type Z 1039 0.90 • 5=Children under 15 during ref period 22404 19.40
TYPE Z IMPUTATION • Type Z Imputation = Hierarchical sorting and merging Operation that matches type Z noninterviews with respondents based on demographic characteristics available for both. • Imputes entire record from single donor.
ELIGIBILITY FOR TYPE Z IMPUTATION • Type Z noninterview • Wave 1, or for Wave 2+ no previous wave info available • Type Z Eligibility • TYPZIMP (Wave 3) Frequency Percent • ------------------------------------------------- • Not Eligible 2964 72.63 • Eligible 1117 27.37
ELIGIBILITY FOR TYPE Z DONORS • Interview or sufficient partial interview • sufficient partial = reached first asset question (completed Demographics, Labor Force Recipiency, General Income Recipiency, and Asset Intro.)
TYPE Z PROCESS • determine if person is type Z or donor, create separate files for type Z and donors
TYPE Z PROCESS - CONTINUED • create 4 levels of match keys for each person on both files • match keys are based on rotation group plus various demographic variables: age, race, sex, veteran status, marital status, relationship to reference person, educational attainment, parental status, spouse’s interview status • Level 1 keys are the most restrictive, level 4 are the least (designed to always find a match)
TYPE Z PROCESS - CONTINUED • sort both files by match keys • match files • select best match for each type Z case: • level 1 match=best level 4=worst • transfer data from donor record to type z record for matched cases
LITTLE TYPE Z • Used in labor force edit to get job and labor force data from a donor
DISCUSSION ISSUES ON HOW TO IMPROVE CURRENT IMPUTATIONS • What do we gain by doing type Z imputations vs. hot deck imputations? What are the trade-offs? • What is the threshold (or how should a threshold be determined) for identifying hot-deck overuse for a particular donor/cell? Does this need to be adjusted as the sample size changes (as in the case of a sample cut)?
DISCUSSION ISSUES ON HOW TO IMPROVE CURRENT IMPUTATIONS (CONTINUED) • What is the threshold (or how should a threshold be determined) for determining cold-deck overuse? • How do we determine optimum size for a particular hot deck? Is there a relationship between the number of cells in a hot deck matrix and the number of cases in the universe?
DISCUSSION ISSUES ON HOW TO IMPROVE CURRENT IMPUTATIONS (CONTINUED) • Currently, we do not distinguish between reported data and imputed data in the stratifying variables for particular hot decks. Do we need to be concerned about this? • Any objective, simple way to choose stratifying variables in a hot deck? • What methods/criteria should be used to determine quality of imputations?