380 likes | 493 Vues
Investigation of Treatment of Influential Values. Mary H. Mulry Roxanne M. Feldpausch. Outline. Current practices Methods investigated Results Next steps. Influential Observation.
E N D
Investigation of Treatment of Influential Values Mary H. Mulry Roxanne M. Feldpausch
Outline • Current practices • Methods investigated • Results • Next steps
Influential Observation An observation is considered influential if its weighted contribution has an excessive effect on the estimate of the total (Chambers et al 2000)
The Data - U.S. Monthly Retail Trade Survey • Collect sales and inventories • Monthly survey of about 12,500 retail business with paid employees • Sample selected every 5 years • Sample is stratified based on industry and sales • Quarterly sample of births • Deaths are removed
The Data • Analysis done at published NAICS level • Hidiroglou-Berthelot algorithm ran on the data before looking for influential values • Horvitz-Thompson estimator
Causes of Influential Units • One time or rare event • Erroneous measure of size • Change in the make-up of the unit • Seasonal Businesses
Current Practices • Analyst review an effect listing of micro level data and investigates units that may be influential • When the analyst determines a correctly reporting unit may be influential, the case is referred to a statistician
Current Practices • One time influential value • Imputation • Recurring influential value • Weight adjustment based on the principles of representativeness • Moving the unit to a different industry when the nature of the business changes
Goals • To improve upon current methodology by making it more objective and rigorous • To find methodology that uses the observation but in a manner that assures its contribution does not have an excessive effect on the total
Assumptions • Influential observations occur infrequently, but are problematic when they appear. • The influential observation is true, although unusual. It is not the result of a reporting or coding error.
Strategy • Identify candidate methodologies and test with real data from one industry (about 700 businesses) for a month that contains an influential value
Evaluation Criteria • Number of influential observations detected, including the number of true and false detections made • Estimate of bias • Impact on month-to-month change
Notation • where • Yi is the sales for the i-th business in a survey sample of size n • wi is the sample weight for the i-th unit • Xi is the previous month’s sales for the ith business
Methods Examined • Weight trimming • Reverse calibration • Winsorization • Generalized M-estimation
Weight Trimming • Does not identify influential units • Adjusts the weight of the observation
Weight Trimming • Truncate the weight of the influential observation • Adjust the weights of the non-influential observations to account for the remainder of the truncated weight • Sum of the new weights is the same as the sum of the original weights • (Potter 1990)
Weight Trimming Notes • Calculations were done within sample stratum. • Choice of correction factor could be investigated. We arbitrarily chose ci=wi/3.
Reverse Calibration • Does not identify influential units • Adjusts the value of the observation
Reverse Calibration • Use a robust estimation method to estimate the total • Modify the influential observations to achieve that total • (Chambers and Ren 2004)
Winsorization • Identifies influential units • Adjusts the value of the observation
Winsorization • Type I • Type II
Winsorization – Defining K • Define a separate Kh for each stratum in a manner than minimizes the mse (Kokic and Bell 1994) • Define a separate Ki for each observation in a manner that minimizes the mse (Clarke 1995)
Winsorization – Defining K • Use unweighted data to define Kh for each stratum where Kh = mh +2sh • Use weighted data to define Kh for each stratum where Kh = mh +2sh where mh and sh are based on the weighted data
Winsorization-Our Implementation • Used a robust regression in SAS to estimate the parameters needed in the calculations
M-estimation • M-estimators are robust estimators that come from a generalization of maximum likelihood estimation
M-estimation • Identifies influential units • Adjusts either the weight or the value of the influential observation
M-estimation • Used a weighted M-estimation technique that is able to modify the weights or the values of the influential observations (Beaumont and Alavi 2004)
Number of Outliers Detected *Method does not detect outliers, one outlier was specified
Replacement Values (in Millions) *Weight trimming adjusts the other 18 weights in the stratum **Winsor wgt +2s identified 3 other values
Chosen for Further Study • Winsorization by each observation • M-estimation by observation • M-estimation by weight
Contact Information Mary.H.Mulry@census.gov Roxanne.Feldpausch@census.gov