380 likes | 1.3k Vues
Using Stata for Subpopulation Analysis of Complex Sample Survey Data. Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 2009 2009 Stata Conference. Presentation Outline. Introduction: Subclass Analysis Issues Kish’s Taxonomy of Subclasses
E N D
Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 2009 2009 Stata Conference
Presentation Outline • Introduction: Subclass Analysis Issues • Kish’s Taxonomy of Subclasses • Two Alternative Approaches to Inference • Variance Estimation and Methods for ‘Singletons’ • Examples using NHANES and NHAMCS Data • Suggestions for Practice • Directions for Future Research 2009 Stata Conference: Subpop Analysis of Survey Data
Subclass Analysis Issues • Analysts of large, complex sample survey data sets are often interested in making inferences about subpopulationsof the original population that the sample was selected from (e.g., Caucasian Females) • These subpopulations are referred to interchangeably in various literatures as subgroups, subclasses, subpopulations, domains, and subdomains, leading to confusion among analysts of survey data 2009 Stata Conference: Subpop Analysis of Survey Data
Subclass Analysis Issues, cont’d • Software procedures for analysis of complex sample survey data are becoming more powerful, flexible, and widely available, offering analysts several options • Analysts need to be careful when analyzing subclasses, and be aware of the alternative approaches to subclass analysis that are possible and their implications for inference 2009 Stata Conference: Subpop Analysis of Survey Data
Kish’s Taxonomy of Subclasses • Design Domains: Restricted to specific strata according to the complex sample design (usually geographically, e.g., Texas) • Cross-Classes: Broadly distributed (in theory) across the strata and primary sampling units defining a complex sample (e.g., African-Americans over age 50) • Mixed Classes: Disproportionately distributed across the complex sample design (e.g., Hispanics in a sample including Los Angeles as a stratum) • See Kish (1987), Statistical Design for Research 2009 Stata Conference: Subpop Analysis of Survey Data
Design DomainsX = Sample Element in Subclass 2009 Stata Conference: Subpop Analysis of Survey Data
Cross-Classes 2009 Stata Conference: Subpop Analysis of Survey Data
Mixed Classes 2009 Stata Conference: Subpop Analysis of Survey Data
Applying Kish’s Taxonomy • The type of subclass is critical for determining an appropriate analysis approach • Two possible approaches to inference motivated by the taxonomy: 1. Unconditional approach (cross-classes, mixed classes) 2. Conditional approach (design domains) 2009 Stata Conference: Subpop Analysis of Survey Data
The Unconditional Approach • Appropriate for Cross-Classes, and in some cases Mixed Classes; the subclass of interest theoretically can appear in all design strata and primary sampling units (PSUs) • KEY POINT: Allow the software to process the entire survey data set, and recognize all possible design strata and PSUs; DO NOT delete sample cases not in the subclass! 2009 Stata Conference: Subpop Analysis of Survey Data
The Unconditional Approach • Rationale: estimated variances for sample estimates of subclass parameters (based on within-stratum variance between PSUs) need to reflect sample-to-sample variability based on the full complex design • In other words, if a particular subclass does not appear in a PSU in any given sample (although in theory it could have), that PSU should contribute 0 to variance estimates, rather than be ignored completely! 2009 Stata Conference: Subpop Analysis of Survey Data
The Unconditional Approach • Further, the subclass sample size in each stratum is going to be a random variable, and theoretical sample-to-sample variance in realizations of this random variable should be incorporated into any variance estimation procedures 2009 Stata Conference: Subpop Analysis of Survey Data
The Unconditional Approach • If cross-classes (or in some cases mixed classes) are being analyzed, and PSUs where the subclass does not appear (by random chance) are deleted, problems arise • Some strata may appear to have only one PSU by design (preventing variance estimation unless an ad hoc approach is used) • Entire design strata may be dropped, impacting variance estimates and calculations of degrees of freedom 2009 Stata Conference: Subpop Analysis of Survey Data
The Unconditional Approach: General Stata Code • svy, subpop(indicator): command varlist, options • indicator = an indicator variable for the subpop or an if condition, e.g., if male == 1 • svy: mean, over(groupvar) • svy: prop, over(groupvar) • Stata drops strata* with no subpopulation observations from degrees of freedom calculations * Exercise: repeat 10 times really fast 2009 Stata Conference: Subpop Analysis of Survey Data
The Conditional Approach • Appropriate for Design Domains, where a subclass cannot appear outside of specific design strata • The rationale behind the unconditional approach no longer applies • Certain design strata should not contribute to variance estimation or calculation of degrees of freedom 2009 Stata Conference: Subpop Analysis of Survey Data
The Conditional Approach • Restrict the analysis to only those design strata where the subclass of interest exists • Variance estimates reflecting sample-to-sample variability should only be based on those design strata where the subclass can appear (unlikethe unconditional approach) • Subclass sample sizes in design domains are assumed to be fixed, by design 2009 Stata Conference: Subpop Analysis of Survey Data
The Conditional Approach: General Stata Code • svy: command varlist if (condition), options • (condition)might be male == 1, or a more complex combination of conditions (e.g., male == 1 & age >= 50 & age <= 90) 2009 Stata Conference: Subpop Analysis of Survey Data
Variance Estimation Methods • All of these issues are only relevant when using Taylor Series Linearization, which is a default for variance estimation in Stata • Conditional analyses are OK to perform when using replication methods, such as Balanced Repeated Replication or Jackknife Repeated Replication (Rust and Rao, 1996) 2009 Stata Conference: Subpop Analysis of Survey Data
Ad-hoc Fixes for ‘Singleton’ Clusters in Stata 10.1 • Stata 10.1 provides users with four ad-hoc fixes for the problem where strata are identified with only a single ultimate cluster for variance estimation in a subpopulation analysis: • Report Missing Standard Errors (not really a fix) • Treat Units as Certainty Units, which contribute nothing to the standard error • Scale Variance using Certainty Units, which uses the average variance from each stratum with multiple PSUs for each stratum with only a single PSU • Center at the Grand Mean, where the variance contribution comes from a deviation from the grand mean instead of the stratum mean 2009 Stata Conference: Subpop Analysis of Survey Data
Example: The NHANES Data • We first consider examples based on the NHANES II data set, collected from a nationally representative multistage probability sample of the U.S. population from 1976-1980 (oldie but a goodie) • Briefly, a sample of the U.S. population was given medical examinations in an effort to assess the health of the U.S. population 2009 Stata Conference: Subpop Analysis of Survey Data
Example NHANES Analysis • Analysis Subclass: African-Americans ages 50 and above (this is a cross-class of the U.S. population, which can theoretically appear in all design strata and PSUs) • Analysis Objective: Estimate the mean systolic blood pressure of this subclass and an appropriate standard error • See West et al. (2007) for more details 2009 Stata Conference: Subpop Analysis of Survey Data
Conditional Approach:Stata Code for NHANES Analysis • svyset ppsu [pweight = fwgtexam], strata(stratum) singleunit(missing) • svyset ppsu [pweight = fwgtexam], strata(stratum) singleunit(centered) • Also singleunit(certainty), singleunit(scaled) • gen b50subp = (race == 2 & ager >= 50) • svy: mean bpsyst if b50subp == 1 2009 Stata Conference: Subpop Analysis of Survey Data
Conditional Approach: Results 2009 Stata Conference: Subpop Analysis of Survey Data
Conditional Approach? • This approach would not be appropriate for this particular subclass • Computed standard errors would generally be biased downward, because additional sources of sample-to-sample variability are ignored when following this approach • Same issues apply for analytic models • Evidence that the “scaled” ad-hoc fix may be overly conservative! 2009 Stata Conference: Subpop Analysis of Survey Data
Unconditional Approach:Stata Code for NHANES Analysis • svyset ppsu [pweight = fwgtexam], strata(stratum) singleunit(missing) • Note: choice of single unit option does not matter when following this approach! • gen b50subp = (race == 2 & ager >= 50) • svy, subpop(b50subp): mean bpsyst 2009 Stata Conference: Subpop Analysis of Survey Data
Unconditional Approach: Results * Note: Stata dropped three strata with no sample units in the subpopulation. 2009 Stata Conference: Subpop Analysis of Survey Data
Unconditional Approach? • This approach would be the appropriate choice for a cross-class such as African-Americans over the age of 50 • Inferences are theoretically appropriate • Same idea for analytic models • Results suggest that the “centered” and “certainty” ad-hoc fixes for conditional analyses are reasonable 2009 Stata Conference: Subpop Analysis of Survey Data
Example: The NHAMCS Data • Analysis Subclass: Visits to Emergency Departments (ED) by African-American men ages 60 and above (this is another cross-class of the U.S. population, which can theoretically appear in all NHAMCS design strata and PSUs) • Analysis Objective: Estimate the percentage of all ED visits by members of this subclass for dizziness and/or vertigo in 2004 • See West et al. (2008) for more details 2009 Stata Conference: Subpop Analysis of Survey Data
Stata Code for NHAMCS Analyses • svyset cpsum [pweight = patwt], strata(cstratm) singleunit(…) • generate subc = (settype == 3 & sex == 2 & agecat == 5 & race == 2) • svy: tabulate dizzyrfv if subc == 1, se ci percent * conditional • svy, subpop(subc): tabulate dizzyrfv, se ci percent * unconditional 2009 Stata Conference: Subpop Analysis of Survey Data
NHAMCS Analysis Results 2009 Stata Conference: Subpop Analysis of Survey Data
NHAMCS Analysis Implications • No problems with strata having only a single ultimate cluster: ad-hoc fixes all give the same results • Weighted point estimates are identical • Substantially fewer design-based degrees of freedom when following the conditional approach; the full complex design will not be reflected in estimation of sample-to-sample variance (many ultimate clusters are lost) • Conditional analysis assumes that each sample will be of fixed size n = 397 for variance estimation purposes; no random variance! • Conditional analysis results in overly liberal inferences 2009 Stata Conference: Subpop Analysis of Survey Data
Suggestions for Practice • Consider Kish’s Taxonomy when determining an appropriate subclass analysis approach • Utilize the appropriate software options for unconditional analyses when analyzing cross-classes • Be careful with missing values when creating the subpopulation indicator • The unconditional analysis approach generally works fine for both cases (when in doubt, use this approach) 2009 Stata Conference: Subpop Analysis of Survey Data
Directions for Future Research • More appropriate calculation / estimation of design-based and effective degrees of freedom for sparse subclasses or mixed classes • Development of analytic theory for interval estimation when working with small subclasses, which does not rely on asymptotic results 2009 Stata Conference: Subpop Analysis of Survey Data
References • Kish, L. 1987. Statistical Design for Research. New York: Wiley. • Rust, K. F., and J. N. K. Rao. 1996. Variance estimation for complex surveys using replication. Statistical Methods in Medical Research 5: 283–310. • West, B.T., Berglund, P., and Heeringa, S.G. 2008. A Closer Examination of Subpopulation Analysis of Complex Sample Survey Data. The Stata Journal, 8(3), 1-12. • West, B.T., Berglund, P., and Heeringa, S.G. 2007. Alternative Approaches to Subclass Analysis of Complex Sample Survey Data. Proceedings of the 2007 Joint Statistical Meetings. 2009 Stata Conference: Subpop Analysis of Survey Data
Questions / Thank You! • For additional questions, comments, or electronic copies of these slides or the papers, please send an email to bwest@umich.edu 2009 Stata Conference: Subpop Analysis of Survey Data