650 likes | 836 Vues
Using Bootstrap in NLSCY. Today’s Presentation. B O O T S T R A P. We’ll discuss the guiding principles We’ll demonstrate the CV lookup spreadsheet (which is based on the bootstrap weights). Bootstrap macros by example Summarize some technical aspects. Background.
E N D
Today’s Presentation B O O T S T R A P • We’ll discuss the guiding principles • We’ll demonstrate the CV lookup spreadsheet (which is based on the bootstrap weights) • Bootstrap macros by example • Summarize some technical aspects.
Background The National Longitudinal Survey of Children and Youth measures a wide array of characteristics related to child and youth development There are many opportunities for statistical inference The number of possibilities is further compounded by the longitudinal character of the survey A basic problem of inference is finding the variability of the estimators.
The bootstrap approach • Does not need exact formulas. • Takes into account design information. • It can be adapted to the desired level of precision. • Computer intensive
Basic Idea of Bootstrap A) Take a subsample of the original sample - trying to mimic the initial selection process. B) For this subsample compute weights as if it was the actual sample. The result is a bootstrap weight. Repeat A) and B) many times to obtain a set of bootstrap weights Note that both A) and B) make essential use of the design information
Basic Idea of Bootstrap - continued Now suppose we are interested in an estimate; - Compute the estimate using each of the bootstrap weights - Compute the variance of the obtained points. Note: These two steps are implemented in any program or software that uses bootstrap weights to assess sampling variability.
The Need to Use the Design Information Using the release weights gives the correct estimates. However, the variance of the estimator provided by SAS or SPSS is not the real one - most of the time it is less. Here are the comparisons for two examples: Average Regression Coefficients
Using Bootstrap • We have two tools at hand: • A database of variances for proportions - already computed by bootstrap • The bootstrap macros
Results for Proportions For the variability of proportion estimates we can use an Excel table of already computed results. The work has been done using bootstrap. This table replaces the usual look-up tables for variance. One can choose the domain based on age and province.
Results for Proportions - continued This general framework allows for estimating variability of proportions in future cycles of the survey. In most situations where proportions are involved, consulting this database may be enough. Here are examples on how to use it. skip
Understanding the tableExample 1 Question: What is the quality (c.v) of the estimates for the proportion of girls aged 3 in Newfoundland at cycle 3? How many will there be in cycle 5? Will the quality suffer from the smaller sample size?
Click on the right arrow in Province to select a province intro/skip
Since the proportion of girls should be around 50%, click on Prop. Cible and select 50%.
You can now see that the c.v. for that particular domain in cycle 3 was 17.5% with 44 children in the sample. In cycle 5, we predict 35 children will be left in sample (assuming 90% response rate in cycle 4 and 5) and the c.v. will grow to 19.6%. intro/sauter
Understanding the tableExample 2 Question: What domains based on a 15% proportion are not publishable? We are looking for domains with a c.v. higher than 33.33%
Finally, type in 33.33 in the second field and click OK
You can now see the first few rows of estimates that we can’t release according to customary quality level guidelines.
Results for Proportions - Summary The table contains variance estimations obtained by bootstrap - under general conditions. It is best suited for quick general assessment of variance and projections for future cycles. When we need the most accurate variance estimation we have to do the bootstrap for the specific variable of interest. intro
Macros - Outline • Bootstrap weights are computed and made available by methodology. • The user runs the macros.
Macros - Details • Preparing the input • Specifying the options and running the macros • Saving and interpreting the results
Preparing the input • Two input files are required: • The bootstrap weights file • The file with variables of interest • These files must be merged - ususally with the CHILDID identifier
Specifying options The options to be specified are as follows: (i) The kind of estimator. (ii) Whether the analysis is done globally or by domains. (iii) SAS libraries. (iv) The names of the variables for analysis. (v) The number of bootstrap weights to be used.
Specifying options - continued (i) The built-in choices - in the current version - are: Other estimators may require customizing the code (ii) If analysis by subgroup is desired, the user needs to specify the subgroup variable. Totals Linear Regression Ratios Logistic Regression Difference of Ratios
Examples with SAS Macros a) Estimate variance of a total by region b) Estimate the variance of an average c) Estimate the variance of regression coefficients
Estimate variance of a total by region Problem: Find the variance of the total number of bedrooms in households with teenagers within each province - as estimated from the sample.
Estimate variance of a total by region - continued /* %partition(domains=); *no partition if no variable name provided; %total(dataset=,variable=,nb_weights=); COLLECT OUTPUT FROM DATASET: totals %ratio(dataset=,numerator=,denominator=,nb_weights=); COLLECT OUTPUT FROM DATASET: ratios - in PERCENTS - %ratio_difference(dataset=,numerator1=,denominator1=, numerator2=,denominator2=,nb_weights=); COLLECT OUTPUT FROM DATASET: diffrat - in PERCENTS - %regression (dataset=,dependent=,independent=,nb_weights=); COLLECT OUTPUT FROM DATASET: bs_reg %logistic_reg (dataset=,dependent=,independent=,nb_weights=); COLLECT OUTPUT FROM DATASET: bs_reglg NOTE: unless explicitly deleted, the datasets mentioned above will keep accumulating the results of successive macro calls */
Estimate variance of a total by region - continued /* %partition(domains=); *no partition if no variable name provided;
Estimate variance of a total by region - continued /* %partition(domains=); *no partition if no variable name provided; %total(dataset=,variable=,nb_weights=); COLLECT OUTPUT FROM DATASET: totals %ratio(dataset=,numerator=,denominator=,nb_weights=); COLLECT OUTPUT FROM DATASET: ratios - in PERCENTS - %ratio_difference(dataset=,numerator1=,denominator1=, numerator2=,denominator2=,nb_weights=); COLLECT OUTPUT FROM DATASET: diffrat - in PERCENTS - %regression (dataset=,dependent=,independent=,nb_weights=); COLLECT OUTPUT FROM DATASET: bs_reg %logistic_reg (dataset=,dependent=,independent=,nb_weights=); COLLECT OUTPUT FROM DATASET: bs_reglg NOTE: unless explicitly deleted, the datasets mentioned above will keep accumulating the results of successive macro calls */
Estimate variance of a total by region - continued %total(dataset=,variable=,nb_weights=); COLLECT OUTPUT FROM DATASET: totals
Estimate variance of a total by region - continued %include "C:\users\dochcat\bootstrap\NLSCY_VES.sas"; %let weight_path = C:\users\dochcat\bootstrap\Bs_Weights; %let weights = bvar; libname wt_lib "&weight_path"; %let data_path = C:\users\dochcat\Data; %let data = basic_set; libname dt_lib "&data_path"; %let save_path = C:\users\dochcat\bootstrap\Results; %let output = table01; libname sv_lib "&save_path";
Estimate variance of a total by region - continued procsortdata=wt_lib.&weights out=weights; by childid; run; procsortdata=dt_lib.&data (where=(cmmcq01>12)) /*keep only teenagers*/ out=dataset; by childid; run; data data_and_weights; merge dataset(in=a) weights(in=b); by childid; if a; * keep only the necessary records; run;
Estimate variance of a total by region - continued /* initialise totals */ procdatasetslibrary=work; delete totals; run; %partition(domains=cgehd03); %total(dataset=data_and_weights, variable=nb_bedrooms, nb_weights=1000); /*save results*/ datasv_lib.&output; set totals; run; procprintdata=sv_lib.table01; run;
Estimate the variance of an average Problem: For children of age 6, find the average number of years of education of the Person Most Knowledgeable about the child Note: Even though the average was not mentioned as an available option, it is easily computed as a ratio.
Estimate variance of an average - continued /* %partition(domains=); *no partition if no variable name provided; %total(dataset=,variable=,nb_weights=); COLLECT OUTPUT FROM DATASET: totals %ratio(dataset=,numerator=,denominator=,nb_weights=); COLLECT OUTPUT FROM DATASET: ratios - in PERCENTS - %ratio_difference(dataset=,numerator1=,denominator1=, numerator2=,denominator2=,nb_weights=); COLLECT OUTPUT FROM DATASET: diffrat - in PERCENTS - %regression (dataset=,dependent=,independent=,nb_weights=); COLLECT OUTPUT FROM DATASET: bs_reg %logistic_reg (dataset=,dependent=,independent=,nb_weights=); COLLECT OUTPUT FROM DATASET: bs_reglg NOTE: unless explicitly deleted, the datasets mentioned above will keep accumulating the results of successive macro calls */
Estimate variance of an average - continued %ratio(dataset=,numerator=,denominator=,nb_weights=); COLLECT OUTPUT FROM DATASET: ratios - in PERCENTS -
Estimate variance of an average - continued %include "C:\users\dochcat\bootstrap\NLSCY_VES.sas"; %let weight_path = C:\users\dochcat\bootstrap\Bs_Weights; %let weights = bvar; libname wt_lib "&weight_path"; %let data_path = C:\users\dochcat\Data; %let data = basic_set; libname dt_lib "&data_path"; %let save_path = C:\users\dochcat\bootstrap\Results; %let output = table02; libname sv_lib "&save_path";
Estimate variance of an average - continued procsortdata=wt_lib.&weights out=weights; by childid; run; procsortdata=dt_lib.&data (where=(cmmcq01=6 and cedpd04<96)) /*keep only age 6 kids with valid values of the variable*/ out=dataset; by childid; run; data data_and_weights; merge dataset(in=a) weights(in=b); by childid; if a; *keep only the necessary records; count=1; *necessary for average calculation; run;
Estimate the variance of an average - continued /* initialise ratios */ procdatasetslibrary=work; delete ratios; run; %partition(domains=); *ensure no partition; %ratio(dataset=data_and_weights, numerator=cedpd04, denominator=count, nb_weights=1000); /* save the results */ datasv_lib.&output; set ratios; run; procprintdata=sv_lib.table02; run;
Estimate the variance of an average - results Note that the result is expressed as a percentage. For our purposes we need to divide the standard errors, confidence limits, and estimates by 100, and the variance by 100*100. The coefficient of variance stays the same. Hence the results are: Mean=12.8446, with a 95% confidence interval [12.6544 , 13.0348]
Estimate the variance of an average - comments Let us compare the confidence interval we just computed with the confidence interval produced by using only the release weights. The following SAS code will produce these confidence limits: procmeansmeanlclmuclmdata=data_and_weights; var cedpd04; weight w_final; run; And they are …
Estimate the variance of an average - comments … while the bootstrap estimate of the same confidence interval is: [12.6544 , 13.0348] This is what we get when we compare: Bootstrap Classical We can see an increase by a factor of about 1.7 - for this variable. back /intro/skip
Estimate the variance of regression coefficients Problem: Estimate the variance of regression coefficients of an outcome variable - the PPVT score. The independent variables are: the number of years of education of the PMK, and positive interaction in parenting.