280 likes | 402 Vues
Learn how DEF2 tackles missing data and non-independence issues in cross-cultural research by utilizing Multiple Imputation and network regression models. Discover steps for imputation, data analysis, and assessing non-independence.
E N D
Introduction to and Overview of DEF2An R software package for cross-cultural research E. Anthon Eff Malcolm M. Dow Wes Routon Anthropological Sciences Conference, Albuquerque, March 18, 2014
The two major problems with cross-cultural data analysis addressed by DEF2 are: Missing Data All of the major cross-cultural data sets have substantial missing data. Single imputation methods – mean substitution, regression predicted scores, hot deck, etc. – result in coefficient variance estimates of that are downwardly biased. Data editing procedures, e.g. listwise deletion, generally result in small samples (loss of power) and also require very strong assumptions about why data are missing. These assumptions are very unlikely to hold. Single imputation methods are no longer recommended. DEF 2employs the Multiple Imputation by Chained Equations (mice) approach to handling missing data. Non-Independence of Sample Units Sample cases in cross-cultural and cross-national data are frequently not independent of one another due to various inter-societal network processes: cultural trait borrowing, conquest, emulation, inheritance from ancestral populations, etc. This is the classic Galton’s Problem in anthropology, understood more generally as the problem of cultural trait transmission. DEF2addresses this issue by incorporating networks of relations into regression models, and employing instrumental variables procedures to generate consistent and relatively efficient estimates.
Step 1 of the DEF2 Approach to Multiple Imputation of Missing Data: finding auxiliary variables. The mice procedure imputes values for missing observations on the variables specified in the structural regression model of interest, using both these variables themselves plus a set of auxiliary variables. Ideal auxiliary variables are usually a subset of those with no missing values in the full data set. Auxiliary variables must be correlated with the variables in the structural regression model that have missing values, since the imputation procedure is designed to “borrow” information from them to help impute the missing values. DEF2 will employ auxiliary variables provided by the user. Alternatively, DEF2 will identify suitable auxiliary variables as follows: 1. identify all categorical, ordinal, interval variables with no missing values in the complete data set. 2. identify variables that one wants to impute, and, one at a time, treating each as a dependent variable: i) regress (using binary/ordinal logit, multinomial, OLS) the dependent variable on the covariate that provides the highest correlation, and save the residual ii) add to the regression model the covariate that correlates highest with the residual, and calculate the new residual iii) repeat the above steps 8 times (or more) iv) calculate the relative importance of predictors, drop variables that fall below a given threshold, and recalculate the residual v) repeat steps ii – iv.
Step 2: Create m complete data sets • The mice procedure is repeated m times to create m copies of the data set, each containing different sets of imputed values. • Since each data set is now complete, each can be analyzed using any of the usual statistical models that require complete data. • m = 10 - 100 is currently suggested, depending on sample size and amounts of missing data.
Step 3: Analyzing the data and pooling the results: Rubin’s Rules
Analyzing the data and pooling the results, cont…. Rubin’s pooling procedures can be done with any statistic generated by the statistical method employed to analyze the m imputed data sets.
Galton’s Problem Incorporating inter-societal networks into network autocorrelation effects regression models
What processes might be inducing non-independence? • Spatial Diffusion: societies in close proximity have more opportunity to emulate, conform to, adopt, borrow, etc. neighbors behaviors, beliefs, customs, rituals… (horizontal diffusion.) • Language similarity: Similarity due to populations splitting off from same ancestral population. (vertical diffusion.) • Religion: Marriage practices spread world-wide by the colonization of large swaths of the world by European Christian nations. • Equivalence: units “similarly situated” in a network and not necessarily proximate. E.g., economic similarity, core/periphery in world system, colonial status, ecological setting, …
Assessing non-independence: Tobler’s First Law of Geography “Everything is related to everything else, but near things are more closely related than distant things.” This “law” suggests that the scores on variable y for the ith societyshould be similar to the scores of those societies with which it has the closest relationships. Call these societies i’s “neighborhood set.” If so,yishould be similar to the weighted average of the set of y scores for i’s neighborhood set, where the weights indicate relative closeness. If the N scores on yare significantly correlated with the N weighted average scores, conclude the yvariable is auto-(self)-correlated.
Weighting sample units. First , need to construct an NxN connectivity matrix C of pair-wise relatedness scores among sample units, and then row-normalize C to unity to get the required weights matrix W. That is, wij= cij ⁄Σjcij. (If a variable yis premultiplied by W, i.e. Wy, the product will be an Nx1 vector of weighted averages that are on the same scale as y.)
Incorporating autocorrelated variables into multiple regression • Most cross-cultural researchers are usually interested in testing whether hypothesized predictor variables are acting on a dependent variable, as well as what processes are inducing autocorrelation in it. • The Network Autocorrelation Regression Effects Models in DEF2 do just that.
Most commonly used network autocorrelation regression model is: Network Autocorrelation Effects model: y = α + ρWy + Xβ + ε Where: W is a row-normalized NxN weighting matrix with wij > 0 if i and j are related, 0 otherwise, and wii = 0 for all i; ρisthenetwork autocorrelation coefficient; y is an Nx1 vector; Wyis an Nx1 vector where each element i is a weighted average of y values fori’s neighborhood set; X is an Nxk matrix of exogenous variables; βis an kx1 vector of coefficients; εis an Nx1 vector of error terms. Also called the Network “Lag” model, by analogy to time series, since W acts similarly to the lag operator in time series models, except that W lags the y variable in other kinds of social and physical “spaces.” This is the model currently implemented in DEF2
Estimating the network autocorrelation effects regression model y= α + ρWy + Xβ + ε • MLE: Maximum Likelihood Estimation. This is usually the method of choice. But the log-likelihood function contains the term ln|A|, whereA= (I – ρW). Since Aisasymmetric and usually not sparse, finding the eigenvalues is computationally burdensome for large N. And, for more than two endogenous Wyvariables, the likelihood function is intractable. • OLS: Ordinary Least Squares. Basic assumption of OLS is that all r.h.s. variables be independent of (uncorrelated with) the error term ε. If not, all coefficient estimates (ρand β) are biased and inconsistent. Here,y is by definition a function of ε, soWy is also a function of ε. That is, Cov(Wy, ε) ≠ 0. Wyis thus an endogenous regressor. • IV:Instrumental Variables (IV). Provides a way to obtain consistent parameter estimates for models with endogenous variables. 2SLS is an IV estimation procedure. Can deal with large samples and multiple endogenous variables. DEF2uses IV estimation procedures.
An “intuitive” view of the IV regression approach OLS model: y = α + ρWy + ε ε Z Wy y Zis an instrument for Wyif Cov(Z,ε) = 0 (Z is valid) and Cov(Z,Wy) ≠ 0 (Z is relevant). So, need to find an additional variable(s) Z that is correlated with Wy but uncorrelated with ε to serve as an instrument for Wy.
An “intuitive” view of the 2SLS IV estimation procedure Consider again the network effects model y = α + ρWy + Xβ + ε Suppose we use WX, the lagged values of X, as an instrument for Wy. Step 1. Using OLS, estimateWy = a + WXc + υ Save the predicted scoresŷw = â + WXĉ Step 2. Again using OLS, estimatey = α + ρ ŷw + Xβ + ε (Note: the reported standard errors from step 2 are incorrect. Not an issue for the 1-step procedures used in all the usual software packages.)
2SLS Estimation of the network autocorrelation effects regression model with IVs: general case y = α + Xβ + ε
Where to get appropriate instruments? • Usually, it’s hard to find additional variables that meet the conditions required. Variables that affect the endogenous variable(s) are often also likely to affect the dependent variable. • Kelejian and Prucha (1998) show that the set of {WX, W2X, W3X,…} variables are optimal as instruments for Wy, where W2, W3,…. are the 2-step and 3-step connections between sample units. In practice, the WX variables or some subset of them will usually be sufficient.
Evaluating the quality of the instrumental variables Quality of 2SLS estimators depends on the quality of the IVs. Require that • Cov(Z,ε) = 0. IVs must be valid. IVestimation is vulnerable on this point. Tests are available only if there are more instruments than endogenous variables (overidentification.) • IVs also need to be relevant. i.e., they should predict endogenous variables independently of other exogenous variables. Shea (1997) proposed a partial R2 measure of instrument relevance for multiple endogenous variable models. • Marginal associations between endogenous variable(s)and Z isknown as the “weak” instruments problem. Some diagnostics are available. • No perfect collinearity between all exogenous variables.
Overidentification tests • If there is more than 1 instrumental variable available for Wy, can test the null hypothesis that at least one of them is correlated with the errors. • Sargan (1958) is the best known test: Ts = NR2u ~ χ2 (withdf = #IVs - #endogenous variables) where R2u is the R2 of OLS regression of 2SLS residuals on the IVs. • Basmann (1960) provides an alternate, though similar, test. • Kirby and Bollen (2009) discuss additional variants of Sargan and Basmann in the context of SEM.
“Weak” Instruments • Bound et al (1995) show that when the instruments are only weakly correlated with the endogenous variables IV estimates are biased in the same direction as OLS estimates, and may be more biased than OLS. In addition, weak IV regression estimates may not be consistent. • Staiger and Stock (1997) suggest that the partial F-statistic from the increase in the regression R2 after adding the auxiliary instruments to the exogenous variables in the first stage regression should be greater than 10. • Stock and Yogo (2005) provide tables that give some guidance as to how much greater than 10 the F-statistic may have to be.
Example: Monogamy in the Pre-industrial World Multiple proposed determinants of the long-term historical shift in marriage preference from polygynous to monogamous unions are tested using data from the Standard Cross-Cultural Sample.
W matrices employed • Geographical Distance: the WD matrix is described in Dow and Eff (2009), where cij = (1/dij)2 Use only the nearest 20 societies. • Language similarity: the WLmatrix is described in Eff (2008), where cij = e-score(ij) If the Ws are collinear, can combine them into a single matrix: WDL = πDWD + πLWL where 0 ≤ πD, πL ≤ 1 and πD + πL =1 Then, run all combinations of WDL and select as “best” the matrix that maximizes R2iv Also obtain information on the weights that yield the “best” combined W.
2SLS estimation of network autocorrelation regression model using composite distance/language W matrix. Dependent variable is a Box-Cox transform of the percentage of married females in monogamous marriages [monofem(λ – 1)/λ) ]
Summary: • DEF2 is a new statistical package designed for cross-cultural and cross-national data sets. • Given the ubiquity of missing data in such data sets, DEF2 includes a suite of programs for multiple imputation of missing data • Given that sample units in comparative data sets are non-independent due to various processes of cultural trait diffusion, DEF2 includes a suite of programs to implement network autocorrelation effects models. • Available ??? Where and How, Anthon and Doug.