Kathryn Sharpe & Wei Zhu

Kathryn Sharpe & Wei Zhu Structural Equation Modeling (SEM) – Some Basic Concepts

SEM Basics • SEM is a statistical technique for testing and estimating causal relationships first proposed in 1921 by the American Geneticist Dr. Sewall Green Wright (1889-1988) 2

SEM Basics • SEM is a set of usually inter-related linear regression equations. • SEM without latent variables is called • Path Analysis. • SEM is a confirmatory analysis procedure although sometimes it can also be used as an exploratory analysis tool.

A simple example: Path Diagrams & Equations for Eating Disorder Every variable with an incoming arrow leads to a regression equation. Our regression equation system is as follows: Directional Arrows indicate cause and effect

SEM Programs The most popular software packages for SEM are: • LISREL (Karl Gustav Jöreskog & Dag Sörbom) • EQS (Peter Bentler, UCLA) • AMOS • PROC CALIS (and PROC TCALIS) in SAS For this example, we will use PROC CALIS. It takes our linear equations (previous slide) and estimates the parameters for the model. Then it evaluates the goodness of fit of the model.

Dr. Karl Gustav Jöreskog(Sweden) & Dr. Peter M. Bentler (USA) Widely viewed as leaders in SEM development in our times 6

SAS Code SAS correlation procedure Suppress Pearson correlations Specifies output Dataset with type Covariance matrix Proc calis: use cov rather than corr proccorrcovnocorr data=eddataoutp=edcova(type=cov); run; proccaliscov mod data=edcova; Lineqs bi = b1 am + b2 sw + E1, sw = b3 am + b4 bi + E2, dt = b5 bi + b6 sw + E3, rd = b7 dt + E4 ; Std E1-E4 = The1-The4 ; Cov E1 E2 = Ps1; Run; Give the linear equations describing the system Give variances of exogenous variables Error variances We must set the error equal to a parameter value; otherwise it is assumed to be 0 BI and SW are correlated so we must estimate the correlation of their error terms’ variances

Results of SAS Analysis: Age of Menstruation (AM) SAS gives us parameter estimates, error estimates, and t-values for each path included in the model. We use a t-test to determine which paths are significant. In addition, we can calculate the confidence intervals. -.1232 (-.77,.52) -1.9912 (-4.02,.04) -.0525 (-2.02,1.92) Body Image (BI) Adolescent Self Worth (SW) -.1040 (-.27,.06) -.2699 (-1.33,.79) .3341 (.04,.63) Drive for Thinness (DT) .8292 (.72,.94) Risk for Disorder (RD)

Goodness of Fit After reporting the parameter estimates, SAS reports many different measures of fit so we can evaluate it in any way we choose. The more measures we use to evaluate our model, the better. SAS Output: A good fit does not necessarily mean a perfect model. We can still have unnecessary variables or be missing important ones. By convention, a model is “good” if: • GFI > .90/.95, • Small Chi-Square value, • large p-value, • RMSEA Estimate should • be close to zero.

Useful Websites Google and Wikipedia have done a good job for searching and summarizing many items including SEM. Type “structural equation modeling” in Google, you will see the SEM wiki site listed as the first item: http://en.wikipedia.org/wiki/Structural_equation_modeling Looking at the recommended sites towards the end of the SEM wiki page, you will find further useful links such as: A good website for SEM lecture notes: http://faculty.chass.ncsu.edu/garson/PA765/structur.htm LISREL: http://www.ssicentral.com/lisrel/ EQS: http://www.mvsoft.com/ MPLUS: http://www.statmodel.com/ GLLAMM: http://www.gllamm.org/ SEM AFNI (brain functional pathway analysis): http://afni.nimh.nih.gov/sscc/gangc/PathAna.html SAS Proc TCALIS: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_tcalis_sect087.htm The UCLA SAS Web: http://www.ats.ucla.edu/stat/sas/faq/path_analysis.htm

Part II: PCNA and Bootstrap Resampling 1. Partial Correlation Network Analysis

PCNA: Generating a Path Diagram When there is not a hypothesized diagram for a SEM analysis, we can generate a path diagram using partial correlation network analysis. In 2006, Marrelec discussed the concept of detecting an underlying connectivity network in data, and the methods for analysis. He noted the importance of detection without hypothesized relationships, as SEM requires. In 2007, Marrelec et. al. published a work praising the use of Partial Correlation Network Analysis (PCNA) in conjunction with SEM. Partial correlation analysis is a technique that allows us to investigate the relationship between two variables free of influence from other variables. Consider two variables, X and Y. We want to know the correlation of X and Y while controlling for Z. The most intuitive way to understand partial correlation is to consider two regressions.

Our PCNA Bootstrap Methodology We have N variables, and we are interested to know which pairs have significant relationships when controlling for all other variables in the system. Additionally, we are interested to know which pairs’ relationships is changed by the disease state of the measured tissue, for example. For each pair of variables, i and j, we regress the two variables individually on all other variables in the system, and calculate the corresponding residuals. This creates two variables, and , representing the original variables free of the influence of all other variables in the system. Then we can evaluate their correlation. is the partial correlation of the variables. However, this is just one number, so we cannot incorporate the influence of covariates into the significance test of this value. This is why we use a bootstrapping procedure.

Part II: PCNA and Bootstrap Resampling 2. Bootstrap Resampling

Bootstrap Resampling The idea behind bootstrapping is resampling with replacement. … 1 2 3 m We have our original sample of m subjects. … Select one of them at random, and then replace it before randomly selecting the next. Repeat this m times. 1 2 3 m i Now we have a sample of m subjects consisting of subjects from the original sample. However, some subjects may be repeated, and some subjects from the original sample may not be present in our resample. Use each resample to calculate the partial correlation. Now we have a population of n measurements for each pair of variables. If we perform this analysis on our two datasets individually, we will have 1000 estimates of partial correlation for the normal tissue and 1000 estimates for the diseased tissue.

Bootstrap Resampling We will let the significance of the relationships in the normal dataset represent the general significance of partial correlation among variables in the system. We can create a difference variable to estimate the difference of the partial correlation between the normal tissue and diseased tissue. The significance of the differences represents the influence of disease on the partial correlation between variables. The results we must evaluate are two lists of partial correlations (those for the normal tissue, and those for the diseased tissue). Sort the normal and difference variables. If 0 is contained in the middle 95% of the observations, then we would say the relationship or influence of disease is insignificant for this pair of variables. (This is called the percentile method).

Results The results of the PCNA bootstrap in the brain data (four datasets; covariates: drug, group) example is shown at the left. No arrows! At this point, we would ask the collaborating researcher for input on the directionality of each path. For paths not easily determined, we can implement one path in each direction. The results would be a hypothesized relationship that can be verified using structural equation modeling with an independent data set.

Kathryn Sharpe & Wei Zhu