1.55k likes | 1.66k Vues
This course provides a comprehensive introduction to hierarchical models (HLM) with a focus on their application in political science, particularly in American politics. Participants will engage in theoretical discussions guided by key texts such as Gelman and Hill, as well as practical exercises using R for statistical computing. Emphasis will be placed on understanding multilevel data, the ecological fallacy, and how relationships differ across aggregation levels. Assignments will include practical homework and original research papers, fostering a deeper grasp of HLM's utility in analyzing complex social phenomena.
E N D
POLS 606 Hierarchical Models
Intro • Who are you? • Fields • Substantive interests? • I am Dave • American politics • Campaigns and elections
Logistics • Book—Gelman and Hill • Snijders and Bosker is recommended. • Chapter 2 of G&H is up to you to master • G&H don’t rely on matrix algebra to teach • Probability and simulation • Bayesian • R • Not easier or harder, just different.
Lectures • There will be a mix • Math (no PowerPoint) • There will be lectures using R • Will tend to alternate
R • Very powerful/flexible • Wave of the future • Need to understand stuff more • I have never used it before so we will learn it together
Grades • Homework • There will be a bunch and they will be a mix of practical and theory • Final • Questions I would write for the methods exam • Paper • Original research using HLM. Won’t be due until start of Fall Semester
What is a multilevel model? • Theory tells you that concepts at more than one level of aggregation are related • Usually thought of as geographic • Countries • States • Schools
What is a multilevel model? • Doesn’t have to be • Time • Experimental Condition • Institutions • Regime • Bureaucracies • Individuals (panel data)
Theory is key! • Two types of relationships • Random intercepts • Mean value of DV depends on aggregate unit • Random slopes • Effect of IV depends on aggregate unit • Can have both
So you have multilevel data • Choice 1: Aggregate • Combine data to the highest level of aggregation • Create “average” value of variables for each higher unit • Advantage • Easy! • Can easily weigh based on N • Straightforward
Aggregate • Disadvantages • Shifts meaning • Variables are macro level. Theory is (presumably) micro level. • Ecological fallacy • Classic example: Race and literacy (Robinson 1950)
Why different? • The key is the within region correlations
Both individual and ecological correlation depend on this, but in different ways • Individual depends on the internal cells of the region table • Weighted average of the corrs within the regions • Ecological on depends on the marginals • Only the marginals—no use of info in the cells.
So? • The things that go into the calculation of the ecological correlation do not tell us anything about what we are interested in.
Some Math • Assume • Total group of N Persons • Two variables x & x • N people divided in to m groups. • X & Y are % of x & y in each of the m groups
Three correlations • 1) Total individual correlation (r) • Correlation ignoring the grouping • 2) Ecological correlation (re) • Correlation between m pairs (weighted by n of m) • 3) within area Correlations (rw) • This is the weighted average of the correlations withi the m groups
Two correlation ratios • ηXA & ηYA • Measure the degree of clustering of X&Y by area • High ηXA means wide variation in X across regions
Math • Can write the relationship between the correlations as:
So? • re, then, is the weighted difference between individual correlation (thing we care about) and the average of m within area correlations where weights depend on clustering • Bias is not innocuous. Correlations are inflated. re will be large in magnitude than r. • Cannot infer across levels. Don’t do it. Won’t get away with it.
Disaggregate • You could ignore the higher level of aggregation and pretend everything is observed at the individual level • Advantage: Easy and generous • Disadvantage • You are lying. • Overstate power • Ignores correlations in the errors (not iid)
Dependence of errors • The problem is a function of the intraclass correlation • Simple model: • Y is the DV • μ = Grand Mean • Uj = Macro effects (errors) • Rij = Unit specific errors
Intraclass correlation • Errors all mean 0 • Expected value of macro units are: μ+Uj
Intraclass correlation • It is the proportion of the variance in Y explained by the macro effects • The key concept in HLM • It is the degree of similarity of observations within the groups
Intraclass Correlation • Note that it changes the error variance • OLS assumes that errors are uncorrelated across observations. • This says they aren’t. • Inflates power • Shrinks standard errors • Macro variables will try to account for this
Other solutions to multilevel data • Dummy variables • Doesn’t fix standard errors • Can’t specify interesting effects • Clustering • Fixes errors but not all other problems. • Ignores any systematic problems and the theories associated
Real Solution? HLM • Effects may vary (random slopes) • Use all of the info available and use it accurately • Better predictions • Account for structure in data • Efficiency • Accurate standard errors
How HLM? R! • R is a different kind of stats package • It is a language, not a program • Open source • http://cran.r-project.org/ • Problem is that it is not obviously user-friendly • No point and click front end embedded. • This can be addressed—R is adaptable
R • The computer staff tells me it is installed and they will install it on your office machines • Update by adding packages • Rcmdr – gui interface • arm • BRugs • R2WinBUGS • car • foreign • DAAG • Matrix and lme4 if not automatically
Packages • packages are commands or sets of programs to do things. • sessionInfo() tells you what are currently attached • library(“name”)
R • Need to load packages each time • The basic starting place for R is the command Prompt (>) • R will take anything you type at this line as a command and will respond • Load packages as library(arm) • Can (and probably should) write a script to do it all
R • If you just start typing stuff, R assumes you are telling it to evaluate a statement • 2+2 • pi • Any math equation. • R wants you to define “objects” • Everything needs to be an object
Commands • Basic format • “object”<-”command”(“definition”, option, option) • Example: open data • kidiq <- read.dta(file="c:/R/kidiq.dta") • reads the childrens IQ score data used in Chapter2 • “kidiq” names object kidiq • “<-” tells R that you are going to give it a definition • “read.dta” is the command to read data • “(file=“c:/R/kidiq,dta”)” tells it which data. Note / not \
R • Random things about objects • Case sensitive • Can (and often do) have . in the name • Will remember that they are there • Can see objects by ls() command • <- defines (equivalent to =) • Look at example 1
R working directory and workspace • Each session has a working directory • Where R looks for files • If launched from windows icon can define under properties (right click) • getwd() • ls() • q() • Save workspace image? • Saves all objects in a .RData file
Help! • help.start() • help(“name”)
Script • Can do line by line commands, but those are slow, temporary and error prone • better to use script editor: • File->new script • control+N • Can save and re-load
Missing data • R handles missing data • uses “NA” • Will read in data and convert just fine
Reading in data • kidiq <- read.dta(file="c:/R/kidiq.dta") • We have seen this before • Attaching: • in commands you need to tell R which data you are using (in fact, you can have lots of data sets loaded at once). • fit<-lm(kid.score~mom.hs, data=kidiq) • The command is attach • attach(kidiq) • fit<-lm(kid.score~mom.hs) • detach(kidiq)
Attach • R looks for things in a particular order • search() • Attach moves stuff around in the order • Order matters a lot—names of objects versus names of variables
Rcmdr • Handy front end, point and click • library(Rcmdr) • Has a script window • Nice, but don’t lean on it too hard • Thinks it is smarter than you
JGR (“Jaguar”) • Need to download and install it • Probably need computer staff for machines • Launches separate from R • Package manager is very nice • Runs separate version of R
Graphics • R has wonderful graphics if you can them to do what you want. • demo(graphics) • Starting point is plot() • plot(y~x) • plot(x, y) • graphs.R
Regression • You should know the basics and this should be review • Data being used: kidsiq (same as before)
Regression kid.score = a + b(mom.hs) + error lm(formula = kid.score ~ mom.hs) coef.est coef.se (Intercept) 77.55 2.06 mom.hs 11.77 2.32 n = 434, k = 2 residual sd = 19.85, R-Squared = 0.06 • Interpret • 78 = E(kid.score) if mom.hs=0 • 12 = Expected change in ks when mom.hs = 1
Regression • kid.score = a + b(mom.iq) + error lm(formula = kid.score ~ mom.iq) coef.est coef.se (Intercept) 25.80 5.92 mom.iq 0.61 0.06 n = 434, k = 2 residual sd = 18.27, R-Squared = 0.20 • Interpret • 26 = E(kid.score) when mom.iq=0 • 0.61 = expected change in ks for every iq point of mom
Regression • Both predictors lm(formula = kid.score ~ mom.hs + mom.iq) coef.est coef.se (Intercept) 25.73 5.88 mom.hs 5.95 2.21 mom.iq 0.56 0.06 n = 434, k = 3 residual sd = 18.14, R-Squared = 0.21 • Interpret?
Interactions • Remember, sometimes the effect of a variable is conditional on another variable • In stata you need to create the interaction, in R you can do it on the fly