Analysis of Categorical Data

Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013

Overview • Data Types • Contingency Tables • Logit Models • Binomial • Ordinal • Nominal

Things not covered (but still fit into the topic) • Matched pairs/repeated measures • McNemar’sChi-Square • Reliability • Cohen’s Kappa • ROC • Poisson (Count) models • Categorical SEM • TetrachoricCorrelation • Bernoulli Trials

Data Types (Levels of Measurement) Discrete/Categorical/Qualitative Continuous/Quantitative Nominal/Multinomial: Rank Order/Ordinal: Binary/Dichotomous/Binomial: • Properties: • Values arbitrary (no magnitude) • No direction (no ordering) • Example: • Race: 1=AA, 2=Ca, 3=As • Measures: • Mode, relative frequency • Properties: • Values semi-arbitrary (no magnitude?) • Have direction (ordering) • Example: • Lickert Scales (LICK-URT): • 1-5, Strongly Disagree to Strongly Agree • Measures: • Mode, relative frequency, median • Mean? • Properties: • 2 Levels • Special case of Ordinal or Multinomial • Examples: • Gender (Multinomial) • Disease (Y/N) • Measures: • Mode, relative frequency, • Mean?

Code 1.1 Contingency Tables • Often called Two-way tables or Cross-Tab • Have dimensions I x J • Can be used to test hypotheses of association between categorical variables

Contingency Tables: Test of Independence • Chi-Square Test of Independence (χ2) • Calculate χ2 • Determine DF: (I-1) * (J-1) • Compare to χ2 critical value for given DF. R1=156 R2=664 N=820 C1=265 C2=331 C3=264 Where: Oi = Observed Freq Ei= Expected Freq n= number of cells in table

Code 1.2 Contingency Tables: Test of Independence • Pearson Chi-Square Test of Independence (χ2) • H0: No Association • HA: Association….where, how? • Not appropriate when Expected (Ei) cell size freq < 5 • Use Fisher’s Exact Chi-Square R1=156 R2=664 N=820 C1=265 C2=331 C3=264

Contingency Tables • 2x2 Disorder (Outcome) Yes No a b Yes a+b c d Risk Factor/ Exposure c+d No a+c b+d a+b+c+d

Contingency Tables:Measures of Association Depression Probability : Contrasting Probability: Yes No a= 25 b= 10 Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol 35 Yes c= 20 d= 45 Alcohol Use Contrasting Odds: Odds: 65 No The odds for depression were 5.62 times greater in Alcohol users compared to nonusers. 45 55 100

Why Odds Ratios? i=1 to 45 (20 + 45*i) Depression (45 + 55*i) Yes No a= 25 b= 10*i (25 + 10*i) Yes c= 20 d= 45*i Alcohol Use No 45 55*i

The GeneralizedLinear Model • General Linear Model (LM) • Continuous Outcomes (DV) • Linear Regression, t-test, Pearson correlation, ANOVA, ANCOVA • GeneralizedLinear Model (GLM) • John Nelder and Robert Wedderburn • Maximum Likelihood Estimation • Continuous, Categorical, and Count outcomes. • Distribution Family and Link Functions • Error distributions that are not normal

Logistic Regression • “This is the most important model for categorical response data” –Agresti (Categorical Data Analysis, 2nd Ed.) • Binary Response • Predicting Probability (related to the Probit model) • Assume (the usual): • Independence • NOT Homoscedasticity or Normal Errors • Linearity (in the Log Odds) • Also….adequate cell sizes.

Logistic Regression • The Model • In terms of probability of success π(x) • In terms of Logits (Log Odds) • Logit transform gives us a linear equation

Code 2.1 Logistic Regression: Example The Output as Logits • Logits: H0: β=0 Freq. Percent Not Depressed 672 81.95 Depressed 148 18.05 • Conversion to Probability: What does H0: β=0 mean? • Conversion to Odds • Also=0.1805/0.8195=0.22

Code 2.2 Logistic Regression: Example • The Output as ORs • Odds Ratios: H0: β=1 • Conversion to Probability: • Conversion to Logit (log odds!) • Ln(OR) = logit • Ln(0.220)=-1.51 Freq. Percent Not Depressed 672 81.95 Depressed 148 18.05

Code 2.3 Logistic Regression: Example Logistic Regression w/ Single Continuous Predictor: AS LOGITS: Interpretation: A 1 unit increase in age results in a 0.013 increase in the log-odds of depression. Hmmmm….I have no concept of what a log-odds is. Interpret as something else. Logit > 0 so as age increases the risk of depression increases. OR=e^0.013 = 1.013 For a 1 unit increase in age, there is a 1.013 increase in the odds of depression. We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change]

Logistic Regression: GOF • Overall Model Likelihood-Ratio Chi-Square • Omnibus test for the model • Overall model fit? • Relative to other models • Compares specified model with Null model (no predictors) • Χ2=-2*(LL0-LL1), DF=K parameters estimated

Code 2.4 Logistic Regression: GOF (Summary Measures) • Pseudo-R2 • Not the same meaning as linear regression. • There are many of them (Cox and Snell/McFadden) • Only comparable within nested models of the same outcome. • Hosmer-Lemeshow • Models with Continuous Predictors • Is the model a better fit than the NULL model. X2 • H0: Good Fit for Data, so we want p>0.05 • Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of Group * Outcome using. Df=g-2 • Conservative (rarely rejects the null) • Pearson Chi-Square • Models with categorical predictors • Similar to Hosmer-Lemeshow • ROC-Area Under the Curve • Predictive accuracy/Classification

Code 2.5 Logistic Regression: GOF(Diagnostic Measures) • Outliers in Y (Outcome) • Pearson Residuals • Square root of the contribution to the Pearson χ2 • Deviance Residuals • Square root of the contribution to the likeihood-ratio test statistic of a saturated model vs fitted model. • Outliers in X (Predictors) • Leverage (Hat Matrix/Projection Matrix) • Maps the influence of observed on fitted values • Influential Observations • Pregibon’s Delta-Beta influence statistic • Similar to Cook’s-D in linear regression • Detecting Problems • Residuals vs Predictors • Leverage VsResiduals • Boxplot of Delta-Beta

Logistic Regression: GOF L-R χ2 (df=1): 2.47, p=0.1162 H-L GOF: Number of Groups: 10 H-L Chi2: 7.12 DF: 8 P: 0.5233 McFadden’s R2: 0.0030

Code 2.6 Logistic Regression: Diagnostics • Linearity in the Log-Odds • Use a lowess (loess) plot • Depressed vs Age

Code 2.7 Logistic Regression: Example Logistic Regression w/ Single Categorical Predictor: AS OR: Interpretation: The odds of depression are 0.299 times lower for males compared to females. We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males compared to females. Or…why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = 3.34.

Ordinal Logistic Regression • Also called Ordered Logistic or Proportional Odds Model • Extension of Binary Logistic Model • >2 Ordered responses • New Assumption! • Proportional Odds • BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese) • The predictors effect on the outcome is the same across levels of the outcome. • Bmi3grp (1 vs 2,3) = B(age) • Bmi3grp (1,2 vs 3) = B(age)

Ordinal Logistic Regression • The Model • A latent variable model (Y*) • j= number of levels-1 • From the equation we can see that the odds ratio is assumed to be independent of the category j

Code 3.1 Ordinal Logistic Regression Example AS LOGITS: For a 1 unit increase in Blood Pressure there is a 0.012 increase in the log-odds of being in a higherbmi category AS OR: For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are 1.012 times greater.

Code 3.2 Ordinal Logistic Regression: GOF • Assessing Proportional Odds Assumptions • Brant Test of Parallel Regression • H0: Proportional Odds, thus want p >0.05 • Tests each predictor separately and overall • Score Test of Parallel Regression • H0: Proportional Odds, thus want p >0.05 • Approx Likelihood-ratio test • H0: Proportional Odds, thus want p >0.05

Code 3.3 Ordinal Logistic Regression: GOF • Pseudo R2 • Diagnostics Measures • Performed on the j-1 binomial logistic regressions

Multinomial Logistic Regression • Also called multinomial logit/polytomous logistic regression. • Same assumptions as the binary logistic model • >2 non-ordered responses • Or You’ve failed to meet the parallel odds assumption of the Ordinal Logistic model

Multinomial Logistic Regression • The Model • j= levels for the outcome • J=reference level • where x is a fixed setting of an explanatory variable • Notice how it appears we are estimating a Relative Risk and not an Odds Ratio. It’s actually an OR. • Similar to conducting separate binary logistic models, but with better type 1 error control

Code 4.1 Multinomial Logistic Regression Example Does degree of supernatural belief indicate a religious preference? AS OR: For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic.

Multinomial Logistic Regression GOF • Limited GOF tests. • Look at LR Chi-square and compare nested models. • “Essentially, all models are wrong, but some are useful” –George E.P. Box • Pseudo R2 • Similar to Ordinal • Perform tests on the j-1 binomial logistic regressions

Resources “Categorical Data Analysis” by Alan Agresti UCLA Stat Computing: http://www.ats.ucla.edu/stat/

Analysis of Categorical Data

Analysis of Categorical Data

Presentation Transcript

Categorical Data Analysis

Chapter 14: Analysis of Categorical Data

Chapter 16 – Categorical Data Analysis

Introduction to Categorical Data Analysis

Chapter 12 – Analysis of Categorical Data

Categorical Data Analysis

Categorical Data

Categorical Data

INTRODUCTION TO CATEGORICAL DATA ANALYSIS

Categorical Data

Categorical Data

Categorical Data

Categorical Data Analysis

Categorical Data

STA617 Advanced Categorical Data Analysis

Categorical Data Analysis PGRM 14

The Analysis of Categorical Data

Categorical Data Analysis

The Analysis of Categorical Data

Categorical Data Analysis

INTRODUCTION TO CATEGORICAL DATA ANALYSIS

Categorical data