Understanding Association and Correlation in Statistical Analysis
210 likes | 341 Vues
This handout outlines the fundamental concepts of association and correlation in statistics, targeting tests appropriate for categorical and continuous data. It covers methods like the chi-square test for categorical data, Pearson’s r for interval/ratio data, and Spearman’s rho for ordinal data. Additionally, it discusses research questions, data entry for chi-square testing, significance testing, degrees of freedom, and odds ratios. Essential assumptions and calculations are also presented, equipping students with necessary skills for effective data analysis and interpretation.
Understanding Association and Correlation in Statistical Analysis
E N D
Presentation Transcript
Week 3 Association and correlation handout & additional course notes available at http://homepages.gold.ac.uk/aphome Trevor Thompson 15-10-2007
Overview 1) What are tests of association and which test do I use? 2) Associations within categorical data • - descriptives (frequency tables) • - the chi-square test 3) Associations within continuous data • - descriptives (scatterplots) • - Spearmans and Pearsons ‘r’ - Howell (2002) Chap 6 & 9. ‘Statistical Methods for Psychology’
What is association/correlation? • To examine whether there is a relationship between variables • Variables are either associated or independent (which is null hypothesis?) • Causation vs. association • depends on the experimental design not the test used
Which test to use? Categorical data – Chi-square Ordinal (ranked) data - Spearmans rho Interval/ratio data - Pearsons r • Test selection depends on data: • Other less commonly used tests exist (tetrachoric, kendall’s tau, phi etc) – see Howell • Logistic regression covered in later lecture
Which test to use - examples • Pearson’s r • Is there an association between height and weight? • Is there an association between 50 cities ranked for ‘livability’ 10 years ago and these cities ranked for ‘livability’ today? • Spearman’s rho • Is there an association between gender (male / female) and yogurt preference (light / dark)? • Chi-square test
Chi-square test • Pearson’s chi-square test for categorical data -descriptives -assumptions -chi-square significance test • Research question: Is gender associated with preference for a specifically coloured yogurt?
Chi-square test • Data entry • each row should representresponses of one participant • Compute contingency (frequency) table • n-way table denotes number of variables gender & yogurt is 2-way table • Tables also described in terms of how many levels of each variable. So 3*2 table would represent one variable with 3 levels & one variable with 2 levels gender & yogurt preference is 2*2 table
Chi-square test • Descriptives • Contingency tables: Probable association Probable independence (no association) Possible association?
Chi-square test • Assumptions 1. Observations must be independent 2. Observations must be mutually exclusive • responses should only fall into cell. E.g. prefer either dark or light yogurt – not both 3. Inclusion of non-occurrences • include all responses (e.g. both ‘yes’ and ‘no’ ) - otherwise can be misleading • 4. Cell size • Expected cell size>5
Chi-square test • Significance testing • Are two variables significantly associated? Run Pearson’s chi-square
Chi-square test Pearsons 2 statistic • Gender & yogurt preference significantly associated (2=6.67, p<.05) Is this in the expected direction? • Our hypothesis was 2-tailed. If 1-tailed (e.g. females will prefer light yogurts) then check contingency table for direction • Can halve p-value if 1-tailed – but only if variables have 2 levels
Chi-square test Degrees of freedom • df = (R-1) * (C-1) where r=rows, c=columns • Yates’ Continuity correction • Only applicable to 2 * 2 tables • (O‑E)2 in formula to {|0-E| -0.5}2 • Not really needed
Chi-square test • Likelihood ratio • An alternative test for associations of categorical data • For large samples, likelihood ratio=Pearson chi-square • For small samples, chi-square test may be more accurate • Likelihood ratio is useful when for multi-dimensional associations – covered in Logistic regression lecture
Chi-square test Odds-ratio (OR) estimate How large is our significant association? • Odds of: females choosing light relative to dark? 2/1 & males choosing light relative to dark? 1/2 • Odds ratio= a/b c/d -or equivalently, OR=(ad)/(bc) • Odds ratio: What is likelihood of choosing a light yogurt for females relative to males? 4/1
Chi-square test – underlying logic • Pearson 2= ∑ (O-E)2 E O=observed frequency E=expected frequency • 2 statistic represents deviation of actual observed data differs from that expected by chance • Calculating 2 Step 1 -Calculate expected frequencies Prob of choosing light yogurt? ½ (30/60) Prob of being female? ½ Prob of being female & prefer light yogurt? ¼ [Joint prob = p1 x p2] So if N=60, expected freq for each cell =15 (60 x ¼)
Chi-square test – underlying logic • Step 2. Observed frequencies • Bigger deviations between observed and chance-expected cell sizes, the greater the likelihood of a significant association • 2= ∑ (O-E)2 = (20-15)2 + (10-15)2 + (10-15)2 + (20-15)2 E 15 15 15 15=6.67, same as in SPSS output
Chi-square test – underlying logic • Corresponding probability value of 2=6.67 is p=.01 (meaning a value of 6.67 occurs 1/100 by chance) • Above chi-square distribution shows values of chi-square statistic that would be obtained by chance in repeated sampling • Distribution of 2 changes according to df
Correlation and regression • Detailed coverage of correlation/regression in lectures 8 & 9 • When X & Y are continuous variables, we use Pearson’s correlation-coefficient ‘r’ (or equivalent Spearman’s rho for ranked data) • Correlation vs. regression i. correlation used to index strength of association regression used in prediction ii. (historically) If X is fixed then regression, if X is random then correlation
Correlation and regression • Descriptives Scatterplot • Correlation (r) related to degree to which the points cluster around line (0 to 1 or -1) • Regression line is “line of best fit”
Correlation and regression • Significance testing Pearsons product-moment correlation • r=0; no correlation r=+1 or -1; max correlation • Null hyp is population r=0 , with r normally distributed • To evaluate significance of ‘r’ convert to ‘t’ • t = r * √ (N – 2) (1 – r 2) • Assumptions of normality and homogeneity of variance apply – covered in detail in lecture 6
Summary • Selection of appropriate test depends on data • Chi-square test - explanation of output • Chi-square test - underlying logic • Correlation and regression