1 / 61

# gapminder

Discovering patterns in data: correlation and visualization. www.gapminder.org. Will Hamilton April 2017. Contents. PART 2: Data visualisation Rules for good graphing Bad graphs. PART 1: Correlation What is correlation? Types of correlation Linear regression Multiple linear regression Télécharger la présentation ## gapminder

E N D

### Presentation Transcript

1. Discovering patterns in data: correlation and visualization www.gapminder.org Will Hamilton April 2017

2. Contents PART 2: Data visualisation • Rules for good graphing • Bad graphs PART 1: Correlation • What is correlation? • Types of correlation • Linear regression • Multiple linear regression • Linear regression diagnostics • Non-linear correlation • Logistic regression • Correlation and causation Conclusions • The age of data!

3. Dependence • Independence • Occurrence of one variable does not affect the probability of occurrence of the other • Dependence • Any statistical relationship between two random variables

4. Describing linear relationships: • Pearson Product-Moment Correlation Coefficient • Linear regression

5. mtcars

6. Linear regression • “Least squares method” • Minimise the sum of squared difference between each datapoint and the line

7. y = –5.34 χ + 37 • r2 = 0.74 • P = 1.29x10-10

8. Explaining linear regression in words y = –5.34 χ + 37 For every 1000lb increase in car weight (χ), miles per gallon (y) decreases by 5.34 mpg.

9. Explaining linear regression in words r2 = 0.74 Our model (y = –5.34 χ + 37) explains 74% of the variance in our data. 26% of the variance in car miles per gallon is not accounted for in our model.

10. Explaining linear regression in words P = 1.29x10-10 Statistical significance test i.e. The probability of obtaining data as extreme or more extreme than this if there were no association between mpg and weight

11. Don’t confuse r2 and m • y = χ • r2 = 1.0 • P < 2.2x10-16 • y = 2χ • r2 = 1.0 • P < 2.2x10-16

12. y = 112 χ – 131 • r2 = 0.78 • P = 1.22x10-11

13. Explaining linear regression in words • y = 112 χ – 131 • For every 1000lb increase in car weight (χ), engine displacement (y) increases by 112 cubic inches. • r2 = 0.78 • Our model explains 78% of the data variance • P = 1.22x10-11 • Probability of getting data as extreme or more extreme than those obtained if there were no association between car weight and engine displacement is very small

14. Multiple linear regression • What if multiple independent variables influence your dependent variable? • Y = c + m1x1 + m2x2 + m3x3 ... + mnxn

15. Size of bubble = total population living with HIV (PLHIV) Colour of bubble = WHO global region. Blue = Sub-Saharan Africa. Countries with HIV prevalence >3% are named. There is no obvious/ linear correlation between HIV prevalence and a nation’s wealth

16. Just Sub-Saharan African countries

17. There is no obvious/ linear correlation between HIV prevalence and a nation’s human development index

18. Just Sub-Saharan African countries

19. There is no obvious/ linear correlation between HIV prevalence and a nation’s adult female literacy…

20. … In fact, among Sub-Saharan African countries there is a positive correlation; i.e. more literate countries tend to have higher HIV prevalences!

21. Among Sub-Saharan African countries, high risk sex (defined by DHS data, e.g. multiple concurrent sexual partners) is positively correlated with HIV prevalence

22. Multiple linear regression • Y = c + m1x1 + m2x2 + m3x3 ... + mnxn • Y = HIV prevalence • X1 = Log10 GDP per capita • X2 = Human Development Index • X3 = Female literacy rate • X4 = High risk sex behaviours

23. Reality-checking your linear regressions Diagnostic plots: • Residuals vs Fitted • Normal Q-Q • Scale-Location • Residuals vs Leverege

24. Residuals vs Fitted values

25. Testing for homoscedasticity

26. Lesotho Swaziland Burundi

27. Non-linear rank correlations • Things that are correlated but not in straight line (linear) relationships • Ranking correlations • Spearman’s rho (ρ) • Kendall’s tau (τ)

28. Levenson RM, et al. 2015. Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology Breast Cancer Images. PLoS ONE 10(11): e0141357.

29. Spearman’s rank correlation coefficient ρ = 0.97 P < 2.2x10-16

30. Logistic regression Analyses of binary categorical dependent variables 1 • Odds of dependent variable being 0 or 1 computed for every position along the independent variable • Odds Ratio (OR) imputed 0

31. Multiple logistic regression: worked example • 321 children with P. falciparum malaria at a hospital in rural Ghana. • 27 (8.41%) of the children died • Were any clinical variables associated with mortality? With thanks to Andrew McGovern

32. Correlation ≠ Causation !

33. From Tyler Vigen - http://www.tylervigen.com/spurious-correlations

34. “Establishing the why” Strength Consistency Specificity Temporality Biological gradient Plausibility Coherence Experiment Analogy The Environment and Disease: Association or Causation? A. Bradford Hill, Proceedings of the Royal Society of Medicine, 1965