1 / 61

gapminder

Discovering patterns in data: correlation and visualization. www.gapminder.org. Will Hamilton April 2017. Contents. PART 2: Data visualisation Rules for good graphing Bad graphs. PART 1: Correlation What is correlation? Types of correlation Linear regression Multiple linear regression

micheal
Télécharger la présentation

gapminder

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discovering patterns in data: correlation and visualization www.gapminder.org Will Hamilton April 2017

  2. Contents PART 2: Data visualisation • Rules for good graphing • Bad graphs PART 1: Correlation • What is correlation? • Types of correlation • Linear regression • Multiple linear regression • Linear regression diagnostics • Non-linear correlation • Logistic regression • Correlation and causation Conclusions • The age of data!

  3. Dependence • Independence • Occurrence of one variable does not affect the probability of occurrence of the other • Dependence • Any statistical relationship between two random variables

  4. Describing linear relationships: • Pearson Product-Moment Correlation Coefficient • Linear regression

  5. mtcars

  6. Linear regression • “Least squares method” • Minimise the sum of squared difference between each datapoint and the line

  7. y = –5.34 χ + 37 • r2 = 0.74 • P = 1.29x10-10

  8. Explaining linear regression in words y = –5.34 χ + 37 For every 1000lb increase in car weight (χ), miles per gallon (y) decreases by 5.34 mpg.

  9. Explaining linear regression in words r2 = 0.74 Our model (y = –5.34 χ + 37) explains 74% of the variance in our data. 26% of the variance in car miles per gallon is not accounted for in our model.

  10. Explaining linear regression in words P = 1.29x10-10 Statistical significance test i.e. The probability of obtaining data as extreme or more extreme than this if there were no association between mpg and weight

  11. Don’t confuse r2 and m • y = χ • r2 = 1.0 • P < 2.2x10-16 • y = 2χ • r2 = 1.0 • P < 2.2x10-16

  12. y = 112 χ – 131 • r2 = 0.78 • P = 1.22x10-11

  13. Explaining linear regression in words • y = 112 χ – 131 • For every 1000lb increase in car weight (χ), engine displacement (y) increases by 112 cubic inches. • r2 = 0.78 • Our model explains 78% of the data variance • P = 1.22x10-11 • Probability of getting data as extreme or more extreme than those obtained if there were no association between car weight and engine displacement is very small

  14. Multiple linear regression • What if multiple independent variables influence your dependent variable? • Y = c + m1x1 + m2x2 + m3x3 ... + mnxn

  15. Size of bubble = total population living with HIV (PLHIV) Colour of bubble = WHO global region. Blue = Sub-Saharan Africa. Countries with HIV prevalence >3% are named. There is no obvious/ linear correlation between HIV prevalence and a nation’s wealth

  16. Just Sub-Saharan African countries

  17. There is no obvious/ linear correlation between HIV prevalence and a nation’s human development index

  18. Just Sub-Saharan African countries

  19. There is no obvious/ linear correlation between HIV prevalence and a nation’s adult female literacy…

  20. … In fact, among Sub-Saharan African countries there is a positive correlation; i.e. more literate countries tend to have higher HIV prevalences!

  21. Among Sub-Saharan African countries, high risk sex (defined by DHS data, e.g. multiple concurrent sexual partners) is positively correlated with HIV prevalence

  22. Multiple linear regression • Y = c + m1x1 + m2x2 + m3x3 ... + mnxn • Y = HIV prevalence • X1 = Log10 GDP per capita • X2 = Human Development Index • X3 = Female literacy rate • X4 = High risk sex behaviours

  23. Reality-checking your linear regressions Diagnostic plots: • Residuals vs Fitted • Normal Q-Q • Scale-Location • Residuals vs Leverege

  24. Residuals vs Fitted values

  25. Testing for homoscedasticity

  26. Lesotho Swaziland Burundi

  27. Non-linear rank correlations • Things that are correlated but not in straight line (linear) relationships • Ranking correlations • Spearman’s rho (ρ) • Kendall’s tau (τ)

  28. Levenson RM, et al. 2015. Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology Breast Cancer Images. PLoS ONE 10(11): e0141357.

  29. Spearman’s rank correlation coefficient ρ = 0.97 P < 2.2x10-16

  30. Logistic regression Analyses of binary categorical dependent variables 1 • Odds of dependent variable being 0 or 1 computed for every position along the independent variable • Odds Ratio (OR) imputed 0

  31. Multiple logistic regression: worked example • 321 children with P. falciparum malaria at a hospital in rural Ghana. • 27 (8.41%) of the children died • Were any clinical variables associated with mortality? With thanks to Andrew McGovern

  32. Correlation ≠ Causation !

  33. From Tyler Vigen - http://www.tylervigen.com/spurious-correlations

  34. “Establishing the why” Strength Consistency Specificity Temporality Biological gradient Plausibility Coherence Experiment Analogy The Environment and Disease: Association or Causation? A. Bradford Hill, Proceedings of the Royal Society of Medicine, 1965

  35. Data visualisation :-D https://www.youtube.com/watch?v=9mnVWJpMhuE

  36. The visual display of quantitative information • Show the data • Avoid distorting the data • High “data-ink ratio” • Avoid “chartjunk” • Data richness • Integrated Micro/ Macro-trend data displays • Neuropsychology of human perception Edward Tufte

  37. The visual display of quantitative information Avoid “Chartjunk” Edward Tufte

More Related