610 likes | 751 Vues
Discovering patterns in data: correlation and visualization. www.gapminder.org. Will Hamilton April 2017. Contents. PART 2: Data visualisation Rules for good graphing Bad graphs. PART 1: Correlation What is correlation? Types of correlation Linear regression Multiple linear regression
E N D
Discovering patterns in data: correlation and visualization www.gapminder.org Will Hamilton April 2017
Contents PART 2: Data visualisation • Rules for good graphing • Bad graphs PART 1: Correlation • What is correlation? • Types of correlation • Linear regression • Multiple linear regression • Linear regression diagnostics • Non-linear correlation • Logistic regression • Correlation and causation Conclusions • The age of data!
Dependence • Independence • Occurrence of one variable does not affect the probability of occurrence of the other • Dependence • Any statistical relationship between two random variables
Describing linear relationships: • Pearson Product-Moment Correlation Coefficient • Linear regression
Linear regression • “Least squares method” • Minimise the sum of squared difference between each datapoint and the line
y = –5.34 χ + 37 • r2 = 0.74 • P = 1.29x10-10
Explaining linear regression in words y = –5.34 χ + 37 For every 1000lb increase in car weight (χ), miles per gallon (y) decreases by 5.34 mpg.
Explaining linear regression in words r2 = 0.74 Our model (y = –5.34 χ + 37) explains 74% of the variance in our data. 26% of the variance in car miles per gallon is not accounted for in our model.
Explaining linear regression in words P = 1.29x10-10 Statistical significance test i.e. The probability of obtaining data as extreme or more extreme than this if there were no association between mpg and weight
Don’t confuse r2 and m • y = χ • r2 = 1.0 • P < 2.2x10-16 • y = 2χ • r2 = 1.0 • P < 2.2x10-16
y = 112 χ – 131 • r2 = 0.78 • P = 1.22x10-11
Explaining linear regression in words • y = 112 χ – 131 • For every 1000lb increase in car weight (χ), engine displacement (y) increases by 112 cubic inches. • r2 = 0.78 • Our model explains 78% of the data variance • P = 1.22x10-11 • Probability of getting data as extreme or more extreme than those obtained if there were no association between car weight and engine displacement is very small
Multiple linear regression • What if multiple independent variables influence your dependent variable? • Y = c + m1x1 + m2x2 + m3x3 ... + mnxn
Size of bubble = total population living with HIV (PLHIV) Colour of bubble = WHO global region. Blue = Sub-Saharan Africa. Countries with HIV prevalence >3% are named. There is no obvious/ linear correlation between HIV prevalence and a nation’s wealth
There is no obvious/ linear correlation between HIV prevalence and a nation’s human development index
There is no obvious/ linear correlation between HIV prevalence and a nation’s adult female literacy…
… In fact, among Sub-Saharan African countries there is a positive correlation; i.e. more literate countries tend to have higher HIV prevalences!
Among Sub-Saharan African countries, high risk sex (defined by DHS data, e.g. multiple concurrent sexual partners) is positively correlated with HIV prevalence
Multiple linear regression • Y = c + m1x1 + m2x2 + m3x3 ... + mnxn • Y = HIV prevalence • X1 = Log10 GDP per capita • X2 = Human Development Index • X3 = Female literacy rate • X4 = High risk sex behaviours
Reality-checking your linear regressions Diagnostic plots: • Residuals vs Fitted • Normal Q-Q • Scale-Location • Residuals vs Leverege
Lesotho Swaziland Burundi
Non-linear rank correlations • Things that are correlated but not in straight line (linear) relationships • Ranking correlations • Spearman’s rho (ρ) • Kendall’s tau (τ)
Levenson RM, et al. 2015. Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology Breast Cancer Images. PLoS ONE 10(11): e0141357.
Spearman’s rank correlation coefficient ρ = 0.97 P < 2.2x10-16
Logistic regression Analyses of binary categorical dependent variables 1 • Odds of dependent variable being 0 or 1 computed for every position along the independent variable • Odds Ratio (OR) imputed 0
Multiple logistic regression: worked example • 321 children with P. falciparum malaria at a hospital in rural Ghana. • 27 (8.41%) of the children died • Were any clinical variables associated with mortality? With thanks to Andrew McGovern
From Tyler Vigen - http://www.tylervigen.com/spurious-correlations
“Establishing the why” Strength Consistency Specificity Temporality Biological gradient Plausibility Coherence Experiment Analogy The Environment and Disease: Association or Causation? A. Bradford Hill, Proceedings of the Royal Society of Medicine, 1965
Data visualisation :-D https://www.youtube.com/watch?v=9mnVWJpMhuE
The visual display of quantitative information • Show the data • Avoid distorting the data • High “data-ink ratio” • Avoid “chartjunk” • Data richness • Integrated Micro/ Macro-trend data displays • Neuropsychology of human perception Edward Tufte
The visual display of quantitative information Avoid “Chartjunk” Edward Tufte