1 / 61

610 likes | 751 Vues

Discovering patterns in data: correlation and visualization. www.gapminder.org. Will Hamilton April 2017. Contents. PART 2: Data visualisation Rules for good graphing Bad graphs. PART 1: Correlation What is correlation? Types of correlation Linear regression Multiple linear regression

Télécharger la présentation
## gapminder

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.
Content is provided to you AS IS for your information and personal use only.
Download presentation by click this link.
While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

**Discovering patterns in data: correlation and visualization**www.gapminder.org Will Hamilton April 2017**Contents**PART 2: Data visualisation • Rules for good graphing • Bad graphs PART 1: Correlation • What is correlation? • Types of correlation • Linear regression • Multiple linear regression • Linear regression diagnostics • Non-linear correlation • Logistic regression • Correlation and causation Conclusions • The age of data!**Dependence**• Independence • Occurrence of one variable does not affect the probability of occurrence of the other • Dependence • Any statistical relationship between two random variables**Describing linear relationships:**• Pearson Product-Moment Correlation Coefficient • Linear regression**Linear regression**• “Least squares method” • Minimise the sum of squared difference between each datapoint and the line**y = –5.34 χ + 37**• r2 = 0.74 • P = 1.29x10-10**Explaining linear regression in words**y = –5.34 χ + 37 For every 1000lb increase in car weight (χ), miles per gallon (y) decreases by 5.34 mpg.**Explaining linear regression in words**r2 = 0.74 Our model (y = –5.34 χ + 37) explains 74% of the variance in our data. 26% of the variance in car miles per gallon is not accounted for in our model.**Explaining linear regression in words**P = 1.29x10-10 Statistical significance test i.e. The probability of obtaining data as extreme or more extreme than this if there were no association between mpg and weight**Don’t confuse r2 and m**• y = χ • r2 = 1.0 • P < 2.2x10-16 • y = 2χ • r2 = 1.0 • P < 2.2x10-16**y = 112 χ – 131**• r2 = 0.78 • P = 1.22x10-11**Explaining linear regression in words**• y = 112 χ – 131 • For every 1000lb increase in car weight (χ), engine displacement (y) increases by 112 cubic inches. • r2 = 0.78 • Our model explains 78% of the data variance • P = 1.22x10-11 • Probability of getting data as extreme or more extreme than those obtained if there were no association between car weight and engine displacement is very small**Multiple linear regression**• What if multiple independent variables influence your dependent variable? • Y = c + m1x1 + m2x2 + m3x3 ... + mnxn**Size of bubble = total population living with HIV (PLHIV)**Colour of bubble = WHO global region. Blue = Sub-Saharan Africa. Countries with HIV prevalence >3% are named. There is no obvious/ linear correlation between HIV prevalence and a nation’s wealth**There is no obvious/ linear correlation between HIV**prevalence and a nation’s human development index**There is no obvious/ linear correlation between HIV**prevalence and a nation’s adult female literacy…**… In fact, among Sub-Saharan African countries there is a**positive correlation; i.e. more literate countries tend to have higher HIV prevalences!**Among Sub-Saharan African countries, high risk sex (defined**by DHS data, e.g. multiple concurrent sexual partners) is positively correlated with HIV prevalence**Multiple linear regression**• Y = c + m1x1 + m2x2 + m3x3 ... + mnxn • Y = HIV prevalence • X1 = Log10 GDP per capita • X2 = Human Development Index • X3 = Female literacy rate • X4 = High risk sex behaviours**Reality-checking your linear regressions**Diagnostic plots: • Residuals vs Fitted • Normal Q-Q • Scale-Location • Residuals vs Leverege**Lesotho**Swaziland Burundi**Non-linear rank correlations**• Things that are correlated but not in straight line (linear) relationships • Ranking correlations • Spearman’s rho (ρ) • Kendall’s tau (τ)**Levenson RM, et al. 2015.**Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology Breast Cancer Images. PLoS ONE 10(11): e0141357.**Spearman’s rank correlation coefficient**ρ = 0.97 P < 2.2x10-16**Logistic regression**Analyses of binary categorical dependent variables 1 • Odds of dependent variable being 0 or 1 computed for every position along the independent variable • Odds Ratio (OR) imputed 0**Multiple logistic regression: worked example**• 321 children with P. falciparum malaria at a hospital in rural Ghana. • 27 (8.41%) of the children died • Were any clinical variables associated with mortality? With thanks to Andrew McGovern**From Tyler Vigen -**http://www.tylervigen.com/spurious-correlations**“Establishing the why”**Strength Consistency Specificity Temporality Biological gradient Plausibility Coherence Experiment Analogy The Environment and Disease: Association or Causation? A. Bradford Hill, Proceedings of the Royal Society of Medicine, 1965**Data visualisation :-D**https://www.youtube.com/watch?v=9mnVWJpMhuE**The visual display of quantitative information**• Show the data • Avoid distorting the data • High “data-ink ratio” • Avoid “chartjunk” • Data richness • Integrated Micro/ Macro-trend data displays • Neuropsychology of human perception Edward Tufte**The visual display of quantitative information**Avoid “Chartjunk” Edward Tufte

More Related