1 / 96

Data Visualization with ggplot: A Comprehensive Guide

Learn the grammar of graphics, ggplot syntax, data-aes-geoms, facets & themes, scales, colors, labels & legends, stats, and ggplot extension packages for effective data visualization using ggplot in R.

lbush
Télécharger la présentation

Data Visualization with ggplot: A Comprehensive Guide

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. R for Epi WorkshopModule 3:Data visualization with ggplot Mike Dolan Fliss, MSW, MPS PhD Candidate in Epidemiology UNC Gillings School of Global Public Health

  2. Module Outline • The grammar of graphics • ggplot syntax • Data-aes-geoms • Facets & Themes • Preparing data to Plot • Scales • Colors • Labels & legends • Stats • ggplot extension packages

  3. Same MO as before: Code along! • Recommend typing it out • Tab-complete is your speed and typo friend • Copy and paste always available • Mostly follow-along, with a little bit of free-work at the end of the module. A little bit of live coding. • Minimum to start: library(tidyverse) setwd("D:/<YOUR DIR HERE>/R Workshop/") births_sm = readRDS("births_final.rds")

  4. To start: New script, load data, headers #........................................ # IPH Workshop: R for Epi # Examples for modules 3 and 4 # Mike Dolan Fliss #........................................ library(tidyverse) setwd("D:/<YOUR FILE PATH HERE!>/R Workshop/") # ^ Above is Mike’s – make your own! births_sm = readRDS("births_final.rds") #........................................ # Module 3 #### #........................................ # _ Slide: ggplot components ####

  5. To start: New script, load data, headers

  6. 1. Grammar of Graphics Underlying theory

  7. Grammar of Graphics • A consistent, theoretical framework for describing data visualizations. • Based on a book (1999) by Leland Wilkinson • Helps think about and plan graphics outside of R… but implemented deeply in R’s in ggplot2 package. • May also be familiar if you’ve worked in Tableau – Wilkinson now works for Tableau

  8. Grammar of graphics components https://medium.com/tdebeus/think-about-the-grammar-of-graphics-when-improving-your-graphs-18e3744d8d18

  9. Grammar of graphics components data aesthetic mapping geometric object statistical transformations scales coordinate system position adjustments faceting

  10. Anatomy of a ggplot data aesthetic mapping geometric object statistical transformations scales coordinate system position adjustments faceting

  11. 2. ggplot2 syntax An implementation of gg!

  12. ggplot components (are the same!) data aesthetic mapping geometric object statistical transformations scales coordinate system position adjustments faceting

  13. ggplot components

  14. ggplot components Or minimally, ggplot(data=<DATA>)+ <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)) e.g., using mpg dataset ggplot(mpg)+geom_point(aes(displ, hwy, color=class)) * Read the “+” here as “and” … And a note on overloaded operators!

  15. Versus base / qplot shorthand hist(mpg$cyl) ggplot(mpg)+geom_histogram(aes(x=cyl)) ggplot(mpg, aes(cyl))+geom_histogram() qplot(mpg$cyl, geom="histogram")

  16. 3. Data, aes, geoms The flexible power of aesthetic mappings

  17. data ggplot likes “long”, well structured data.frames ggplot “stats” can make quick transformations dplyr will help with complicated transformations tidyr will help go from wide to long Extensions allow ggplot to understand other kinds of data (e.g. maps, network data)

  18. geoms …and many more in sister packages. Like other key packages, a cheat sheet is built into R.

  19. Aesthetic Mappings Data Aesthetics x y color (name, rgb) fill size linetype (int or name) alpha height width shape angle ….and more Geometry Numbers & Factors (characters coerced) • meduc • mage • cigdur • wksgest • preterm_f • pnc5_f • county_name • raceeth_f • …and more Minimum for geom_point(), with rest are defaulted https://cran.r-project.org/web/packages/ggplot2/vignettes/ggplot2-specs.html

  20. Cheatsheets Cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf ... Or right in RStudio! (One more benefit of using current RStudio IDE version)

  21. Sampling for prototyping speed Graphs can take a few moments, and for prototyping we want speed. Let’s use some dplyr to generate a sample to play with, and transfer to a dplyrtibble for the dataset viewing benefits. set.seed(1) # for replicability table(complete.cases(births_sm)) # test how it works births_10k = births_sm %>% filter(complete.cases(.)) %>% # dot syntax as_tibble() %>% sample_n(10000) births_10k %>% head(2) # helpful when doing EDA

  22. Basic geometries and mappings • Let’s create a scatterplot of wksgest and mage using ggplot and geom_point. ggplot(births_10k, aes(mage, ksgest))+geom_point()

  23. Basic geometries and mappings D’oh! Overplotting! Use the geom_jitter() geometry instead. ggplot(births_10k, aes(mage, ksgest))+geom_jitter()

  24. Basic geometries and mappings Let’s try colors. Map cigdur to color. That’s it! ggplot(births_10k, aes(mage, wksgest, color=cigdur))+ geom_jitter()

  25. Basic geometries and mappings Fancier: change color to preterm_f, change the point character to a “.” and alpha to 0.5. Note globalaes! ggplot(births_10k, aes(mage, wksgest, color=preterm_f))+ geom_jitter(pch=“.”, alpha=0.1) # ^ Typical chained spacing in last two examples, like dplyr

  26. Aesthetic inheritance Subsequent geometric layers will inherit the aesthetic mappings from the original ggplot() call unless they’re overwritten. Meaning these are equivalent: ggplot(births_10k, aes(mage, wksgest))+ geom_jitter() ggplot(births_10k)+ geom_jitter(aes(mage, wksgest)) #^ equivalent

  27. Aesthetic inheritance 2 …but there’s good reason to be intentional about this – like when we want multiple geometries to use the same mappings ggplot(births_10k, aes(mage, wksgest, color=cigdur))+ geom_jitter(alpha=0.1) + geom_smooth(method="lm")

  28. Connecting dplyr to ggplot While we’re at it, we may want to send dplyr output (without saving) directly into ggplot. Example: ggplot(births_10k, aes(mage, wksgest, color=cigdur))+ geom_jitter(alpha=0.1) + geom_smooth() # smooth is unhappy. Why? births_10k %>% filter(cigdur %in% c("Y", "N")) %>% # Note new logical operator, %in%! ggplot(aes(mage, wksgest, color=cigdur))+ geom_jitter(alpha=0.1) + geom_smooth() # Sends the dplyr'edtibble / data.frame into # the first param of ggplot... the data param!

  29. Common Visual EDA* workflow: Exploratory data analysis • Pick some variables: mage, wksgest • Consider a geometry: e.g. geom_bin2d • Check the help file (F1, or type it in) • Peruse, maybe run the example code (can run right within help window) • Write our own!

  30. Common Visual EDA workflow: • Pick some variables: mage, wksgest • Consider a geometry: e.g. geom_bin2d • Check the help file (F1, or type it in) • Peruse, maybe run the example code (can run righ within help window). • Write our own!

  31. Common Visual EDA workflow: • Pick some variables: e.g. mage, wksgest • Consider a geometry: e.g. geom_bin2d • Check the help file (F1, or type it in) • Peruse, maybe run the example code (can run righ within help window). • Write our own! ggplot(births_10k, aes(mage, wksgest))+ geom_bin2d() Note: Some blanks seem to be a function of our unassigned (so, default) bin width. We could do something like bins = max(births_10k$wksgest)-min(births_10k$wksgest) to fix this!

  32. 4. Facets & Themes Small multiples and overall look and feel

  33. Facets (wrap/grid):small multiples • Facets take an R formula object (e.g . ~ x, y ~ x) and split your graph into small multiples based on that. • Can also “free the scales” so they aren’t shared across plots with scales = “free_x” and/or “free_y” ggplot(births_10k, aes(mage, wksgest))+ geom_jitter(aes(color=preterm_f), pch=".", alpha=0.5)+ geom_smooth()+ facet_wrap( ~ raceeth_f)

  34. Facets (grid/wrap):small multiples • Facets take an R formula object (e.g . ~ x, y ~ x) and split your graph into small multiples based on that. • Can also “free the scales” so they aren’t shared across plots with scales = “free_x” and/or “free_y” ggplot(births_10k, aes(mage, wksgest))+ geom_jitter(aes(color=preterm_f), pch=".", alpha=0.5)+ geom_smooth()+ facet_grid( ~ raceeth_f)

  35. Facets (grid/wrap):small multiples • Facets take an R formula object (e.g . ~ x, y ~ x) and split your graph into small multiples based on that. • Can also “free the scales” so they aren’t shared across plots with scales = “free_x” and/or “free_y” ggplot(births_10k, aes(mage, wksgest))+ geom_jitter(aes(color=preterm_f), pch=".", alpha=0.5)+ geom_smooth()+ facet_grid(methnic ~ mrace)

  36. Themes • Change the theme with theme_NAME(), e.g. theme_minimal(). • Can define your own themes, or tweak existing ones. • See https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html for more themes. More on ggplot extensions later! ggplot(births_10k, aes(mage, wksgest))+ geom_jitter(aes(color=preterm_f), pch=".", alpha=0.5)+ geom_smooth()+ facet_grid( ~ raceeth_f)+ theme_minimal()

  37. 5. Preparing data to plot Thinking ahead

  38. Back to data Sometimes we need to transform our data to get at the question we have in mind. Saw with cigdur earlier. e.g. What is the association of preterm birth and maternal age?* * Again, we’re not approaching this from a stricter causal inference epidemiology frame in this workshop…

  39. Preparing to plot: preterm/mage Might first try this as a plot of the relationship of preterm birth and maternal age using: ggplot(births)+geom_histogram(aes(x=mage, fill=preterm_f), position="fill") Maybe suggestive of some meaning… but YUCK!

  40. Preparing to plot: preterm/mage Let’s try again, but first create a new dataset to plot with. Use dplyr, group the original dataset by (mage), and summarize it into two variables: pct_preterm and n. Then create a plot of the relationship of preterm birth and maternal age. magepreterm_df = births_sm %>% group_by(mage) %>% summarise(pct_preterm = mean(preterm_f == "Preterm", na.rm=T)*100, n=n()) ggplot(magepreterm_df, aes(mage, pct_preterm))+ geom_point(aes(size = n))+ geom_smooth(aes(weight=n)) # ^ Note the weight & size aesthetics!

  41. Preparing to plot: preterm/mage Could do this all in one pass, if we don’t need to keep working with the temporary dataset. Readable! births_sample %>% group_by(mage) %>% summarise(pct_preterm = mean(preterm_f == "Preterm", na.rm=T)*100, n=n()) %>% ggplot(aes(mage, pct_preterm))+ geom_point(aes(size = n))+ geom_smooth(aes(weight=n))

  42. Other plots to explore Starting to speak to some of the reality of the data. How might we interpret these? What do you see?

  43. Plots & Interpretation ggplot(births_10k, aes(wksgest))+ geom_histogram(bins=25) * Note we don’t (yet) know how to clean up axes, labels, legends, etc. Just aiming for the functional look of the plot this time (see upper right.) But this is typical of iterative development – aim for visual communication, then hone details.

  44. Plots & Interpretation births_10k %>% filter(cigdur %in% c("N", "Y")) %>% ggplot(aes(wksgest, fill=cigdur))+ geom_density(adjust=2, alpha=.5)+theme_minimal()

  45. Plots & Interpretation births_10k %>% filter(cigdur %in% c("N", "Y")) %>% ggplot(aes(cigdur, wksgest, color=cigdur, fill=cigdur))+ geom_boxplot(alpha=.5)+ facet_wrap(~pnc5_f)+ theme_minimal()

  46. Plots & Interpretation county_stats = births_sm %>% group_by(county_name, cores) %>% summarise(pct_preterm = mean(preterm, na.rm=T)*100, pct_earlyPNC = mean(pnc5, na.rm=T)*100, n=n()) ggplot(county_stats, aes(pct_earlyPNC, pct_preterm, size=n))+ geom_point()+ geom_smooth()

  47. Sidenote (other tools later)

  48. Plots & Interpretation births_sm %>% group_by(raceeth_f) %>% summarise(preterm = mean(preterm, na.rm=T), pnc5 = mean(pnc5, na.rm=T)) %>% gather(measure, percent, -raceeth_f) %>% mutate(percent = percent * 100) %>% ggplot(aes(measure, percent, fill=raceeth_f, group=raceeth_f, label=round(percent, 1))) + geom_bar(stat="identity", position="dodge")+ geom_text(position=position_dodge(.9), aes(y=percent+3), hjust=0.5)

  49. Plots & Interpretation births_10k %>% filter(cigdur %in% c("N", "Y")) %>% ggplot(aes(wksgest, mage, color=cigdur))+ geom_density_2d(alpha=.5)+theme_minimal()

  50. Worth a skim:http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html • http://chartmaker.visualisingdata.com/

More Related