630 likes | 749 Vues
This presentation focuses on the rich geospatial capabilities of R, covering how to import various geographical data formats and utilize free geographic resources. Learn to create and style maps, perform spatial queries, and analyze geographic data, including modeling techniques. Special attention is given to importing census and electoral data, as well as handling SpatialPolygonsDataFrames for visualizing geographical boundaries. This is an essential guide for anyone looking to learn the fundamentals of geospatial analysis using R.
E N D
Introduction toGeospatial Analysis in R SURF – 24 April 2012 Daniel Marlay
Synopsis • This month's talk is going to look at the geo-spatial capabilities of R. We'll look at how to import common geographical data formats into R and some of the free geographic data sources and map layers available. We'll then look at how to create maps in R using this data, and some of the ways to style it to display our data. We'll look at how R stores geographic data and how we can perform queries against that - for example identifying which points fall into a particular region. Finally, we'll take a brief look at modeling geospatial data and some of the issues to be aware of.
Introduction • There are extensive geospatial capabilities in R • I’ve just started to scratch the surface • This presentation will give a little bit of theory • Most of the content is a walk through of doing geospatial analysis in R • I’ve picked data sets that are freely available • Trying this yourself is the best way to learn • And maybe we’ll learn something about the way Australians vote…
R Geospatial Packages • sp – provides a generic set of functions, classes and methods for handling spatial data • rgdal – provides an R interface into the Geospatial Data Abstraction Library (GDAL) which is used to read and write geospatial data from R
Types of Geospatial Data • Vector data • Points • Lines • Areas • Bitmap • Often used for image data (e.g. aerial photos) • Needs to be registered to a coordinate system • “Labelled” data • Has geographic information, but needs to be matched before it can be used
Setting up the R Environment ## Set working directory to where the data is. Update as required if running this yourself setwd("C:\\Documents and Settings\\marlada\\My Documents\\AQUA Internal\\Thought Leadership\\201204 - SURF Geospatial Analysis Presentation"); ## Load the relevant libraries library(sp); # Basic R classes for handling geographic data library(rgdal); # Library for using the Geographic Data Abstraction Layer library(nlme); # Library that gives us generalised least squares
Read In Census Data (1/3) ## Read in and clean the census data (Note: a lot of this cleaning could be done more easily in Excel) EducationLevel <- read.csv("EducationData.csv",skip=6,na.strings=""); EducationLevel <- EducationLevel[c(-1,-2),c(-1,-27)]; # Remove leading and trailing blank columns and blank second row EducationLevel <- EducationLevel[-(97:100),]; # Remove trailing blank lines #### Create some useable column names EduDataCols <- paste(c(rep("Male",8),rep("Female",8),rep("Total",8)), rep(c("NotStated","InadDescr","Postgrad","GradDipCert","Bachelor","Diploma","Certificate","NA"),3), sep="."); colnames(EducationLevel) <- c("SED",EduDataCols);
Read In Census Data (2/3) #### Recode the data into character and numeric data to avoid weird errors from factors EducationLevel[,1] <- as.character(EducationLevel[,1]); for (col in EduDataCols) { EducationLevel[,col] <- as.numeric(as.character(EducationLevel[,col])); } #### Eyeball the data to make sure it is ok. summary(EducationLevel); head(EducationLevel,10); tail(EducationLevel,10);
Read In Electoral Data (1/2) ## Read in the electoral data ElectionResults <- read.csv("2011NSWElectionResults.csv"); #### Eyeball data to make sure it is ok summary(ElectionResults); head(ElectionResults); tail(ElectionResults);
Read In SED Geography (1/3) ## Read in the state electoral division boundaries (geography) and explore the SpatialPolygonsDataFrame class SED <- readOGR("C:\\Documents and Settings\\marlada\\My Documents\\AQUA Internal\\Thought Leadership\\201204 - SURF Geospatial Analysis Presentation\\Geographies","SED06aAUST_region"); #### Have an initial look at the SED data set that we've just read in summary(SED); plot(SED);
Examining the SpatialPloygonsDataFrame (1/2) #### SED is a SpatialPolygonsDataFrame, an S4 object. We can have a look at how it is constructed mode(SED); slotNames(SED); summary(SED@data); summary(SED@polygons); SED@plotOrder; SED@bbox; SED@proj4string;
Simple Mapping of SpatialPolygonsDataFrames (1/2) #### Let's now look at some more mapping, we've seen that we can plot all of Australia plot(SED[SED$STATE_2006 == "1",]); # Plot NSW plot(SED[SED$STATE_2006 == "1",],xlim=c(150.6,151.4),ylim=c(-34.3,-33.4)); # Plot Sydney - xlim and ylim from google maps ;-) plot(SED[SED$STATE_2006 == "1",],xlim=c(150.6,151.4),ylim=c(-34.3,-33.4)); # Plot Sydney and put on some electoral district names text(coordinates(SED[SED$STATE_2006 == "1",]),labels=(SED[SED$STATE_2006 == "1",])$NAME_2006,cex=0.5);
Thematic Mapping (1/8) ## Thematic mapping SED.NSW <- SED[SED$STATE_2006 == "1",]; # subset of SED for convenience #### Create a ThemeData data set with a summary of the data we are interested in - proportion of people with a tertiary education ThemeData <- data.frame(SED = as.character(EducationLevel$SED), PropTertiaryEd = (EducationLevel$Total.Postgrad + EducationLevel$Total.GradDipCert + EducationLevel$Total.Bachelor + EducationLevel$Total.Diploma + EducationLevel$Total.Certificate) / (EducationLevel$Total.Postgrad + EducationLevel$Total.GradDipCert + EducationLevel$Total.Bachelor + EducationLevel$Total.Diploma + EducationLevel$Total.Certificate + EducationLevel$Total.NA), stringsAsFactors=FALSE); hist(ThemeData$PropTertiaryEd); # Histogram of the proportions to work out the appropriate cut points ThemeData$PropTertiaryEdFact <- cut(ThemeData$PropTertiaryEd,c(0,0.25,0.3,0.35,0.4,0.5,1.0)); # Create a factor for the proportion variable levels(ThemeData$PropTertiaryEdFact) <- c("25% or Less","25% to 30%","30% to 35%","35% to 40%","40% to 50%","More than 50%");
Thematic Mapping (3/8) #### Display a thematic map for all of NSW bands <- length(levels(ThemeData$PropTertiaryEdFact)); pal <- heat.colors(bands); plot(SED.NSW,col=pal[ThemeData$PropTertiaryEdFact[match(SED.NSW$NAME_2006,ThemeData$SED)]]); # Note the use of match() to get the right rows legend("bottomright", legend=levels(ThemeData$PropTertiaryEdFact), fill=pal, title="Prop. with Tertiary Ed.",inset=0.01); #### Display a thematic map for Sydney plot(SED.NSW,col=pal[ThemeData$PropTertiaryEdFact[match(SED.NSW$NAME_2006,ThemeData$SED)]],xlim=c(150.6,151.4),ylim=c(-34.3,-33.4)); legend("bottomright", legend=levels(ThemeData$PropTertiaryEdFact), fill=pal, title="Prop. with Tertiary Ed.",inset=0.01);
Thematic Mapping (5/8) #### Now we'll add the election results to our ThemeData data set rownames(ElectionResults) <- as.character(ElectionResults$District); # Adding rownames allows us to index by them when matching ThemeData$PropGreenVote <- ElectionResults[ThemeData$SED,"GRN"] / ElectionResults[ThemeData$SED,"Total"]; # Create a green vote proportion variable hist(ThemeData$PropGreenVote,breaks=20); # Have a look at the distribution ThemeData$PropGreenVoteFact <- cut(ThemeData$PropGreenVote,c(0,0.05,0.06,0.08,0.1,0.15,1.0)); # Create a factor levels(ThemeData$PropGreenVoteFact) <- c("Less than 5%","5% to 6%","6% to 8%","8% to 10%","10% to 15%","More than 15%");
Thematic Mapping (7/8) #### And do some thematic maps of the election results bands <- length(levels(ThemeData$PropGreenVoteFact)); pal <- heat.colors(bands); plot(SED.NSW,col=pal[ThemeData$PropGreenVoteFact[match(SED.NSW$NAME_2006,ThemeData$SED)]]) legend("bottomright", legend=levels(ThemeData$PropPropGreenVoteFactFact), fill=pal, title="Prop. Voted Green",inset=0.01) plot(SED.NSW,col=pal[ThemeData$PropGreenVoteFact[match(SED.NSW$NAME_2006,ThemeData$SED)]],xlim=c(150.6,151.4),ylim=c(-34.3,-33.4)) legend("bottomright", legend=levels(ThemeData$PropGreenVoteFact), fill=pal, title="Prop. Voted Green",inset=0.01)
Geographic Querying (1/4) ## Demonstration of geographic querying #### Read in the Localities layer from the TOPO 2.5M data set Locs <- readOGR("C:\\Documents and Settings\\marlada\\My Documents\\AQUA Internal\\Thought Leadership\\201204 - SURF Geospatial Analysis Presentation\\Geographies\\localities","aus25lgd_p"); Mtns <- Locs[Locs$LOCALITY == "6",]; # Select only mountains plot(Mtns) #### Use the over function to find a list of mountains in SEDs with more than 10% green votes over(SED.NSW[!is.na(ThemeData$PropGreenVote[match(SED.NSW$NAME_2006,ThemeData$SED)]) & ThemeData$PropGreenVote[match(SED.NSW$NAME_2006,ThemeData$SED)] > 0.10,], Mtns); # Only gets one mountain per SED over(SED.NSW[!is.na(ThemeData$PropGreenVote[match(SED.NSW$NAME_2006,ThemeData$SED)]) & ThemeData$PropGreenVote[match(SED.NSW$NAME_2006,ThemeData$SED)] > 0.10,], Mtns,returnList=TRUE); # Gets all mountains, but in a less useful format do.call("rbind",over(SED.NSW[!is.na(ThemeData$PropGreenVote[match(SED.NSW$NAME_2006,ThemeData$SED)]) & ThemeData$PropGreenVote[match(SED.NSW$NAME_2006,ThemeData$SED)] > 0.10,], Mtns,returnList=TRUE)); # Gives us something a bit more useable