Predicting Water Quality Impaired Stream Segments using Landscape-scale Data and a Regional Geostatistical Model

Predicting Water Quality Impaired Stream Segments using Landscape-scale Data and a Regional Geostatistical Model Erin Peterson Environmental Risk Technologies CSIRO Mathematical & Information Sciences St Lucia, Queensland

This research is funded by This research is funded by U.S.EPA U.S.EPA 凡 Science To Achieve Science To Achieve Results (STAR) Program Results (STAR) Program Cooperative Cooperative CR CR - - 829095 829095 # # Agreement Agreement Space-Time Aquatic Resources Modeling and Analysis Program The work reported here was developed under STAR Research Assistance Agreement CR-829095 awarded by the U.S. Environmental Protection Agency (EPA) to Colorado State University. This presentation has not been formally reviewed by EPA. EPA does not endorse any products or commercial services mentioned in this presentation.

Collaborators Dr. David M. Theobald Natural Resource Ecology Lab Department of Recreation & Tourism Colorado State University, USA Dr. N. Scott Urquhart Department of Statistics Colorado State University, USA Dr. Jay M. Ver Hoef National Marine Mammal Laboratory, Seattle, USA Andrew A. Merton Department of Statistics Colorado State University, USA

Overview Introduction ~ Background ~ Patterns of spatial autocorrelation in stream water chemistry ~ Predicting water quality impaired stream segments using landscape-scale data and a regional geostatistical model: A case study in Maryland, USA

Water Quality Monitoring Goals • Create a regional water quality assessment • Ecosystem Health Monitoring Program • Identify water quality impaired stream segments

Probability-based Random Survey Designs • Advantages • Statistical inference about population of streams over large area • Reported in stream kilometers • Disadvantages • Does not take watershed influence into account • Does not identify spatial location of impaired stream segments

Purpose Develop a geostatistical methodology based on coarse-scale GIS data and field surveys that can be used to predict water quality characteristics about stream segments found throughout a large geographic area (e.g., state)

SCALE: Grain Aquatic Terrestrial Landscape River Network COARSE Climate Atmospheric deposition Geology Topography Soil Type Network Connectivity Stream Network Nested Watersheds Drainage Density Confluence Density Connectivity Flow Direction Network Configuration Vegetation Type Basin Shape/Size Land Use Topography Segment Contributing Area Segment Tributary Size Differences Network Geometry Localized Disturbances Land Use/ Land Cover Reach Riparian Zone Riparian Vegetation Type & Condition Floodplain / Valley Floor Width Cross Sectional Area Channel Slope, Bed Materials Large Woody Debris Overhanging Vegetation Substrate Microhabitat Microhabitat FINE Biotic Condition, Substrate Type, Overlapping Vegetation Detritus, Macrophytes Shading Detritus Inputs Biotic Condition

10 Sill Semivariance Nugget Range 0 1000 0 Separation Distance Geostatistical Modeling • Fit an autocovariance function to data • Describes relationship between observations based on separation distance Distances and relationships are represented differently depending on the distance measure

B A C Distance Measures & Spatial Relationships Straight-line Distance (SLD) Geostatistical models typically based on SLD

B A C Distance Measures & Spatial Relationships Symmetric Hydrologic Distance (SHD) Hydrologic connectivity: Fish movement

B A C Distance Measures & Spatial Relationships Asymmetric Hydrologic Distance Longitudinal transport of material

B A C Distance Measures & Spatial Relationships • Challenge: • Spatial autocovariance models developed for SLD may not be valid for hydrologic distances • Covariance matrix is not positive definite

Flow Asymmetric Autocovariance Models for Stream Networks • Weighted asymmetric hydrologic distance (WAHD) • Developed by Jay Ver Hoef • Moving average models • Incorporate flow volume, flow direction, and use hydrologic distance • Positive definite covariance matrices Ver Hoef, J.M., Peterson, E.E., and Theobald, D.M., Spatial Statistical Models that Use Flow and Stream Distance, Environmental and Ecological Statistics. In Press.

Patterns of Spatial Autocorrelation in Stream Water Chemistry

Objectives Evaluate 8 chemical response variables • pH measured in the lab (PHLAB) • Conductivity (COND) measured in the lab μmho/cm • Dissolved oxygen (DO) mg/l • Dissolved organic carbon (DOC) mg/l • Nitrate-nitrogen (NO3) mg/l • Sulfate (SO4) mg/l • Acid neutralizing capacity (ANC) μeq/l • Temperature (TEMP) °C Determine which distance measure is most appropriate • SLD • SHD • WAHD • More than one? Find the range of spatial autocorrelation

Dataset Maryland Biological Stream Survey (MBSS) Data • Maryland Department of Natural Resources • Maryland, USA • 1995, 1996, 1997 • Stratified probability-based random survey design • 881 sites in 17 interbasins

Study Area Maryland, USA Baltimore Annapolis Washington D.C. Chesapeake Bay

N Spatial Distribution of MBSS Data

2 1 3 1 2 3 1 2 3 SHD AHD SLD GIS Tools Automated tools needed to extract data about hydrologic relationships between survey sites did not exist! Wrote Visual Basic for Applications (VBA) programs to: • Calculate watershed covariates for each stream segment • Functional Linkage of Watersheds and Streams (FLoWS) • Calculate separation distances between sites • SLD, SHD, Asymmetric hydrologic distance (AHD) • Calculate the spatial weights for the WAHD • Convert GIS data to a format compatible with statistics software • FLoWS tools will be available on the STARMAP website: • http://nrel.colostate.edu/projects/starmap

Calculate the PI of each upstream segment on segment directly downstream Watershed Segment B Watershed Segment A • Calculate the PI of one survey site on another site • Flow-connected sites • Multiply the segment PIs A B C Watershed Area A Segment PI of A = Watershed Area B Spatial Weights for WAHD • Proportional influence (PI): influence of each neighboring survey site on a downstream survey site • Weighted by catchment area: Surrogate for flow volume

Calculate the PI of each upstream segment on segment directly downstream A C B • Calculate the PI of one survey site on another site • Flow-connected sites • Multiply the segment PIs E D F G H Spatial Weights for WAHD • Proportional influence (PI): influence of each neighboring survey site on a downstream survey site • Weighted by catchment area: Surrogate for flow volume survey sites stream segment

Calculate the PI of each upstream segment on segment directly downstream • Calculate the PI of one survey site on another site • Flow-connected sites • Multiply the segment PIs Site PI = B * D * F * G Spatial Weights for WAHD • Proportional influence (PI): influence of each neighboring survey site on a downstream survey site • Weighted by catchment area: Surrogate for flow volume A C B E D F G H

Data for Geostatistical Modeling • Distance matrices • SLD, SHD, AHD • Spatial weights matrix • Contains flow dependent weights for WAHD • Watershed covariates • Lumped watershed covariates • Mean elevation, % Urban • Observations • MBSS survey sites

Geostatistical Modeling Methods • Validation Set • Unique for each chemical response variable • Initial Covariate Selection • 5 covariates • Model Development • Restricted model space to all possible linear models • 4 model sets:

Geostatistical Modeling Methods • Geostatistical model parameter estimation • Maximize the profile log-likelihood function Log-likelihood function of the parameters ( ) given the observed data Z is: Maximizing the log-likelihood with respect to B and sigma2 yields: and Both maximum likelihood estimators can be written as functions of alone Derive the profile log-likelihood function by substituting the MLEs ( ) back into the log-likelihood function

where C1 is the covariance based on the distance between two sites, h, given the autocorrelationparameter estimates: nugget ( ), sill ( ), and range ( ). • Covariance matrix for WAHD model • Fit exponential autocorrelation function (C1) • Hadamard (element-wise) product of C1 & square root of spatial weights matrix forced into symmetry ( ) Geostatistical Modeling Methods • Covariance matrix for SLD and SHD models • Fit exponential autocorrelation function

Geostatistical Modeling Methods • Model selection within model set • GLM: Akaike Information Corrected Criterion (AICC) • Geostatistical models: Spatial AICC (Hoeting et al., in press) where n is the number of observations, p-1 is the number of covariates, and k is the number of autocorrelation parameters. http://www.stat.colostate.edu/~jah/papers/spavarsel.pdf • Model selection between model types • 100 Predictions: Universal kriging algorithm • Mean square prediction error (MSPE) • Cannot use AICC to compare models based on different distance measures • Model comparison: r2 for observed vs. predicted values

Summary statistics for distance measures in kilometers using DO (n=826). * Asymmetric hydrologic distance is not weighted here Results • Summary statistics for distance measures • Spatial neighborhood differs • Affects number of neighboring sites • Affects median, mean, and maximum separation distance

180.79 301.76 SLD SHD WAHD Results Mean Range Values SLD = 28.2 km SHD = 88.03 km WAHD = 57.8 km • Range of spatial autocorrelation differs: • Shortest for SLD • TEMP = shortest range values • DO = largest range values

GLM SLD MSPE SHD WAHD Results • Distance Measures: • GLM always has less predictive ability • More than one distance measure usually performed well • SLD, SHD, WAHD: PHLAB & DOC • SLD and SHD : ANC, DO, NO3 • WAHD & SHD: COND, TEMP • SLD distance: SO4

r2 GLM SLD SHD WAHD Results Predictive ability of models: Strong: ANC, COND, DOC, NO3, PHLAB Weak: DO, TEMP, SO4 r2

SHD WAHD SLD Discussion Distance measure influences how spatial relationships are represented in a stream network • Site’s relative influence on other sites • Dictates form and size of spatial neighborhood • Important because… • Impacts accuracy of the geostatistical model predictions

Patterns of spatial autocorrelation found at relatively coarse scale • Geostatistical models describe more variability than GLM SLD, SHD, and WAHD represent spatial autocorrelation in continuous coarse-scale variables SLD • > 1 distance measure performed well • SLD never substantially inferior • Do not represent movement through network • Different range of spatial autocorrelation? • Larger SHD and WAHD range values • Separation distance larger when restricted to network SHD

244 sites did not have neighbors Sample Size = 881 Number of sites with ≤1 neighbor: 393 Mean number of neighbors per site: 2.81 Frequency Number of Neighboring Sites Discussion • Probability-based random survey design (-) affected WAHD • Maximize spatial independence of sites • Does not represent spatial relationships in networks • Validation sites randomly selected

4500 WAHD GLM Difference (O – E) 0 0 1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 8 Number of Neighboring Sites Discussion WAHD models explained more variability as neighboring sites increased • Not when neighbors had: • Similar watershed conditions • Significantly different chemical response values

4500 WAHD GLM Difference (O – E) 0 0 1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 8 Number of Neighboring Sites Discussion • GLM predictions improved as number of neighbors increased • Clusters of sites in space have similar watershed conditions • Statistical regression pulled towards the cluster • GLM contained hidden spatial information • Explained additional variability in data with > neighbors

Coarse COND SO4 ANC PH NO3 DOC Scale of influential ecological processes TEMP DO Fine 0.5 0 1.0 Predictive Ability of Geostatistical Models r2

Conclusions • Spatial autocorrelation exists in stream chemistry data at a relatively coarse scale • Geostatistical models improve the accuracy of water chemistry predictions • Patterns of spatial autocorrelation differ between chemical response variables • Ecological processes acting at different spatial scales • SLD is the most suitable distance measure at regional scale at this time • Unsuitable survey designs • SHD: GIS processing time is prohibitive

Conclusions • Results are scale specific • Spatial patterns change with survey scale • Other patterns may emerge at shorter separation distances • Further research is needed at finer scales • Watershed or small stream network • New survey designs for stream networks • Capture both coarse and fine scale variation • Ensure that hydrologic neighborhoods are represented

Predicting Water Quality Impaired Stream Segments using Landscape-scale Data and a Regional Geostatistical Model: A Case Study In Maryland

Objective Demonstrate how a geostatistical methodology can be used to compliment regional water quality monitoring efforts • Predict regional water quality conditions • Identify the spatial location of potentially impaired stream segments

1996 MBSS DOC Data Kilometers 0 20 N

Methods Potential covariates

Methods Potential covariates after initial model selection (10)

Methods • Fit geostatistical models • Two distance measures: SLD and WAHD • Restricted model space to all possible linear models • 1024 models per set • 9 model sets • Parameter Estimation • Maximized profile log-likelihood function

Model selection within distance measure & autocorrelation function • Spatial AICC (Hoeting et al., in press) Model selection between distance measure & autocorrelation function • Cross-validation method using Universal kriging algorithm • 312 predictions • MSPE • Model comparison: r2 for the observed vs. predicted values Methods

MSPE Mariah Linear with Sill Rational Quadratic Spherical Exponential Hole Effect Autocorrelation Function Results • SLD models performed better than WAHD • Exception: Spherical model • Best models: • SLD Exponential, Mariah, and Rational Quadratic models • r2 for SLD model predictions • Almost identical • Further analysis restricted to SLD Mariah model

Results • Covariates for SLD Mariah model: • WATER, EMERGWET, WOODYWET, FELPERC, & MINTEMP • Positive relationship with DOC: • WATER, EMERGWET, WOODYWET, MINTEMP Negative relationship with DOC • FELPERC

Model coefficients represent change in log10 DOC per unit of X Cross-validation intervals for Mariah model regression coefficients • Cross-validation interval: 95% of regression coefficients produced by leave-one-out cross validation procedure • Narrow intervals • Few extreme regression coefficient values • Not produced by common sites • Covariate values for the site are represented in observed data • Not clustered in space

Predicting Water Quality Impaired Stream Segments using Landscape-scale Data and a Regional Geostatistical Model

Predicting Water Quality Impaired Stream Segments using Landscape-scale Data and a Regional Geostatistical Model

Presentation Transcript

Estimation and Model Selection for Geostatistical Models

Predicting Water Quality Impaired Stream Segments using Landscape-scale Data and a Regional Geostatistical Model

Large- scale water quality modeling

Data quality and model parameterisation

A Forward and Adjoint Neighborhood Scale Air Quality Model

Overview of Stream Water Quality Data

Distributing Iowa’s Water Quality Data Using STORET and ArcIMS

Regional Model-Data Comparison

Measuring landscape scale with ALSM data

Water Quality Data Analysis

The NAO and the Gulf Stream: Basin Scale Interactions to Regional Scale Variability

Regional GIS-based Geostatistical Models for Stream Networks

“Building a Landscape Model”

Predicting Daily Potable Water Savings by Using Rainwater Tanks at Urban Scale

Water Quality in a Stream Reach

Predicting Genetic Merit Using Genomic Data

Statistical Process Control Quality Assessment Model Building and Predicting

Predicting Flu Trends using Twitter Data

Aurora: a new model and architecture for data stream management

Rivers and Stream Water Quality Management

A Geostatistical Framework to Support the Management of Water Quality from Private Wells

Data quality and model parameterisation