Advanced Techniques for Analyzing Massive Terrain Data Sets in Hydrology and Ecology

Modeling & Analyzing Massive Terrain Data Sets Pankaj K. Agarwal Duke University

STREAM Project Scalable Techniques for hi-Resolution Elevation data Analysis & Modeling In collaboration with: Lars Arge, Helena Mitasova Students: Andrew Danner,Thomas Molhave,Ke Yi

Photogrammetry0.76m v. accuracy (5ft contours) 1974 1995 1998 Lidar0.15m v. accuracy; altitude 700m and 2300m RTK-GPS0.10m v. accuracy 2001 2004 1999 Diverse elevation point data: density, distribution, accuracy

Constructing Digital Elevation Models Grid, TIN, Contour DEMs

Natural feature extraction Maps of topo parameters: computed simultaneously with interpolation using partial derivatives of the RST function, terrain features: combined thresholds of topo parameters Ma Extracted foredunes 1999-2004 using profile curvature and elevation threshold

slip faces: slope > 30deg dune crests: profile curvature threshold 1995 new slip face in 1999 2001 Tracking evolving features how to automatically track features that change some of their properties over time?

Hydrology Analysis • Flow Analysis • Watershed hierarchy

Challenge: Massive Data Sets • LIDAR • NC Coastline: 200 million points – over 7 GB • Neuse River basin (NC): 500 million points – over 17 GB • Output is also large • 10ft grid: 10GB • 5ft grid: 40GB

Approximation Algorithms • Exact computation expensive Many practical problems are intractable Multiple & often conflicting optimization criteria Suffices to find a near-optimal solution • Tunable Approximation algorithms • Tradeoff between accuracy and efficiency • User specifies tolerance

read/write head track read/write arm magnetic surface I/O-Bottleneck • Data resides in secondary memory • Disk access is 106 times slower than main memory access • Maximize useful data transferred with each access • Amortize large access time by transferring large contiguous blocks of data • I/O – not CPU – is often the bottleneck for processing massive data sets

Disk RAM CPU I/O-efficient Algorithms [AV88] • Traditional algorithms optimize CPU computation • Not sensitive to penalty of disk access • I/O model • Memory is finite • Data is transferred in blocks • Complexity measured in disk blocks transferred B M B ~ 2K-8K

Elevation Data to Watershed Hierarchies: A Pipeline Approach

Point Cloud to Grid DEM • Given a point cloud sampled from a bivariate function z=F(x,y) (elevation data) • Goal: construct a grid DEM for z • DEM Applications • Hydrology • Ecology • Viewsheds … • Many algorithms only work on grid data

Sub-meter resolution: improved structures representation 2004 lidar-based 30cm resolution DSM 1999 lidar 2m resolution DSM 2004 lidar: 0.5m resolution binning interpolated DSM Binning (within cell average) sufficient for ~1-3m DEMs but interpolation produces improved geometry representation and allows us to create higher resolution DEMs/DSMs

Interpolate O(N3) General Approach: Interpolation • Kriging, IDW, Splines, … z=F(x,y) N Points • Modern data sets can have millions of points • Direct interpolation does not scale

Improving Interpolation with Quad Trees • Segment initial point set using a quad tree • Each quad tree leaf has a <K points (10-100) • Interpolate each segment separately • N/K segments • O(K3) interpolation per segment • O(K2N) total running time • Linear in N • Boundary smoothness? • Use points from nearby segments

Overview of the Algorithm • Construct quad-tree on disk using bulk loading [AAPV 01] • For each quad-tree node, find points in neighboring segments • Arrange points on disk for sequential interpolation • Use the algorithm by Mitasova, Mitas, Harmon • Can use any other interpolation method

Block I/O Building an External Quad Tree K=2 • Build top few levels in memory • Buffer points in lower levels • Write buffers to disk when full

Building a TIN TIN: Triangulated Irregular Network Constrained DelaunayTriangulation Developed an I/O-efficient algorithm Builds TIN for NeuseRiver Basin (14GB) in7.5 hours

Future Work: Hierarchical DEMs • Considerable work in graphics and GIS communities on multiresolution hierarchies for terrains • Focuses on visualization • Most algorithms perform random memory access • Out-of-core simplification • I/O-efficient algorithms for terrain simplification (edge contraction method) • Feature preserving simplification • Preserves main river networks • Preserves visibility for most of the points

Coping with Noisy Data • Nearly all flow models assume [O’Callaghan and Mark 84, de Berg et al. 96] • water flows downwards • stops at a minimum

Denoising via Flooding[Jenson and Domingue 88, Arge et al. 03] • Pour water uniformly on terrain • Water level rises until steady-state

River Network After Flooding After flooding Sort(N) I/Os

Denoising via Flooding[Jenson and Domingue 88, Arge et al. 03] • Pour water uniformly on terrain • Water level rises until steady-state • Use topologicalpersistence (Edelsbrunner et al.) as importance measure to measure the importance of a minimum • Flood low-persistence minima, preserve high-persistence minima

Topological Persistence [Edelsbrunner et al. 02]

The Union-Find Problem • A universe of N elements: x1, x2, …, xN • Initially N singleton sets: {x1}, {x2 }, …, {xN} • Each set has a representative • Maintain the partition under • Union(xi, xj) : Joins the sets containing xi and xj • Find(xi) : Returns the representative of the set containing xi

Formulated as Batched Union-Find • On a terrain represented as a TIN • When reach • A minimum or maximum: do nothing • A regular point u: Issue union(u,v) for a lower neighbor v • A saddle u: let v and w be nodes from u’s two connected pieces in its lower link Issue: find(v), find(w), union(u,v), union(u,w) • Sequence of union/find operations can be computed in advance lower link

Classical Union-Find Algorithm representatives d h i p b j a f l z s r c k e g m n Union(d, h) : Find(n) : h h d f l d f l m n b j a b j a m path compression link-by-rank e g e g n

Complexity • O(N α(N)) for a sequence of N union and find operations [Tarjan 75] • α(•) : Inverse Ackermann function (very slow!) • Optimal in the worst case [Tarjan79, Fredman and Saks 89] • Batched (Off-line) version • Entire sequence known in advance • Can be improved to linear on RAM [Gabow and Tarjan 85] • Not possible on a pointer machine [Tarjan79]

Our Results • An I/O-efficient algorithm for the batched union-find problem using O(sort(N)) = O(N/B logM/B(N/B)) I/Os • Same as sorting • optimal in the worst case • A simpler algorithm using O(sort(N) log(N/M)) I/Os • Implemented • Apply to persistence computation, flooding • O(sort(N)) time • Other applications

I/O-Efficient Batched Union-Find • Two-stage algorithm • Convert to interval union-find • Compute an order on the elements s.t. each union joins two adjacent sets • Solve batched interval union-find • Each stage formulated as a geometric problem and solved using sort(N) I/Os

Experimental Results Traditional algorithm good as long as data fits in memory

Experimental Results:Neuse River Basin 0.5 billion points (14GB raw data), sampled to obtain smaller data sets On the entire data set: EM takes 5 hours, IM fails

Contour Trees

Previous Results • Directly maintain contours • O(N log N) time [van Kreveld et al. 97] • Needs union-split-find for circular lists • Do not extend to higher dimensions • Two sweeps by maintaining components, then merge • O(N log N) time [Carr et al. 03] • Extend to arbitrary dimensions

Join Tree and Split Tree[Carr et al. 03] Qualified nodes 9 9 9 9 8 8 8 8 7 7 7 7 6 6 6 6 5 5 5 5 4 4 4 4 3 3 3 3 2 2 1 1 1 1 Join tree Split tree Join tree Split tree

Final Contour Tree Hard to BATCH! 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 Join tree Split tree Contour tree

9 9 9 Now can BATCH! 8 8 8 u 7 7 u 7 u 6 6 6 v v v 5 5 5 w w w 4 4 4 3 3 3 2 2 2 1 1 1 Join tree Split tree Contour tree Another Characterization Let w be the highest node that is a descendant of v in join tree and ancestor of u in split tree, (u, w) is a contour tree edge

Experimental Results On the Neuse River Basin data set

Coping with Noise: Future Work • Constructing hydrologically conditioned DEMs • Formulate as an optimization problem • Using persistent homology • Integrating topology and geometry • Handling flat areas • Scalability remains a big issue • Handling anthropogenic features

Back to Pipeline: Flow Modeling

110 90 85 95 100 80 Sea From Elevation to River Networks • Where does water go? • From higher elevation to lower elevation • Single flow directions forms a tree

1 2 2 1 7 1 9 1 1 2 11 14 17 1 2 25 1 3 7 1 1 1 Drainage Area • How much area is upstream of each node? • Each node has initial drainage area (1) • Drainage area of internal nodes depends on drainage area of children 3 3 5

Drainage Area example (Panama)

IFSARE and SRTM based stream analysis Stream and watershed boundary extraction for Panama for on-going water quality research to provide more detailed and up-to-date data. SRTM - 7400x3600 DSM at 90m resolution for entire Panama, IFSARE - 10800x11300 DEM at 10m resolution for the Panama canal section Using r.terraflow - officially released in GRASS6.2 in 2006 - streams can be extracted in 3-4 hours, tile-based approach takes days SRTM DEM for entire Panama with main rivers extracted from DEM in 3 hr

Hierarchical Watershed Decomposition

Watershed Hierarchies • Decompose a river network into a hierarchy of hydrological units • All water in HU flows to a common outlet • Hierarchy provides tunable level of detail • Method used: Pfafstetter [VV99] • Want a solution scalable to large modern hi-res terrains

1 2 1 8 3 9 1 7 2 7 3 1 4 5 11 2 9 14 1 3 17 2 1 1 3 25 7 2 6 1 1 1 1 5 Pfafstetter • Find main river • Find four largest tributaries • Label basins/interbasins • Recurse until single path

72 75 7 71 73 74 25 24 21 2 27 26 22 23 Recurse

96 9 97 999 98 998 94 6 994 992 993 99 4 95 93 997 8 92 2 91 996 7 991 5 3 995 1 Example Watershed Boundaries

Advanced Techniques for Analyzing Massive Terrain Data Sets in Hydrology and Ecology