High-Performance Federated and Service-Oriented Geographic Information Systems

High-Performance Federated and Service-Oriented Geographic Information Systems Ahmet Sayar (asayar@cs.indiana.edu) Advisor: Prof. Geoffrey C. Fox

Outline • Motivations • Research Issues • Architecture: Federated Service-Oriented Geographic Information System • Performance enhancing designs - measurements and analysis • Conclusions

Geographic Information Systems (GIS) • GIS is a system for creating, storing, sharing, analyzing, manipulating and displaying geo-data and associated attributes. • Inherently requires federation (see the figure) • Autonomy for scalability, flexibility and extensibility • Distributed data access for geo-data resources (databases, digital libraries etc.) • Utilizing remote analysis, simulation or visualization tools. • Open Standards • OGC and ISO/TC-211

Motivations • Requirements for – • Interoperable Service-oriented Geographic Information Systems • Necessity for sharing and integrating heterogeneous data and computation resources to produce knowledge. • Uniform data access/query, display and analysis from a single access point • Responsive and interactive information systems • GIS applications require quick response • Emergency early warning systems • Home-land security and natural disasters.

ResearchIssues • Interoperability • Defining component based Service-oriented GIS data Grid framework • Adoption of Open Geographic Standards -data model and services • Applying Web Service principles to GIS data services • Integrating Web Service and Open Geographic Standards • Federation • Capability-based federation of GIS Web Service components • Unified data access/query, display from a single access point through integrated data-views • Addressing high-performance support for responsiveness • Streaming GIS Web Services and Pre-fetching framework • Client-based caching • Parallel processing through attribute based query decomposition

Web Service components and data-flow Service-oriented GIS • WMS are data rendering services providing human comprehensible data (binary map images) • WFS are data services providing data in common data model GML – Geographic Markup Language • behaving as mediator and annotation services. • WMS and WFS have their own type of capability metadata defined by Open Geographic specs. • Inter-service communication is done through “getCapability” service interface. • UDDI based registry services. • Components are Web Services and all control goes through SOAP messages • XML-based query language (standard schema) • Built over: • Web Services standards (WS-I+) and • Open Geographic Standards (OGC and ISO/TC-211) • Consists of two types of online services • Web Map Services (WMS) and Web Feature Services (WFS) • And two types of data: • Binary data –map images (provided by WMS), • Structured-data –GML : content (core data) and presentation (attribute and geometry elements) (provided by WFS) Relation of the components and data flow: GIS WMS GML rendering WFS (mediator) wsdl wsdl Binary data GML getCapability getMap getFeatureInfo getCapability getFeature DescribeFeatureType

Capability-based Federation of Standard GIS Web Service Components Web Map Client Interactive map tools WSDL • Built over the proposed standard Web Service components and common data models • MMS, WFS, and GML • Federation is done by aggregating GIS Web Services’ capabilities metadata • Inspired from OGC’s cascading WMS • Unified data access/query/display from a single access point • Providing application-based hierarchical data definitions • layer based data and service (WMS and WFS) compositions • Capability is basically a metadata about data+service: • Server’s information content and acceptable request parameter values Aggregating WMS (Federator) Stubs Stubs HTTP SOAP WSDL Capability.xml WSDL Capability.xml “REST” Capability.xml WFS + Seismic Rec. WFS + State Bounds … WMS + OnEarth Google Maps

WMS WFS WFS Unified Data query and display over integrated data-views • Step-2: (Run time – green lines) Users access/query and display data sources from a single access point (federator) over integrated data-views (multi-layered map images). • Some layers are provided in binary map images (layers from WMS), and some layers are rendered from GML which is provided by WFS. • Users interact with the system through generic Interactive Map Tools. • Enables users to query the map images based on their attributes and features • On Demand Data Access: There is no copying of the data at any intermediary places. Data are kept at their originating sources. Consistency and autonomy. • Step-1: (Before run time – blue lines in the figure)Federator search for standard WS components (WMS or WFS) providing required data layers and organize them in one aggregated capability file. • According to the standard WMS capability schema definition • Capability metadata are collected by getCapability standard service interface • Interactive Map Tools gets the aggregated capability metadata from Federator through service interface (1) a. NASA satellite layer Aggregated Capability Integrated data-view: b over a JPL at California Federator Event-based Interactive Map-Tools 1 Browser 2 Browser b b b. Earthquake-seismic data 3 a Events: - Move, - Zooming in/out - Panning (drag-drop) - Rectangular region - Distance calc. - Attribute querying a CGL at Indiana 1. GetCapability (metadata data+service) 2. GetMap (get map data in set of layer(s)) 3. GetFeatureInfo (query the attributes of data)

Why Capability metadata • Web Services provide key low level capability but do not define an information or data architecture • These are left to domain specific capabilities metadata and data description language (GML). • Machine and human readable information • Enables easy integration and federation • Enables developing application based standard interactive re-usable tools • for data query display and analysis • Seamless data/access/query

High-performance Support for Responsive GIS Designs, measurements and analysis

Performance Investigation • Interoperability requirements bring up some compliance costs: • Common data model (GML) • Web Services (SOAP protocol for communication) • Approaches: Enhancing the GIS systems’ responsiveness • Data transfer and rendering • Streaming GIS Web Services (1) • Structured/annotated GML data rendering (2) • Federator-oriented approaches • Pre-fetching (3) • Client-based caching (4) • Query decomposition and parallel processing (5) • Testing with large scale Geo-science applications • Earthquake forecasting (PI), • Virtual California (VC) • Aim: Turning compliance requirements into competitiveness

(1) Streaming Data Flow Extension toGIS Web-Services registry UDDI • Concern is large-sized XML-structured data transfer • Approach is that responses are chunked into parts and streamed to client as the answer comes. • Enables client to render map images with partially returned data – no need to wait for whole data to be returned. • Provides better performance results • Uses topic-based publish-subscribe messaging systems for exchanging SOAP messages and data payloads. • SOAP is used for negotiation (line-3) with standard “getFeature” request • Publisher information in triple (topic, IP, port) is returned. • Data transfer is done between publisher and subscriber w s d l Web Services’ publish-find-bind triangle 2 1 DB w s d l (A)WMS WFS getFeature 3 (topic, IP, port) Publisher Subscriber 4 server client GML GML topic,ip,port Narada Brokering Server Measured Avg. Response time

(2) GML Data Processing • Processing XML data: Parsing and rendering to create map images. • Two well-known approaches are document models (DOM) and push models (SAX). • We use pull approach for XML processing: • Parses only what is asked for • No support for document validation (major gains of performance) • Structural correctness of XML document • Doesn’t build complete object model in memory (unlike DOM) • Contents are returned directly to application from calls to parser (unlike SAX) [GML]  [Parsing / Validation]  [Geo-data extraction]  [Plotting] (1GB allocated VM)

Analyzing Conventional OGC-GIS systems and Baseline Performance Test • Common/straightforward approaches are characterized as • Stateless services • On-demand data access/rendering, • Single-threaded and no-caching • Systems developed with Open Geographic Standards have: • High degree of interoperability but poor performance results Sample data: Earthquake seismic data served in GML Test Setup: (b). Map Rendering time (a). Average Response time (c). GML data capture time (d). Map images transfer time -Avg value is 48.53msec (a) a = b + c + d (c) (b) The performance cumbersome is -c- > Query and data conversions and large size XML data transfer.

(3) Pre-fetching • Performance bottleneck: • On-demand access to originating databases through WFS, • Transmitting XML-encoded GML representation of data • Solution: • Periodically fetching the whole data before it is needed (so called pre-fetching). • Databases are mapped to GML files and stored locally in the federator • Successive on-demand queries are served by using pre-fetched data (red-curve) • Pros/Cons: • Removes the repeated resource consuming query/data conversions at WFS and associated Databases. • Gives the best performance outcomes for the in-frequently changing archived data, • But, might cause consistency problem depending on the fetching periodicity and data update periodicity Users’ on demand access/query Federator or WMS Event-based Interactive Map Tools WFS Any-data GetData in GML FETCH Periodically runs and maps databases into GMLs , and stores locally GML Temp Storage Represents all the data at the database and their associated attributes NB Federator Local File System

Performance Comparison : Pre-fetching vs. Straightforward On-demand Fetching Map rendering over pre-fetched data at run-time: 48.53msec (included into the table values) Time for rendering map from pre-fetched data -GML • Performance tests are done with earthquake seismic data records • Pre-fetching (independent of run-time): • Earthquake data in Database is routinely mapped to GML and kept at federator • Pre-fetched GML size is 127MB. • The response times seem very close in case of pre-fetching • No matter how much the requested data sizes, Every time request comes, map is rendered from the same size of pre-fetched GML data stored at federator • Dominating performance bottleneck is removed. No need to go through the WFS to get the data from database. • Threshold value: 500KB of data • For 100MB, pre-fetching is about 50 times faster. • The larger the data size the higher the performance gains. Event-based Interactive Map Tools Federator or WMS GML Avg Response time Pre-fetched GML -in federator’s disk space Details for the on-demand performance analysis are given in Slide-14

Enhancements over On-demand Fetching and Rendering • Pre-fetching is very fast and a straightforward approach, BUT • Might cause inconsistency • Intermediary storage of data’s copies at federator, • On-demand (just in time) fetching enables • Keeping the data at their originating sources all the time • Scalability, autonomy and easy data maintenance. • It has performance bottleneck to access/query the federated heterogeneous data sources through WFS-based mediation. • Time consuming request/response conversion • (Request): GetFeature request to SQL, • (Response): Relational tables to GML • Transferring XML-encoded GML data • Enhancement Approaches: • Client-based caching • Parallel processing through attribute-based query decomposition.

(4) Client-based Caching • Makes stateless GIS Web Services stateful • Removes repeated time and resource consuming processes • Helps sharing the workload as equal as possible for the most efficient parallel processing • Each client has different interest of regions of data sets (formulated-queried as bbox), and separate caching area allocated. • Application of working-window and locality principles into map images rendering • Clients are differentiated according to the client assigned session-id parameter in the header of queries. • Federator always keeps the least recently-used data sets for each client separately. • Brings up some overhead to keep up working-window and for each client.

Brief Architecture Server-side Create identity card. Update at every request from the client • FormerRequest Class String uuid; /*unique-user-id*/ String bbox; /*bounding box of the user’s last request*/ Double density; /*data size falling into per unit square*/ Vector [] feature_data; /*geometry elements of the last request*/ Register to client table Set identity to message header Client-side ClientWSStub binding; binding = (ClientWSStub ) new ServiceLocator().WMSServices( servaddress)); String sessionID = session.getid(); //uuid-1 String channel_name = “getMapChannel”; /*Add SessionID to the SOAP message’s header*/ binding.setHeader(service_address, channel_name, sessionID); Map mymap = binding.getMap(request);

Comparing with Google’s Caching and Map Rendering Approach • Google-like map servers are fast because • They replace computation with storage. • Pre-making all images and cut up into tiles • They formalize the accepted requests in terms of parameters, and responses in terms of the tile compositions. • Google’s approach is good for only the client-server based applications • Their approach is static and central. • In large scale applications it is impossible to cache whole data • There is always a limit on storage and computation capabilities • It can’t be applied to distributed dynamic data rendering and extensible applications. • We do fine-grained dynamic information presentation enabling attribute-based data querying and interaction from a single access point over integrated data-views in multi-layered map images. • Client-based caching enables • Dynamic and flexible map rendering based on layer specific attribute-based querying/rendering of data(such as magnitude values of earthquake seismic data) • It enables autonomy of data sources and easy data maintenance

(5) Parallel Processing through attribute based query decomposition Attribute is BBOX defining the ranges for the requested data in the main query getMap DB DB 4 possible positions of main query to the cache: WFS WFS Step-4: Worker threads capturing the GML data in parallel 1 R1 Step-3: Main thread distribute the GteFeature requests to the worker threads R3 GetFeature1 GetFeature-n . . . . R2 R1 2 R4 bbox1 bbox2 bboxn . . . . Step-2: Bbox is an attribute defining range queries 3 Step-1: Cached data extraction and partitioning of the main query bbox Main query BBOX = bbox1+bbox2+…+bboxn Main Query Cached Data R1 • Questions: • How to find best efficient partition-number • How to assign the partitions to worker nodes R2 4

Sample GetFeature request to get feature data (GML) from WFS. -110,35,-100,36 GFeature-1 -110,36,-100,37 GFeature-2 -110,37,-100,38 GFeature-3 -110,38,-100,39 GFeature-4 -110,39,-100,40 GFeature-5 Partition list as bbox values for sample case : - Pn=5 - Main query getMap bbox 110,35 -100,40

(c,d) (c,d) R3 R2 (c, (b+d)/2) (c, (b+d)/2) R1 R4 (a,b) (a,b) ((a+c)/2, b) ((a+c)/2, b) (2) (1) Challenge: Geo-Data Characteristic • Geo-data is characterized as un-evenlydistributed and variable sized according to their locations attributes. • Ex. Human population • A point data is described with location attribute • (x, y) coordinates. • Linestrings, polylines, polygons etc are defined as set of points. • Data sets falling into a queried region is formulated as bounding box (bbox) • Coordinates of a rectangle (a, b, c, d) • Need for advanced techniques for parallel processing and workload sharing !

Partitioning techniques for Query Decomposition • 1. Blind partitioning • For the first time queries • Uses static/default partitioning number • Costs un-necessary partitioning overheads • Not efficient • 2. Smart partitioning (next-slide) • As a solution to the sharing of unpredictable workload • Utilize client-based caching and • FormerRequest Object giving session information • Utilize locality principles and working-window to find out the best efficient partition number • Partitions’ assignment: Threads are assigned in round-robin fashion • Initially every worker node (WN) is assigned equal #of partition (share) • If partition number (PN) can not be divided evenly then the remaining partitions (rmg) are re-distributed to the worker nodes

Smart Partitioning through Client-based Caching • Aim: Determining the most efficient partition number to get best performance result from parallel processing • Based-on the locality principles. • Assumption: successive requests have similar data density • Data’s density is measure of data size falling into per unit square. • Example: Human population data : no population on the ocean, and urban areas have higher population than the rural areas. • Brief algorithm: • Each layer on which partitioning will be done has a threshold value pre-defined. • Threshold value helps finding the largest area in bbox to be assigned. • Largest area changes depending on the density of the data last time requested • Density is obtained from the FormerRequest object for that client Static – pre-defined From Client-based caching FormerRequest object If >= 2, then partition

Performance Test Setup • NASA Satellite maps are provided by WMS from NASA’s OnEarth project. • WFS servers, federator server and event-based interactive map tools are in Indiana University Community Grids Labs. • Tests are done in Local Area Network (LAN) by using grid-farm machines; gf12,15,16,17,18,19.ucs.indiana.edu. • Grid-farm machines have 2 Quad-core Intel Xeon processors running at 2.33 GHz with 8 GB of memory and operating Red Hat Enterprise Linux ES release. NASA Satellite Map Images WMS Binary map image JPL California 1 GetMap Event-based dynamic map tools Federator WFS-1 GML Binary map image DB1 2 2 Browser 1 GetFeature WFS-2 DB2 1: NASA satellite map images 2: Earthquake- seismic records CGL Indiana 2 .. Earthquake Seismic records WFS-6 DB6 2

Parallel & On-demandWith Blind Partitioning • The larger the data size the higher the performance gains • As the data size falling in a specific range query increase, the possibility of equal sharing increases. • From the figure it seems partitioning into 10 or 20 give the best results, but • What about relatively small sized data rendering • What partition number gives the best result for a specific range and data sizes See next slide as an illustration of need of using smart partitioning The number of worker WFS : 6 Partition Numbers : 2, 10, 20

Parallel & On-demandWith Smart Partitioning 10 10 2 2 i : Best partition/thread numbers • Actual performance results are much better, because of the client-based caching. • Depending on the cache and main query overlapping size, response times changes between orange-line and brown-line in the second figure • Brown-line shows the best case in which the whole main query range falls in cached data ranges. The number of worker WFS : 6 Partition Numbers : 2, 10, 20

Overhead Timings resulting from parallelization Overhead Timings : Range query: Sample range: 0 to10 3. Merging • Overheads: Query partitioning, sub-query creation, and merging results to sub-queries. • Partitioning: Defining the partition number and cutting the main query range into that number of pieces in the form of bounding box (bbox) values (range query attribute) • Sub-query creation: Create corresponding XML-based query (getFeature) for each partition in the partition list to fetch the remote GML data from WFS. • Merging: Aggregating the results to sub-queries and creating one complete map images as an answer to main query 1. 1. Partitioning into 5: 0-2, 2-4, 4-6, 6-8, 8-10 2. 3. 2. Query Creations for partitions: Q1, Q2, Q3, Q4, Q5 Query for Range:0-10 Queries/responses for partitioned ranges WFS WFS WFS illustration of overheads DB1 DB1 DB1

Conclusions – Performance • Streaming data transfer techniques allow data rendering even on partially returned data. • Pull parsing results in best outcomes for XML encoded GML data rendering - Eliminating the requirement of data validation. • Federator’s natural characteristic allowed us develop advanced caching and parallel processing designs. • Pre-fetching and parallel-processing techniques are mutually exclusive. • Best performance outcomes are achieved through pre-fetching but can cause data inconsistency. • Triggering periodicity must be defined carefully. • Parallel-processing techniques’ success is based on how well we share the workload to worker nodes. • Un-evenly distributed and variable sized geo-data characteristics. • Client-based caching enables us efficient workload sharing for the best efficient parallel processing • Besides enabling removing time and resource consuming repeated jobs. • We saw that • Application of working-window and locality principles by means of client-based caching, and • Parallel processing through attribute-based query decomposition Helped us increase the system responsiveness to a greater extent.

Conclusions – Framework • Fine-grained dynamic information presentation through a federation framework enabled us heterogeneous data sources to be queried as a single resource over integrated data-view in multi-layered map images • Autonomous local resources controlling definition of data • Removing the burden of individually accessing each data source with ad-hoc query languages. • We showed that Open Geographic Standards (OGC) can be applied together with Web Service standards. • We converted HTTP/GET-POST based queries into XML-based queries by developing standard schemas --compatible with the standards. • We also extended the standard service definitions with streaming data transfer capabilities by using publish-subscribe based messaging middle. • Easy extension with new data and service resources • Open Geographic and Web Service standards • No physical data integration • Just-in-time or late-binding federation • Data always is kept at its originating resource • This enables easy data-maintenance and high degree of autonomy • Seamless interaction with the system through integrated data views in multi-layered map images • Enables interactive feature based querying besides displaying the data

Contributions • A federated Service-oriented Geographic Information Systems framework • Integrating Web Services with Open Geographic Standards to support interoperability at both data and service levels • Production of knowledge from distributed data sources in multi-layered map images. • Hierarchical data definitions through capability metadata federations • Fine-grained dynamic information presentation • Enabling unified interactive data access/query and display from a single access point through federator. • Investigated performance efficient designs and did detailed benchmarking • Streaming GIS Web Services • Federator-oriented high-performance design techniques • Pre-fetching • Client-based caching : Working-window and locality principles • Parallel processing through attribute-based query decomposition over un-predictable workload sharing

Acknowledgement • The work described in this presentation is part of the QuakeSim project which is supported by the Advanced Information Systems Technology Program of NASA's Earth-Sun System Technology Office. • GalipAydin: Web Feature Server (WFS)

Thanks!....

BACK-UP SLIDES

WMS WMS WMS Capability Federation Map Rendering WFS WFS WFS User Portal Interactive Map-Tools Federator 1 GIS Browser 2 2 3 2 1 1 1. GetCapability (metadata data+service) 2. GetMap (get map data in set of layer(s)) 3. GetFeatureInfo (query the attributes of data) Capability-based Federation of the standard Web Service Components • Application-based hierarchical data: • [Application]- Pattern Informatics • [Layer-1] State-boundary over Satellite • [Data-1] • State-boundary (WFS-1) • [Data-2] • Satellite-Image(WMS-2) • [Layer-2] • Google map (WMS-1) • [Layer-3]- Earthquake-Seismic • [Data-1] • Earthquake-Seismic(WFS-3) • Built over the proposed standard Web Service components and common data models • Unified data access/query/display from a single access point • Providing application-based hierarchical data definitions • layer based data and service (WMS and WFS) compositions • Federation is done by aggregating GIS Web Services’ capabilities metadata • Capability is basically a metadata about data+service: • Server’s information content and acceptable request parameter values a, b, c and d a Sample Layers for PI: • NASA satellite layer • Earthquake-seismic layer • Google Map Layer • State-boundaries Layer c b d Events: - Move, - Zooming in/out - Panning (drag-drop) - Rectangular region - Distance calc. - Attribute querying

Hierarchical data Integrated data-view 1 2 3 1: Google map layer 2: States boundary lines layer 3: seismic data layer Event-based Interactive Tools : Query and data analysis over integrated data views

Hierarchical data / Integrated data-viewFor IEISS Geo-science Application • Application-based hierarchical data: • [Application]- IEISS • [Layer-1] Gas-pipeline over Satellite • [Data-1] • Gas-pipeline (WFS-1) • [Data-2] • Satellite-Image(WMS-2) • [Layer-2] • Google map (WMS-1) • [Layer-3]- Electric-power • [Data-1] • Electric-power(WFS-3)

GetCapabilities Schema and Sample Request Instance

GetMap Schema and Sample Request Instance

Event-based Interactive Map Tools • <event_controller> • <event name="init" class="Path.InitListener" next="map.jsp"/> • <event name="REFRESH" class=" Path.InitListener " next="map.jsp"/> • <event name="ZOOMIN" class=" Path.InitListener " next="map.jsp"/> • <event name="ZOOMOUT" class="Path.InitListener" next="map.jsp"/> • <event name="RECENTER" class="Path.InitListener“next="map.jsp"/> • <event name="RESET" class=" Path.InitListener " next="map.jsp"/> • <event name="PAN" class=" Path.InitListener " next="map.jsp"/> • <event name="INFO" class=" Path.InitListener " next="map.jsp"/> • </event_controller>

Sample GML document

Sample GetFeature Request Instance

A Template simple capabilities file for a WMS

WWW Generalizing the Problem Domain Client/User-Query • Query heterogeneous data sources as a single resource • Heterogeneous: local resource controls definition of the data • Single resource: remove the burden of individually accessing each data source • Easy extension with new data and service resources • No real integration of data • Data always at local source • Easy maintenance of data • Seamless interaction with the system • Collaborative decision makings Integrated View federation services Mediator Mediator Mediator DB Files Data in files, HTML, XML/Relational Databases, Spatial Sources/sensors

Such as filter, transformation, reasoning, data-mining, analysis AS Repository AS Tool (ASVS) AS Tool (ASFS) AS Services (user defined) AS Sensor AS Sensor Messages using ASL Generalization of the Proposed Architecture • We need to define Application Specific: • Federator federating the capabilities of distributed ASVS and ASFS to create application-based hierarchy of distributed data and service resources. • Mediators: Query and data format conversions • Data sources maintain their internal structure • Large degree of autonomy • No actual physical data integration • GIS-style information model can be redefined in any application areas such as Chemistry and Astronomy • Application Specific Information Systems (ASIS). • We need to define Application Specific • Language (ASL) -> GML :expressing domain specific features, semantic of data • Feature Service (ASFS) -> WFS :Serving data in common language (ASL) • Visualization Services (ASVS) -> WMS : Visualizes information and provide a way of navigating ASFS compatible/mediated data resources • Capabilities metadata for ASVS and ASFS. Unified data query/access/display Federator ASVS 1 3 1 4 2 2 Mediator Mediator Standard service API Standard service API 3 Capability Federation ASL-Rendering Standard service API

Contributions (Systems Software) • Developing Web Map Server (WMS) in Open Geographic Standards • Extended with Web Service Standards and • Streaming map creation capabilities • Developing GIS Federator • Extended from WMS • Provides application specific layer-structured hierarchical data as a composition of distributed standard GIS Web Service components • Enable uniform data access and query from a single access point. • Interactive map tools for data display, query and analysis. • Browser and event-based. • Extended with AJAX (Asynchronous Java and XML)

High-Performance Federated and Service-Oriented Geographic Information Systems