The SEQUOIA 2000 Storage Benchmark

The SEQUOIA 2000 Storage Benchmark Michael Stonebraker Presented by Handy Patriawan

Introduction • Different types of benchmark oriented at different application areas • Business data processing (e.g. TPC-x) Represent typical needs of DBMS users (update oriented transactions, stress transaction system and overhead of simple command processing) • ECAD (Electronic Computer Aided Design) application Contains complex commands that have high locality of reference on small data set and efficiency of client-server DBMS connection • Many more See http://www.benchmarkresources.com/handbook/)

Reasons for New Benchmark • Unfortunately those benchmark does not represent all applications. • One example application is Engineering and Scientific DBMS. • Why? • Massive size • Complex data type • Sophisticated searching

Reasons for New Benchmark (cont’d) • Massive size • Usually stores substantial number of images and simulations output • Four main SEQUOIA 2000 research group has ~ 10^14 bytes (100 TB!) • Complex data type • Often include multi-dimensional arrays, spatial information and other data types (e.g. temporal data type) • Usually not present in most benchmarks • Sophisticated search • Current DBMS implementation search method may not be enough for scientific & engineering community • Again… usually not present. B-Tree is not good enough

Earth Science • New DBMS benchmark is needed • Reasons as mentioned before • Categories of Earth Science (ES) research • Field studies • Remote sensing • Simulation

Earth Science (cont’d) • Field studies • { (longitude, latitude, elevation, array-of-value) } • Santa Barbara SEQUOIA 200 group has collected field data from Antarctic Ocean about the effect of ozone depletion on organisms • Remote sensing • Value (longitude, latitude, wavelength-band, time) • Thematic Mapper (TM) sensors on the Landsat satellites sample the Earth’s surface on a 30 m x 30 m grid, 7 wavelength-band every 15 days

Earth Science (cont’d) • Simulations • Array-of-values(longitude, latitude, elevation, time) • General Circulation Models (GCMs) for modeling climate.

Benchmark Data (Intro) • ES researchers usually focus on problems at the following four scales: • Local (e.g. a river drainage basin) • Regional (e.g. a study area of one or two states) • National (e.g. a study area of one country) • Earth (e.g. the study area is the whole world) • Most studies use tile based study area • Coarse: several Km2/tile, simulation output • Medium: 1 Km2/tile, Advance Very High Resolution Radiometer(AVHRR) • Fine: <100 m/tile, typical Thematic Mapper sensor data

Proposed Benchmark • Regional Benchmark • Focus on Regional size problem • Region size is 1280 Km x 800 Km encompassing the states of California and Nevada • National Benchmark • Focus on National size problem • Region size is 5500 Km x 3000 Km covers United States • Earth Benchmark • The whole globe of Earth

Types of Data • Raster • Capture the values of energy absorption in five wavelength bands • Point • Contains the names and locations of specific geographic features • Polygon • Homogeneous landuse/landcover • Directed graph • Drainage networks

Type of Data (cont’d)

Benchmark Setup • Data set should be stored in tertiary memory set (e.g. optical disk robot or tape robot) • Maintain the “durability” of benchmark • Example, Wisconsin Benchmark that measures disk benchmark can be fitted into memory

Benchmark Queries • Eleven queries • Each is written in POSTQUEL • POSTQUEL queries use the following POSTGRES schema • create RASTER (time = int4, location = box, band = int4, data = int2[][]) • create POINT (name = char[], location = point) • create POLYGON (landuse = int4, location = polygon) • create GRAPH (identifier = int4, segment = open-polygon)

Benchmark Queries (cont’d) • Two dimensional array is broken into sub array for storage convenience. • National benchmark one array is 129 Mbytes • Five collections of benchmarks • Data load • Raster queries • Polygon and point queries • Spatial joins involving one or more types of objects • recursion

Benchmark Queries (cont’d) • Constants • RECTANGLE: a geographic rectangle of size 100 Km x 100 Km randomly placed in study region • BAND: a random wavelength band • TIME: a random time • LANDUSE: a random landuse/landcover type • LOCATION: a random geographic point in the study area • POINT-NAME: the name of a random point in the POINT class • INT-1: an integer, set for this benchmark at 64 • FLOAT-1: a floating point number, set for this benchmark at 1.0 • FLOAT-2: a floating point number, set for this benchmark at 10.0

Data Load • Query 1: Create and load the data base and build any necessary secondary indexes • Expose between high performance (with streaming “bulk” copy) and low performance (just running one insert query per record) • They can differ by the factor of 10! • The whole benchmark must be copied from tape to disk, otherwise the query will run at the speed of tape • R-tree indexes on location in POINT and POLYGON classes and on segment in the GRAPH class. • B-tree indexes on time and band on RASTER, name for POINT and name for POLYGON. • Index on the function, size, operating on polygon locations in one query ??????? • Time: run on 4 data sets and time to build 8 indexes ???????

Raster Queries • Query 2 (time travel): Select AVHRR data for a given wavelength band and rectangular region ordered by ascending time • retrieve (clip (RASTER.data, RECTANGLE), RASTER.time)where RASTER.band = BANDorder by ascending time • Return 26 images for the selected band, clipped to correct rectangle and ordered by ascending time • ES scientists run many queries on raster data received from satellites • Time: running the query and time for returning the data to application program (excluding putting data on the screen)

Raster Queries (cont’d) • Query 3: Select AVHRR data for a given time and geographic rectangle and the calculate an arithmetic function of the five wavelength band values for each cell in the study rectangle • retrieve (raster-avg {clip (RASTER.data), RECTANGLE})where RASTER.time = TIME • raster-avg is a user-defined function that computes a weighted average of the individual cell values in RASTER.data • ES scientists run different weighted averages for a given study area and look for the one that produces the best output

Raster Queries (cont’d) • Query 4: Select AVHRR data for a given time, wavelength band and geographic rectangle. Lower the resolution of the image by a factor of 64 to a cell size of 4 Km x 4 Km and store it as a new DBMS object • retrieve into FOO-1 (time = RASTER.time,location = RASTER.location, band = RASTER.band,data = lower-res (clip (RASTER.data, RECTANGLE), INT-1))where RASTER.time = TIME and RASTER.band = BAND • INT-1 = 64 • Creating abstracts of raster data. It is useful ES scientists to see a lower resolution of data and zoom into areas of interest.

Point and Polygon Queries • Query 5: Find the POINT record that has a specific name • retrieve (POINT.all)where POINT.name = POINT-NAME • “non-spatial subsetting of POINT data on a non-spatial attribute” • A system needs to have a non-spatial indexing and able to produce spatial and non-spatial attributes

Point and Polygon Queries • Query 6: Find all polygons that intersect a specific rectangle and store them in the DBMS • retrieve into FOO-2 (POLYGON.all)where POLYGON.location || RECTANGLE • “||”: user-defined “polygon intersects rectangle” operator that return true if the location of polygon intersects with the RECTANGLE • This query needs (some kind of ) a spatial indexing to be efficient

Point and Polygon Queries • Query 7: Find all polygons that are more than a specific size and within a specific circle • retrieve (POLYGON.all)where size(POLYGON.location) > FLOAT-1 andPOLYGON.location <|> circle (LOCATION, FLOAT-2) • FLOAT-1: threshold polygon size, 1 KmLOCATION: center of circleFLOAT-2: radius, 10 Km • <|>: return true if polygon (left operand) is inside circle (right operand) • Efficient execution requires a query optimizer capable of selecting the more restrictive clause and to execute it first. • For this benchmark, most polygons are larger than 1 Km2, so the “POLYGON.location<|>…” should be executed first

Spatial Joins • Query 8: Show the landuse/landcover in a 50 Km quadrangle surrounding a given point • retrieve (POLYGON.landuse, POLYGON.location)where POLYGON.location || make-box (POINT.location, 50)and POINT.name = “POINT-NAME” • make-box: create a rectangle with radius of 50 from location • ||: return true if a polygon intersects with the rectangle of interest • Joins polygon and rectangle spatial data type

Spatial Joins • Query 9: Find the raster data for a given landuse type in a study rectangle for a given wavelength band and time • retrieve (POLYGON.location, clip(RASTER.data, POLYGON.location))where POLYGON.landuse = LANDUSEand RASTER.band = BAND and RASTER.time = TIME • LANDUSE: landuse classifications desired • BAND: wavelength band of desired raster data • TIME: time of desired raster data • Joins landuse and polygon data type and non-spatial data type

Spatial Joins • Query 10: Find the names of all points within polygons of a specific vegetation type and create this a new DBMS object • retrieve into FOO-3 (POINT.name)where POINT.location || POLYGON.locationand POLYGON.landuse = LANDUSE • “||”: “point inside polygon” operator, allowed to be overloaded by POSTGRES • Joins point, polygon and landuse data type

Recursion • Query 11: Find all segments of any waterway that are within 20 Km downstream of a specific geographic point • retrieve into temp(GRAPH.identifier, GRAPH.segment, partial-length(GRAPH.segment, LOCATION))where LOCATION *&* GRAPH.segmentappend * to temp(GRAPH.identifier, GRAPH.segment, length = temp.length + length (GRAPH.segment)where end (temp.segment) = begin (GRAPH.segment)and GRAPH.identifier notin (temp.identifier)and temp.length < 20

Recursion (cont’d) • *&*: return true if a point is on a specific segment • “temp” will have the final result of the query • More like Breadth-First-Search • Example application of this query Dunsmuir spill query after a toxic chemical spill into Sacramento River near town Dunsmuir.

Benchmark Constraints and Reporting Conventions • It is permissible to run this benchmark on any combination of hardware and software, but the retail price of the hardware must be reported. The result of the benchmark should be reported as a collection of all 11 benchmarks and performance is calculated with the following formula:performance = (total elapsed time for the benchmark)/ (retail price of hardware)In case of incomplete test rest, the report should contain the results of the queries that could be run. • The new object result and data returned to application program can be discarded, no display requirement is needed. This is a storage benchmark, not a visualization benchmark.

Benchmark Constraints and Reporting Conventions (cont.) • Benchmark can be coded in any language (e.g. use SQL for RDBMS).The number for high-level declarative interface must be reported as well if the query is run using low level algorithmic interface to DBMS. • Security is still needed! Any system that does not run the application in a separate protection domain from DBMS must clearly note this fact.

Benchmark Results • Not all queries can be run efficiently. The use of Z transformation won’t help. • Array simulation using blobs (binary large objects) needs special implementation (no indexing) • Conclusion? No relation DBMS and OODBMS are included in the benchmark • Contestants: • GRASS: public domain GIS written by US Army Corps of Engineers Construction Engineering Research Laboratory (CERL) • IPW: A raster-oriented image processing written at UC, Santa Barbara • POSTGRES: next-generation DBMS written at UC, Berkeley

Conclusions

Conclusions (cont’d) • A system should not limit the object size (e.g. one polygon consists of 5184 nodes) • The fastest system on raster data is IPW, optimize for efficiency. • Better than GRASS because of selective read • Better than POSTGRE because of careful pipelining without temporary data writing to disk. • GRASS is good with raster data, but the technique it uses does not work with overlapping polygons • Query Optimizer in POSTGRES is incorrectly execute the first condition in where clause • No “Swiss Cheese” implementation on POSTGRES.

The SEQUOIA 2000 Storage Benchmark

The SEQUOIA 2000 Storage Benchmark

Presentation Transcript

Sequoia Trees

Benchmark Scotland Glasgow 25 May 2000

Giant Sequoia Tree

The Giant Sequoia

Stewards of the Sequoia

Giant Sequoia ( Sequoidendron giganteum )

Sequoia National Park

Sequoia Middle School

Diebold and Sequoia

‘ The SEQUOIA 2000 Storage Benchmark ’ M.Stonebraker, J.Frew, K. Gardels, J. Meredith

Solution of Benchmark Problems for CO 2 Storage

Redwood ( Sequoia sempervirens )

Sequoia

Sequoia Voting Systems

Sequoia National Park

Sequoia National Park

Sequoia National Park Lodging

Sequoia National Park Lodging

Shailendra Signh Sequoia

Sequoia National Park Lodging

SEQUOIA CONSULTING GROUP