Advanced Geoscience Data Analytics at Extreme Scale

Complex Scientific Analytics in Earth Science at Extreme Scale John Caron University Corporation for Atmospheric Research Boulder, CO Oct 6, 2010

Who Are You? • Unidata is NSF-funded engineering group • NSF Division of Atmospheric and Geospace Sciences (AGS) • Core constituency is synoptic meteorology • Build tools to obtain and analyze earth science data for research and education • Atmosphere, oceans, hydrology, climate, biosphere • Unidata is the developer of netCDF format • I am a Software Engineer, not a Scientist • Java-NetCDF library • THREDDS Data Server • Common Data Model

Previous results for Geoscience • Geoscience has no “very large” projects. Rather diverse, heterogeneous, highly distributed, with a diversity of data formats and access methods (hindering wide data use). • Data is stored non-hierarchically, fully-distributed, with a large number of independent sites. Virtually all scientists want to “own” and “control” their data. • Append-only – written once and never updated. • Geoscience data can be represented (at a low level) as n-dimensional arrays. These are not stored in databases but in “scientific file formats” such as HDF, netCDF. • Data formats are usually chosen by data producers, causing data to be archived optimally for data writing / storage, not for retrieval and analysis. Data is often stored with insufficient metadata for non-expert users. • Geoscientific data can come from a large variety of sources: ground observatories, mobile stations, sensor networks, aerial observers, simulation models, etc. The set of data for a given location may have different resolutions, different sample rates, different perspectives, or different coordinate systems and therefore must be transformed, regridded, aligned, and otherwise unified before they can be analyzed.

National Center for Atmospheric Research (NCAR) Data Archives • NCAR Mass Storage: 8 PB • Growing at 2.5 PB/year • By 2012: 5PB/year • Tape silo with 100 TB Disk Cache • NCAR Research Data Archive: 55 TB • High quality observations and model output • 55 TB / 600 Datasets = 100 GB / Dataset

NASA ESDISEarth Science Data and Information System • Satellite Observational Data • 4.2 PB/ 4000 datasets = 1 TB / dataset • 4.2 PB/ 175 million files = 24 MB / file

CLASS (NOAA) data volumes CLASS currently holds about 30 PB of data Projected to grow to 100 PB by 2015 and 160 PB by 2020

Dataset Size / Heterogeneity • Climate Model Intercomparision (PCMDI) project for IPCC AR4 (2006/7): 35 TB, ~20 climate models • 35 TB / 78,000 files = 450 MB / file • Stored in netCDF with CF Conventions • NASA's Global Change Master Directory (GCMD) holds more than 25,000 Earth science data set and service descriptions.

Earth Science Data ArchiveCurrent Practices • Raw data is processed into archive or exchangeformat • General Purpose : NetCDF / HDF • Special Purpose : eg in meteorology: WMO’s GRIB, BUFR • Interest is in this archive (not raw) data • Dataset is a collection of homogeneous files • Common metadata, “single schema” (approx) • Granule = single file partitioned by time • Effectively “append only” • “Near real-time archive” allows file appending • “Rolling archive” keeps, eg, most recent 30 days • Very diverse • Big Mandated Archives : NOAA, NASA • Many others : DOE, USGS, EPA, NCAR, Universities, etc.

Earth Science Data ArchiveCurrent Practices • “Search metadata” may be put into RDBMS, data is not (some exceptions) • Data may be online or “near online” in tertiary storage • Most data is transferred as files in a batch service • Place order, get data later • May have subsetting / aggregation service • May have file format translation service (hard) • May have regridding service (very hard) • Stating to develop “online” Web Services • Open Geospatial Consortium (OGC) protocols / ISO-191xx data models • Community standard protocols, egOPeNDAP in ocean and atmos. • Synchronous, assumes online • Processing • Some standard operators: statistics, regridding • Algebra / calculus to create “derived fields”

General Purpose Scientific Data File Formats in Earth Science • NetCDF (Unidata) / HDF (NCSA) • Persistent Fortran 77 / 90 arrays • Arbitrary key/value attributes • Multidimensional arrays • Regular / rectangular (think Fortran) • Ragged (bit of a poor cousin) • Tiled/compressed (performance) • Language API bindings • Efficient strided array subsetting • Procedural, “file-at-a-time” • Some higher level tools for processing sets of files • Machine / OS / Language independent • Solved the “Syntactical Problem” of data access

Data Semantics are hard • Semantics are typically stored in key/value attributes stored in the files • Datasets define “attribute Conventions” • Human readable documents (eg CF Conventions) • Sometimes with software API (eg HDF-EOS) • Sometimes you “just have to know” what it means • Sometimes there are no semantics in the file

What to do about heterogeneity? • Rewrite into common form / database • Caveat: must save original data • Leave the data in the original file formats • Develop decoders to “common data model”

Unidata’s Approach • Virtual Dataset • Collection of many files • Hide the details of the file partitioning • Provide remote access • Efficient subsetting in space/time • Let user programs work in coordinate space • Handle mapping to array indices • Define small set of “scientific feature types” • Dataset is a collection of objects, not arrays • Necessary to abstract details of array storage

Scientific Feature Types Point Trajectory Station Profile Radial Grid Swath Unidata’s Common Data Model Objects, How the User sees the data Coordinate Systems Georeferencing, topology Data Access netCDF-3, HDF5, OPeNDAP BUFR, GRIB1, GRIB2, NEXRAD, NIDS, McIDAS, GEMPAK, GINI, DMSP, HDF4, HDF-EOS, DORADE, GTOPO, ASCII • Storage format, • multidimensional arrays

Geoscience Data Summary • Many important geoscience datasets • 10,000 (?) • Unique metadata / semantics • Stored in append-only file collections • Time partitioned • Optional metadata indexing • Three levels • Storage format : multidimensional arrays • Coordinate Systems : space / time georeferencing : topology • Objects : Forecast Model Run, Radar Sweep, Satellite Swath Image, Vertical Profile of Atmosphere, Time series of surface observations, collection of lightning strikes, autonomous underwater vehicle (AUV) trajectories, etc.

Advanced Geoscience Data Analytics at Extreme Scale

Advanced Geoscience Data Analytics at Extreme Scale

Presentation Transcript

Extreme-Scale Software Overview

Simulation at Extreme Scale

S cience of Textiles

Extreme Analytics at eBay

Trust Model for eXtreme Scale Identity Management (XSIM) in Scientific Collaborations

Shark:SQL and Rich Analytics at Scale

S cience

S cience

Extreme Scale Analytics on Spatio -Temporal Datasets

Complex Scientific Analytics in Earth S cience at Extreme Scale

S cience in Medicine

Extreme Querying With analytics

Svalbard S cience Forum

S cience I nquiry and N ature O f S cience

S CIENCE N IGHT 2009

AP Environmental S cience

Toward a Sustainable Architecture at Extreme Scale

S toma S cience

The Hybrid Model: Experiences at Extreme Scale

Earth S cience Review Unit 1:

S cience

G od and S cience