Supporting Ad-hoc Data Exploration for Large Scientific Databases

SDM center Supporting Ad-hoc Data Explorationfor Large Scientific Databases LBNL: Arie Shoshani Ekow Otoo Alex Sim Kesheng John Wu ORNL: Randy Burris Dan Million All Hands Meeting March 26-27, 2002

P3: Efficient Access from Large Datasets Data mining Large Distributed Storage Resource Management dataset Request Interpreter Adaptive file caching file HPSS/ Disk storage grid

Typical Scientific Exploration Process • Generate large amounts of raw data • large simulations • collect from experiments • Post-processing of data • analyze data (find particles produced, tracks) • generate summary data • e.g. momentum, no. of pions, transverse energy • Number of properties is large (50-100) • Analyze data • use summary data as guide • extract subsets from the large dataset • Need to access events based on partialproperties specification (range queries) • e.g. ((0.1 < AVpT < 0.2) ^ (10 < Np < 20)) v (N > 6000) • apply analysis code

Problem Statement • Large number of objects reside in files on a distributed Data Grid • 108 – 109 objects • O.5 – 5 million files • 15,000 – 150,000 tapes • Distributed system can be across continents • 100’s of sites • Some of the data is replicated based on demand or pre-assigned replication • Request expressed as logical request by user • Systems and network may fail • Problem: given a logical request, get relevant data to local system without human intervention

The big picture Logical request (73.39 < zdc2Energy < 94.94 AND -24.99 < qxb < -7.25) Request interpreter Logical objects {set192_01.STAR,…, set287_07.STAR} Request Manager gsiftp://dg0n1.mcs.anl.gov/homes/ asim/gsiftp/ set192_01.STAR hrm://DRMServerAlone@srm.lbl.gov:4000/ home/dm/srm/data1/gsiftp/ set287_07.STAR “physical” objects Storage Resource Managers Sites: dg0n1.mcs.anl.gov srm.lbl.gov:4000 File access management* HPSS/shared disk * Grid Enabled Access

Disk Cache Disk Cache Disk Cache Disk Cache Disk Cache SC 2001 Demo Denver client Logical Request BIT-MAP Index Request Executor File Transfer Monitoring Legend: GridFTP DRM Control path Data Path Chicago Berkeley Livermore Berkeley server server server server GridFTP DRM FTP GridFTP HRM GridFTP

Middleware Components • 1) BitMap index • Size of data to be indexed: 108 objects x 500 attributes x 4 bytes = 200 GB • 2) Request Executer • Uses Replica Catalog • Monitors transfer progress • 3) Storage Resource Managers (SRMs) • Disk Resource Manager (DRM) • Hierarchical Resource Manager (HRM) • 4) File Transfer Visualization tool (FTV) • View by file size and fraction of file transferred • View by % of files transferred

Monitoring File Transfer

Earth Science Data Grid (ESG II) Architecture Discovery Apps Analysis Apps Publication Portals Security (Authen+Author) Services Request Management Services Middleware Dataset Metadata Services Discovery Metadata Services Replica Services Vis Services Analysis Services Data Services Servers Archival data On-line data Ancillary catalog General and Use Metadata catalog

Storage Resource Management A Collaboratory middleware project Arie Shoshani Alex Sim Junmin Gu Computing Sciences Directorate Lawrence Berkeley National Laboratory http://sdm.lbl.gov/srm

Motivation • Grid architecture emphasized in the past • Security • Compute resource coordination & scheduling • Network resource coordination & scheduling (QOS) • SRMs role in the data grid architecture • Storage resource coordination & scheduling • Types of storage resource managers • Disk Resource Manager (DRM) • Tape Resource Manager (TRM) • Hierarchical Resource Manager (TRM + DRM)

client client Replica catalog Request Interpreter Request Executer request planning Network Weather Service HRM DRM DRM tape system Disk Cache Disk Cache Disk Cache Where Do SRMs Fit in Grid Architecture? ... Client’s site logical query property-file index logical files site-specific files site-specific files requests pinning & file transfer requests network ...

Challenges (1) • Managing storage resources in an unreliable distributed large heterogeneous system • Long lasting data intensive transactions • Can’t afford to restart jobs • Can’t afford to loose data, especially from experiments • Type of failures • Storage system failures • Mass Storage System (MSS) • Disk system • Server failures • Network failures

Challenges (2) • Heterogeneity • Operating systems (well understood) • MSS - HPSS, Castor, Enstore, … • Disk systems – system attached, network attached, parallel • Optimization issues • avoid extra file transfers - What to keep in each disk caches over time • How to maximize sharing for multiple users • Global optimization • Multi-Tier storage system optimization

Specific Problems • Managing resource space allocation • What if there is no space? • Managing pinning of files • What if files can be removed in the middle of a transfer • Space reservations • What if multiple files are needed concurrently • File streaming • For processing a large set of files • Pin-lock • What if you pinned files, and system deadlocks • User priorities • Access control – who can read/write a file

tape system tape system Disk Cache Disk Cache HRMs in PPDG(high level view) • Monitors files written into BNL’s HPSS • Selects files to replicate • Issues request_to_put for file (or many files) Replica Coordinator HRM-COPY HRM-GET HRM (performs writes) HRM (performs reads) GridFTP GET (pull mode) LBNL BNL

Measurements FILE_REQUEST_FAILED Notified_Client Migration_Finished Migration_Requested Transfered_to_PDSF_from_BNL Staging_finished_at_BNL Staging_started_at BNL Staging_requested_at_BNL File replication request start

The Other Talks Logical request Request interpreter Bitmap Indexing (John Wu) Selected Logical objects Request Manager Qualified objects Shared Disk File Caching (Ekow Otoo) Storage Resource Managers Optimizing Shared Access to Tertiary Storage (Randy Burris) File access management * HPSS/shared disk * Grid Enabled Access

P3 Tasks • Deployment of compressed BitMap index (COMBIX) for HEP and Combustion applications (millions-billions objects) • Logical range query to find qualified files - HENP • Logical range query to find “flame fronts” – Combustion • Developing optimal disk caching policies • Using simulation and real test with DRM • Testing a new caching policy method based on “hazard rates” • Deploying of HRM at ORNL and BNL for use with Climate and HENP applications • To support data movement of files to NERSC for climate simulation production data • To support event subset access for HENP simulations • Developing efficient access to HPSS (ORNL) • Parallel streams • Partial file reads

Supporting Ad-hoc Data Exploration for Large Scientific Databases

Supporting Ad-hoc Data Exploration for Large Scientific Databases

Presentation Transcript

Security for Ad Hoc Networks

Supporting Noise-Free Queries in Large Image Databases

EXPLORATION COMMITTEE and Ad hoc BIOMEDICAL COMMITTEE

Scientific Exploration

An Optimistic Concurrency Control Algorithm for Mobile Ad-hoc Network Databases

Hybrid Cellular-Ad hoc Data Network

Ad Hoc Networking via Named Data

Ad Hoc Networking via Named Data

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

Scalable Location Management for Large Mobile Ad hoc Networks

Supporting Ad-Hoc Ranking Aggregates

Supporting Noise-Free Queries in Large Image Databases

AHDIT: Ad Hoc Data Interoperability Tool

Hierarchical Grid Location Management for Large Wireless Ad hoc Networks

Interworking between Ad hoc Network and Supporting Network

Ad hoc Analysis of EPI Data

Experience Building and Supporting Secure Ad Hoc Collaborations

Inputting Data via Smartview Ad Hoc

Large Volume Ad Hoc Usage API request

Algorithms for Ad Hoc Networks

Large Scientific Databases

Searching Large Scientific Data