1 / 55

Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services

Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services. Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE. Information Based Computing. Data Mining. Distributed Archives. Application. Collection Building.

gamada
Télécharger la présentation

Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Intensive ComputingInformation Based ComputingDigital Libraries / Metacomputing Services Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE

  2. Information Based Computing Data Mining Distributed Archives Application Collection Building Information Discovery Digital Library

  3. Co-evolution of Technology • Supercomputer Centers and Digital Libraries • Both support large scale processing & storage of data • Will the supercomputer centers of the future be digital libraries?

  4. Researchers Chaitanya Baru Amarnath Gupta Bertram Ludaescher Richard Marciano Yannis Papakonstantinou Arcot Rajasekar Wayne Schroeder Michael Wan

  5. Outline • Two views of computing • Executionenvironment - metacomputing systems • Data Management environment - digital library • Analysis for moving data to the process or the process to the data • Data Management Environment • Information Based Computing

  6. Object Based Information Model Constructors: turning data sets into objects Metacomputing Environment Data Management for publication Parallel I/O - MPI Data Management for execution Data Resources Data Resources Publication / Services Environment Presentation Interface Digital Libraries Multimedia / GIS / MVD / XML / LDAP / CORBA / Z39.50 Execution Environment

  7. Choice between Environments • Should we provide services for manipulating information • Move the process to the data • Should we provide execution environments • Move data to the process

  8. Data Distribution Comparison Reduce size of data from S bytes to s bytes and analyze Data Handling Platform Supercomputer Data B b Execution rate r < R Bandwidths linking systems are B & b Operations per bit for analysis is O Operations per bit for data transfer is o Should the data reduction be done before transmission?

  9. Distributing Services Compare times for analyzing data with size reduction from S to s Supercomputer Data Handling Platform Read Data Reduce Data Transmit Data Network Receive Data S / B O S / r o s / r s / b o s / R Supercomputer Data Handling Platform Read Data Transmit Data Receive Data Reduce Data Network S / B o S / r S / b o S / R O S / R

  10. Processing at supercomputer T(Super) = S/B + OS/r + os/r + s/b + os/R Processing at archive T(Archive) = S/B + oS/r + S/b + oS/R + OS/R Comparison of Time

  11. Optimization Parameter Selection Have algebraic equation with eight independent variables. T (Super) < T (Archive) S/B + OS/r + os/r + s/b + os/R < S/B + oS/r + S/b + oS/R + OS/R Which variable provides the simplest optimization Criterion?

  12. Scaling Parameters Data size reduction ratio s/S Execution slow down ratio r/R Problem complexity o/O Communication/Execution balance r/(ob) Note (r/o) is the number of bits/sec that can be processed. When r/(ob) = 1, the data processing rate is the same as the data transmission rate. Optimal designs have r/(ob) = 1

  13. O > o (1-s/S) [1 + r/R + r/(ob)] / (1-r/R) Note, as the execution ratio approaches 1, the required complexity becomes infinite Also, as the amount of data reduction goes to zero, the required complexity goes to zero. Complexity Analysis Moving all of the data is faster, T(Super) < T(Archive) Sufficiently complex analysis

  14. b > (r /O) (1 - s/S) / [1 - r/R - (o/O) (1 + r/R) (1 - s/S)] Note the denominator changes sign when O < o (1 + r/R) / [(1 - r/R) (1 - s/S)] Even with an infinitely fast network, it is better to do the processing at the archive if the complexity is too small. Bandwidth Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast network

  15. Execution Rate Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast supercomputer R > r [1 + (o/O) (1 - s/S)] / [1 - (o/O) (1 - s/S) (1 + r/(ob)] Note the denominator changes sign when O < o (1 - s/S) [1 + r/(ob)] Even with an infinitely fast supercomputer, it is better to process at the archive if the complexity is too small.

  16. Data Reduction Optimization Moving all of the data is faster, T(Super) < T(Archive) Data reduction is small enough s > S {1 - (O/o)(1 - r/R) / [1 + r/R + r/(ob)]} Note criteria changes sign when O > o [1 + r/R + r/(ob)] / (1 - r/R) When the complexity is sufficiently large, it is faster to process on the supercomputer even when data can be reduced to one bit.

  17. Is the Future Environment a Metacomputer or a Digital Library? • Sufficiently high complexity • Move data to processing engine • Digital Library execution of remote services • Traditional supercomputer processing of applications • Sufficiently low complexity • Move process to the data source • Metacomputing execution of remote applications • Traditional digital library service

  18. The IBM Digital Library Architecture Application (DL client) (SRB) (MCAT) Object Server Library Server “Federated” search Videocharger DB2 ADSM Oracle Metadata in DB2 or Oracle Text and Image indices Distributed storage resources

  19. Generalization of Digital Library • Scaling transparency • Support for arbitrary size data sets • Support for arbitrary data type • Location transparency • Access to remote data • Access to heterogeneous (non-uniform) storage systems • Remove restriction of local disk space size • Name service transparency • Support for multiple views (naming conventions) for data • Presentation transparency • Support for alternate representations of data

  20. Describing Information Content

  21. State-of-the-art Information Management: Digital Library

  22. High Performance Storage • Provide access to tertiary storage - scale size of repository • Disk caches • Tape robots • Manage migration of data between disk and tape • High Performance Storage System - IBM • Provides service classes • Support for parallel I/O • Support for terabyte sized data sets • Provide recoverable name space

  23. State-of-the-art Storage: HPSS • Store Teraflops computer output • Growth - 200 TB data per year • Data access rate - 7 TB/day = 80 MB/sec • 2-week data cache - 10 TB • Scalable control platform • 8-node SP (32 processors) • Support digital libraries • Support for millions of data sets • Integration with database meta-data catalogs

  24. Silver Node Tape / disk mover DCE / FTP /HIS Log Client SSA RAID RS6000 Tape Mover PVR (9490) 9490 Robot Eight Tape Drives 108 GB SSA RAID Silver Node Tape / disk mover DCE / FTP /HIS Log Client 108 GB 9490 Robot Four Drives High Performance Gateway Node 3490 Tape SSA RAID Silver Node Tape / disk mover DCE / FTP /HIS Log Client Magstar 3590 Tape 54 GB SSA RAID Silver Node Tape / disk mover DCE / FTP /HIS Log Client 108 GB Trail- Blazer3 Switch HiPPISwitch Silver Node Tape / disk mover DCE / FTP /HIS Log Client SSA RAID 9490 Robot Seven Tape Drives High Node Disk Mover HiPPI driver 108 GB Silver Node Tape / disk mover DCE / FTP /HIS Log Client SSA RAID 54 GB Silver Node Tape / disk mover DCE / FTP /HIS Log Client SSA RAID Wide Node Disk Mover HiPPI driver Magstar 3590 Tape 108 GB MaxStrat RAID Silver Node Storage / Purge Bitfile / Migration Nameservice/PVL Log Daemon SSA RAID 160 GB 830 GB HPSS Archival Storage System

  25. HPSS Bandwidths • SDSC has achieved: • Striping required to achieve desired I/O rates

  26. Turning Archives into Digital Libraries • Meta-data based access to data sets • Support for application of methods (procedures) to data sets • Support for information discovery • Support for publication of data sets • Research issue - optimization of data distribution between database and archive

  27. DB2/HPSS Integration Database Table C1 C2 C3 C4 C5 • Collaboration with IBM TJ Watson Research Center • Ming-Ling Lo, Sriram Padmanabhan, Vibby Gottemukkala • Features: • Prototype, works with DB2 UDB (Version 5) • DB2 is able to use a HPSS file as a tablespace container • DB2 handles DCE authentication to HPSS • Regular as well as long(LOB) data can be stored in HPSS • Optional disk buffer between DB2 and HPSS DB2 DB2 Disk buffer HPSS HPSS Disk cache

  28. Generalizing Digital Libraries • SRB - Location transparency • Access to heterogeneous systems • Access to remote systems • MCAT - Name service transparency • Extensible Schema support • MIX - Presentation transparency • Mediation of information with XML • Support for semi-structured data • Access scaling • MPI-I/O access to data sets using parallel I/O

  29. SRB Software Architecture Application (SRB client) SRB APIs Metadata Catalog MCAT SRB User Authentication Dataset Location Access Control Type Replication Logging UniTree HPSS DB2 Illustra Unix

  30. 14 Installed SRB Sites Montana State University NCSA Rutgers Large Archives

  31. Support for Collection hierarchy allows grouping of hetero-geneous data sets into a single logical collection hierarchical access control, with ticket mechanism Replication optional replication at the time of creation can choose replica on read Proxy operations supports proxy (remote) move and copy operations Monitoring capability Supports storing/querying of system- and user-defined “metadata” for data sets and resources API for ad hoc querying of metadata Ability to extend schemas and define new schemas Ability to associate data sets with multiple metadata schemas Ability to relate attributes across schemas Implemented in Oracle and DB2 SRB / MCAT Features

  32. MCAT Schema Integration • Publish schema for each collection • Clusters of attributes form a table • Tables implement the schema • Use Tokens to define semantic meaning • Associate Token with each attribute • Use DAG to automate queries • Specify directed linkage between clusters of attributes • Tokens - Clusters - Attributes

  33. Publishing A New Schema

  34. Adding Attributes to the New Schema

  35. Displaying Attributes From Selected Schemas

  36. Security • Integration of SDSC Encryption Authentication system (SEA) with Globus GSI • Kerberos within security domain • Globus for inter-realm authentication • Access control lists per data set • Audit trails of usage • Need support for third-party authentication • User A accesses data under the control of digital library B when the data is stored at site C

  37. MIX: Mediation of Information using XML Active View 1 Active View 2 BBQ Interface BBQ Interface XML data XMAS query Mediator Support for “active” views Local Data Repository XMAS query “fragment” XML data Convert XMAS query to local query language, and data in native format to XML Wrapper Wrapper Wrapper SQL Database Spreadsheet HTML files

  38. Integration of Digital Librarywith Metacomputing Systems • NTON OC-192 network (LLNL - Caltech - SDSC) • HPSS archive • Globus metacomputing system • SRB data handling system • MCAT extensible metadata • MIX semi-structured data mediation using XML • ICE collaboration environment • Feature extraction

  39. Data Intensive and High-Performance Distributed Computing Application Toolkits Communication Libs. Visualization Grid-enabled Libs Domain Specific Services Layer Resource Discovery Resource Brokering Scheduling Generic Services Layer INFORMATION SERVICES Interdomain Security Fault Detection End-to-End QoS Resource Management Remote Data Access Resources Layer Data Repositories Network Caching Metadata Local Resource Management

  40. Research Activities • Support for remote execution of data manipulation procedures • Globus - SRB integration • Automated feature extraction • XML based tagging of features • XML query language for storing attributes into the Intelligent Archive • Integration with RIO - parallel I/O transport

  41. Views of Software Infrastructure • Software infrastructure supports user applications • Reason for existence of software is to provide explicit capabilities required by applications • What is the user perspective for building new software systems? • Is the integration of digital library and metacomputing systems the final version?

  42. Software Integration Projects • NSF • Computational Grid - Middleware using distributed state information to support metacomputing services • DOE • Data Visualization Corridor - collaboratively visualize multi-terabyte sized data sets • NASA • Information Power Grid - integrate data repositories with applications and visualization systems • DARPA • Quorum - provide quality of service guarantees

  43. User Requirements - Five Software Environments • Code Development • Resources support • Run-time • Parallel Tools and Libraries • Distributed Run-Time • Metacomputing environment • Interaction Environments • Collaboration, presentation • Publication / Discovery / Retrieval • Data intensive computing environment

  44. Metacomputing Environment Data Flow Perspective Application Object Oriented Interface Distributed Execution Environment Data Caching System Data Staging System Data Handling System Remote Data Manipulation Archival Storage System

  45. Publication Environment Data Flow Perspective Application Run-time Access Data Set Constructor Digital Library Services Collection Management Software Data Handling System Remote Data Manipulation Archival Storage System

  46. Application Parallel I/O Library Memory Tiling Data Structures Library Library Interoperation Data Caching System Data Handling System Archival Storage System Run-time Environment Data Flow Perspective

  47. Interaction Environment Data Flow Perspective Application Collaboration Environment Visualization Environment Rendering System Data Formatting System Data Caching System Data Manipulation System Archival Storage System

  48. Taxonomy of User Requirements

  49. Comparison of Environments

  50. Comparison of Environments

More Related