250 likes | 325 Vues
Tera/Petabyte data distribution architectures. Chris A. Mattmann USC-CSE Annual Research Review Sunday, October 26, 2014. Outline. Research Problem and Importance Background and Related Work Problem Statement Approach Evaluation Strategy Conclusions. Research Problem and Importance.
E N D
Tera/Petabyte data distribution architectures Chris A. Mattmann USC-CSE Annual Research Review Sunday, October 26, 2014
Outline • Research Problem and Importance • Background and Related Work • Problem Statement • Approach • Evaluation Strategy • Conclusions MATTMANN-ARR
Research Problem and Importance • Volume of data returned from scientific experiments and media content providers growing rapidly • Planetary Data System • Current: 20 terabytes for all NASA missions • Growing to: over 200terabytes from a single mission! • Orbiting Carbon Observatory • Current: hundreds of gigabytes to a single terabyte • Growing to: over 150 terabytes! * Projected as of 1/11/04 MATTMANN-ARR
Research Problem and Importance • National Cancer Institute’s Early Detection Research Network (EDRN) • Current: tens of gigabytes to hundreds of gigabytes • Growing to: hundreds of gigabytes to terabytes • Question: how to distribute these voluminous data sets? MATTMANN-ARR
Distributing Large Volumes of Data • Use existing infrastructure? • HTTP/REST? • Issues: • Scalability? • Single entrypoint? • Limited bandwidth? • What about other distribution mechanisms? RMI SOAP GridFTP MATTMANN-ARR
Distributing Large Volumes of Data • Few data movement mechanisms in place for scientists, students, educators, etc. to get their data • EDRN: HTTP/REST • National Space Science Data Archive: FTP • Physical Oceanography Data Active Archive Center: FTP, and Aspera commercial UDP technology • Even Google: HTTP/REST, SOAP • Even when there are many mechanisms in place, how do we select the correct one? • Sometimes, we may even need to use them in concert • Certain users may only be able to get data from GridFTP, while others may require HTTP/REST • HTTP combined with a UDP based mechanism may speed up the transfer MATTMANN-ARR
Distributing Large Volumes of Data • Understanding the Tradeoffs • HTTP/REST isn’t all bad: it’s pervasive, it’s ubiquitous, it’s a standard • It’s good in many situations, but not all situations • Same goes for many of the other distribution mechanisms • RMI scalable, but ties you to java, Peer-to-Peer highly scalable and efficient, but may neglect dependability and consistency • Understanding how many different data movement technologies there are: • GridFTP, Aspera software, HTTP/REST, RMI, CORBA, SOAP, XML-RPC, Bittorrent, JXTA, UFTP, FTP, SFTP, SCP, Siena, GLIDE/PRISM-MW • …and that’s just off the top of my head! • Understanding the classes of data movement technologies MATTMANN-ARR
Software Architecture • The definition of a system in the form of its canonical building blocks • Software Components: the computational units in the system • Software Connectors: the communications and interactions between software components • Software Configurations: arrangements of components and connectors and the rules that guide their composition MATTMANN-ARR
A Software Architectural View of the Data Distribution Problem • …Understanding the architectures of existing data systems MATTMANN-ARR
A Software Architectural View of the Data Distribution Problem • …Deciding the appropriate software connectors for data distribution (and their combinations) to use MATTMANN-ARR
A Software Architectural View of the Data Distribution Problem • …Satisfying specified user scenarios for data distribution MATTMANN-ARR
A Software Architectural View of the Data Distribution Problem • …Making these people happy! MATTMANN-ARR
Research Question • What types of software connectors are best suited for delivering these huge amounts of data to the users, that satisfy their particular scenarios, in a manner that is performant, scalable, in these hugely distributed data systems? MATTMANN-ARR
Problem Statement • Identifying and selecting suitable software connectors for data distribution* that satisfy user specified constraints • Use eight key dimensions of data distribution • Literature review • Our own experience in the context of planetary science and cancer research at JPL • User specified constraints on eight dimensions are data distribution scenarios • Identification of four basic distribution connector classes • RPC, P2P, Grid, Event-based • What classes are appropriate for which distribution scenarios? * Referred to as “distribution connectors” or “data distribution connectors” MATTMANN-ARR
Eight Dimensions of Data Distribution MATTMANN-ARR
Eight Dimensions of Data Distribution • Total Volume - the total amount of data that needs to be transferred from providers of data to consumers of data. • Number of Delivery Intervals - the number, size and frequency (timing) of intervals that the volume of data should be delivered within. • Performance Requirements - any constraints and requirements on the scalability, efficiency, consistency, and dependability of the distribution scenario. • Number of Users - the amount of unique users that the data volume needs to be delivered to. • Number of User Types - the amount of unique user types, such as scientists, or students, that the data volume needs to be delivered to. • Data Types - The number of different data types that are part of the total volume to be delivered. • Geographic Distribution - The geographic distribution of the data providers and consumers. • Access Policies - The number and types of access policies in place at each producer and consumer of data. MATTMANN-ARR
Approach Classification Categorization Integration Testing/Evaluation MATTMANN-ARR
Evaluation Strategy • Empirical evaluation using real world systems • NASA Planetary Data System • NASA Orbiting Carbon Observatory Mission • National Cancer Institute’s Early Detection Research Network • Quantifiably measure • consistency (data delivered is data sent) • efficiency (memory footprint and data throughput) • scalability (data volume and number of hosts • dependability (uptime, number of faults) • Compare to off-the-shelf connector solutions • OODT, GridFTP, Aspera, UFTP, Bittorrent, possibly more MATTMANN-ARR
Current Progress • Preliminary Study with NASA’s Planetary Data System • Classified and Compared Data Movement Technologies • Parallel TCP/IP technologies • GridFTP, bbFTP • UDP bursting technologies • Aspera, UFTP • Baseline technologies • SCP, FTP, HTTP MATTMANN-ARR
Experimental Results • Classified and Evaluated each technology against data distribution dimensions • Measured transfer rate • LAN-based • WAN-based • Varied dataset sizes from 10s of MBs to 10s of GBs • Ease to operate, easeto install • UDP technologies not testable on WAN(firewall, security, ease to configure) * GridFTP (blue), bbFTP (red), FTP (green) MATTMANN-ARR
Conclusions • Proposed approach for classifying, selecting and evaluating different software connectors for data distribution • Preliminary results suggest parallel TCP/IP technologies beneficial in real world system (PDS) • Currently formalizing connector metadata and developing connector XML profiles MATTMANN-ARR
Questions? • Thanks for your attention! MATTMANN-ARR
Refereed Papers • C. Mattmann, S. Kelly, D. Crichton, S. Hughes, S. Hardman, P. Ramirez and R. Joynger. A Classification and Evaluation of Data Movement Technologies for the Delivery of Highly Voluminous Scientific Data Products. In Proceedings of NASA/IEEE Conference on Mass Storage Systems and Technologies, May 2006. • C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications. In Proceedings ofICSE, Shanghai, China, May 20th-28th, 2006. • N. Medvidovic and C. Mattmann. The GridLite DREAM: Bringing the Grid to Your Pocket. In Proceedings of the Monterey Workshop on Networked Systems, Irvine, CA, September, 2005. • C. Mattmann, N. Medvidovic, P. Ramirez and V. Jakobac. Unlocking the Grid. In Proceedings of the 8th ACM SIGSOFT International Symposium on Component-based Software Engineering (CBSE8), pp. 322-336. LNCS 3489, St. Louis, Missouri, May 14th-15th, 2005. • C. Mattmann, S. Malek, N. Beckman, M. Mikic-Rakic, N. Medvidovic and D. Crichton. GLIDE: A Grid-based, Lightweight, Infrastructure for Data-intensive Environments. In Proceedings of the European Grid Conference (EGC2005), pp. 68-77. LNCS 3470, Amsterdam, The Netherlands, February 14-16, 2005. MATTMANN-ARR
Refereed Papers • J. Steven Hughes, D. Crichton, S. Kelly, C. Mattmann, R. Joyner, J. Wilf and J. Crichton. A Planetary Data System for the 2006 Mars Reconnaissance Orbiter Era and Beyond. In Proceedings of the 2nd ESA Symposium on Ensuring the Long Term Preservation and Adding Value to Scientific and Technical Data (PV-2004). Frascati, Italy, October 5-7, 2004. • C. Mattmann, D. Crichton, J.S. Hughes, S. Kelly and P. Ramirez. Software Architecture for Large scale, Distributed, Data-Intensive Systems. In Proceedings of the 4th IEEE/IFIP Working Conference on Software Architecture (WICSA-4), pp. 255-264. Oslo, Norway, June 12th-15th, 2004. • C. Mattmann, P. Ramirez, D. Crichton and J.S. Hughes. Packaging Data Products using Data Grid Middleware for Deep Space Mission Systems. In Proceedings of the 8th International Conference on Space Operations (Spaceops-2004), AIAA Press. Montreal, Canada, May 2004. MATTMANN-ARR