1 / 14

VIFI: Virtual Information Fabric for Data-Driven Discovery

VIFI is a novel cyberinfrastructure that enables data-driven discovery from distributed, fragmented datasets without moving massive amounts of data or exposing sensitive information. It aims to provide open-source middleware tools and demonstrate its usefulness in various domains.

adolphi
Télécharger la présentation

VIFI: Virtual Information Fabric for Data-Driven Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VIFI : Virtual Information Fabric for Data-Driven Discovery from Distributed Fragmented Repositories PI: Dr. Ashit Talukder Bank of America Endowed Chair in IT Email: atalukde@uncc.edu Web: http://cs.uncc.edu/directory/talukder-ashit

  2. VIFI Concept • Novel VIFI cyberinfrastructure that facilitates data-driven discovery from distributed, fragmented datasets • without requiring movement of massive amounts of data • without exposing sensitive raw datasets to end users. • Overarching Goals: • Open source middleware tools • Evaluate and demonstrate on multiple domains: Earth Science, Astronomy, Health Informatics, Resilient Human-building ecosystems • Useful in domains involving massive, or heterogeneous data streams, with noveledge analytics, fog computing.

  3. Traditional Data Fabric: Limitations • Complex and timely processes, standards, APIs, MOUs - may include format conversion, DB import, select field encryption, data redaction or de-identification, etc. • Given appropriate authorizations and consideration for data privacy, bulk datasets are transported across bandwidth limited connections. • After staging bulk data ingest, analytics differentiates valuable information from irrelevant data. Irrelevant data volume often eclipses that of the usable information. 1.45pm to 3.45pm - Room 232

  4. VIFI Proof of Concept: Early Stage Demonstrations • Demonstrate initial core components in VIFI proof of concept use-case: • User interface and visualization of distributed data and VIFI features • Portable analytics container (PAC) – prepare self-contained analytics scripts and algorithms • Docker swarm – deploy, monitor, execute portable analytics (PAC) on remote repositories • Orchestration of distributed infrastructure • Distributed computation and analytics without moving distributed repositories • User visualization of analytics and data-driven insights Demonstrate on pilot Earth science use-case for climate and weather precipitation model prediction from distributed earth science repositories Demonstrate on pilot Astronomy use-case for detecting specific statistical patterns from distributed astronomy data

  5. VIFI POC Use Case: Hourly Precipitation datasets over the Great Plains • When it rains at somewhere in the Great Plains, would there be a probability density function to forecast how (strong/long/much) it rains? • The example uses rainfall data for one day from NASA’s observational (GPM) and model datasets at three different spatial resolutions. from [Bukovsky 2011] 10: Northern Plains

  6. VIFI Motivation: Traditional Data Fabric Architecture Model Data 3. Observational data re-gridded to the same resolution of the model data (if necessary) Model Server 4. JPDF is computed for the observational data 1. Download Model data User Node (or Server) 5. JPDF is computed for the model data 2. Download Obs data Observation Server 6. Observed and simulated JPDFs are used to compute an Evaluation Metric Observations Data

  7. VIFI Motivation: Traditional Data Fabric Architecture • Disadvantages: • Long time for transferring massive datasets to the User Node • High requirements for storing massive datasets on the User Node • All computations are executed on the same server • All data are transferred to the User Node – including data that might not be relevant for the analysis in question • Scientist must manually install the algorithms (including all dependencies) on the User Node

  8. VIFI Motivation: ViFi Enabled Data Fabric Architecture Model Data 5. Execute Model PAC 10. Execute Evaluation PAC 6. Execute Obs PAC Model Server 3. Request Model PAC Docker Image 1. Send Model PAC Script 7. Send Model Results 9. Request Evaluation PAC Docker Image User Node (or Server) 8. Send Obs Results Docker Hub 2. Send Obs PAC Script 4. Request Obs PAC Docker Image Observation Server Observations Data

  9. VIFI Motivation: ViFi Enabled Data Fabric Architecture • Advantages: • All phases of the scientific analysis lifecycle (compute and data transfer) are executed by a single agent (NIFI), without any manual intervention or a-priori knowledge on the scientist part. • Science algorithms are encapsulated in re-usable PACs, which can be seamlessly deployed and run on any ViFi-enabled Node • Computations are distributed onto multiple servers, which have direct access to the data (NO NEED TO MOVE DATA). • Only a subset of the data (i.e., results of Model and Observation PACs) are transferred over the network, drastically reducing the data transfer times • Scalability of overall infrastructure to any new data source by simply installing the ViFi software.

  10. VIFI User Interface PAC script Upload PAC script Write Visualization Types Results

  11. NIFI at User Site

  12. NIFI at Server(s) Site(s) NIFI at Model Server NIFI at Observation Server

  13. PoC Current Status • Open source (extensible and portable across infrastructures) • Initial deployment on AWS (for speed of demonstration – portable and easy to deploy on local managed infrastructure if needed) • AWS virtual machines • AWS S3 bucket to keep results • First Datacenter hosts Model data + NIFI + Docker Swarm • Second Datacenter hosts Observation data + NIFI + Docker Swarm • User node with NIFI + Docker Swarm • Docker Image of Apache OCW at Docker Hub • User interface and visualization base functionalities

  14. PoC Future Work • Expand Pilot, commence HLA design • Integration between UI and NIFI. • Common NIFI workflow design for most datacenters (i.e., not only for JPL): • Identification of common attributes of users, as well as, datacenters. • Data search and virtualization separate from PAC scripts. • Data governance, data management, search and query • Workflow scheduling and optimization (e.g., DAWN, IReS). • Security integration. (authentication, authorization, audit, provenance) • Encryption integration (encrypt relevant data and run computations on encrypted data) • Demonstrate, evaluate, benchmark on multiple application domains.

More Related