Common solution for the (very-)large data challenge.

Common solution for the (very-)large data challenge. VLDATA Call: EINFRA-1 (Focus on Topics 4-5) Deadline: Sep. 2nd 2014

1.1 Objectives The mission/vision/endgoal of VLDATA is to provide common solutions for handling large and extremely large common scientific data in a cost-effective way. This solution builds on existing pan-European e-Infrastructure and tools to provide an interoperable, efficient and sustainable platform for scientific user communities, in particular, to support a new generation of data scientists. The success of this project will secure European leadership on the development and support of big data and global data science and, therefore, will contribute to the leadership of European scientists and enterprises in many research and innovation fields

Objectives (I) • O1: a "flexible and extendable" platform supporting common solutions for large scale distributed data processing and analysis, ensuring interoperability among existing e-Infrastructure providers. • O1.1: (WP2,3,4,5) provide a common solution using generic e-Infrastructure for processing large scale or extremely large scale of scientific data in a robust, efficient and cost-effective way. • O1.2: (WP6) provide a flexible and customizable platform that can be extended to cover the specific requirements of each community.

Objectives (II) • O2: standardized solutions aiming to a global interoperability of open access for large-scale data processing, minimizing unnecessary large transfers • O2.1: (WP2) provide common language and standard for handling big volume of data • O2.2: (WP2,3,4,5) improve the efficiency of distributed data processing by providing smart data and computing management platform • O2.3: (WP2,3,4,5) enable effective handling of big data samples by integrating new technologies • O2.4: (WP8) assess the value of this generic solution towards relevant stakeholders: end scientists, their management, funding agencies, policy makers, companies and the society at large

Objectives (III) • O3: Increase the number of users and Research Infrastructure projects making efficient use of existing e-Infrastructure resources, designing appropriate exploitation strategies and a long-term sustainability plan. • O3.1: (WP5,7) deliver ready-to-use high-quality standard products for internal and external usage, enhancing interdisciplinary data sciences at a global scale • O3.2: (WP6,9) increase the degree of the open access of large scale distributed data • O3.3: (WP9) educate new generation of data scientists and the society in general

1.2 Relation to the work programme

1.3 Concept and approach (ideas) Make IT simple • Simplicity: VLDATA provides an abstraction of the different Resources that are all made accessible the end user via the same interfaces. • Transparency: Users are allowed to specify their Workflows/Pipelines with different levels of abstractions. The platform takes care of the necessary Resource Allocation to fulfill the required specifications. • Extendibility and flexibility: VLDATA provides an API that allows users to extend the provided functionality by developing new or customized components • Reliability: Quality standards and extensive validation in several scientific domains to ensure the readiness-to-use and robustness of VLDATA based solutions • Scalability: Modular implementation allowing horizontal (amount of connected Resources or Users) and vertical (amount of processed Units) scaling to adapt VLDATA to the needs of each particular community or Research Infrastructure project. • Smart and intelligent: building on collected experience and monitoring data, algorithm can look for optimized scheduling/searching strategies, including automated decision making based on usage traces and expectations. • Cost-effective: Building up on existing well-established solutions and incrementally extending and developing to address new challenges with an evolving validated common solution, avoiding unnecessary duplicated efforts.

1.3 Concept and approach (model) • Model (building blocks): • Collaborative modular architecture, with multiple layers sharing the same Framework and Basic modules, allowing horizontal & vertical scaling to ensure scalability. • Open, iterative, incremental and parallel, requirement-driven development process. Agile(?) methodology. • Standard procedures for quality assurance, including security, platform integration and validation, including reference benchmarks, and release procedures in accordance with requirements for production level services. • Layers: (result of 10 evolution of DIRAC development effort) • Framework: (communication, security, access control, user/group management, DBs) • Basic modules: SystemLogging, Configuration, Accounting, Monitoring • Low level modules: File Catalog, Resource Status, Request Management, Workload Management • High level modules: Data Management, Workflow Management • Interfaces: User - Resource

1.3 Concept and approach (assumption) • Current solution can be evolved into the new general platform to be widely applied. • Evolution from grids to clouds, but heterogeneity will increased • Large degree of commonality on low-level requirements and tools between different scientific domains • Fast grow of data and computing requirements almost doubling every year. Aggregated estimation close exabyte level in 5 years from now (EGI expects 10.000.000 Cores and 1?? exabyte of scientific data by 2020). (Ref: http://delaat.net/talks/cdl-2014-05-13.pdf) • Similar grow in number of data objects, computing units and end users (60 % of ESFRI projects completed or launched by 2015). • New scientific domains are entering the digital era 4th paradigm of science, new data science is emerging (http://research.microsoft.com/en-us/collaboration/fourthparadigm/) • Data is to be made openly available beyond the community that produced them, down to the citizens that might also contribute to its further processing • Common development and validation provides robustness as well as cost-saving and, thus, enables sustainability

1.4 Ambition

2.1 Expected impacts • DIRECT impact. • scalability, robustness (for the Research Infrastructure) • The expected impact is that participating RI projects will be able to operate their Distributed Computing Systems efficiently processing their large volume research data, making it available to their end users in reliable and cost-effective way, which couldn't be achieved before, which may lead to new way of organizing science activities, leading to significant scientific break throughs. By providing important functional components (e.g., ) which was missing from existing practices, VLDATA platform will make possible the transparent integration of resources, hiding the complexity from use, resulting in the extension of the scale of the resources Resource Infrastructure projects can utilize. This will increase the number of RI using the project tools and the number of different types of resources reachable through the tools. • simplicity (for the user: scientist/operator) • cost-efficiency (for funding agencies) • reduce the duplication efforts, maximizing the use of EU-invest e-Infrastructures, enlarging the user communities, providing efficient data processing services, providing advanced technology by integrating the state-of-the-art which reduces development cost significantly. (also the processing algorithm )

2.1 Expected impacts • Indirect impact: large user community • Science • innovation • Society • industry • Citizens • policy maker • new generation data scientists • On the other hand, the scale of the data challenge requires simple but intelligent solutions to integrate resources from different e-Infrastructure providers.

2.2 Measures to maximize impact

Research Infrastructures (I) • Belle II: • Usage of DIRAC for the Experiment, use case presented: • Common access to various platforms: Grid + cloud + cluster + HPC • Support for Monitoring for Workflow management tools • Integrate for the needs of other participants • User interface • EU-T0: Virtual data centers / New Virtualization techniques?

Research Infrastructures (II) • PAO: • Usage of DIRAC for the Experiment, data taking -> 2022 • using a standard solution will help the sustainability. • Extend functionality for their use case. • Common access to various platforms: Grid + cloud + cluster + HPC (follow evolution of providers) in particular OSG • Open Access to data • EU-T0: Data locality

Research Infrastructures (III) • LHCb: • should cover Run 2 needs and target to the needs of Run 3 (DAQ Upgrade) • Data rate will be increase by a factor ~5, 10 PB/year. • Integration of Cloud resources. • Massive data-driven Workflows for users. • Data preservation (?) • Resource (cpu/storage/network/...) description/monitoring/availability/management, smart allocation • Smart/Intelligent/dynamic data placement strategies (network) • EU-T0: New Virtualization techniques, Resource description/monitoring/availability, Virtual data centers, Data locality

Research Infrastructures (IV) • EISCAT_3D: • searching data (metadata catalog), intelligent searching (patterns recognition) • visualization, • Workflow to go from one data level to another with appropriated access rights • Training • flexible interconnection of different resources, central (HPC) + distributed (Grid/Cloud) • time constrained massive data reduction (10 PB -> 1 PB / month ??), including the possibility for users defined algorithm. • EU-T0:

Research Infrastructures (V) • BES III:

3.1 Work Plan (To be confirmed) • WP1 Coordination (UB, Spain) • External Advisory board (EUDAT, OGF, RDA, OSG, PRACE, XSEDE, CERN/HelixNebula) • WP2 Requirement analysis & Design (CU, UK) • WP3 Data-driven development( UB, Spain) • WP4 User-driven development( CYFRONET, Poland) • WP5 Quality ( UAB, Spain) • WP6 Validation (????) • LHCb (CNRS/INFN) • Belle II (InstitutJozef Stefan, UniMB, MariborandUniLJ,Slovenia) • EISCAT_3D (SNIC, Sweden/EISCAT Science Associate) • PAO (CESNET, Czech Republic) • BES III (IHEP, China/INFN-Torino, Italy) • DIRAC 4 EGI, multi-community solution EGI ( EGI.eu, the Netherlands) • WP7 Dissemination: outreach + Training (CNRS, France) • WP8 Exploitation (ASCAMM, Spain) • WP9 Communication, Internationalization (UvA, the Netherlands)

3.2 Management structure and procedures External Communities’ Coordinators Consortium Board (all partners) External Advisory Board Internal Communities’ Coordinators Project Manager Executive Board (1 Representative from each Area) Coordinator Comm./Exploit. Coordinator Tech. Coordinator Design/Develop WPs (2,3,4,5) Integration/Operations WPs (6) Communication/ Sustainability WPs (7,8,9)

3.3 Consortium as a whole

Private Companies • Bull/Dell (??) • ETL (UK) • AlpesLaser (CH)

3.4 Resources to be committed

Calendar (milestones) • May 23: Close the Contractors • June 11-13: all WP ready, F2F meeting to close the Work plan. Deadline for RIs and third Parties • July 9-11: Close proposal (I) • July 25: Proof read -> External review • Aug 18 -> Sep 2: final upates

Common solution for the (very-)large data challenge.

Common solution for the (very-)large data challenge.

Presentation Transcript

The Data Deluge and the Grid

The Need for Speed

Tuning a Very Large Data Warehouse

E VERY S TUDENT, E VERY D AY.

VLDATA Common solution for the (very-)large data challenge

A Framework for Particle Advection for Very Large Data

A Framework for Particle Advection for Very Large Data

Parallel Visualization for Very Large Data Simulations

Approximate Queries on Very Large Data

What is a Database Management System?

The Pre-Sidedress Nitrate Test PSNT

VisIt : A Tool for Visualizing and Analyzing Very Large Data

Very Large Array data

DEFINED BENEFIT PLANS – -- Very common for governmental agencies

The Intensity Frontier The Next Challenge for Fermilab

Very large data sets

Animation applied to representation and interaction in the datascape

PCA for analysis of complex multivariate data

Data Mining of Very Large Data

Authority on Demand Flexible Access Control Solution

PCA for analysis of complex multivariate data

Go For the Best Assault Rifle Owner Mailing List Solution