1 / 76

Data Management and Data Processing Support on Array-Based Scientific Data

Data Management and Data Processing Support on Array-Based Scientific Data. Yi Wang Advisor: Gagan Agrawal. Candidacy Examination. Big Data Is Often Big Arrays. Array data is everywhere. Molecular Simulation: Molecular Data. Life Science: DNA Sequencing Data (Microarray). Earth Science:

zwi
Télécharger la présentation

Data Management and Data Processing Support on Array-Based Scientific Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Management and Data Processing Support on Array-Based Scientific Data Yi Wang Advisor:Gagan Agrawal Candidacy Examination

  2. Big Data Is Often Big Arrays • Array data is everywhere Molecular Simulation: Molecular Data Life Science: DNA Sequencing Data (Microarray) Earth Science: Ocean and Climate Data Space Science: Astronomy Data

  3. Inherent Limitations of Current Tools and Paradigms • Most scientific data management and data processing tools are too heavy-weight • Hard to cope with different data formats and physical structures (variety) • Data transformation and data transfer are often prohibitively expensive (volume) • Prominent Examples • RDBMSs: not suited for array data • Array DBMSs: data ingestion • MapReduce: specialized file system

  4. Mismatch Between Scientific Data and DBMS • Scientific (Array) Datasets: • Very large but processed infrequently • Read/append only • No resources for reloading data • Popular formats: NetCDF and HDF5 • Database Technologies • For (read-write) data – ACID guaranteed • Assume data reloading/reformattingfeasible

  5. Example Array Data Format - HDF5 • HDF5 (Hierarchical Data Format)

  6. The Upfront Cost of Using SciDB • High-Level Data Flow • Requires data ingestion • Data Ingestion Steps • Raw files (e.g., HDF5) -> CSV • Load CSV files into SciDB “EarthDB: scalable analysis of MODIS data using SciDB” - G. Planthaber et al.

  7. Thesis Statement • Native Data Can Be Queried and/or Processed Efficiently Using Popular Abstractions • Process data stored in the native format, e.g., NetCDF and HDF5 • Support SQL-like operators, e.g., selection and aggregation • Support array operations, e.g., structural aggregations • Support MapReduce-like processing API

  8. Outline • Data Management Support • Supporting a Light-Weight Data Management Layer Over HDF5 • SAGA: Array Storage as a DB with Support for Structural Aggregations • Approximate Aggregations Using Novel Bitmap Indices • Data Processing Support • SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats • Future Work

  9. Overall Idea • An SQL Implementation Over HDF5 • Ease-of-use: declarative language instead of low-level programming language + HDF5 API • Abstraction: provides a virtual relational view • High Efficiency • Load data on demand (lazy loading) • Parallel query processing • Server-side aggregation

  10. Functionality • Query Based on Dimension Index Values (Type 1) • Also supported by HDF5 API • Query Based on Dimension Scales (Type 2) • coordinate system instead of the physical layout (array subscript) • Query Based on Data Values (Type 3) • Simple datatype + compound datatype • Aggregate Query • SUM, COUNT, AVG, MIN, and MAX • Server-side aggregation to minimize the data transfer index-based condition coordinate-based condition content-based condition

  11. Execution Overview 1D: AND-logic condition list 2D: OR-logic condition list 1D: OR-logic condition list Same content-based condition 11

  12. Experimental Setup • Experimental Datasets • 4 GB (sequential experiments) and 16 GB (parallel experiments) • 4D: time, cols, rows, and layers • Compared with Baseline Performance and OPeNDAP • Baseline performance: no query parsing • OPeNDAP: translates HDF5 into a specialized data format

  13. Sequential Comparison with OPeNDAP (Type2 and Type3 Queries)

  14. Parallel Query Processing for Type2 and Type3 Queries

  15. Outline • Data Management Support • Supporting a Light-Weight Data Management Layer Over HDF5 • SAGA: Array Storage as a DB with Support for Structural Aggregations • Approximate Aggregations Using Novel Bitmap Indices • Data Processing Support • SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats • Future Work

  16. Array Storage as a DB • A Paradigm Similar to NoDB • Still maintains DB functionality • But no data ingestion • DB and Array Storage as a DB: Friends or Foes? • When to use DB? • Load once, and query frequently • When to directly use array storage? • Query infrequently, so avoid loading • Our System • Focuses on a set of special array operations - Structural Aggregations

  17. Structural Aggregation Types Non-Overlapping Aggregation Overlapping Aggregation

  18. Grid Aggregation • Parallelization: Easy after Partitioning • Considerations • Data contiguity which affects the I/O performance • Communication cost • Load balancing for skewed data • Partitioning Strategies • Coarse-grained • Fine-grained • Hybrid • Auto-grained

  19. Partitioning Strategy Decider • Cost Model: analyze loading cost and computation cost separately • Load cost • Loading factor × data amount • Computation cost • Exception - Auto-Grained: take loading cost and computation cost as a whole

  20. Overlapping Aggregation • I/O Cost • Reuse the data already in the memory • Reduce the disk I/O to enhance the I/O performance • Memory Accesses • Reuse the data already in the cache • Reduce cache misses to accelerate the computation • Aggregation Approaches • Naïve approach • Data-reuse approach • All-reuse approach

  21. Example: Hierarchical Aggregation • Aggregate 3 grids in a 6 × 6 array • The innermost 2 × 2 grid • The middle 4 × 4 grid • The outmost 6 × 6 grid • (Parallel) sliding aggregation is much more complicated

  22. Naïve Approach Load the innermost grid Aggregate the innermost grid Load the middle grid Aggregate the middle grid Load the outermost grid Aggregate the outermost grid For N grids: N loads + N aggregations

  23. Data-Reuse Approach Load the outermost grid Aggregate the outermost grid Aggregate the middle grid Aggregate the innermost grid For N grids: 1 load + N aggregations

  24. All-Reuse Approach Load the outermost grid Once an element is accessed, accumulatively update the aggregation results it contributes to For N grids: 1 load + 1 aggregation Only update the outermost aggregation result Update both the outermost and the middle aggregation results Update all the 3 aggregation results

  25. Sequential Performance Comparison • Array slab/data size (8 GB) ratio: from 12.5% to 100% • Coarse-grained partitioning for the grid aggregation • All-reuse approach for the sliding aggregation • SciDB stores `chunked’ array: can even support overlapping chunking to accelerate the sliding aggregation

  26. Parallel Sliding Aggregation Performance • # of nodes: from 1 to 16 • 8 GB data • Sliding grid size: from 3 × 3 to 6 × 6

  27. Outline • Data Management Support • Supporting a Light-Weight Data Management Layer Over HDF5 • SAGA: Array Storage as a DB with Support for Structural Aggregations • Approximate Aggregations Using Novel Bitmap Indices • Data Processing Support • SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats • Future Work

  28. Approximate Aggregations Over Array Data • Challenges • Flexible Aggregation Over Any Subset • Dimensional-based/value-based/combined predicate • Aggregation Accuracy • Spatial distribution/value distribution • Aggregation Without Data Reorganization • Reorganization is prohibitively expensive • Existing Techniques - All Problematic for Array Data • Sampling: unable to capture both distributions • Histograms: no spatial distribution • Wavelets:no value distribution • New Data Synopses – Bitmap Indices

  29. Bitmap Indexing and Pre-Aggregation • Bitmap Indices • Pre-Aggregation Statistics

  30. Approximate Aggregation Workflow

  31. Running Example SELECT SUM(Array) WHERE Value > 3 AND ID < 4; • Bitmap Indices • Pre-Aggregation Statistics Predicate Bitvector: 11110000 i1’: 01000000 i2’: 10010000 Count1: 1 Count2: 2 Estimated Sum: 7 × 1/2 + 16 × 2/3 = 14.167 Precise Sum: 14

  32. A Novel Binning Strategy • Conventional Binning Strategies • Equi-width/Equi-depth • Not designed for aggregation • V-Optimized Binning Strategy • Inspired by V-Optimal Histogram • Goal: approximately minimize Sum Squared Error (SSE) • Unbiased V-Optimized Binning: data is queried randomly • Weighted V-Optimized Binning: frequently queried subarea is prior knowledge

  33. Unbiased V-Optimized Binning • 3 Steps: • Initial Binning: use equi-depth binning • Iterative Refinement: adjusting bin boundaries • Bitvector Generation: mark spatial positions

  34. Weighted V-Optimized Binning • Difference: minimize WSSE instead of SSE • Similar binning algorithm • Major Modification • representative value for each bin is not the mean value

  35. Experimental Setup • Data Skew • Dense Range: less than 5% space but over 90% data • Sparse Range: less than 95% space but over 10% data • 5 Types of Queries • DB: with dimension-based predicates • VBD: with value-based predicates over dense range • VBS : with value-based predicates over sparse range • CD: with combined predicates over dense range • CS : with combined predicates over sparse range • Ratio of Querying Possibilities – 10 : 1 • 50% synthetic data is frequently queried • 25% real-world data is frequently queried

  36. SUM Aggregation Accuracy of Different Binning Strategies on the Synthetic Dataset Equi-Width Equi-Depth Unbiased V-Optimized Weighted V-Optimized

  37. SUM Aggregation Accuracy of Different Methods on the Real-World Dataset Sampling_2% Sampling_20% (Equi-Depth) MD-Histogram Equi-Depth Unbiased V-Optimized Weighted V-Optimized

  38. Outline • Data Management Support • Supporting a Light-Weight Data Management Layer Over HDF5 • SAGA: Array Storage as a DB with Support for Structural Aggregations • Approximate Aggregations Using Novel Bitmap Indices • Data Processing Support • SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats • Future Work

  39. Scientific Data Analysis Today • “Store-First-Analyze-After” • Reload data into another file system • E.g., load data from PVFS to HDFS • Reload data into another data format • E.g., load NetCDF/HDF5 data to a specialized format • Problems • Long data migration/transformation time • Stresses network and disks

  40. System Overview • Key Feature • scientific data processing module

  41. Scientific Data Processing Module

  42. Parallel Data Processing Times on 16 GB Datasets • K-Means • KNN

  43. Future Work Outline • Data Management Support • SciSD: Novel Subgroup Discovery over Scientific Datasets Using Bitmap Indices • SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices • Data Processing Support • StreamingMATE: A Novel MapReduce-Like Framework Over Scientific Data Stream

  44. SciSD • Subgroup Discovery • Goal: identify all the subsets that are significantly different from the entire dataset/general population, w.r.t. a target variable • Can be widely used in scientific knowledge discovery • Novelty • Subsets can involve dimensional and/or value ranges • All numeric attributes • High efficiency by frequent bitmap-based approximate aggregations

  45. Running Example

  46. SciCSM • “Sometimes it’s good to contrast what you like with something else. It makes you appreciate it even more.” - Darby Conley, Get Fuzzy, 2001 • Contrast Set Mining • Goal: identify all the filters that can generate significantly different subsets • Common filters: time periods, spatial areas, etc. • Usage: classifier design, change detection, disaster prediction, etc.

  47. Running Example

  48. StreamingMATE • Extend the precursor system SciMATE to process scientific data stream • Generalized Reduction • Reduce data stream to a reduction object • No shuffling or sorting • Focus on the load balancing issues • Input data volume can be highly variable • Topology update: add/remove/update streaming operators

  49. StreamingMATE Overview

More Related