An Virtualization based Data Management Framework for Big Data Applications

An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University

Motivation: Scientific Data Analysis • Parallel Ocean Program • 3-D Grid: 42 * 2400 * 3600 • > 30 attributes (TEMP, SALT …) • 1.4 GB per attribute • Simulation Speed: > 50 GB • Road-runner EC3 simulation • 40003 records • 7 attributes (X, Y, VX, … MASS) • 36 bytes per record • Simulation Speed: 2.3 TB Science becomes increasing data driven Strong requirements for efficient data analysis

Motivation: Big Data • “Big Data” Challenge: • Fast Data Generation Speed • Slow Disk IO and Network Speed • Gap will become bigger in the future • Different Data Formats • Observations: • Scientific analysis over data subsets • Community Climate System Model, Data Pipelines from Tomography, X-ray Photon Correlation Spectroscopy • Attributes Subset, Spatial Subset, Value Subset • Multi-resolution data analysis • Wide area data transfer protocols

An Example of Ocean Simulation More Efficient! Aggregation Result Data Samples I want to analyze TEMP within North Atlantic Ocean! I want to see the average TEMP of the ocean! I want to quickly view the general global ocean TEMP Data Subset Network POP.nc Entire Data File TEMP SALT UVEL VVEL Combine Flexible Data Management Wide Area Data Transfer Protocol Remote Data Server

Introduction • A server-side data virtualization method • Standard SQL queries over scientific datasets • Translate SQL into low-level data access code • Data Formats: NetCDF, HDF5 • Data subsetting and aggregation • Multiple subsetting and aggregation types • Greatly decrease the data transfer volume • Data sampling • Efficient data analysis with small accuracy lost • Combine with wide area transfer protocols • Flexible data management + Efficient data transfer • SDQuery_DSI in Globus GridFTP

Thesis Work • Existing Work: • Supporting User-Defined Subsetting and Aggregation over Parallel NetCDF Datasets (CCGrid2012) • Indexing and Parallel Query Processing Support for Visualizing Climate Datasets (ICPP2012) • Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices (HPDC2013) • SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol (SC2013) • Future Work: • Correlation Data Analysis among Multiple Variables • Bitmap Indexing • Better Efficiency, More Flexibility • Correlation Data Mining over Scientific Data

Outline • Current Work • Parallel Server-side Data Subsetting and Aggregation • Flexible Data Sampling and Efficient Error Calculation • Combine Data Management with Data Transfer Protocol • Proposed Work • Flexible Correlation Analysis over Multi-Variables • Correlation Mining over Scientific Dataset • Conclusion

Contribution • Server-side subsetting and aggregation • Subsetting: Dimensions, Coordinates, Values • Bitmap Indexing: two-phase optimizations • Aggregation: SUM, AVG, COUNT, MAX, MIN • Keep data in native format(e.g., NetCDF, HDF5) • SciDB, OPeNDAP: huge data loading or transform cost • Parallel data processing • Data Partition Strategy • Multiple Parallel Levels – Files, Attributes, Blocks • Data visualization • SDQueryReader in Paraview • Visualize only subsets of data

Background: Bitmap Indexing • Widely used in scientific data management • Suitable for float value by binning small ranges • Run Length Compression (WAH, BBC) • Compress bitvectors based on continuous 0s or 1s • Can be treated as a small profile of the data

Overview of Server-side Data Subsetting and Aggregation Generate Query Request Index Generation Index Retrieval Generate data subset based on IDs Parse the metadata file Perform data aggregation Generate Unstructured Grid Parse the SQL expression

Bitmap Index Optimizations • Run-length Compression(WAH, BBC) • Pros: compression rate, fast bitwise operations • Cons: ability to locate dim subset is lost • Value Predicates vs. Dim Predicates • Two traditional methods: • Without bitmap indices: post-filter on values • With bitmap indices (Fastbit): post-filter on dim info • Two-phase optimizations: • Index Generation: Distributed indices over sub-blocks • Index Retrieval: • Transform dim subsetting conditions into bitvectors • Support bitwise operation among dim and value bitvectors

Optimization 1: Distributed Index Generation • Index Generation: • Generate multi-small indices over sub-blocks of data • Partition Strategy: • Study relationship between queries and partitions • Partition the data based on query preferences • α rate: redundancy rate of data elements • Index Retrieval: • Filter the indices based on dim-based query conditions

Partition Strategy • Queries involve both value and dim conditions • Bitmap Indexing + Dim Filter • Worst: All elements have to be involved • Ideal: Elements exact the same as dim subset • α rate: redundancy rate of data elements • Number of elements in index / Total data size • Partition Strategies: • Users query has preference • Timestamp, Longitude, Latitude • Study relationship between queries and partitions • Partition the data based on query preferences • α rate can be greatly decreased

Optimization 2: Index Retrieval • Value-based Predicates: • Find satisfied bitvectors from index files on disk • Dim-based Predicates: • Dynamically generate dim bitvectors which satisfy current predicates • Fast Bitwise Operations: • Logic AND operations are performed between dim and value bitvectors to generate the point ID set Post-filter?

Parallel Processing Framework L1: data file L2: attribute L3: data block

Experiment Setup • Goals: • Index-based Subsetting vs. Load + Filter in Paraview • Scalability of Parallel Indexing Method • Parallel Indexing vs. FastQuery • Server-side Aggregation vs. Client-side Aggregation • Dataset: • POP (Parallel Ocean Program) • GCRM (Global Cloud Resolving Model) • Environment: • IBM Xeon Cluster 8 cores, 2.53GHZ • 12 GB memory

Efficiency Comparison with Filtering in Paraview • Data size: 5.6 GB • Input: 400 queries • Depends on subset percentage • General index method is better than filtering when data subset < 60% • Two phase optimization achieved a 0.71 – 11.17 speedup compared with traditional bitmap indexing method • Index m1: Traditional Bitmap Indexing, no optimization • Index m2: Use bitwise operation instead of post-filtering • Index m3: Use both bitwise operation and index partition • Filter: load all data + filter

Memory Comparison with Filtering in Paraview • Data size: 5.6 GB • Input: 400 queries • Depends on subset percentage • General index method has much smaller memory cost than filtering method • Two phase optimization only has small extra memory cost • Index m1: Bitmap Indexing, no optimization • Index m2: Use bitwise operation instead of post-filtering • Index m3: Use both bitwise operation and index partition • Filter: load all data + filter

Scalability with Different Proc# • Data size: 8.4 GB • Proc#: 6, 24, 48, 96 • Input: 100 queries • X pivot: subset percentage • Y pivot: time • Each process take care of one sub-block • Good scalability as number of processes increases

Compare with FastQuery • FastQuery: • A parallel indexing method based on FastBit • Build a relational table view over dataset • Generate parallel indices based on partition of the table • Pros: standard way to process data based on tables • Cons: multi-dim feature is lost • Only support row-based partition • Basic Reading Unit: continuous rows (1-dim segments) • Our method: • Flexible Partition Strategy • Partition the multi-dim data based on users’ query preference • Smaller Reading Times • Basic Reading Unit: multi-dim blocks

Execution Time Comparison with FastQuery • Data size: 8.4 GB, 48 processes • Query Type: value + 1st dim, value + 2nd dim, value + 3rd dim, overall • Input: 100 queries for each query type • Achieved a 1.41 to 2.12 speedup compared with FastQuery

Parallel Data Aggregation Efficiency • Data size: 16GB • Process number: 1 - 16 • Input: 60 aggregation queries • Query Type: • Only Agg • Agg + Group by + Having • Agg + Group by • Much smaller data transfer volume • Relative Speedup: • 4 procs: 2.61 – 3.08 • 8 procs: 4.31 – 5.52 • 16 procs: 6.65 – 9.54

Contributions • Statistic Sampling Techniques: • A subset of individuals to represent whole population • Information Loss and Error Metrics: • Mean, Variance, Histogram, Q-Q Plot • Challenges: • Sampling Accuracy Considering Data Features • Error Calculation with High Overhead • Support Data Sampling over Bitmap Indices • Data samples has better accuracy • Support error prediction before sampling the data • Support data sampling over flexible data subset • No data reorganization is needed

Data Sampling over Bitmap Indices • Features of Bitmap Indexing: • Each bin (bitvector) corresponds to one value range • Different bins reflect the entire value distribution • Each bin keeps the data spatial locality • Contains all space IDs (0-bits and 1-bits) • Row Major, Column Major • Hilbert Curve, Z-Order Curve • Method: • Perform stratified random sampling over each bin • Multi-level indices generates multi-level samples

Stratified Random Sampling over Bins S1: Index Generation S2: Divide Bitvector into Equal Strides S3: Random Select certain % of 1’s out of each stride

Error Prediction vs. Error Calculation Error Calculation Error Prediction Data Sampling Multi-Times Error Prediction Sampling Request Predict Request Sample Error Metrics Feedback Sampling Request Sample Not Good? Decide Sampling Error Calculation

Error Prediction • Pre-estimate the error metrics before sampling • Calculate error metrics based on bins • Bitmap Indices classifies the data into bins • Each bin corresponds to one value or value range; • Find some representative values for each bin: Vi; • Enforce equal sampling percentage for each bin • Extra Metadata: number of 1-bits of each bin: Ci; • Compute number of samples of each bin: Si; • Pre-calculate error metrics based on Vi and Si • Representative Values: • Small Bin: mean value • Big Bin: lower-bound, upper-bound, mean value

Data Subsetting + Data Sampling S1: Find value subset Value = [2, 3) RID = (9, 25) S2: Find Spatial ID subset S3: Perform Stratified Sampling on Subset

Experiment Results • Goals: • Accuracy among different sampling methods • Compare Predicted Error with Actual Error • Efficiency among different sampling methods • Speedup for combining data sampling with subsetting • Datasets: • Ocean Data – Multi-dimensional Arrays • Cosmos Data – Separate Points with 7 attributes • Environment: • Darwin Cluster: 120 nodes, 48 cores, 64 GB memory

Sample Accuracy Comparison • Sampling Methods: • Simple Random Method • Stratified Random Method • KDTree Stratified Random Method • Big Bin Index Random Method • Small Bin Index Random Method • Error Metrics: • Means over 200 separate sectors • Histogram using 200 value intervals • Q-Q Plot with 200 quantiles • Sampling Percentage: 0.1%

Sample Accuracy Comparison • Mean • Q-Q Plot • Histogram Traditional sampling methods can not achieve good accuracy; Small Bin method achieves best accuracy in most cases; Big Bin method achieves comparable accuracy to KDTree sampling method.

Predicted Error vs. Actual Error • Means, Histogram, Q-Q Plot for Small Bin Method • Means, Histogram, Q-Q Plot for Big Bin Method

Efficiency Comparison Sample Generation Time Error Calculation Time Index-based Sample Generation Time is proportional to the number of bins(1.10 to 3.98 times slower). The Error Calculation Time based on bins is much smaller than that based on data (>28 times faster).

Total Time based on Resampling Times Total Sampling Time • Index-based Sampling: • Multi-time Error Calculations • One-time Sampling • Other Sampling Methods: • Multi-time Samplings • Multi-time Error Calculations • X axis: resampling times • Speedup of Small Bin: • 0.91 – 20.12

Speedup of Sampling over Subset Subset over Spatial IDs Subset over values X axis: Data Subsetting Percentage (100%, 50%, 30%, 10%, 1%) Y axis: Index Loading Time + Sampling Generation Time 25% Sampling Percentage Speedup :1.47 – 4.98 for Spatial Subsetting 2.25 - 21.54 for value Subsetting

Background: Wide-Area Data Transfer Protocols • Efficient data transfers over wide-area network • Globus GridFTP: • Striped, Streaming, Parallel Data Transfer • Reliable and Restartable Data Transfer • Limitation: volume? • The basic data transfer unit is file (GB or TB Level) • Strong requirements for transferring data subsets • Goal: Integrate core data management functionality with wide-area data transfer protocols

Contribution • Challenges: • How should the method be designed to allow easy use and integration with existing GridFTP installation? • How can users view a remote file and specify the subsets of data ? • How to support efficient data retrieval with different subsetting scenarios? • How can data retrieval be parallelized and benefits from multi-steaming? • GridFTP SDQuery DSI • Efficient Data Transfer over Flexible File Subset • Dynamic Loading / Unloading with Small Overhead • Performance Model based Hybrid Data Reading • Parallel Streaming Data Reading and Transferring

Motivation: Correlation Analysis • Correlation Attributes (Variables) Analysis • Study relationship among variables • Make scientific discovery • Two Scenarios: • Basic Scientific Rule Verification and Discovery • Feature Mining – Halo finding, Eddy finding • Challenge: • Correlation analysis is useful but extremely time consuming and resource costly • No method support flexible correlation analysis on data subset

Correlation Metrics • Multi-Dimensional Histogram: • Value distributions of variables; • Entropy • A metric to show the variability of the dataset; • Low => constant, predictable data; • High => random data; • Mutual Information • A metric for computing the dependence between two variables; • Low => two variables are independent; • High => one variable provides information about another; • Pearson Correlation Coefficient • A metric to quantify the linear correspondence between two variables; • Value Range: [-1, 1]; • <0: inverse proportional; >0 proportional; =0 independent;

Our Solution and Contribution • A framework which supports both individual and correlation data analysis based on bitmap indexing • Individual Analysis: flexible data subsetting • Correlation Analysis: • Interactive queries among multi-variables • Correlation metrics calculation based on indices • Support correlation analysis over data subset • Support Correlation Analysis over Bitmap Indices • Better efficiency, smaller memory cost • Support both Static Indexing and Dynamic Indexing • Support correlation analysis over data samples

User Cases of Correlation Analysis Please enter variable names which you want to perform correlation queries: TEMP SALT UVEL Please enter your SQL query: SELECT TEMP FROM POP WHERE TEMP>0 AND TEMP<1 AND depth_t<50; Entropy: TEMP(2.19), SALT(1.90), UVEL(1.48) Mutual Information: TEMPSALT: 0.18, TEMP->UVEL->0.017; Pearson Correlation: ….. Histogram: (SALT), (UVEL) Please enter your SQL query: SELECT SALT FROM POP WHERE SALT<0.0346; Entropy: TEMP(2.29), SALT(2.99), UVEL(2.68) Mutual Information: TEMPUVEL 0.02, SALT->UVEL->0.19; Pearson Correlation: ….. Histogram: (UVEL) Please enter your SQL query: UNDO Entropy: TEMP(2.19), SALT(1.90), UVEL(1.48) Mutual Information: TEMPSALT: 0.18, TEMP->UVEL->0.017; Pearson Correlation: ….. Histogram: (SALT), (UVEL) Please enter your query:

Dynamic Indexing • No Indexing Support: • Load all data for A and B; • Filtering A and B to generate subset; • Combined Bins: Generate (A1, B1)->count1, … (Am, Bm)->countm based on each data elements within the data subset; • Calculate Correlation Information based on combined bins; • Dynamic Indexing (Indices for each variable): • Query bitvectors for A and B; (no data loading cost, zero or very small filtering cost) • Combined Bins: Generate (A1, B1)->count1, … (Am, Bm)->countmbased on bitwise operations between A and B (much faster because bitvectors# are much smaller than elements#) • Calculate Correlation Information based on combined bins

Static Indexing Dynamic Indexing: One index for each variable. Still need to perform bitwise operations to generate combine bins. Static Indexing: Generate one big indices file over multi-variables. Only need to perform bitvectors filtering or combining. (Extremely small cost)

Correlation Mining • Challenges of Correlation Queries • Do not know which subsets contain important correlations • Keep submitting queries to explore correlations • Correlation Mining: • Automatically find important correlations • Suggest correlations to users • A bottom-up method: • Generate correlations over basic spatial and value units • Use bitmap indexing to speedup this process • Use association rule mining to find and combine similar correlations

Generate Scientific Association Rule Association Rule Example: t_lon(10.1−15.1), t_lat(25.2−30.2), depth_t(1−10), TEMP(0−1), SALT(0.01−0.02) →Mutual Information(0.23, High)

Feature Mining • Feature Mining based on Correlation Analysis • Sub-halo: Correlation between space and velocity • Eddy: Correlation between speed in different directions • OW distance to find eddies • OW > 0, not eddy; OW<= 0, might be eddy • One detection method: • Build v based on row major (x, y) • Build u based on column major (y, x) • Eddy can not exist for long sequence of 1-bits

An Virtualization based Data Management Framework for Big Data Applications

An Virtualization based Data Management Framework for Big Data Applications

Presentation Transcript

Hadoop , a distributed framework for Big Data

Data Integration for Big Data

Data Virtualization

Big Data Conference: Analytics and Applications for Federal Big Data

BIG DATA IN ENGINEERING APPLICATIONS

Algorithms for Big-Data Management

A framework for easy development of Big Data applications

Big Data Symposium: Analytics and Applications for Federal Big Data - FEMA

Data Virtualization an Overview

An Approach for Automatic Data Virtualization

Data Analytics for Big Data

Runtime Data Management for Data-Intensive Scientific Applications

Resource Management in Virtualization-based Data Centers

Networking Architectures for Big-Data Applications

Applications of Big Data & Hadoop

Big Data Big Data

Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for Beginners | Edureka

Data Analytics for Big Data