550 likes | 565 Vues
This talk outlines the use of parallel database systems for big data analytics, covering topics such as DSS queries, basic statistics models, and processing alternatives both inside and outside a DBMS. It discusses benchmarking and optimizations for parallel processing, as well as the challenges and benefits of using SQL as the query language for big data analytics. The talk also highlights the importance of data preparation and the integration of machine learning and text data analysis in the analytics process.
E N D
Talk Outline 1. Big Data Analytics Cubes: DSS queries, basic statistics Models: machine learning & statistics Graphs 2. Processing alternatives Inside a DBMS: SQL queries, UDFs Outside a DBMS: HDFS, MapReduce, Spark, C++ 3. Benchmarking & optimizations Benchmarkinbg General: Algorithmic: parallel, general, across systems Specific: System-dependent 2/60
Acknowledgments Mike Stonebraker, MIT (DBMS-centric) Divesh Srivastava, ATT (Theory) 3/79
Big Data VVVV: VVV+veracity Finer granularity than transactions No schema required In general big data cannot be directly analyzed: pre-processing needed Diverse data sources, non-relational, beyond alphanumeric tables logs: user interactions, social networks, streams, sensors 4/79
Big Data Systems NoSQL, no SQL really? Some ACID guarantees, but not OLTP Web-scale data is not universal Many practical problems are smaller Data(base) integration and cleaning harder than DB SQL remains the defacto query language 5/79
Big Data AnalyticsDWH + Machine Learning + Text Data Preparation • Copy, load • Data Exploration, Data Quality • Prep: Data Integration, Cleaning, Preprocessing Analytics • Cubes • Machine Learning, Statistics Models • Graphs Highly Iterative Process Deployment • Scoring • Tuning • Lifecycle Maintenance
Ignored and misunderstood aspects Preparing data takes a lot of time. SQL helps but it requires expertise, statistical programming in R or SAS, spreadsheet macros, custom applications in C++/Java Lot of data: tabular originates in a DBMS Emphasis on large n, but many problems have many “small” files Data set is generally much smaller than raw data Model computation is the most researched topic; deployment rarely considered 7/79
Analytics Simple: Cubes: BI, OLAP, MOLAP, they capture descriptive statistics (histograms, means, quantiles, plots, statistical tests) Complex: Mathematical models (ML, statistics, optimiz.) Graphs (path, connectivity, cliqsues) 8/60
Cubes • Multidimensional cube • Star/snowflake schema • Queries: aggregations+joins • Query optimization: cost-based opt, heuristics, avoid worst plans, relational algebra • Requirement: ER and physical model • Best: still DBMS
Machine Learning Models Large n or d: cannot fit in RAM, minimize I/O Multidimensional d: 10s-1000s of dimensions feature selection and dimensionality reduction Represented & computed with matrices & vectors data set: mixed attributes model: numeric=matrices, discrete: histograms intermediate computations: matrices, histograms, equations 10/60
Machine Learning Algorithms Behavior with respect to data set X: Few: one pass, fixed # passes (SVD, regress., NB) Most: iterative, convergence (k-means, CN 5.0, SVM) Time complexity: Research issues: Parallel processing; Main memory coming back time complexity may increase with disk incremental and online learning DBMS: vanilla models on tabular data 11/60
Graphs • G=(V,E) n=|V| • Large n, but G sparse • Typical problems (poly,combinatorial, NP-complete): • transitive closure, reachability • shortest/critical path • neighborhood analysis • clique detection
Graph algorithms • Combinatorial versus matrices • Complexity • P • NP-complete, NP-hard problems • Exp • DBMS: indeed good for many graph problems on POLY, but certainly not for harder problems
Why analytics inside the DBMS? • Huge data volumes: potentially better results with larger amounts of data; less processing time • Minimizes data redundancy; Eliminate proprietary data structures; simplifies data management; security • Caveats: SQL is not as powerful as C++, limited mathematical functionality, complex DBMS architecture to extend source code
2. Processing 2.1 Parallel processing 2.2 Inside DBMS: SQL, UDFs, system C++ 2.3 Outside DBMS: MapReduce, Spark, C++/Java 2.4 Contrasting technologies 16/60
Parallel processing issues • Contention for shared resources • Load balancing: data values skew • startup, availability overhead • Data shuffling: redistribution, transfer, read/write to disk • Scaleup better for DBMSs • Scaleout harder for parallel DBMS than Hadoop
2.1 Inside DBMS • Assumption: • data records are in the DBMS; exporting slow • data set or graph stored as a table • Programming alternatives: • SQL queries and UDFs: SQL code generation (JDBC), precompiled UDFs. +: SP, embedded SQL, cursors • Internal C Code (direct access to file system and RAM)..hard, non-portable • DBMS advantages: • important: storage, queries, security • secondary: recovery, concurrency cntl, integrity, OLTP 20/60
Programming with queries • pros: • Isolation from DBMS internals • Declarative, elegant, abstract • cons: • Difficult to program (numerical methods) • no direct binding with arrays/matrices in SQL • slow processing if there are joins (even hash join) • internal data conversion generally necessary
Inside DBMSAccelerate processing with UDFs • Classification of User-Defined Functions (UDFs): • Scalar UDF • Aggregate UDF • Table UDF • Programming: • Called in a SELECT statement • C code or similar language • API provided by DBMS, generally C/C++ • Data type mapping 23/60
Inside DBMSUDF pros and cons • Advantages: • arrays and flow control • Flexibility in code writing and no side effects • No need to modify DBMS internal code • In general, simple data types • Limitations: • OS and DBMS architecture dependent, not portable • No I/O capability, no side effects • Null handling and fixed memory allocation • Memory leaks with arrays (matrices): fenced/protected mode • Debugging: harder than Java, Scala, R 24/60
Inside DBMSAggregate UDF (scalar UDF skipped) [JM1998,SIGMOD] Table scan Memory allocation of variables in the heap GROUP BY extend their computing expressibility But require handling nulls Advantage: parallel & multithreaded processing Drawback: returns a single value, not a table Practically all DBMSs: Vertica, SQL Server, PostgreSQL, SciDB, Teradata, Oracle, DB2, .. Useful for mathematical model computation 25/60
Inside DBMSInternal C code (if source code available); not popular [LTWZ2005,SIGMOD], [MYC2005,VLDB] [SD2001,CIKM] Advantages: access to file system (table record blocks), leverage physical operators (scan, join, sort, search) main memory, data structures, libraries hardware: multithreading, multicore CPU, RAM, caching LI/L2 math libraries: LAPACK Disadvantages: requires careful integration with subsystems not available to end users and practitioners may require exposing calls with DM language or SQL 26/60
Inside DBMSQuery/UDF evaluation:Physical Operators [DG1992,CACM] [SMAHHH2007,VLDB] [WH2009,SIGMOD] • Serial DBMS (1 CPU, RAID|SharedDisk): • table scan • join: hash/sort-merge, indexed nested loop • external merge sort • Parallel DBMS (shared-nothing): • even row distribution, hashing • parallel table scan • parallel joins: large/large (sort-merge, hash); large/short (replicate short) • parallel/distributed sort: aggregation 27/60
Outside DBMS • Plain files: C++, Java, R, SAS, .. • shared-nothing (DBMS, Hadoop) • File systems: POSIX | HDFS • Platform: DBMS, MapReduce, Spark, Graph engines • Shared memory (HPC, 1000s of cores) • MPI • OpenMP
Outside DBMS: plain files, sequential processingPackages, libraries, Java/C++ [ZHY2009,CIDR][ZZY2010,ICDE] • Statistical and DM packages (e.g. R, SAS): • exported flat files; proprietary file formats • Memory-based (processing data records, models, internal data structures) • Programming languages (arrays, programming control statements, libraries) • Limitation: large number of records • Packages: R, SAS, SPSS, Matlab, WEKA 29/60
Outside DBMS: MapReduce, Spark [DG2008,CACM] • Parallel processing; simple; shared-nothing • Commodity diverse hardware (big cluster) • Functions are programmed in a high-level programming language (e.g. Java, Python, Scala); flexible. • <key,value> pairs processed in two phases: • map(): computation is distributed and evaluated in parallel; independent mappers • reduce(): partial results are combined/summarized • Can be inside/outside DBMS, depending on level of integration with DBMS 30/60
Outside DBMSMapReduce Files and Processing • File Types: • Text Files: portable storage (e.g. CSV files.) • SequenceFiles: Efficient processing as blocks • Processing: • Points are sorted by “key” before sending to reducers • Small files merged to get larger blocks • Con: intermediate results are stored in file system (I/O) 31/60
Schema required Text processing difficult Simple parsing of files Built-in indexing Programming: declarative More rigid Shorter programs w/SQL queries Aggregate UDF Automatic optimization: data distribution for parallel proc. Fault-toler: in transact., not query No schema Text processing easy Need customized parsers Manually added indexes Programming: imperative Much more flexible Longer programs in Java/MR; Scala/Spark shorter map()/reduce() (Spark in RAM) Manual optim: programmer optimizes access Fault-tolerance in long jobs, per node Contrasting: DBMS vs MapReduce/Spark
3. Benchmarking & optimizations • Benchmarking; DBMS vs Spark • Data set layout • Optimizations: • Algorithmic • DBMS-specific: SQL queries, UDFs • MapReduce: optimize map() and reduce() • Spark: optimize MapReduce, higher abstraction than Java
Storage: plan similar, but physical operator and cost do change • Classical • Row • New • Column (relational algebra same, cost different) • Array (non-relational, linear algebra, cost different)
Optimization: Data Set Storage layout: Horizontal/Vertical 37/60
OptimizationsAlgorithmic & Systems Algorithmic 90% research, many efficient algorithms accelerate/reduce computations or convergence database systems focus: reduce I/O approximate solutions parallel Systems (SQL, MapReduce/Spark) Hardware: cluster of computers vs multicore CPUs Programming: SQL/C++ versus Java/Scala 38/60
Algorithmic: models [ZRL1996,SIGMOD] Programming: data set available as flat file, binary file required for random access May require data structures working in main memory and disk Programming not in SQL: C/C++ are preferred languages Assumption d<<n: n has received more attention Issue: d>n produces numerical issues and large covariance/correlation matrix (larger than X) 39/60
Algorithmic Optimizations: models [STA1998,SIGMOD] [ZRL1996,SIGMOD][O2007,SIGMOD] Exact model computation: data summaries: sufficient statistics (Gaussian pdf), histograms, discretization accelerate convergence, reduce iterations, faster matrix operations: * + parallel Approximate model computation: Sampling: efficient in time O(s) Incremental: escape local optima, reseed, gradient descent 40/60
DBMS Query Optimizations Split queries; query optimizer falls short Join: denormalized storage: model, intermediate tables favor hash joins over sort-merge for data set secondary indexing for join: sort-merge join Aggregation (create statistical variables): push group-by before join: watch out nulls and high cardinality columns Outer joins scans: sync scans (share I/O), sampling O(1) access (diffic. truly random, must handle approx. error) 41/60
DBMS SQL Optimizations SQL queries mathematical equations as queries: joins+aggregations Turing-complete: SQL code generation + imperative programming language UDFs as query optimization substitute difficult/slow math computations: summaries and gradient descent eliminate joins RAM: data structures, push processing but no I/O 42/60
DBMS programming optimizations [HLS2005,TODS] [O2007,TKDE] UDFs can substitute slow SQL query code complex matrix computations scalar UDFs: vector operations, math functions Aggregate UDFs (aka UDAs): compute data set summaries in parallel, update state in RAM, but blocking behavior Table UDFs: streaming model; external temporary file; get close to array functionality => patch solution 43/60
Skip: MapReduce Optimizations[ABASR2009,VLDB] [CDDHW2009,VLDB][SADMPPR2010,CACM] Data set keys as input, partition data set text versus sequential file loading into file system may be required Parallel processing high cardinality keys: i handle skewed distributions reduce row redistribution in map() Main memory processing 44/60
Skip: MapReduceCommon Optimizations[DG2008,CACM][FPC2009,PVLDB] [PHBB2009,PVLDB] Modify Block Size; Disable Block Replication Delay reduce(), chain Map() Tune M and R (memory allocation and number) Several M use the same R: M>R Avoid full table scans by using subfiles (requires file naming convention) combine() in map() to shrink intermediate files Intermediate files should be managed in SequenceFiles for efficiency SequenceFiles as input with custom data types. 45/60
Skip: MapReduceIssues Loading files is fast, but converting to binary may be necessary Not every analytic task is efficiently computed with MapReduce Input key generally OK if high cardinality Skewed map key distribution: bottleneck Key redistribution (lot of message passing and I/O) 46/60
Spark • based on MapReduce, but faster and more expressive (Scala, Java) • RDDs: caching data • map(): avoid intermediate map files • reduce(): in RAM • pipelining, like query optimizers • Scala versus Java: higher abstraction (comparable to SQL?) • Fault-tolerance compromised, better in MR • Mllib and GraphX: growing
Systems optimizationsSQL vs MR/Spark (optimized versions, same hardware) Task SQL UDF MR/Spark Speed: compute model 2 1 3 Speed: score data set 1 3 2 Programming flexibility 3 2 1 Process non-tabular data 3 2 1 Loading/copy speed 2 2 1 Ability to add optimizations 3 1 2 Manipulating data key distribution 1 2 3 Immediate processing (push=SQL,pull=MR/Spark) 2 1 3 48/60
DBMS, MapReduce/Spark Task SQL/UDF/AFL MR/Spark 1-node sequential open-source y y Parallel open source y? y Fault tolerant on long jobs n y Libraries limited Many Arrays and matrices limited good Massive parallelism, large # nodes ok y 49/60
ConclusionsDBMS future Parallel DBMSs: expensive, but reliable Fast data mining algorithms solved? Yes, but not considering data sets are stored in a DBMS; in-DBMS analytics of relational tabular data Main memory query processing, streams Faster/parallel load/transfer between DBMS and Hadoop General tradeoffs in speed/programming: horizontal (relational, data frame) vs vertical layout (web, rdf, sparse mtrx) Algorithms: one pass versus parallel processing reduce passes/iterations 50/60