MySQL Data Warehousing Survival Guide

MySQL Data Warehousing Survival Guide Marius Moscovici (marius@metricinsights.com) Steffan Mejia (steffan@lindenlab.com)

Topics • The size of the beast • Evolution of a Warehouse • Lessons Learned • Survival Tips • Q&A

Size of the beast • 43 Servers • 36 active • 7 standby spares • 16 TB of data in MySQL • 12 TB archived (pre S3 staging) • 4 TB archived (S3) • 3.5B rows in main warehouse • Largest table ~ 500M rows (MySQL)

Warehouse Evolution - First came slaving • Problems: • Reporting slaves easily fall behind • Reporting limited to one-pass SQL

Warehouse Evolution - Then came temp tables • Problems: • Easy to lock replication with temp table creation • Slaving becomes fragile

Warehouse Evolution - A Warehouse is Born • Problems: • Warehouse workload limited by what can be performed by a single server

Warehouse Evolution - Workload Distributed • Problems: • No Real-Time Application integration support

Warehouse Evolution - Integrate Real Time Data

Lessons Learned - Warehouse Design Workload exceeds available memory

Lessons Learned - Warehouse Design • Keep joins < available memory • Heavily Denormalize data for effective reporting • Minimize joins between large tables • Aggressively archive historical data

Lessons Learned - Data Movement • Mysqldump is your friend • Sequence parent/child data loads based on ETL assumptions • Orders without order lines • Order lines without orders • Data Movement Use Cases • Full • Incremental • Upsert (Insert on duplicate key update)

Full Table Loads • Good for small tables • Works for tables with no primary key • Data is fully replaced on each load

Incremental Loads • Table contains new rows but no updates • Good for insert-only tables • High-water mark level included in Mysqldump where clause

Upsert Loads • Table contains new and updated rows • Table must have primary key • Can be used to update only subset of columns

Lessons Learned - ETL Design • Avoid large joins like the plague • Break out ETL jobs into bite-size-bites • Ensure target data integrity on ETL failure • Use memory staging tables to boost performance

ETL Design - Sample Problem Build a daily summary of customer event log activity

ETL Design - Sample Solution

ETL Pseudo code - Step 1 1) Create staging table & Find High Water Mark: SELECT IFNULL(MAX(calendar_date),'2000-01-01') INTO @last_loaded_date FROM user_event_log_summary; set max_heap_table_size = <big enough number to hold several days data> CREATE TEMPORARY TABLE user_event_log_summary_staging (.....) ENGINE = MEMORY; CREATE INDEX user_idx USING HASH on user_event_log_summary_staging(user_id);

ETL Pseudo code - Step 2 2) Summarize events: INSERT INTO user_event_log_summary_staging ( calendar_date, user_id, event_type, event_count) SELECT DATE(event_time), user_id, event_type, COUNT(*) FROM event_log WHERE event_time > CONCAT(@last_loaded_date, '23:59:59') GROUP BY 1,2,3;

ETL Pseudo code - Step 3 3) Set denormalized user columns: UPDATE user_event_log_summary_staging log_summary, user SET log_summary.type = user.type, log_summary.status = user.status WHERE user.user_id = log_summary.user_id;

ETL Pseudo code - Step 4 3) Insert into Target Table: INSERT INTO user_event_log_summary (...) SELECT ... FROM user_event_log_summary_staging;

Functional Partitioning • Benefits depend on • Partition Execution Times • Data Move Times • Dependencies between functional partitions

Functional Partitioning

Job Management • Run everything single-threaded on a server • Handle dependencies between jobs across servers • Smart re-start key to survival • Implemented 3-level hierarchy of processing • Process (collection of build steps and data moves) • Build Steps (ETL 'units of work') • Data Moves

DW Replication • Similar to other MySQL environments • Commodity hardware • Master-slave pairs for all databases • Mixed environments can be difficult • Use rsync to create slaves • But not with ssh (on private network) • Monitoring • Reporting queries need to be monitored • Beware of blocking queries • Only run reporting queries on slave (temp table issues) • Nagios • Ganglia • Custom scripts

Infrastructure Planning • Replication latency • Warehouse slave unable to keep up • Disk utilization > 95% • Required frequent re-sync • Options evaluated • Higher speed conventional disks • RAM increase • Solid-state-disks

Optimization • Check / reset HW RAID settings • Use general query log to track ETL / Queries • Application timing • isolate poor-performing parts of the build • Optimize data storage - automatic roll-off of older data

Infrastructure Changes • Increased memory 32GB -> 64GB • New servers have 96GB RAM • SSD Solution • 12 & 16 disk configurations • RAID6 vs. RAID10 • 2.0T or 1.6TB formatted capacity • SATA2 HW BBU RAID6 • ~ 8 TB data on SSD

Results • Sometimes it pays to throw hardware at a problem • 15-hour warehouse builds on old system • 6 hours on optimized system • No application changes

Finally...Archive Two-tiered solution • Move data into archive tables in separate DB • Use select to dump data - efficient and fast • Archive server handles migration • Dump data • GPG • Push to S3

Survival Tips • Efforts to scale are non-linear • As you scale, it becomes increasingly difficult to manage • Be prepared to supplement your warehouse strategy • Dedicated appliance • Distributed processing (Hadoop, etc) • You can gain a great deal of headroom by optimizing I/O • Optimize current disk I/O path • Examine SSD / Flash solutions • Be pragmatic about table designs • It's important to stay ahead of the performance curve • Be proactive - monitor growth, scale early • Monitor everything, including your users • Bad queries can bring replication down

MySQL Data Warehousing Survival Guide

MySQL Data Warehousing Survival Guide

Presentation Transcript

Data warehousing with MySQL By Anand Pandey

SURVIVAL GUIDE

Survival guide

Survival Guide

Survival Guide

Survival Guide

Survival Guide

Survival Guide

Survival Guide

Survival Guide

Survival Guide

Survival Guide

Survival Guide

Survival Guide

Survival Guide

Data Warehousing

Data Warehousing

Data Warehousing with MySQL

Data warehousing with MySQL

Data Warehousing

Data Warehousing