1 / 37

MarketShare Big Data Analytics

MarketShare Big Data Analytics. An Big Data Analytics architecture for the cloud. With and Without Buzzwords. Big Data analytics on the Cloud. >50 TB analytics pipeline on Amazon Web Services using Hadoop. MarketShare : Big Data Analytics on the Cloud. Cloud architecture evolution

shea
Télécharger la présentation

MarketShare Big Data Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MarketShare Big Data Analytics An Big Data Analytics architecture for the cloud

  2. With and Without Buzzwords Big Dataanalytics on the Cloud • >50 TB analytics pipeline on Amazon Web ServicesusingHadoop

  3. MarketShare: Big Data Analytics on the Cloud • Cloud architecture evolution • A real life big data workflow on the cloud • Fixing issues in big data workflow * Source: Interbrand 2011 report MarketShare confidential and proprietary

  4. Cloud + Big Data

  5. Traditional 3 tier architecture Master App Server User Application Server Database Server Eliminate accessibility restrictions

  6. Moving to web based applications Corporate Data Center Master App Server Web Server User Database Server(s) Application Server(s) Laptop Eliminate Hardware Expenditure Storage Server(s)

  7. Moving to the cloud Amazon Web Services Cloud Web Server App Server EC2 Instance EC2 Instance User Laptop EC2 Sharded Instance with MySQL Eliminate Distributed Storage

  8. Moving big data to Hadoop Amazon Web Services Cloud Web Server App Server EC2 Instance EC2 Instance User Laptop EC2 Instance with MySQL Eliminate distributed systems mgmt. HDFS Cluster

  9. Compute Elasticity Amazon EC2 On Demand Instances Amazon EC2 permanent Instances App Server Web Server Amazon Elastic MapReduce EC2 Instance EC2 Instance User Laptop On demand hadoop instances EC2 Instance with MySQL

  10. Storage Elasticity Amazon EC2 On Demand Instances Amazon EC2 permanent Instances App Server Web Server Amazon Elastic MapReduce EC2 Instance EC2 Instance User Amazon Managed Storage Laptop EC2 Instance with MySQL Managed Database Services Amazon Simple Storage Service (S3) RDS Database Instance

  11. Network Elasticity Amazon EC2 On Demand Instances Amazon EC2 permanent Instances Web Server App Server EC2 Instance EC2 Instance Amazon Elastic MapReduce User Amazon Managed Storage Amazon EC2 permanent Instances Laptop Elastic Load Balancer Web Server App Server Network Elasticity EC2 Instance EC2 Instance Amazon Simple Storage Service (S3) RDS Database Instance

  12. Defining the Cloud • Cloud = Managed Storage + Network Elasticity + On Demand Compute

  13. An Example Big Data Workflow

  14. Raw Data Repository Transformation Validation Provision Cluster Customer Data Repository .................................... .................................... -- GUID Propagation CREATE TABLE userid_guid_mapAS SELECT distinct user_id, guid from dbact_guid; CREATE TABLE mapcountAS SELECT COUNT(*) AS cnt, user_id FROM userid_guid_mapGROUP BY user_id; CREATE TABLE impr_guidAS SELECTa.*, m.guidFROM impri JOIN userid_guid_mapm ON (i.user_id=m.user_id); SELECT COUNT(*) FROM impr_guid; .................................... .................................... .................................... Provision Cluster SCP FTP Raw Data Repository Generic Schema LFIS Funnel Transformation Validation hadoopfs -put Sequencing Validation Summarization Validation MR job not launched Estimate Cluster Size GUID propagation Verify Cluster Topology Mapred job timeout 95% done Fix Cluster & Deploy Binaries Hadoop stopped responding Cleaning & truncation Stats transformation StatisticalAnalysis Joining with dimensions Transform in Generic Schema LUIS Report Generation and Visualization Generic Schema Add Nodes --Exception Handling !<cmd> Rebalance Partitions Restart Terminate Cluster

  15. The big picture

  16. Data Management on the Cloud Big Data + Dynamic Provisioning

  17. Defining the cloud for databases • A dynamically provisioned commodity cluster of virtual machines with the following characteristics • Infinite • A large number of nodes can be commissioned in minutes • Taxi Meter • Most services are billed at an hourly usage level

  18. 3 new problems for database query processing • When should resources be added to a data processing system? • Partition management for Low Cost on the cloud • How many of these resources should be permanent? • Materialization with Intermittent Scalability • Where should these resources be added in the stack? • Replication to Improve Query Performance

  19. Transforming Data to Actionable Insights Campaign Heat Map Time Cube Materialization • Incremental updates • Multi-dimensional indexes • Multi-dimensional partitions Engagement High Geography Medium Low Publishers

  20. Generic Schema

  21. Life of a Query Query Optimize Query Identify Partitions Access Partitions Consolidate Results 21

  22. Tuple Access Layer Locate the partition for the fast dimension values Note here that STATE is set to ‘*’ but it has already been materialized Therefore, we eliminate any sum across all states On the local partition, do the aggregation to calculate the ‘*’

  23. 5 types of partitions maintained in the cloud Exclusive EC2 nodes that are allocated for specific keys Temporary EC2 nodes that host temporary replicas Permanent EC2 nodes that are permanently allocated to service queries Archive S3 storage that is not accessible directly by the query engine Intermittent EC2 nodes that are allocated by the loading engine

  24. Taxi MeterMaking resources transient

  25. Usage Patterns • Usage patterns vary throughout the day and throughout the week • A couple of periods of heavy usage daily, followed by moderate to low usage Data Loading Cube Materialization Users run ad hoc queries to understand campaign exceptions Quick review of key reports by users prior to heading home Users review standard reports looking for campaign exceptions Computing Resources Users make any necessary campaign adjustments 4 – 6am 8 – 10am 10 – Noon Noon – 4pm 4 – 6pm

  26. Traditional Computing Approach • Traditional computing approach buys enough computing resources to meet peak usage demand • Even many cloud “solutions” provide only the peak computing power option with no way to dynamically reallocate the computing resources to match the current usage demand • Result: Substantial waste in computing resources and money Maximum Computing Resources $$ $$ $$ $$ Computing Resources 4 – 6am 8 – 10am 10 – Noon Noon – 4pm 4 – 6pm

  27. “Adaptive” Computing Economics • Finely matching computing resources to user usage patterns can provide a 50% to 90% cost savings versus the traditional computing resource allocation approach • Result: Lower cost with improvements in availability and performance Maximum Computing Resources $$ $$ $$ $$ Adaptive Computing Resources Computing Resources 4 – 6am 8 – 10am 10 – Noon Noon – 4pm 4 – 6pm

  28. Intermittent scalabilityUsing large number of nodes during load time

  29. Managing CapEx with Role Based Clusters SINGLE CLUSTER FOR DATA CLEANSING, LOAD AND QUERY 15TB 100 NODES Monthly Cost = $28,800

  30. Role Based Clusters UI 2 hours daily for load on 10 nodes Query on 5 nodes Monthly Cost = $2,052 Ad Server Data, Search Engine Data DATA CLEANSING QUERY READY PARTITIONS BUILD CUBE HIBERNATE CUBE

  31. Selective replication for hot partitions

  32. Partition level query slowdown • Dynamic statistics • The query execution system logs status for each partition • If a particular partition is regularly lagging behind, it is marked for replication • Static statistics • The query execution system identifies skews in specific partitions • Partitions with size skew etc are marked for replication Operational (EC2) P1 P2 P1 P2 P3 P4 P3 P4 P5 P6 P5 P6

  33. Fixing partition level slowdown • If the query execution system detects SLA violations • Adds two new temporary nodes (Temp 1) • Creates new replicas for the ‘hot’ partitions Operational (EC2) Temporary (EC2) Node 1 Node 2 Node 3 Temp 1 P1 P2 P1 P2 P3 P4 P3 P3 P3 P4 P5 P6 P6 P6 P5 P6

  34. Key level query slowdown • Key Level Dynamic statistics • A particular key takes time for materializing various facets of the cube Operational (EC2) P1 P2 P1 P2 P3 P4 P3 P4 P5 P6 P5 P6

  35. Fixing partition level slowdown • If the query execution system detects SLA violations for a particular key • Adds a new temporary node (Temp 2) • Denormalizes the key such that all data for that key is materialized Temporary (EC2) Node 5 KN Materialized Operational (EC2) Node 1 Node 2 Node 3 P1 P2 P1 P2 P3 P4 P3 P4 P5 P6 P5 P6

  36. Partitions can be in 5 different states Data Retrieval Module Retrieve data from materialized Access replicas if base partition is overwhelmed Access base partitions is not materialized Isolated (EC2) Isolate keys on separate machines Replicated (temporary EC2) Operational (EC2) Create replicas if partition is hot Archive (S3/EBS) Load (temporary EC2) Post load, archive the partitions

  37. Next Steps • Lots of challenges in cloud + modeling • Collaboration opportunities • We are hiring!

More Related