1 / 50

What is Parallel Data Warehouse (PDW) and where does it fit?

What is Parallel Data Warehouse (PDW) and where does it fit? . Mike Lampa Director – Business Analytic Solutions. Agenda. What is Big Data and Where does PDW Fit? PDW Architecture on Dell hardware MPP and Shared Nothing concepts Data distribution and re-distribution

dunne
Télécharger la présentation

What is Parallel Data Warehouse (PDW) and where does it fit?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What is Parallel Data Warehouse (PDW) and where does it fit? Mike Lampa Director – Business Analytic Solutions

  2. Agenda • What is Big Data and Where does PDW Fit? • PDW Architecture on Dell hardware • MPP and Shared Nothing concepts • Data distribution and re-distribution • EDW Reference Architecture • Data Modeling Guidelines • ETL Guidelines • Resource Skilling Considerations Feel free to ask questions as we move along – goal is to make this presentation as interactive as possible! Confidential

  3. What is Big Data

  4. The DataExplosion • 988 Exabytes of information in 2010. • Everyday 2.5 quintillions of data is created. • Volume of data across enterprise doubling every 3 years • 80% of enterprise data is unstructured DELL CONFIDENTIAL

  5. Storm Rising in Data Analytics Major technology developments are driving three key mega-trends driving new opportunities for the industry and our customers Real-time BI and Analytics Scalable Database Architectures “Big Data” Unstructured NoSQL/NewSQL Self-Service Structured In-Memory Hadoop Pig/Hive Visualization Columnar/MPP Cloud BI Socianalytics

  6. Big Data complements Analytics (DW & BI) Data Sources Processing Infrastructure Knowledge Capture Business Value Models and Production Analytic Applications Shared Infrastructure Crawlers Sensors Apps Devices Bots Un-Structured Exploratory Analytics New IP Creation BI User BI Tools DW ERP CRM Well-defined processing Data-enriched tools Structured PDW Data MgmtSvcs IT Professional Domain Specialist App Developer DELL CONFIDENTIAL

  7. PDW Appliance: Architecture & MPP Concepts

  8. Confidential

  9. What is the PDW Appliance? Microsoft Parallel Data Warehouse • Microsoft software running on Dell hardware • High-end data warehouse, scales to 100’s of terabytes • Massively Parallel Processing (MPP) for high performance • Architected with redundancy throughout the system • Based on proven MS SQL Server 2008 R2 platform • Low cost of ownership with industry standard Dell hardware • Sold as an appliance with software preloaded • Extensive consulting and application services available • Microsoft and Dell representatives work together to serve the customer Confidential

  10. Microsoft PDW ArchitectureScales for Resilience and High Performance, with a Low Cost of Entry Control Rack Data Racks (up to 4) Scale by adding data rack(s) PowerEdge R610 Database Servers MD3620f Storage Nodes 1. PDW Engine 2. Admin Console 3. Metadata 4. Workspace Control Nodes (R710) Active / Passive Client Drivers Active Directory/DNS HPC Setup/patching Management Servers (R610) Data Center Monitoring Dual Fiber Channel Dual Infiniband 1. SSIS Instance 2. Loader Tool 3. File Staging Landing Zone (R510) ETL Load Interface File store for backups Std. SQL Backups Full and differential Backup Node (R710 and MD3600f w/MD1200’s) Corporate Backup Solution Spare Database Server Corporate Network Private Network Confidential

  11. Rack Configuration for Dell MD Appliance Control Rack Data Rack 1 Data Rack 2 Data Rack 3 Data Rack 4 Confidential

  12. PDW Core Concepts • Distributed Relational Database • 10 DBMS servers per Data Rack • Data distributed across multiple DBMS instances • Massively parallel processing • Multiple concurrent resources resolve SQL set operations against Distributed data • Compute Node architecture supports 10 parallel instances of DBMS per Data Rack. • Each DBMS instance works in parallel on its own “distribution” of a single user query. • Shared nothing computing • Resource and data independence are maintained within each DBMS instance • Each Compute Node reserves its shared resources (CPU, Memory, Disk) for only its distribution of system data • Managed by MPP server (Control node) • Converting schema & metadata from shared nothing to a common logical view • Configured for high redundancy Confidential

  13. Data Distribution • Distributed:A table structure with evenly distributed records across multiple shared nothing databases. • Distribution Key: A single column in a Distributed table that is used for hash distribution of records across multiple shared nothing databases. • Replicated:A table structure that exists as a full copy on each shared nothing database. • Ultra Shared Nothing: Design database schema with a mix of replicated and distributed tables to minimize data movement between nodes. • Dimensions are replicated • Facts are distributed • Redistribute rows at run time when distribution incompatibility is encountered in SQL set operation. Confidential

  14. Ultra Shared Nothing Example TD PD SF CD MD TD PD SF CD MD TD PD SF CD MD TD PD SF CD MD Confidential

  15. Redistribution • Redistribution:The movement of data between shared nothing database instances to answer distribution incompatible SQL queries within a PDW Appliance. • Shuffle:A redistribution technique that leverages Inifiniband™ network to create temporary distribution compatible data sets. • At least one table in the query plan uses a Distribution Key in its join criteria. • Any table that is not joined on it’s Distribution Key is targeted for Shuffle first. Leftmost table is chosen if multiple tables meet this criteria. • Replication:A redistribution technique that is used to create a temporary full copy of a data set. Confidential

  16. SMP vs MPP MPP • Dedicated resources • Scales to PB • Applicable > 20 TB • Built in HA and redundancy • High Concurrency for complex workloads SMP • Shared Resources • Limited scaling • Applicable < 20TB • HA must be architected in Confidential

  17. EDW Architecture

  18. EDW Logical Architecture Confidential

  19. EDW Information Layers Data Rack Landing Zone Confidential

  20. Data Flow Integration Layer Data In Source Stage Replication ETL LRF Base & Package Layer Data Store • PDW Load Scripts • Load Scheduling PDWPRD01 Base & Package 100 TB Package Copy for Presentation Data Presentation Layer Data Store PDWPRD03 Presentation 60 TB Dell network Infiniband Consumption Layer Data out SAS, BO, MSAS, BI Tools Confidential

  21. Enabling Packages - Hub and Spoke • Physical Data Marts (Packages) May make sense from consumption perspective. • Primary Considerations: • Business Function • Type of BI Workload • Secondary Considerations: • User Size, Data Volumes & Performance • Security & Sensitivity Confidential

  22. Data Modeling Guidelines

  23. PDW Table “Geometries” • Replicated: A table structure that exists as a full copy within each PDW Data Node • Distributed: A table structure that is hashed and distributed as evenly as possible across all PDW Data Nodes on the appliance

  24. Date Dim Item Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Prod Dim ID Prod Category Prod Sub Cat Prod Desc Store Dim Store Dim ID Store Name Store Mgr Store Size PDW Table Geometry Example PDW Storage Nodes Compute Nodes Source System DD ID SF 1 SD PD DD ID SF 2 SD PD Sales Fact Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold DD ID SF 3 SD PD Promo Dim DD ID SF 4 SD PD Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End DD ID SF 5 SD PD Confidential

  25. Distribute or Replicate? • Use distributed tables when: • The table is large – generally > 5GB • For fact/detail tables • Full table scans do not provide acceptable performance • Use replicated tables when: • The table is small – generally < 5GB • For dimension/lookup tables • Multiple foreign keys exist and foreign key joins are common Confidential

  26. Partitioning Distributed Tables • Distributed tables are already segmented by hashed distributions • Will further partition rows within a distribution, based on a partition function (eg: Time_Dim Quarter or Year) • Allows for operations efficiency when adding, loading, dropping, and switching partitions • Good for fast loading of an unused partition and then switching it in after loading • Partition for manageability • Typically on a date key (or integer surrogate) • Typically same as clustered index key • SWITCH partitions OUT for fast delete of history or IN to modify or add a specific historical slice Confidential

  27. Colocation Tables must be designed for performance from the beginning. Performance optimization is not just a DBA thing after development is complete! • Colocation: Within a PDW appliance, two individual records with identical keys will always belong to the same Distribution. • Single GREATEST performance consideration • Beneficial for Join and Aggregation performance (eg Parent & Child join) • Distribution Compatibility • Choosing the right Distribution Key • Identify Commonly used join keys and/or aggregations • Choose a single column that limits skew to < 40% (High Domain Cardinality & low Distribution SKEW) • Distribution Key should be the first column declared for Distributed Table DDL • Consider Surrogate Keys when “business key” is compound Confidential

  28. Colocation Example • Ensure that all compatible Distribution Keys are identical data types. • [customer.customer_id integer] = [customer_hist.customer_id integer] • [customer.customer_id integer] <> [customer_hist.customer_id char(10)] • Colocation and Distribution Key Example Distribution Key Confidential

  29. Handling Large Dimensions Large Dimensions (>5GB uncompressed) • Distribute/Normalize • Distribute: • If possible, distribute dimension on same key as fact surrogate key. • If distribution compatibility not possible, “shuffle” dimension data on the fly (at Query time) • Normalize: • Normalize large dimension into smaller tables and replicate the core dimension (more manageable replication size) • Look at usage pattern, if only a few columns from dimension are used in most of the major queries, separate high use columns from low use columns into separate Dim_tables. Both, have the same surrogate key. • Core dimension is replicated which is joined locally to fact table • Create a view to combine data from core and outrigger to insulate complexity from users (CAUTION: check the performance) Confidential

  30. Multi Level Partitioning • Partitioning • Partitioning is a method of distributing a table’s rows among a number of sub-tables (partitions). • Partitioning is applied within each Distribution. • Multi-Level Partition Support • Any combination of up to four total Range, Hash, Or List partition schemes. • Each new partition level generates new partitions at a multiple of the previous level. • Partition Values: [Range 4 x List 6] will generate 24 partition files per Distribution. Confidential

  31. Benefits of Partitioning • Reduce Table Scans • This is the most common use case for partitioning. • Relies on query restrictions aligned with: Range, Hash, or List qualifiers. • Practice: Partition on commonly restricted fields (query based). • Minimize Memory Utilization • Join Operations • Reduces memory requirements per join. • Reduces disk spill if session or operation limits are reached. • Aggregation • Reduces memory requirements to build result set. • Reduces disk spill if session or operation limits are reached. • Practice: Hash Partition on join key. Confidential

  32. Multi Level Partitioning Example DDL of Multi-Level Partitioned Table CREATE TABLE member ( memberID BIGINT NOT NULL, memberType SMALLINT NOT NULL, lastName VARCHAR(50) NOT NULL, activeStatus CHAR(1) NOT NULL, salesTotal FLOAT, lastLogin DATE NOT NULL) WITH distribute_on (memberID), text compressed, IIpartition=((range on lastLogin partition p01 values < '2007_01_01', partition p02 values < '2007_04_01', partition p03 values < '2007_07_01', partition p04 values < '2007_10_01', partition p05 values < '2008_01_01', partition p06 values >= '2008_01_01') SUBpartition (hash on memberID 5 partitions) SUBpartition (list on activeStatus PARTITION p101 VALUES ('n'), PARTITION p102 VALUES ('a'), PARTITION p103 VALUES (default))); Confidential

  33. ETL Guidelines

  34. PDW Data Loading Design Goals • Load data efficiently and non-obtrusively, respecting concurrent queries and loads • Reduce table fragmentation as much as possible • Provide system recovery capabilities in the event of data load failure that have minimal impact on concurrent queries • Provide multiple load/ETL options for PDW customers

  35. Data Movement Service (DMS) with PDW • Runs on the following nodes as a Windows service: • Control • Compute • Landing Zone • Used to quickly move data in parallel between nodes by using Infiniband network in PDW • Uses ADO.NET • Uses SqlClient namespace to select data from SQL Server • Uses SqlBulkCopy to insert data into Compute nodes • Two protocols/networks used by DMS: • Data transfer network to move data between nodes • Message network to send command and status messages to nodes from Manager • DMS closely interacts with the primary PDW Engine Service • DMS is used for both loading and querying data Confidential

  36. ETL Loading Options in PDW • DWLoader Utility • SQL Server Integration Services (SSIS) • CREATE TABLE AS SELECT (CTAS) • Standard SQL DML statements: INSERT/SELECT Confidential

  37. PDW Distributed Table Load – Step 1 Control Rack Data Rack DMS DMS Control Node Compute Nodes Storage Nodes Converter Converter Sender Sender (1) DWLoader invoked/ SSIS Writer Writer Receiver Receiver (4) Each row is converted for bulk insert and hashed based on the distribution column SQL Server DMS Ser er (2) Load Manager creates staging tables PDW Engine Infiniband (5) Hashed row is sent to appropriate node receiver for loading DMS Load Manager DMS Manager (3) DMS reads load data and buffers records to send to Compute Nodes round-robin Landing Zone (6) Row is bulk inserted into staging table Load File/SSIS SSIS API Distributor DMS Load Client Confidential

  38. PDW Distributed Table Load – Step 2 STEP 1: DWloader creates topology equivalent staging table and moves data from LZ file into staging tables using DMS Staging DB Destination DB 2nd step process DWloader uses SQL commandsto move from staging to destination tables NOTE: distributions of a table are written in parallel when the multi-transactions option is set to true. Confidential

  39. Data Loading – DWloader • Command-line utility invoked on the Landing Zone • Integrated with DMS • Streamlines I/O and minimizes data-loading times through powerful parallel loading functionality against a single text file • Optimize data load speeds while maintaining a performance balance so as not to seriously degrade concurrently running queries • Characteristics of Dwloader • Accommodate initial data loads of large files over 300 GB • Achieve data load speeds of up to 2 TB per hour • Accommodate multiple and concurrent incremental loads • Has settings for canceling and showing status of loads • Input file must reside on the Landing Zone • Max. concurrency 10, queues up subsequent load Confidential

  40. Data Loading – SSIS • The SQL Server PDW Destination is an SSIS component that lets you load data into SQL Server PDW by using an SSIS .dtsx package. • In the package workflow for SQL Server PDW, you can load and merge data from multiple sources and load data to multiple destinations. • The loads occur in parallel, both within a package and among multiple packages running concurrently • SQL Server 2008 R2 SSIS includes: • SQL Server Parallel Data Warehouse Connection Manager • SQL Server Parallel Data Warehouse Destination • Similar to dwloader, SSIS leverages DMS for parallel load operations. • SSIS can run either on the Landing Zone or on a server outside the PDW appliance. Confidential

  41. SSIS and PDW Data Types When using SSIS to load data from a data source to a SQL Server PDW database: • Data is first mapped from the source data to SSIS data types. • This allows data from multiple data sources to map to a common set of data types. • Then the data is mapped from SSIS to SQL Server PDW data types. Confidential

  42. Leading Practices – Data Loading • Minimize page breaks (fragmentation) by designing “partition-friendly” loads. • If necessary, drop non-clustered indexes before loading and re-index after all loads are complete. • There is no benefit to sorting data before hitting the Landing Zone. Confidential

  43. Leading Practices – Staging Databases • Historic PDW load jobs tend to be the largest. The staging database may be reduced in size for subsequent incremental loads. • When creating the staging database, use the following guidelines: • Replicated table size should be the estimated size per Compute Node of all the replicated tables that will load concurrently. • Distributed table size should be the estimated size per appliance of all the distributed tables that will load concurrently. • Log size is typically similar to the replicated table size. Confidential

  44. Leading Practices - SSIS • For good PDW loading throughput, it is important to keep a steady stream with minimal starts and stops. • PDW connections and queries are very costly to initiate. Use fewer to do more. • Data type conversion in the PDW destination adapter is very expensive. Be sure the input types match the destination types, especially for strings and decimals. • Consider performing data transformations after loading into the staging database (ELT instead of ETL). Confidential

  45. ETL Guidelines • Grouping: Determine the largest set of data that is distribution compatible within the query. This will break queries in multiple compatible steps. Confidential

  46. ETL Guidelines • Joining two distribution incompatible tables • Scenario where changing the structure of either table is not possible • Usually encountered while populating fact tables from underlying BASE tables • Create a temporary table with required columns from driver table distributed on a key which makes distribution compatibility possible • More controlled “Shuffle” • Temp table or Join output can be reused by multiple queries • ETL BEST PRACTICE: BREAK YOUR WORKLOAD IN MULTIPLE, MANAGEABLE DISTRIBUTION COMPATIBLE SET OF QUERIES Confidential

  47. Resource Skilling Considerations

  48. Skills Consideration • Platform Skills - Moving from SQL to PDW • Retain much of your SQL skills (SQL Server Data Architecture & DBA, SSIS, etc) • Heterogeneous platform as you scale from GB to PB! • Design Skills: • MPP Data Architecture differs from SMP • Think in Terms of Distribution Keys vsPrimary_Key and Foreign_Key • Think in Terms of Distribution Compatibility vs Indexes for Performance • Surrogate Keys lend themselves to Distribution Keys • MPP ETL Architecture differs from SMP • More use of Load Ready Files with Upsert Logic vs Dynamic Lookup • Staging environment strategies simulate CDC Key Lookups • Surrogate Keys add complexity to CDC Lookups and Key Generation Confidential

  49. Summary • PDW is an MPP appliance from Microsoft on Dell Hardware • Keep in mind MPP and Shared Nothing concepts while designing your EDW on PDW. • Traditional SMP concepts are neither sufficient nor applicable. • Break your workload in manageable distribution compatible chunks. • PDW supports both Normalized and Star schemas. • Consider grouping data in logical information layers. • Use combination of Dwloader & SSIS depending on unit-of-work • Retain your core technology platform skills, augment your DW design skills Confidential

  50. Q & A

More Related