320 likes | 617 Vues
Massive scale with Microsoft SQL Server 2008 R2 Parallel Data Warehouse Edition. Rushabh Mehta Managing Director (India) | Solid Quality Mentors rmehta@solidq.com. About me: Rushabh Mehta. Professional Association for SQL Server President Solid Quality Mentors (Solid Q )
 
                
                E N D
Massive scale with Microsoft SQL Server 2008 R2 Parallel Data Warehouse Edition Rushabh Mehta Managing Director (India) | Solid Quality Mentors rmehta@solidq.com
About me: Rushabh Mehta • Professional Association for SQL Server • President • Solid Quality Mentors (SolidQ) • Business Intelligence Mentor • Managing Director, India • SQL Server MVP • rmehta@solidq.com ◊ www.solidq.com ◊ @rushabhmehta
Agenda • Microsoft Data Warehousing Overview • SMP v/s MPP Architecture • Microsoft Parallel Data Warehouse Architecture and Components
Microsoft’s Commitment to DW and BI FastTrack2009 PDW2010 SQL 6.01995 SQL 6.51996 SQL 7.01998 SQL 20002000 SQL CE2000 64-bit2001 SQL 20052005 SQL 20082008 OLAP and ETL DW Scale Parallel Processing Partitioning Scale to 100s of TB Data Mining Data Profiling Power Pivot Pervasive Insight Managed Reporting Ad-hoc Reporting Compression Load Optimize Data Warehousing VS Integration Resource Governor Multiple sources KPIs • Gartner Leaders Quadrant for Business Intelligence, since 2008 • Gartner Leaders Quadrant for Data Warehouse, since 2008 • Leader in “The Forrester Wave: Enterprise Data Warehousing Platforms, Q1 2009” • Fastest growing of top 5 data warehouse vendors - IDC • Microsoft spends as a company $9.1 billion in research annually
SQL Server Fast Track Data Warehouse Solution to help customers and partners accelerate their data warehouse deployments • A methodfor designing a cost-effective, balanced system for Data Warehouse workloads • Reference hardware configurationsdeveloped in conjunction with hardware partners using this method • Best practices for data layout, loading and management
Fast Track Scope Supporting Systems BI Data Storage Systems Presentation Layer Systems Integration Services ETL Analysis Services Cubes Web Analytic Tools Data Path Reporting Services Presentation Data Presentation Data Subject AreaData Marts SharePoint Services Microsoft Office SharePoint SAN, Storage Array Data Warehouse PerformancePoint Services Data Staging,Bulk Loading Reference Architecture Scope (dashed)
Fast Track Value Proposition Appliance-like time to value Reduces DBA effort; fewer indexes, much higher level of sequential I/O Choice of HW Platforms Dell, HP, Bull, EMC and IBM – more in future Low TCO Through Commodity hardware and value pricing; Lower storage costs. High Scale New reference architectures scale up to 48 TB (assuming 2.5x compression) Reduced Risk Validated by Microsoft; better choice of hardware; application of Best Practice 8
SMP Architecture • SMP = Symmetric Multiprocessing • Two or more identical processors connected to single shared main memory and controlled by single OS instance • Any processor can work on any task • Easily move tasks between processors to balance workload efficiently • All SQL Server implementations up until now have been SMP
MPP Architecture • MPP = Massively Parallel Processing • Uses many separate CPUs running in parallel to execute a single program • Each CPU has its own memory • Applications must be segmented, using high speed communications between nodes
Parallel Data Warehouse Control Rack DataRack Control Rack Data Rack/s
Parallel Data Warehouse Compute Nodes Storage Nodes SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL Control Nodes Active / Passive Client Drivers Management Servers Support / Patching Compute Node + Storage Node PDW Node Dual Fiber Channel Dual Infiniband Landing Zone ETL Load Interface Backup Node Corporate Backup Solution Spare Compute Node Corporate Network Private Network
Compute Nodes Compute Node SQL
Architecture: Compute Server Node Hardware Options Dual Multi-Core Processors TempDB Workspace Enterprise Class DBMS CPU RAM DUAL 4Gb FC CPU Dual InfiniBand
Data Layout • Replicated:A table structure that exists as a full copy within each discrete PDW Node. • Distributed: A table structure that is hashed on a single column and uniformly distributed across all nodes on the appliance. Each distribution is a separate physical table in the DBMS. • Ultra shared nothing: The ability to design a schema of both distributed and replicated tables to minimize data movement between nodes • Small sets of data can be more efficiently stored in full (replicated). • Certain set operations are more efficient against full sets of data (i.e., single node operations).
Date Dim Item Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Prod Dim ID Prod Category Prod Sub Cat Prod Desc Store Dim Store Dim ID Store Name Store Mgr Store Size Data Layout DD DD DD DD DD DD ID ID ID ID ID ID SF 1 SF 1 SF 2 SF 4 SF 5 SF 3 SD SD SD SD SD SD PD MD PD PD PD PD Sales Fact Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold Promo Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End
Parallel Data Warehouse Compute Nodes Storage Nodes SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL Control Nodes Active / Passive Control Nodes Active / Passive Cluster Client Drivers Client Drivers Management Servers Support / Patching Dual Fiber Channel Dual Infiniband Landing Zone ETL Load Interface Backup Node Corporate Backup Solution Spare Compute Node Corporate Network Private Network
Control Node & Client Drivers • Client connections always go through the control node • The Control Node contains no persistent user data • PDW ‘Secret Sauce’ • Processes SQL requests • Prepares execution plan • Orchestrates distributed execution • Local SQL Server to do final query plan processing / result aggregation • Client Drivers provided by DataDirect • ODBC, OLE-DB, JDBC and ADO.NET client drivers • Available drivers for 32 and 64 bits
PDW Benefits – Massive Parallel Processing Control Rack DataRack Query 1 is standard T-SQL submitted to SQL Server on Control Node Query 1 Query is executed on all 10 Nodes Results are sent back to client ? ? ? ? ? ? ? ? ? ?
PDW Benefits – Massive Parallel Processing Control Rack DataRack Multiple queries are simultaneously executed across all nodes. PDW supports querying while data is loading. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Blazing fast performance by parallelizing queries on highly optimized ultra shared nothing nodes.
Parallel Data Warehouse Compute Nodes Storage Nodes SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL Control Nodes Active / Passive Client Drivers Management Servers Management Nodes Active / Passive Cluster Support / Patching Support / Patching Dual Fiber Channel Dual Infiniband Landing Zone ETL Load Interface Backup Node Corporate Backup Solution Spare Compute Node Corporate Network Private Network
Management Node • Runs a separate domain controller (Active Directory) • Used for deploying patches to all nodes in the appliance • Holds images in case a node needs reimaging • High Availability using Active / Passive clustering
Parallel Data Warehouse Compute Nodes Storage Nodes SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL Control Nodes Active / Passive Client Drivers Management Servers Support / Patching Dual Fiber Channel Dual Infiniband Landing Zone Landing Zone ETL Load Interface ETL Load Interface Backup Node Corporate Backup Solution Spare Compute Node Corporate Network Private Network
Landing Zone • Provides high capacity storage for data files from ETL processes • Integration services available on the landing zone • Connected to internal network • Available as sandbox for other applications and scripts that run on internal network • Source • Landing Zone Files • Data Loader • Compute Nodes
Parallel Data Warehouse Compute Nodes Storage Nodes SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL Control Nodes Active / Passive Client Drivers Management Servers Support / Patching Dual Fiber Channel Dual Infiniband Landing Zone Backup Node ETL Load Interface Backup Node Corporate Backup Solution Corporate Backup Solution Spare Compute Node Corporate Network Private Network
Backup Node • Coordinated backup across the nodes • Database level backup • Full or differential • Metadata backup • Can restore to a larger appliance • Optional item – 1 size per config • Up to 524TB of capacity • Available in XS, S, M, L and XL
PDW Software Architecture MS BI (AS, RS) Nexus Query Tool 3rd Party Tools (Client Access) Compute Node Compute Nodes Compute Nodes DMS JDBC OLE-DB ODBC ADO.NET IIS SQL Server Control Node User Data Admin Console DMS PDW Services Landing Zone DMS Loader Client SQL SSIS DSQL Core Engine Services DMS Manager SQL OS Backup Node SQL OS DMS SQL Server DW Schema DW Queue DW Configuration DW Authentication Management Node HPC AD Built by DWPU Existing MS software 3rd Party
Conclusion • MPP architecture supports massive scale through increased parallelization and shared-nothing architecture • Microsoft SQL Server 2008 R2 Parallel Data Warehouse Edition brings massive scale wrapped in the simplicity of an appliance
References • Microsoft Parallel Data Warehouse official site http://www.microsoft.com/pdw
Feedback / QnA • Your Feedback is Important! Please take a few moments to fill out our online feedback form at: << Feedback URL – Ask your organizer for this in advance>> For detailed feedback, use the form at http://www.connectwithlife.co.in/vtd/helpdesk.aspx Or email us at vtd@microsoft.com • Use the Question Manager on LiveMeeting to ask your questions now!
Contact • SolidQ www.solidq.com • Email Address rmehta@solidq.com