500 likes | 741 Vues
Business Intelligence & Big Data Analytics. Hamid Djam Principal Architect Business Intelligence & Analytics.
 
                
                E N D
Business Intelligence & Big Data Analytics Hamid Djam Principal Architect Business Intelligence & Analytics
EMC makes no representation and undertakes no obligations with regard to product planning information, anticipated product characteristics, performance specifications, or anticipated release dates (collectively, “Roadmap Information”). Roadmap Information is provided by EMC as an accommodation to the recipient solely for purposes of discussion and without intending to be bound thereby. Roadmap information is EMC Restricted Confidential and is provided under the terms, conditions and restrictions defined in the EMC Non-Disclosure Agreement in place with your organization. Disclaimer
Why A Complete Big Data Analytics Stack Matters • Big Data is the new source for economic value • The clearest path to competitive advantage • The ultimate manifestation of fact-based decision making • The net new catalyst for business innovation and workplace evolution • The driving force of a new computing paradigm: data computing
Challenges in Today’s DW Environments… Traditional solutions cannot meet new challenges • Critical business insight is outside enterprise data warehouse because the traditional DW solutions cannot absorb data fast enough • 100s of data marts • ‘Shadow’ databases • Data is everywhere and growing • 44x data growth by 2020 Enterprise Data Warehouse But it only holds 10 % of data Data-marts and ‘personal databases’ e.g. Access, Excel …… Makeup up 90% of corporate data • Source: IDC Digital Universe Study, • sponsored by EMC, May 2010
DW Challenges Resolved With, BI as a Service BUSINESS IT • Long Project Duration. • Gap in understanding business requirements. • Business creating their own data marts. • Inconsistent data between IT systems and business systems. Speed Agility Flexibility Change Short term Stability Security Control Standards Long term Reference: Nine Secrets to Building an Agile, Adaptable BI Environment ,TDWI
EMC IT: Offering IT-as-a-Service Desktop-as-a-service Virtual Desktops Client Devices Enterprise Applications/ Software-as-a-service MDM CRM Governance, risk, compliance Apps ERP Business intelligence Security Info. Lifecycle Mgmt Ent. Content Mgmt Application Platforms Platform-as-a-service Integration Web server Application Server Runtime environments Development tools App. frameworks Greenplum SQL Server Oracle … Database Platform Infrastructure-as-a-service Network Compute Storage & backup vBlock Infrastructure
Information Management Core Disciplines • Guarantees data availability where and when it is required • Movement and transformation of enterprise information • Interconnectivity of IT portfolio • Standardized formats and service interfaces – SOA • Identification and deduplication of shared master data • Cross-referencing and disambiguation • Hierarchy management • Data governance framework and stewardship processes • Unstructured data storage and management • Workflow-based publishing & versioning services • Tie-in to enterprise portal and user identity / security strategies Data Integration Master Data Management Content Management • Framework and organization to ensure management of data as a strategic corporate asset • Data stewardship • Policies and procedures; monitoring and measuring • Data warehouse methodology – envisioning to deployment • Business use-case- or function-specific datamarts / reporting solutions • Moving with agility fromreactive to predictive capability • Assurance that trustworthy data is accessible at time of demand • Standardization& cleansing • Business data rule enforcement • Stale data refresh • Augmentation from external sources Data Governance Business Intelligence Information Quality
Building The Industry’s Only Complete Big Data Analytics “Stack” Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data Greenplum Database Enterprise & Community Editions World’s Most Scalable MPP Database Platform
GREENPLUM DATABASE Industry-Leading Massively Parallel Processing (MPP) Performance Placeholder-waiting for box image from Beth
Building The Industry’s Only Complete Big Data Analytics “Stack” Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data Greenplum Database Enterprise & Community Editions World’s Most Scalable MPP Database Platform
EMC Greenplum Database IsPurpose-built for Big Data • EMC Greenplum is a shared nothing, massively parallel processing (MPP) data warehouse system • Core principle of data computing is to move the processing dramatically closer to the data and to the people Fast DataLoading Extreme Performance& Elastic Scalability Unified Data Access
Massively Parallel ProcessingAnd Linear Performance Scalability Greenplum 4.0: Database Architecture SQL MapReduce MasterServers Query planning & dispatch ... ... Network Interconnect SegmentServers Query processing & data storage ... ... ExternalSources Loading, streaming, etc.
Platform IndependenceDelivers Choice and Flexibility • Data Computing Appliance • Optimized Price/Performance • Minimum time-to-value • Ideal for Production Environments • Software-Only • On your x86 hardware • Flexibility for any workload • Ideal for Q/A or DR • Virtualized Infrastructure • Pool resources • Elastic scalability • Ideal for Test & Development
Mature Enterprise Platform CLIENT ACCESS ODBC, JDBC, OLEDB, etc. 3rd PARTY TOOLS BI Tools, ETL Tools Data Mining, etc ADMIN TOOLS GP Performance Monitor pgAdmin3 for GPDB CLIENT ACCESS & TOOLS LOADING & EXT. ACCESS Petabyte-Scale Loading Trickle Micro-Batching Anywhere Data Access STORAGE & DATA ACCESS Hybrid Storage & Execution(Row- & Column-Oriented) In-Database Compression Multi-Level Partitioning Indexes – Btree, Bitmap, etc. LANGUAGE SUPPORT Comprehensive SQL Native MapReduce SQL 2003 OLAP Extensions Programmable Analytics PRODUCT FEATURES GPDB ADAPTIVE SERVICES Multi-Level Fault Tolerance Online System Expansion Workload Management Shared-Nothing MPP Parallel Query Optimizer Polymorphic Data Storage™ Parallel Dataflow Engine gNet™ Software Interconnect MPP Scatter/Gather Streaming™ CORE MPP ARCHITECTURE
EMC GREENPLUM HD Delivering Enterprise-Ready Apache Hadoop
Building The Industry’s Only Complete Big Data Analytics “Stack” Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data Greenplum Database Enterprise & Community Editions World’s Most Scalable MPP Database Platform
Greenplum HD – Enterprise Ready Hadoop Platform for Unstructured Data • Greenplum Hadoop is faster, more dependable, and easier to use • Faster to address the growth of unstructured data • EMC reliable for the Enterprise • Easier to use with existing systems and tools
Why Hadoop? • With massive growth of unstructured data, open-source software, Apache Hadoop has quickly become an important new data platform and technology • We've seen this first-hand with customers deploying Hadoop alongside Greenplum databases
Why EMC Greenplum HD? • EMC has the technical depth, expertise and critical mass in building the scalable and reliable distributed data processing systems necessary to drive technical innovation into Hadoop • Hadoop needs to become “mission critical” and “easier to use and manage” • HDFS optimizations, workload management, job scheduling, systems management, etc. • Fault-tolerance: Eliminate SPOF for Name-Node, Job Tracker and other key components underlying Hadoop
Greenplum HD: Hadoop Software Distributions • Introducing Greenplum HD, enterprise-ready Apache Hadoop software distributions • Community Edition software • 100% open source • Enterprise Edition software • Advanced features • 100% API compatible
Greenplum HD Data Computing Appliance • Introducing the world’s first: • high-performance • purpose-built • data co-processing Hadoop appliance • Combining Hadoop and Greenplum Database in one appliance
THE ANSWERMACHINE DATA IN. DECISIONS OUT. Introducing the Greenplum Data Computing Appliance
Building The Industry’s Only Complete Big Data Analytics “Stack” Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data Greenplum Database Enterprise & Community Editions World’s Most Scalable MPP Database Platform
Key Architectural Principles • Keep it simple • Build on standard hardware components • Performance comes from our software architecture • Best of breed x86 and Ethernet networking technologies • Benefit from broad ecosystem innovation • Make it modular for easy scaling • SAN connectivity designed in • Focus on Data Computing, not Data Warehousing • Greenplum Database • SAS Analytics • Hadoop
DCA Functional Components Administrative Switch Free Functional Block 8 Segment Servers Free Functional Block 2 10GE Switches 2 GPDB Master Servers Free Functional Block Free Functional Block 4 GPDB Segment Servers
Scale to Multiple Racks In GranularQuarter Rack Increments 1st Rack Expansion Rack + . . . + Add ¼ rack Increments Add ¼ rack Increments
High Availability Built-In • Master server data protection • HW RAID protection for drive failures • Replicated transaction logs for server failure • On server failure • Standby server activated • Administrator alerted Segment Segment Segment Segment Master Master • Segment Server Data Protection • HW RAID protection for drive failures • Mirrored segments for server failures • On server failure • Mirrored segments take over with no loss of service • Fast online differential recovery …
GPDB HA Groups And Segment Mirrors GPDB HA Group GPDB HA Group Segment Server 1 P1 P2 P3 M6 M8 M10 Segment Server 2 P4 P5 P6 M1 M9 M11 GPDB HA Group Segment Server 3 P7 P8 P9 M2 M4 M12 Set of Active Segment Instances GPDB HA Group Segment Server 4 P10 P11 P12 M3 M5 M7 Number of primary and mirror instances shown above are for illustration purposes only. Each Segment Server in a DCA actually supports a total of 12 instances (6 primaries and 6 mirrors)
DCA Can Sustain Up to Four Server Failures Per Rack, One Per HA Group GPDB HA Group GPDB HA Group Segment Server 1 P1 P2 P3 M6 M8 M10 Segment Server 2 P4 P5 P6 M1 M9 M11 GPDB HA Group Segment Server 3 P7 P8 P9 M2 M4 M12 Set of Active Segment Instances GPDB HA Group Segment Server 4 P10 P11 P12 M3 M5 M7 Number of primary and mirror instances shown above are for illustration purposes only. Each Segment Server in a DCA actually supports a total of 12 instances (6 primaries and 6 mirrors)
EMC Dial-Home andRemote Support Built-In • EMC Premium Support • ESRS secure IP connection enabled for DCA racks • Automatic dial home for DCA HW and SW failures • 24x7 Remote technical support and trouble shooting • Online support triggers FRU parts shipment • Four hour on site support objective EMC Support FTPS Or ESRS
Customer Support Services EMC Greenplum Warranty and Premium Maintenance One year Limited HW Warranty Secure Self-Help 24x7 access to eService support tools including knowledgebase, forums Remote Technical Support Technical support and remote troubleshooting during normal business hours Replacement parts shipped for next business day arrival Premium Maintenance • Remote Technical Support • 24x7 technical support and remote troubleshooting • Customer-managed case severity level • Installation of platform operating system updates • Onsite Support • Installation of replacement parts • Four-hour response objective • Proactive Service • Secure remote monitoring for hardware • Notification of engineering technical advisories • Built-in tools maximize stability and performance • Secure Self-Help • 24x7 access to eService support tools including knowledgebase, forums, and appropriately licensed software updates
Data Computing Appliance (DCA) • Purpose-built, highly scalable next generation data warehousing appliance • Architecturally integrates database, compute, storage, and network into an enterprise-class, easy-to-implement system. • Balanced for best price/performance ratio • Available in quarter-, half-, three-quarter-, full-, and multi-rack configurations
High Capacity DCA • Suitable for large data base customers with PB scalability in mind • Increase the data capacity in a rack by three-times • Reduced rack space, power, and cooling needs per unit data • Lowest price-per-unit data warehouse appliance • Available in quarter-, half-, three-quarter-, full-, and multi-rack configurations
Application Specific Configurations Database Hadoop EMC* makes no representation and undertakes no obligations with regard to product planning information, anticipated product characteristics, performance specifications, or anticipated release dates (collectively, “Roadmap Information”). Roadmap Information is provided by EMC as an accommodation to the recipient solely for purposes of discussion and without intending to be bound thereby.
Seamless Infrastructure Integration • Data Protection • Big Data Loading & Staging • Storage Expansion • Disaster Recovery
Seamless Infrastructure Integration Isilon Scale Out Storage For Big Data Staging EMC Data Domain Efficient Backup & Restore EMC VMAX SRDF EMC Data Domain Replication For Disaster Recovery EMC VMAX SAN Mirror For Advanced Storage Management EMC* makes no representation and undertakes no obligations with regard to product planning information, anticipated product characteristics, performance specifications, or anticipated release dates (collectively, “Roadmap Information”). Roadmap Information is provided by EMC as an accommodation to the recipient solely for purposes of discussion and without intending to be bound thereby.
Efficient Backup/Restore withEMC Data Domain • Data Domain deduplication is a great fit for Greenplum datasets • Drastic reduction in backup storage requirement • Backup all segment servers in parallel directly to Data Domain • With Greenplumdeduplication friendly compressed data streams, achieve effective backup rates up to 6TB/hr
DCA SAN Mirror H12011 • Default DCA configuration has Segment Primaries and Segment Mirrors on internal storage • SAN Mirror offloads Segment Mirrors to SAN storage • Doubles effective capacity of a DCA • Foundation of SAN leverage • Seamless off-host backups • Data replication • No performance impact • Primaries on internal storage • SAN sized for load and failed segment server P1 M1 … … P96 M96 EMC* makes no representation and undertakes no obligations with regard to product planning information, anticipated product characteristics, performance specifications, or anticipated release dates (collectively, “Roadmap Information”). Roadmap Information is provided by EMC as an accommodation to the recipient solely for purposes of discussion and without intending to be bound thereby.
GREENPLUM CHORUS The World’s First Enterprise Data Cloud Platform
Building The Industry’s Only Complete Big Data Analytics “Stack” Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data Greenplum Database Enterprise & Community Editions World’s Most Scalable MPP Database Platform
Greenplum Chorus • Greenplum’s Enterprise Data Cloud Platform (EDC), enabling: • Self-service provisioning • Data services • Collaborative analytics • Customers deploy Chorus along with VMware and the Greenplum Database to create an agile and self-service analytic infrastructure • Chorus can significantly accelerate the time and ease with which companies extract value and insight from their data
Spin up new projects rapidly with self-service provisioning. • Provision instances, both single-node and multi-node. • Provision sandboxes as new databases or schemas. • Import data easily from anywhere in the cloud.
Data is now discoverable, self-documenting, and shared. • Browse schemas and explore data with powerful search and visualization tools. • Attach documents, ask questions, add comments, and build a living data dictionary. • Define data sets, share them with the team, and schedule imports.
Create a collaborative environment for deep analytics on big data. • Create project workspaces with shared files, data, documentation and workflows. • Execute workflows directly in the sandbox, and then track changes to work and results over time. • Control permissions to protect private data. • Publish functions and documentation, to promote common standards and techniques. • Import functions from libraries of in-database analytics functions. • Collaborate within projects, share information across teams.