Bridge to Big Data

Bridge to Big Data Hadoop in the Enterprise Architecture Jim Walker October 23,2012 Strata/Hadoop World 2012

Big Data: Organizational Game Changer Transactions + Interactions + Observations = BIG DATA BIG DATA Mobile Web Petabytes Sentiment SMS/MMS Speech to Text User Click Stream Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting CRM Gigabytes Business Data Feeds Dynamic Pricing Segmentation External Demographics Search Marketing Customer Touches User Generated Content ERP Affiliate Networks Megabytes Support Contacts Purchase detail Purchase record Payment record HD Video, Audio, Images Dynamic Funnels Offer details Product/Service Logs Offer history Increasing Data Variety and Complexity

What is a Data Driven Business? • DEFINITIONBetter use of available data in the decision making process • RULEKey metrics derived from data should be tied to goals • PROVEN RESULTSFirms that adopt Data-Driven Decision Making have output and productivity that is 5-6% higher than what would be expected given their investments and usage of information technology* 1110010100001010011101010100010010100100101001001000010010001001000001000100000100010010010001000010111000010010001000101001001011110101001000100100101001010010011111001010010100011111010001001010000010010001010010111101010011001001010010001000111 * “Strength in Numbers: How Does Data-Driven Decisionmaking Affect Firm Performance?” Brynjolfsson, Hitt and Kim (April 22, 2011)

optimize optimize optimize Big Data: Optimize Outcomes at Scale optimize optimize optimize optimize optimize optimize optimize Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation.

Enterprise Big Data Flows Business Transactions& Interactions CRM, ERP Web, Mobile Point of sale Unstructured Data Big Data Platform Log files Exhaust Data Classic Data Integration & ETL Social Media Business Intelligence & Analytics Dashboards, Reports, Visualization, … Sensors, devices DB data Capture Big Data Collect data from all sources structured &unstructured Process Transform, refine, aggregate, analyze, report Distribute Results Interoperate and share data with applications/analytics Feedback Use operational data w/inthe big data platform 1 2 3 4

Univ. California - Irvine Medical Center Optimizing patient outcomes while lowering costs • Current system, Epic holds 22 years of patient data, across admissions and clinical information • Significant cost to maintain and run system • Difficult to access, not-integrated, stand alone • Apache Hadoop sunsets legacy system and augments new electronic medical records (EMR) • Migrate legacy data to Apache Hadoop, replaced existing ETL and temporary databases and captures complete info • Eliminate maint of legacy system for $500K annual savings • Integrate data with EMR and clinical front-end for Better patient service and improved research • UC Irvine Medical Center is ranked among the nation's best hospitals by U.S. News & World Report for the 12th year • More than 400 specialty and primary care physicians • Opened in 1976 • 422-bed medical facility

Data Platform Requirements for Big Data Data Platform for Big Data • Capture • Collect data from all sources - structured and unstructured data • all speeds batch, async, streaming, real-time • Process • Transform, refine, aggregate, analyze, report • Exchange • Deliver data with enterprise data systems • Share data with analytic applications and processing • Operate • Provision, monitor, diagnose, manage at scale • Reliability, availability, affordability, scalability, interoperability Across all deployment models OperatingSystems Virtual Platforms CloudPlatforms Big Data Appliances

Apache Hadoop & Big Data Use Cases Big Data Transactions, Interactions, Observations Refine Explore Enrich Business Case

Operational Data RefineryHadoop as platform for ETL modernization Refine Explore Enrich Capture • Capture new unstructured data along with log files all alongside existing sources • Retain inputs in raw form for audit and continuity purposes Process • Parse the data & cleanse • Apply structure and definition • Join datasets together across disparate data sources Exchange • Push to existing data warehouse for downstream consumption • Feeds operational reporting and online systems DB data Unstructured Log files Refinery Capture and archive Parse & Cleanse Structure and join Upload Enterprise Data Warehouse

“Big Bank” Key Benefits • Capture and archive • Retain 3 – 5 years instead of 2 – 10 days • Lower costs • Improved compliance • Transform, change, refine • Turn upstream raw dumps into small list of “new, update, delete” customer records • Convert fixed-width EBCDIC to UTF-8 (Java and DB compatible) • Turn raw weblogs into sessions and behaviors • Upload • Insert into Teradata for downstream “as-is” reporting and tools • Insert into new exploration platform for scientists to play with

Big Data Exploration & VisualizationHadoop as agile, ad-hoc data mart Refine Explore Enrich Capture • Capture multi-structured data and retain inputs in raw form for iterative analysis Process • Parse the data into queryable format • Explore & analyze using Hive, Pig, Mahout and other tools to discover value • Label data and type information for compatibility and later discovery • Pre-compute stats, groupings, patterns in data to accelerate analysis Exchange • Use visualization tools to facilitate exploration and find key insights • Optionally move actionable insights into EDW or datamart DB data Unstructured Log files Explore Capture and archive Structure and join Categorize into tables upload JDBC / ODBC Optional EDW / Datamart Visualization Tools

“Hardware Manufacturer” Key Benefits • Capture and archive • Store 10M+ survey forms/year for > 3 years • Capture text, audio, and systems data in one platform • Structure and join • Unlock freeform text and audio data • Un-anonymize customers • Categorize into tables • Create HCatalog tables “customer”, “survey”, “freeform text” • Upload, JDBC • Visualize natural satisfaction levels and groups • Tag customers as “happy” and report back to CRM database

Application EnrichmentDeliver Hadoopanalysis to online apps Refine Explore Enrich Capture • Capture data that was once too bulky and unmanageable Process • Uncover aggregate characteristics across data • Use Hive Pig and Map Reduce to identify patterns • Filter useful data from mass streams (Pig) • Micro or macro batch oriented schedules Exchange • Push results to HBase or other NoSQL alternative for real time delivery • Use patterns to deliver right content/offer to the right person at the right time DB data Unstructured Log files Enrich Capture Parse Derive/Filter Scheduled & near real time NoSQL, HBase Low Latency Online Applications

“Clothing Retailer” Key Benefits • Capture • Capture weblogs together with sales order history, customer master • Derive useful information • Compute relationships between products over time • “people who buy shirts eventually need pants” • Score customer web behavior / sentiment • Connect product recommendations to customer sentiment • Share • Load customer recommendations into HBase for rapid website service

Hadoop in Enterprise Data Architectures Existing Business Infrastructure Web New Tech Datameer Tableau Karmasphere Splunk Web Applications IDE & Dev Tools ODS &Datamarts Applications & Spreadsheets Visualization & Intelligence Operations EDW Low Latency/NoSQL Discovery Tools Existing Custom Templeton WebHDFS Sqoop Flume HCatalog HBase Pig Hive MapReduce HDFS Ambari Oozie HA ZooKeeper Big Data Sources (transactions, observations, interactions) CRM ERP financials Social Media Exhaust Data logs files

Hadoop Integration Options Near Real-Time Integration Batch & Scheduled Integration Existing Infrastructure Existing Infrastructure Logs & Files Databases & Warehouses Applications & Spreadsheets Visualization & Intelligence Logs & Files Databases & Warehouses Applications & Spreadsheets Visualization & Intelligence SQOOP Data Integration (Talend, Informatica) Flume WebHDFS REST ODBC/JDBC HDFS HCatalog HDFS Pig Hive HBase HCatalog Pig Hive HBase MapReduce MapReduce

What Is the Size of Your Cluster? Open Source data management with scale-out storage & distributed processing Storage Key Characteristics • Scalable • Efficiently store & process petabytes • Linear scale driven by additional processing and storage • Reliable • Redundant storage • Failover across nodes and racks • Flexible • Store all types of data in any format • Apply schema on analysis & sharing of data • Economical • Use commodity hardware • Open source software guards against vendor lock-in HDFS • Distributed across “nodes” • Natively redundant • Name node tracks locations Compute Map Reduce • Splits a task across processors “near” the data & assembles results • Self-Healing, High Bandwidth Clustered Storage

Hadoop Balance Innovation & Stability The CHASM relative % customers www.hortonworks.com/moore Innovators, technology enthusiasts Early adopters, visionaries Early majority, pragmatists Laggards, Skeptics Late majority, conservatives time Customers want technology & performance Customers want solutions & convenience Source: Geoffrey Moore - Crossing the Chasm

Where Does It Fit into Your Business?

Hortonworks Data Platform • Simplify deployment to get started quickly and easily • Monitor, manage any size cluster with familiar console and tools • Only platform to include data integration services to interact with any data • Metadata services opens the platform for integration with existing applications • Dependable high availability architecture • Tested at scale to future proof your cluster growth 1 • Reduce risks and cost of adoption • Lower the total cost to administer and provision • Integrate with your existing ecosystem

Next Steps? • Expert role based training • Course for admins, developers and operators • Certification program • Custom onsite options Download Hortonworks Data Platform hortonworks.com/download 1 Use the getting started guide hortonworks.com/get-started 2 Learn more… get support 3 Hortonworks Support • Full lifecycle technical support across four service levels • Delivered by Apache Hadoop Experts/Committers • Forward-compatible hortonworks.com/training hortonworks.com/support

Bridge to Big Data

Bridge to Big Data

Presentation Transcript

Big Data

Introduction to Big Data

Big Data

Big Data

„Big data ”

Big Data

Big Data

Big Data – Big ROI

Big Data

Big Data

Big Data

BIG DATA

Introduction to Big Data

Big Data

Big Data

Big Data Training | Big Data Courses | Big Data Online Courses

introduction to BIG DATA

Big Data Big Data

Big Data