SQLCAT: Big Data – All Abuzz About Hive

SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT BI/Big Data PM Microsoft http://blogs.msdn.com/cindygross @SQLCindy Cindy.Gross@microsoft.com Ed Katibah SQLCAT Spatial PM Microsoft http://blogs.msdn.com/b/edkatibah/ @Spatial_Ed Ed.Katibah@Microsoft.com

Big Agenda Hive Hadoop Big Data Analytics to Insights

A NEW SET OF QUESTIONS • How do I better predict future outcomes? • How do I optimize my fleet based on weather and traffic patterns? • What’s the social sentiment for my brand or products Advanced ANALYTICS SOCIAL & Web ANALYTICS LIVE DATA FEEDS

NEW OPPORTUNITIES GE Revenue Growth Increases ad revenue by processing 3.5 billion events per day Businesses Innovation Measures and ranks online user influence by processing 3 billion signals per day Operational Efficiencies Uses sentiment analysis and web analytics for its internal cloud Massive Volumes Processes 464 billion rows per quarter, with average query time under 10 secs. Cloud Connectivity Connects across 15 social networks via the cloud for data and API access Real-Time Insight Improves operational decision making for IT managers and users

MANAGE any data, any size, anywhere 010101010101010101 1010101010101010 01010101010101 101010101010 Unified Monitoring, Management & Security Non-Relational Streaming Relational Data Movement

VVVVroom! Volume – beyond what environment can handle Velocity – Need decisions fast Variety – Many formats Variability – Multiple interpretations

Big Data MapReduce, Streaming, Machine Learning, Massively Parallel Processing BASE Not ACID Schema on Read Not Write Scale Out for Pay As You Go Too Big, Complex, or Expensive for Current Environment

BIG DATA REQUIRES AN END-TO-END APPROACH INSIGHT Self-Service Collaboration Corporate Apps Devices DATA ENRICHMENT Discover Combine Refine DATA MANAGEMENT Relational Non-relational Analytical Streaming

Hadoop architecture. Active Directory (Security) Pipeline / workflow (Oozie) Business Intelligence (Excel, PowerView…) Metadata (HCatalog) Graph (Pegasus) Stats processing (RHadoop) Scripting (Pig) Query (Hive) Machine Learning (Mahout) NoSQL Database (HBase) Data Integration ( ODBC / SQOOP/ REST) Distributed Processing (Map Reduce) System Center Log file aggregation (Flume) Distributed Storage (HDFS)

Hive Architecture HiveQL Metastore Hive Web Interface (HWI) Command Line Interface (CLI) JDBC ODBC Thrift Server Compiler, Optimizer, Executor Hadoop Head Node Name Node Data Nodes / Task Nodes Hive

DEMO:Analyzing a Frankenstorm

Behind the Scenes

Get HDInsight Sign up for Windows Azure HDInsight Service http://HadoopOnAzure.com (Cloud CTP) Download Microsoft HDInsight Server http://microsoft.com/bigdata (On-Prem CTP)

Create Table CREATE EXTERNAL TABLEcensusP (State_FIPSint, County_FIPSint, Population bigint, Pop_Age_Over_69 bigint, Total_Householdsbigint, Median_Household_Incomebigint, KeyIDstring) COMMENT 'US Census Data' PARTITIONED BY (Year string) ROW FORMAT DELIMITED FIELDS TERMINATED by '\t' STORED AS TEXTFILE; ALTER TABLE censusP ADD PARTITION (Year = '2010') LOCATION '/user/demo/census/2010';

Inside a Hive Table DATA TYPES EXTERNAL / INTERNAL PARTITIONED BY | CLUSTERED BY | SKEWED BY Terminators ROW FORMAT DELIMITED | SERDE STORED AS FIELDS/COLLECTION ITEMS/MAP KEYS TERMINATED BY LOCATION

MetaData Metadata is stored in a MetaStore database such as Derby SQL Azure SQL Server View SHOW TABLES 'ce.*';DESCRIBE census; DESCRIBE census.population;DESCRIBE EXTENDED census;DESCRIBE FORMATTED census; SHOW FUNCTIONS "x.*"; SHOW FORMATTED INDEXES ON census;

Data Types Primitives Numbers: Int, SmallInt, TinyInt, BigInt, Float, Double Characters: String Special: Binary, Timestamp Collections STRUCT<City:String, State:String> | Struct (‘Boise’, ‘Idaho’) ARRAY <String> | Array (‘Boise’, ‘Idaho’) MAP <String, String> | Map (‘City’, ‘Boise’, ‘State’, ‘Idaho’) UNIONTYPE <BigInt, String, Float> Properties No fixed lengths NULL handling depends on SerDe

Storage – External and Internal CREATE EXTERNAL TABLE census(…) LOCATION '/user/demo/census'; LOCATION ‘hdfs:///user/demo/census'; LOCATION ‘asv://user/demo/census'; Use EXTERNAL when Data also used outside of Hive Data needs to remain even after a DROP TABLE Use custom location such as ASV Hive should not own data and control settings, directories, etc. Not creating table based on existing table (AS SELECT) And ASV = Azure Storage Vault (blob store) INTERNAL is NOT a keyword, just leave off EXTERNAL

Storage – Partition and Bucket • CREATE EXTERNAL TABLE census (…) • PARTIONED BY(Year string) • CLUSTERED BY (population) into 256 BUCKETS Partition • Directory for each distinct combination of string partition values • Partition key name cannot be defined in table itself • Allows partition elimination • Useful in range searches • Can slow performance if partition is not referenced in query • Buckets • Split data based on hash of a column • One HDFS file per bucket within partition sub-directory • Performance may improve for aggregates and join queries • Sampling

Storage – File formats and Serdes • CREATE EXTERNAL TABLE census (…) • ROW FORMAT DELIMITED • FIELDS TERMINATED by‘\001‘ • STORED AS TEXTFILE, RCFILE, SEQUENCEFILE, AVRO • Format • TEXTFILE is common, useful when data is shared and all alphanumeric • Extensible storage formats via custom input, output formats • Extensible on disk/in-memory representation via custom SerDes

CREATE INDEX • CREATE INDEX census_population • ON TABLE census (population) • AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' • WITH DEFERRED REBUILD • IN TABLE census_population_index; • ALTER INDEX census_population ON census REBUILD; • Key Points • No keys • Index data is another table • Requires REBUILD to include new data • SHOW FORMATTED INDEXES on MyTable; • Indexing May Help • Avoid many small partitions • GROUP BY

Create View CREATE VIEWcensusBigPop (state_fips, county_fips, population) AS SELECT state_fips, county_fips, population FROM census WHERE population > 500000 ORDER BY population; Sample Code SELECT * FROM censusBigPop; DESCRIBE FORMATTED censusBigPop; Key Points Not materialized Can have ORDER BY or LIMIT

Query SELECTc.state_fips, c.county_fips, c.population FROM census c WHEREc.median_household_income > 100000 GROUPBYc.state_fips, c.county_fips, c.population ORDERBYcounty_fips LIMIT100; Key Points Minimal caching, statistics, or optimizer Generally reads entire data set for every query Performance The order of columns, tables can make a difference to performance Use partition elimination for range filtering

Sorting ORDER BY One reducer does final sort, can be a big bottleneck SORT BY Sorted only within each reducer, much faster DISTRIBUTE BY Determines how map data is distributed to reducers SORT BY + DISTRIBUTE BY = CLUSTER BY Can mimic ORDER BY, better perf if even distribution

Joins • Supported Hive Join Types • Equality OUTER - LEFT, RIGHT, FULL LEFT SEMI • Not Supported Non-Equality IN/EXISTS subqueries (rewrite as LEFT SEMI JOIN)

Joins • Characteristics • Multiple MapReduce jobs unless same join columns in all tables • Put largest table last in query to save memory • Joins are done left to right in query order • JOIN ON completely evaluated before WHERE starts

EXPLAIN • EXPLAIN SELECT * FROM census; • EXPLAIN SELECT * FROM census WHERE population > 100000; • EXPLAIN EXTENDED SELECT * FROM census; • Characteristics • Does not execute the query • Shows parsing • Lists stages, temp files, dependencies, modes, output operators, etc. • ABSTRACT SYNTAX TREE:(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME census))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF))))STAGE DEPENDENCIES:Stage-0 is a root stageSTAGE PLANS:Stage: Stage-0Fetch Operatorlimit: -1

Configure Hive Configuration Hive default configuration <install-dir>/conf/hive-default.xml Configuration variables <install-dir>/conf/hive-site.xml Hive configuration directory HIVE_CONF_DIR environment variable Log4j configuration <install-dir>/conf/hive-log4j.properties Typical Log: c:\Hadoop\hive-0.9.0\logs\hive.log

Why Use Hive BUZZ! Cross-pollinate your existing SQL skills! Makes Hadoop cross-correlations, joins, filters easier Allows storage of intermediate results for faster/easier querying Batch based processing Individual queries still often slower than a relational database E2E insight may be much faster

BI on Big Data Gain Insights Mash-up Hive + other data in Excel Hive data source to PowerPivot for in-memory analytics Power View on top of PowerPivot for spectacular visualizations leading to insights Securely share on SharePoint for collaboration, re-use, centralized data Microsoft on top of Hadoop / Hive includes PowerPivot Power View Analysis Services PDW StreamInsight SQL Server SQL Azure Excel

Big Deal Hive Hadoop Big Data Analytics to Insights

Next Steps Get Involved Read a bit http://sqlblog.com/blogs/lara_rubbelke/archive/2012/09/10/big-data-learning-resources.aspx Programming Hive Book http://blogs.msdn.com/cindygross Sign up: Windows Azure HDInsight Service http://HadoopOnAzure.com (Cloud CTP) Download Microsoft HDInsight Server http://microsoft.com/bigdata (On-Prem CTP) Think about how you can fit Big Data into your company data strategy Suggest uses, be prepared to combat misuses

Big Data References • Hadoop: The Definitive Guide by Tom White • SQL Server Sqoophttp://bit.ly/rulsjX • JavaScripthttp://bit.ly/wdaTv6 • Twitter https://twitter.com/#!/search/%23bigdata • Hive http://hive.apache.org • Excel to Hadoop via Hive ODBC http://tinyurl.com/7c4qjjj • Hadoop On Azure Videos http://tinyurl.com/6munnx2 • Klouthttp://tinyurl.com/6qu9php • Microsoft Big Data http://microsoft.com/bigdata • Denny Lee http://dennyglee.com/category/bigdata/ • Carl Nolan http://tinyurl.com/6wbfxy9 • Cindy Gross http://tinyurl.com/SmallBitesBigData

Microsoft Big Data at PASS Summit BIA-305-A SQLCAT: Big Data – All Abuzz About HiveWednesday 1015am | Cindy Gross, Dipti Sangani, Ed Katibah BIA-204-M MAD About Data: Solve Problems and Develop a “Data Driven Mindset” Wednesday 1015am | Darwin Schweitzer Enrich Manage Insight AD-300-M Bootstrapping Data Warehousing in Azure for Use with Hadoop Thursday 1015am | Steve Howard, James Podgorski, Olivier Matrat, Rafael Fernandez BIA-306-M How Klout Changed the Landscape of Social Media with Hadoop and BI Thursday 130pm | Denny Lee, Dave Mariani AD-316-M Harnessing Big Data with Hadoop Friday 8am | Mike Flasko DBA-410-S Big Data Meets SQL Server Friday 945am | David DeWitt AD-315-M NoSQL and Big Data Programmability Friday 415p | Michael Rys

Don’t Miss! Win prizes with new online evaluations Build experience with Hands On Labs NEW: TCC 304 Attend David DeWitt’s spotlight session Big Data Meets SQL Server DBA-410-S, Room 6E Friday, 9:45 AM Be SQL Server 2012 Certified with onsite testing Room 212-214 Find hidden session announcements by following: @sqlserver #sqlpass Visit the SQL Clinic and new “I MADE THAT!” Developer Chalk talks NEW: 4C-3 & 4C-4

PASS Resources Free SQL Server and BI training Free 1-day Training Events Regional Event Local and Virtual User Groups Free Online Technical Training This is Community Learning Center

Thank you for attending this session and the 2012 PASS Summit in Seattle

Please fill out evaluations! SQLCAT: Big Data – All Abuzz About Hive Cindy Gross SQLCAT BI/Big Data PM Microsoft http://blogs.msdn.com/cindygross @SQLCindy Cindy.Gross@microsoft.com Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Ed Katibah SQLCAT Spatial PM Microsoft http://blogs.msdn.com/b/edkatibah/ @Spatial_Ed Ed.Katibah@Microsoft.com

SQLCAT: Big Data – All Abuzz About Hive

SQLCAT: Big Data – All Abuzz About Hive

Presentation Transcript

How to Present your Data on Internet? A guide for beginners

Chapter 2: Data Preprocessing

Data Mining

Data Mining

NoSQL and Big Data Processing Hbase , Hive and Pig, etc.

“An Introduction to Poetry” Billy Collins (b. 1941)

Data Mining

High Scale OLTP Lessons Learned from SQLCAT Performance Labs

Best practices to ensure efficient data models, fast data activation, and performance of your SAP NetWeaver BW 7.3 data

BIG DATA

DATA ANALYSIS

Data Quality

Data Mining: Concepts and Techniques — Chapter 2 —

Data Preprocessing

Data Mining with DB

DATA MINING

UNIT-II Data Preprocessing

Data Collection/Data Capture

Advanced Stored Procedures

Data Mining: Concepts and Techniques

Chapter 2: Data Preprocessing

Hive Awards WInners