Cindy Gross CAT PM SQL, BI, Big Data

HDInsight: Jiving about Hadoop and Hive with CAT Cindy GrossCAT PMSQL, BI, Big Data http://blogs.msdn.com/cindygross/ @SQLCindy Cindy.Gross@microsoft.com

Objectives • Quick Overview: Big Data, Hadoop, HDInsight, Open Source • What Hive is • Why Hive for Hadoop? • Why Hive for SQL Pros? • How Hive fits into Hadoop/HDInsight • Hive is better together with SQL, AS, BI Key Takeaways • How Hive fits • Hive DDL and DML • Formats, Structure • Storage options Why are we here?

Big Data MapReduce, Streaming, Machine Learning, Massively Parallel Processing Schema on Read Not Write BASE Not ACID Scale Out for Pay As You Go Too Big, Complex, or Expensive for Current Environment

A New Set of Questions • How do I better predict future outcomes? • How do I optimize my fleet based on weather and traffic patterns? • What’s the social sentiment for my brand or products Advanced ANALYTICS SOCIAL & Web ANALYTICS LIVE DATA FEEDS

New Opportunities GE Revenue Growth Increases ad revenue by processing 3.5 billion events per day Businesses Innovation Measures and ranks online user influence by processing 3 billion signals per day Operational Efficiencies Uses sentiment analysis and web analytics for its internal cloud Massive Volumes Processes 464 billion rows per quarter, with average query time under 10 secs. Cloud Connectivity Connects across 15 social networks via the cloud for data and API access Real-Time Insight Improves operational decision making for IT managers and users

Big Data Technologies Hadoop MapReduce Distributed Storage & Processing Over Multi-Structured Data (HDInsight) Multiple Data Sources Relational / Structured Databases (SQL Server and PDW) Multi-Dimensional and Tabular / Structured Databases (Analysis Services) Event Driven Processing Enterprise BI and Analytics Tools (Office, Analysis Services) Self Service BI and Analytics Tools (Office, Analysis Services) Cloud (Azure) Flexibility + On-Premises Option

Apache Hadoop, Hortonworks, Hdinsight How it fits together Hortonworks HDInsight Apache Hadoop Open Source Community We Consume Code We Contribute Code Core Code Same Across Distributions Microsoft Partner Heavy Contributors to Open Source Hadoop Trusted in Open Source Community HDInsight Service, HDInsight Server Built on Hortonworks Platform Additional Functionality

HDInsight / Hadoop Architecture JavaScript C#, F#, .NET Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Yellow = Data Movement Green = Packages White = Coming Soon Relational (SQL Server) Data Integration ( ODBC / SQOOP/ REST) Pipeline / Workflow (Oozie) Stats processing (RHadoop) Machine Learning (Mahout) Graph (Pegasus) Metadata (HCatalog) PDW Polybase Event Driven Processing Query (Hive) Scripting (Pig) NoSQL Database (HBase) Event Pipeline (Flume) Distributed Processing (MapReduce) Business Intelligence (Excel, Power View, SSAS) Distributed Storage (HDFS) Active Directory (Security) Monitoring & Deployment (System Center) World's Data (Azure Data Marketplace) Azure Storage Vault (ASV)

Hive Architecture HiveQL Metastore Hive Hive Web Interface (HWI) Command Line Interface (CLI) JDBC ODBC Thrift Server Compiler, Optimizer, Executor Hadoop Head Node Name Node Data Nodes / Task Nodes

Enables BI tools via ODBC, structure Structure without full relational modeling Familiar HiveQL - skillset reuse Simplifies Hadoop data access Why Hive for Hadoop?

Batch oriented Data Warehouse focused Entire data sets (table scans) Generates/runs MapReduce (not faster than MR!) Limited indexing, no stats, no cache Programmer is the optimizer Append only (mostly) Hive Characteristics

Someone in your org will be doing it, why not you? Fit projects to appropriate tech Adds to, complements SQL, AS, BI New opportunities for biz and for you Explore, archive, prototype, pre-aggregate, refine algorithms, some self-service Why Hive for SQL Pros?

Updates, OLTP, ACID Subsets, indexes/aggs, built-in optimizer, caching Apps, data, structure, infrastructure already exists Each query has to be fast You know what you need to know Where it makes sense SQL/AS still needed for….

Behind the Scenes

Not Partitioned CREATE EXTERNAL TABLE baconUnPart (type string COMMENT 'type of bacon') COMMENT 'SQL Bacon!' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/demo/bacon'; Partitioned CREATE EXTERNAL TABLE baconPart (type string COMMENT 'type of bacon strips') COMMENT 'SQL Bacon strips' PARTITIONED BY (year string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; ALTER TABLE baconPart ADD PARTITION (Year = ‘1’) LOCATION '/user/demo/bacon1'; ALTER TABLE baconPart ADD PARTITION (Year = ‘2’) LOCATION '/user/demo/bacon2'; Create Table

DATA TYPES EXTERNAL / INTERNAL PARTITIONED BY | CLUSTERED BY | SKEWED BY Terminators ROW FORMAT DELIMITED | SERDE STORED AS Fields/Collection Items/Map Keys TERMINATED BY LOCATION Inside a Hive Table

Metadata is stored in a MetaStore database such as Derby SQL Azure SQL Server See Schema SHOW TABLES 'ba.*'; DESCRIBE baconunpart; DESCRIBE baconunpat.type; DESCRIBE EXTENDED baconunpart; DESCRIBE FORMATTED baconunpart; SHOW FUNCTIONS "x.*"; SHOW FORMATTED INDEXES ON baconunpart; MetaData

Primitives Numbers: Int, SmallInt, TinyInt, BigInt, Float, Double Characters: String Special: Binary, Timestamp Collections STRUCT<City:String, State:String> | Struct (‘Boise’, ‘Idaho’) ARRAY <String> | Array (‘Boise’, ‘Idaho’) MAP <String, String> | Map (‘City’, ‘Boise’, ‘State’, ‘Idaho’) UNIONTYPE <BigInt, String, Float> Properties No fixed string lengths NULL handling depends on SerDe Data Types

CREATE EXTERNAL TABLE baconUnPart(…) LOCATION '/user/demo/bacon'; LOCATION ‘hdfs:///user/demo/bacon'; LOCATION ‘asv://user/demo/bacon'; UseEXTERNALwhen Data also used outside of Hive Data needs to remain even after a DROP TABLE Use custom location such as ASV Hive should not own data and control settings, dirs, etc. UseINTERNALwhen You want Hive to manage the data and storage Short term usage Creating table based on existing table (AS SELECT) Storage – External and Internal

CREATE EXTERNAL TABLE baconPart(…) PARTIONED BY(Year string) CLUSTERED BY(type) into 256 BUCKETS Partition Directory for each distinct combo of string partition values Partition key name cannot be defined in table itself Allows partition elimination Useful in range searches Can slow performance if partition is not referenced in query Buckets Split data based on hash of a column One HDFS file per bucket within partition sub-directory Performance may improve for aggregates and join queries Sampling set hive.enforce.bucketing = true; Storage – Partition and Bucket

CREATE EXTERNAL TABLE baconPart(…) ROW FORMAT DELIMITED FIELDS TERMINATED by‘\001‘ STORED AS TEXTFILE, RCFILE, SEQUENCEFILE, AVRO Format Generally remove headers before loading files TEXTFILE is common, useful when data is shared and alphanumeric Extensible storage formats via custom input, output formats Extensible on disk/in-memory representation via custom SerDes Storage – File Formats

CREATE EXTERNAL TABLE CustomSerDeUsage(…) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION …. SerDes Create your own Java Serialization/Deserialization Includes parse input/output, optimization Usually overrides CREATE TABLE DDL Common SerDes: CSV, XML, JSON, Custom Library: org.apache.hadoop.hive.serde2 Storage –SerDes

ASV://containername@account.blob.core.windows.net/user/demo/ HDFS:///user/demo/ Storage Format HDFS is Hadoop distributed file system ASV is Azure Storage Vault using an API on top of HDFS ASV allows reuse across clusters and with other apps ASV data quickly available to new HDInsight clusters Storage –HDFS and ASV

CREATE INDEX baconPart_idx ON TABLE baconPart (type) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD IN TABLE baconPart_index; ALTER INDEX baconPart_idxON baconPartREBUILD; Key Points No keys Index data is another table Requires REBUILD to include new data SHOW FORMATTED INDEXES on MyTable; Indexing May Help Avoid many small partitions GROUP BY CREATE INDEX

CREATE VIEWbaconOneYear (type) AS SELECT type FROM baconPart WHERE year = 1 ORDER BY type; Sample Code SELECT * FROM baconOneYear;DESCRIBE FORMATTED baconOneYear; Key Points Not materialized Can have ORDER BY or LIMIT Create View

Queries

SELECTc.state_fips, c.county_fips, c.population FROMcensus c WHEREc.median_household_income > 100000 GROUPBYc.state_fips, c.county_fips, c.population ORDERBYcounty_fips LIMIT 100; Key Points Minimal caching, statistics, or optimizer Generally reads entire data set for every query Performance The order of columns, tables can make a difference to performance Use partition elimination for range filtering Query

ORDER BY One reducer does final sort, can be a big bottleneck SORT BY Sorted only within each reducer, much faster DISTRIBUTE BY Determines how map data is distributed to reducers SORT BY + DISTRIBUTE BY = CLUSTER BY Can mimic ORDER BY, better perf if even distribution Sorting

Supported Hive Join Types Equality OUTER - LEFT, RIGHT, FULL LEFT SEMI Not Supported Non-Equality IN/EXISTS subqueries (rewrite as LEFT SEMI JOIN) Characteristics Multiple MapReduce jobs unless same join columns in all tables Put largest table last in query to save memory Joins are done left to right in query order JOIN ON completely evaluated before WHERE starts Joins

EXPLAIN SELECT * FROM baconPart; EXPLAIN SELECT * FROM baconPart WHERE year > 1; EXPLAIN EXTENDED SELECT * FROM baconPart; Characteristics Does not execute the query Shows parsing Lists stages, temp files, dependencies, modes, output operators, etc. ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME baconPart))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FIL E)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)))) STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 EXPLAIN

Configuration

Configuration cd %hive_home%\bin <install-dir> currently: C:\Hadoop\hadoop-1.1.0-SNAPSHOT Hive default configuration <install-dir>/conf/hive-default.xml Configuration variables <install-dir>/conf/hive-site.xml Hive configuration directory HIVE_CONF_DIR environment variable Log4j configuration <install-dir>/conf/hive-log4j.properties Typical Log: c:\Hadoop\hive-0.9.0\logs\hive.log Configure Hive

Wrap Up

BUZZ! BI on Big Data Cross-pollinate your existing SQL skills! Makes Hadoop cross-correlations, joins, filters easier Allows storage of intermediate results for faster/easier querying Batch based processing E2E insight may be much faster Get the right projects on the right technologies Why Use Hive

Next Steps Get Involved Read a bit http://sqlblog.com/blogs/lara_rubbelke/archive/2012/09/10/big-data-learning-resources.aspx Programming Hive Book http://blogs.msdn.com/cindygross Subscribe to Windows Azure HDInsight Service http://HadoopOnAzure.com (Cloud CTP) Download Microsoft HDInsight Server http://microsoft.com/bigdata (On-Prem CTP) Think about how you can fit Big Data into your company data strategy Suggest uses, be prepared to combat misuses

Big Data References • Hadoop: The Definitive Guide by Tom White • SQL Server Sqoophttp://bit.ly/rulsjX • JavaScripthttp://bit.ly/wdaTv6 • Twitter https://twitter.com/#!/search/%23bigdata • Hive http://hive.apache.org • Excel to Hadoop via Hive ODBC http://tinyurl.com/7c4qjjj • Hadoop On Azure Videos http://tinyurl.com/6munnx2 • Klouthttp://tinyurl.com/6qu9php • Microsoft Big Data http://microsoft.com/bigdata • Denny Lee http://dennyglee.com/category/bigdata/ • Carl Nolan http://tinyurl.com/6wbfxy9 • Cindy Gross http://tinyurl.com/SmallBitesBigData

Objectives • Quick Overview: Big Data, Hadoop, HDInsight, Open Source • What Hive is • Why Hive for Hadoop? • Why Hive for SQL Pros? • How Hive fits into Hadoop/HDInsight • Hive is better together with SQL, AS, BI Key Takeaways • How Hive fits • Hive DDL and DML • Formats, Structure • Storage options What we covered

HDInsight: Jiving about Hadoop and Hive with CAT Cindy GrossCAT PMSQL, BI, Big Data http://blogs.msdn.com/cindygross/ @SQLCindy Cindy.Gross@microsoft.com

Cindy Gross CAT PM SQL, BI, Big Data