1 / 24

Introduction to Hive

Introduction to Hive. Liyin Tang liyintan@usc.edu. Outline. Motivation Overview Data Model / Metadata Architecture Performance Cons and Pros Application Related Work. 10/20/2019. Motivation. Realtime Hadoop Cluster. Scribe MidTier. Web Servers. Scribe Writers. MySQL. Oracle RAC.

ivanbritt
Télécharger la présentation

Introduction to Hive

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Hive Liyin Tang liyintan@usc.edu

  2. Introduction to Hive Outline • Motivation • Overview • Data Model / Metadata • Architecture • Performance • Cons and Pros • Application • Related Work 10/20/2019

  3. Introduction to Hive Motivation Realtime Hadoop Cluster Scribe MidTier Web Servers Scribe Writers MySQL Oracle RAC Hadoop Hive Warehouse http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html 10/20/2019

  4. Introduction to Hive Motivation • Limitation of MR • Have to use M/R model • Not Reusable • Error prone • For complex jobs: • Multiple stage of Map/Reduce functions • Just like ask dev to write specify physical execution plan in the database 10/20/2019

  5. Introduction to Hive Overview • Intuitive • Make the unstructured data looks like tables regardless how it really lay out • SQL based query can be directly against these tables • Generate specify execution plan for this query • What’s Hive • A data warehousing system to store structured data on Hadoop file system • Provide an easy query these data by execution Hadoop MapReduce plans 10/20/2019

  6. Introduction to Hive Data Model • Tables • Basic type columns (int, float, boolean) • Complex type: List / Map ( associate array) • Partitions • Buckets • CREATE TABLE sales( id INT, items ARRAY<STRUCT<id:INT,name:STRING> ) PARITIONED BY (ds STRING) CLUSTERED BY (id) INTO 32 BUCKETS; • SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32) 10/20/2019

  7. Introduction to Hive Metadata • Database namespace • Table definitions • schema info, physical location In HDFS • Partition data • ORM Framework • All the metadata can be stored in Derby by default • Any database with JDBC can be configed 10/20/2019

  8. Architecture Map Reduce HDFS http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs

  9. Introduction to Hive Performance • GROUP BY operation • Efficient execution plans based on: • Data skew: • how evenly distributed data across a number of physical nodes • bottleneck VS load balance • Partial aggregation: • Group the data with the same group by value as soon as possible • In memory hash-table for mapper • Earlier than combiner 10/20/2019

  10. Introduction to Hive Performance • JOIN operation • Traditional Map-Reduce Join • Early Map-side Join • very efficient for joining a small table with a large table • Keep smaller table data in memory first • Join with a chunk of larger table data each time • Space complexity for time complexity 7/20/2010

  11. Introduction to Hive Performance • Ser/De • Describe how to load the data from the file into a representation that make it looks like a table; • Lazy load • Create the field object when necessary • Reduce the overhead to create unnecessary objects in Hive • Java is expensive to create objects • Increase performance 7/20/2010

  12. Hive – Performance • QueryA: SELECT count(1) FROM t; • QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t; • QueryC: SELECT * FROM t; • map-side time only (incl. GzipCodec for comp/decompression) • * These two features need to be tested with other queries. http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs

  13. Introduction to Hive Pros • Pros • A easy way to process large scale data • Support SQL-based queries • Provide more user defined interfaces to extend • Programmability • Efficient execution plans for performance • Interoperability with other database tools 10/20/2019

  14. Introduction to Hive Cons • Cons • No easy way to append data • Files in HDFS are immutable • Future work • Views / Variables • More operator • In/Exists semantic • More future work in the mail list 10/20/2019

  15. Introduction to Hive Application • Log processing • Daily Report • User Activity Measurement • Data/Text mining • Machine learning (Training Data) • Business intelligence • Advertising Delivery • Spam Detection 7/20/2010

  16. Introduction to Hive Related Work • Parallel databases: Gamma, Bubba, Volcano • Google: Sawzall • Yahoo: Pig • IBM: JAQL • Microsoft: DradLINQ , SCOPE 7/20/2010

  17. Introduction to Hive Reference • [1] A.Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of VLDB09', 2009. • [2] Hadoop 2009: • http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs • [4] Facebook Data Team: • http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation • [3] Cloudera: • http://www.cloudera.com/videos/introduction_to_hive 7/20/2010

  18. Q & AThank you

  19. Back up

  20. Introduction to Hive Hive Components • Shell Interface: Like the MySQL shell • Driver: • Session handles, fetch, exeucition • Complier: • Prarse,plan,optimzie • Execution Engine: • DAG stage • Run map or reduce 7/20/2010

  21. Introduction to Hive Motivation • MapReduce Motivation • Data processing: > 1 TB • Massively parallel • Locality • Fault Tolerant 7/20/2010

  22. Introduction to Hive Hive Usage • hive> show tables; • hive> create table SHAKESPEARE (freq INT,word STRING) row format delimited fields terminated by ‘\t’ stored as textfile • hive> load data inpath “shakespeare_freq” into table shakespeare;

  23. Introduction to Hive Hive Usage • hive> load data inpath “shakespeare_freq” into table shakespeare; • hive> select * from shakespeare where freq>100 sort by freq asc limit 10;

  24. Introduction to Hive Hive Usage @ Facebook • Statistics per day: • 4 TB of compressed new data added per day • 135TB of compressed data scanned per day • 7500+ Hive jobs on per day • Hive simplifies Hadoop: • ~200 people/month run jobs on Hadoop/Hive • Analysts (non-engineers) use Hadoop through Hive • 95% of jobs are Hive Jobs http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs 7/20/2010

More Related