Processing Data using Amazon Elastic MapReduce and Apache Hive

Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino AravindYeluripiti

Project Goals • Creating an Amazon EC2 instance to process queries against large datasets using Hive. • Choosing two public data sets available on Amazon • Running queries against these datasets • Comparing HiveQL and SQL

Definitions • HadoopMapReduceis a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. • Amazon Elastic MapReduce is a web service which provides easy access to a HadoopMapReduce cluster in the cloud. • Apache Hive is an open source data warehouse built on top of Hadoop which provides a SQL-like interface over MapReduce.

Starting an interactive hive session • Register for an AWS account • Create a key pair in EC2 console • Launch an EMR job flow to start an interactive Hive session using the key pair • Use the domain name and key pair to SSH into the master node of the Amazon EC2 cluster as the user “hadoop” • Start hive

Data set 1: DISASTERS WORLDWIDE FROM 1900-2008 • Available at: • http://www.infochimps.com/datasets/disasters-worldwide-from-1900-2008 • Info • Disaster data from 1900 – 2008, • organized by • start and end date, • country (and sub-location), • disaster type (and sub-type), • disaster name, cost, and persons killed and affected by the disaster. • Details • Number of rows: 17829 • Size: 451 kB (Compressed) , 1.5 MB (uncompressed)

Create table CREATE EXTERNAL TABLE emdata ( start string, ende string, country string, locatione string, type string, subtype string, name string, killed string, cost string, affected string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://cmpt592-assignment5/input/';

Q1: • Get the total number of disasters occurred in each country • Query: SELECT count(distinct id), country FROM emdata GROUP BY country;

Q1: • Output:

Q2: • Get the number and type of disasters in a given country • Query: SELECT count(distinct id), type FROM emdata WHERE country='Afghanistan' GROUP BY type;

Q2: • Output:

Q3: • Get the number, type, and subtype of disasters in a given country • Query: SELECT count(distinct id), type, subtype FROM emdata WHERE country='Afghanistan' GROUP BY type, subtype;

Q3: • Output:

Q4: • Get the total casualties and type of disasters in a given country only when casualties > 100 • Query: SELECT sum(killed), type FROM emdata WHERE country='Afghanistan' and killed>100 GROUP BY type;

Q4: • Output:

Q5: • Get the total casualties and name of country for a certain type of disaster when casualties > 500 • Query: SELECT sum(killed), country FROM emdata WHERE type='Flood' and killed>500 GROUP BY country;

Q3: • Output:

Q6: • Get the total cost and name of country for a certain type of disaster when cost > 1 000 000 • Query: SELECT sum(cost), country FROM emdata WHERE type='Flood' and cost>1000000 GROUP BY country;

Q3: • Output:

Data set 2: Google Books n-grams • Available at: s3://datasets.elasticmapreduce/ngrams/books/ • Details • Size:2.2 TB • Source: Google Books • Created On: January 5, 2011 6:11 PM GMT • Last Updated: January 21, 2012 2:12 AM GMT • Processing n-gram data using Amazon Elastic MapReduce and Apache Hive. • Calculating the top trending topics per decade • http://aws.amazon.com/articles/5249664154115844

Data set 1: Google Books n-grams • N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. • The n grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token. For example, the following sentence. • The yellow dog played fetch. • Would produce the following 2-grams: ["The", "yellow"] ["yellow", 'dog"] ["dog", "played"] ["played", "fetch"] ["fetch", "."] Or the following 3-grams: ["The", "yellow", "dog"] ["yellow", "dog", "played"] ["dog", "played", "fetch"] ["played", "fetch", "."]

Data set 1: Google Books n-grams • Two settings to efficiently process data from S3 • hive> set hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat; • hive> set mapred.min.split.size=134217728; • Used to tell • Hive to use an InputFormat that will split a file into pieces for processing, and • Tell it not to split them into pieces any smaller than 128 MB.

Creating input table • In order to process any data • Define the source of data • Data set used – English 1-grams • Rows : 472,764,897 • Compressed Size : 4.8 GB

Creating input table • Statement to define data CREATE EXTERNAL TABLE english_1grams ( gram string, year int, occurrences bigint, pages bigint, books bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS SEQUENCEFILE LOCATION 's3://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/';

Creating input table • Output

Count the number of rows • Statement Select count(*) from english_1grams; • Output

Normalizing the data • The data in its current form is very raw. • It contains punctuation, numbers (typically years), and is case sensitive. • First we want to create a table to store the results of the normalization. In the process, we can also drop unnecessary columns. • Statement CREATE TABLE normalized ( gram string, year int, occurrences bigint );

Normalizing the data • Output

Inserting data into normalized table • Read raw data and insert into normalized table • Statement INSERT OVERWRITE TABLE normalized SELECT lower(gram), year, occurrences FROM english_1grams WHERE year >= 1890 AND gram REGEXP "^[A-Za-z+'-]+$";

Inserting data into normalized table • Output

Finding word ratio by decade • More books are printed over time, • so every word has a tendency to have more occurrences in later decades. • We only care about the relative usage of that word over time, • so we want to ignore the change in size of corpus. • This can be done by finding the ratio of occurrences of this word over the total number of occurrences of all words.

Finding word ratio by decade • Create a table to store this data • Statement CREATE TABLE by_decade ( gram string, decade int, ratio double ); • Calculate the total number of word occurrences by decade. Then • Join this data with the normalized table in order to calculate the usage ratio.

Finding word ratio by decade • Statement INSERT OVERWRITE TABLE by_decade SELECT a.gram, b.decade, sum(a.occurrences) / b.total FROM normalized a JOIN ( SELECT substr(year, 0, 3) as decade, sum(occurrences) as total FROM normalized GROUP BY substr(year, 0, 3) ) b ON substr(a.year, 0, 3) = b.decade GROUP BY a.gram, b.decade, b.total;

Finding word ratio by decade • Output

Calculating changes per decade • With a normalized dataset by decade we can get down to calculating changes by decade. • This can be achieved by joining the dataset on itself. • We'll want to join rows where the n-grams are equal and the decade is off by one. • This lets us compare ratios for a given n-gram from one decade to the next.

Calculating changes per decade • Statement SELECT a.gram as gram, a.decade as decade, a.ratio as ratio, a.ratio / b.ratio as increase FROM by_decade a JOIN by_decade b ON a.gram = b.gram and a.decade - 1 = b.decade WHERE a.ratio > 0.000001 and a.decade >= 190 DISTRIBUTE BY decade SORT BY decade ASC, increase DESC;

Calculating changes per decade • Output

Final output

Hive (Hadoop) vs. SQL (RDBMS) • Hive (Hadoop) • Intended for scaling to hundred and thousands of machines. • Optimized for full table scans and jobs incur substantial overhead in submission and scheduling. • Best for data transformation, summarization, and analysis of large volumes of data but not appropriate for applications requiring fast query response times. • Read based and therefore not appropriate for transaction processing requiring write operations. • SQL (RDBMS) • Typically run on a single large machine and do not provide support for executing map and reduce functions on the tables. • RDBMS systems are best for when referential integrity are required and frequent small updates are performed. • Tables are indexed and cached so small amounts of data can be retrieved very quickly.

References • Disasters worldwide from 1900-2008 dataset • URL: http://www.infochimps.com/datasets/disasters-worldwide-from-1900-2008 • Finding trending topics using Google Books n-grams data and Apache Hive on Elastic MapReduce • URL: http://aws.amazon.com/articles/5249664154115844 • Google Books Ngrams • URL:http://aws.amazon.com/datasets/8172056142375670 • Apache Hadoop • URL: http://hadoop.apache.org/ • Amazon Elastic Compute Cloud • URL: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html • Getting started Guide • URL: http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-tutorial.html

Acknowledgements • Frank Paladino

Acknowledgements • Dr AparnaVarde

Processing Data using Amazon Elastic MapReduce and Apache Hive

Processing Data using Amazon Elastic MapReduce and Apache Hive

Presentation Transcript

Apache Hadoop and Hive

Data-Intensive Text Processing with MapReduce

Apache Hadoop and Hive

Introduction to Apache HIVE

An Introduction to Apache Hive

Large-Scale Data Processing with MapReduce

Big Data Processing with MapReduce and Spark

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining

MapReduce , HBase , and Hive

Apache Hive and Stinger: SQL in Hadoop

Data Processing with MapReduce

Massive Data Processing 02: MapReduce Basics

Scientific data curation and processing with Apache Tika

Amazon PIRE Data Processing Tutorial

Apache Tez : Accelerating Hadoop Data Processing

Simplifying MapReduce Data Processing

Hadoop foundations with apache hive and hbase

Apache Hive-ORIEN IT

Nested JSON data processing using Apache Spark with Coding

Data Processing Amazon Web Services