490 likes | 727 Vues
Processing Data using Amazon Elastic MapReduce and Apache Hive. Team Members Frank Paladino Aravind Yeluripiti. Project Goals. Creating an Amazon EC2 instance to process queries against large datasets using Hive. Choosing two public data sets available on Amazon
E N D
Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino AravindYeluripiti
Project Goals • Creating an Amazon EC2 instance to process queries against large datasets using Hive. • Choosing two public data sets available on Amazon • Running queries against these datasets • Comparing HiveQL and SQL
Definitions • HadoopMapReduceis a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. • Amazon Elastic MapReduce is a web service which provides easy access to a HadoopMapReduce cluster in the cloud. • Apache Hive is an open source data warehouse built on top of Hadoop which provides a SQL-like interface over MapReduce.
Starting an interactive hive session • Register for an AWS account • Create a key pair in EC2 console • Launch an EMR job flow to start an interactive Hive session using the key pair • Use the domain name and key pair to SSH into the master node of the Amazon EC2 cluster as the user “hadoop” • Start hive
Data set 1: DISASTERS WORLDWIDE FROM 1900-2008 • Available at: • http://www.infochimps.com/datasets/disasters-worldwide-from-1900-2008 • Info • Disaster data from 1900 – 2008, • organized by • start and end date, • country (and sub-location), • disaster type (and sub-type), • disaster name, cost, and persons killed and affected by the disaster. • Details • Number of rows: 17829 • Size: 451 kB (Compressed) , 1.5 MB (uncompressed)
Create table CREATE EXTERNAL TABLE emdata ( start string, ende string, country string, locatione string, type string, subtype string, name string, killed string, cost string, affected string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://cmpt592-assignment5/input/';
Q1: • Get the total number of disasters occurred in each country • Query: SELECT count(distinct id), country FROM emdata GROUP BY country;
Q1: • Output:
Q2: • Get the number and type of disasters in a given country • Query: SELECT count(distinct id), type FROM emdata WHERE country='Afghanistan' GROUP BY type;
Q2: • Output:
Q3: • Get the number, type, and subtype of disasters in a given country • Query: SELECT count(distinct id), type, subtype FROM emdata WHERE country='Afghanistan' GROUP BY type, subtype;
Q3: • Output:
Q4: • Get the total casualties and type of disasters in a given country only when casualties > 100 • Query: SELECT sum(killed), type FROM emdata WHERE country='Afghanistan' and killed>100 GROUP BY type;
Q4: • Output:
Q5: • Get the total casualties and name of country for a certain type of disaster when casualties > 500 • Query: SELECT sum(killed), country FROM emdata WHERE type='Flood' and killed>500 GROUP BY country;
Q3: • Output:
Q6: • Get the total cost and name of country for a certain type of disaster when cost > 1 000 000 • Query: SELECT sum(cost), country FROM emdata WHERE type='Flood' and cost>1000000 GROUP BY country;
Q3: • Output:
Data set 2: Google Books n-grams • Available at: s3://datasets.elasticmapreduce/ngrams/books/ • Details • Size:2.2 TB • Source: Google Books • Created On: January 5, 2011 6:11 PM GMT • Last Updated: January 21, 2012 2:12 AM GMT • Processing n-gram data using Amazon Elastic MapReduce and Apache Hive. • Calculating the top trending topics per decade • http://aws.amazon.com/articles/5249664154115844
Data set 1: Google Books n-grams • N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. • The n grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token. For example, the following sentence. • The yellow dog played fetch. • Would produce the following 2-grams: ["The", "yellow"] ["yellow", 'dog"] ["dog", "played"] ["played", "fetch"] ["fetch", "."] Or the following 3-grams: ["The", "yellow", "dog"] ["yellow", "dog", "played"] ["dog", "played", "fetch"] ["played", "fetch", "."]
Data set 1: Google Books n-grams • Two settings to efficiently process data from S3 • hive> set hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat; • hive> set mapred.min.split.size=134217728; • Used to tell • Hive to use an InputFormat that will split a file into pieces for processing, and • Tell it not to split them into pieces any smaller than 128 MB.
Creating input table • In order to process any data • Define the source of data • Data set used – English 1-grams • Rows : 472,764,897 • Compressed Size : 4.8 GB
Creating input table • Statement to define data CREATE EXTERNAL TABLE english_1grams ( gram string, year int, occurrences bigint, pages bigint, books bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS SEQUENCEFILE LOCATION 's3://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/';
Creating input table • Output
Count the number of rows • Statement Select count(*) from english_1grams; • Output
Normalizing the data • The data in its current form is very raw. • It contains punctuation, numbers (typically years), and is case sensitive. • First we want to create a table to store the results of the normalization. In the process, we can also drop unnecessary columns. • Statement CREATE TABLE normalized ( gram string, year int, occurrences bigint );
Normalizing the data • Output
Inserting data into normalized table • Read raw data and insert into normalized table • Statement INSERT OVERWRITE TABLE normalized SELECT lower(gram), year, occurrences FROM english_1grams WHERE year >= 1890 AND gram REGEXP "^[A-Za-z+'-]+$";
Inserting data into normalized table • Output
Finding word ratio by decade • More books are printed over time, • so every word has a tendency to have more occurrences in later decades. • We only care about the relative usage of that word over time, • so we want to ignore the change in size of corpus. • This can be done by finding the ratio of occurrences of this word over the total number of occurrences of all words.
Finding word ratio by decade • Create a table to store this data • Statement CREATE TABLE by_decade ( gram string, decade int, ratio double ); • Calculate the total number of word occurrences by decade. Then • Join this data with the normalized table in order to calculate the usage ratio.
Finding word ratio by decade • Statement INSERT OVERWRITE TABLE by_decade SELECT a.gram, b.decade, sum(a.occurrences) / b.total FROM normalized a JOIN ( SELECT substr(year, 0, 3) as decade, sum(occurrences) as total FROM normalized GROUP BY substr(year, 0, 3) ) b ON substr(a.year, 0, 3) = b.decade GROUP BY a.gram, b.decade, b.total;
Finding word ratio by decade • Output
Calculating changes per decade • With a normalized dataset by decade we can get down to calculating changes by decade. • This can be achieved by joining the dataset on itself. • We'll want to join rows where the n-grams are equal and the decade is off by one. • This lets us compare ratios for a given n-gram from one decade to the next.
Calculating changes per decade • Statement SELECT a.gram as gram, a.decade as decade, a.ratio as ratio, a.ratio / b.ratio as increase FROM by_decade a JOIN by_decade b ON a.gram = b.gram and a.decade - 1 = b.decade WHERE a.ratio > 0.000001 and a.decade >= 190 DISTRIBUTE BY decade SORT BY decade ASC, increase DESC;
Calculating changes per decade • Output
Hive (Hadoop) vs. SQL (RDBMS) • Hive (Hadoop) • Intended for scaling to hundred and thousands of machines. • Optimized for full table scans and jobs incur substantial overhead in submission and scheduling. • Best for data transformation, summarization, and analysis of large volumes of data but not appropriate for applications requiring fast query response times. • Read based and therefore not appropriate for transaction processing requiring write operations. • SQL (RDBMS) • Typically run on a single large machine and do not provide support for executing map and reduce functions on the tables. • RDBMS systems are best for when referential integrity are required and frequent small updates are performed. • Tables are indexed and cached so small amounts of data can be retrieved very quickly.
References • Disasters worldwide from 1900-2008 dataset • URL: http://www.infochimps.com/datasets/disasters-worldwide-from-1900-2008 • Finding trending topics using Google Books n-grams data and Apache Hive on Elastic MapReduce • URL: http://aws.amazon.com/articles/5249664154115844 • Google Books Ngrams • URL:http://aws.amazon.com/datasets/8172056142375670 • Apache Hadoop • URL: http://hadoop.apache.org/ • Amazon Elastic Compute Cloud • URL: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html • Getting started Guide • URL: http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-tutorial.html
Acknowledgements • Frank Paladino
Acknowledgements • Dr AparnaVarde