Hadoop and Amazon Web Services

Hadoop and Amazon Web Services Ken Krugler

Overview

Welcome • I’m Ken Krugler • Using Hadoop since The Dark Ages (2006) • Apache Tika committer • Active developer and trainer • Using Hadoop with AWS for… • Large scale web crawling • Machine learning/NLP • ETL/Solr indexing

Course Overview • Assumes you know basics of Hadoop • Focus is on how to use Elastic MapReduce • From n00b to knowledgeable in 10 modules… Getting Started Running Jobs Clusters of Servers Dealing with Data Wikipedia Lab Command Line Tools Debugging Tips Hive and Pig Hive Lab Advanced Topics

Why Use Elastic MapReduce? • Reduce hardware & OPS/IT personnel costs • Pay for what you actually use • Don’t pay for people you don’t need • Don’t pay for capacity you don’t need • More agility, less wait time for hardware • Don’t waste time buying/racking/configuring servers • Many server classes to choose from (micro to massive) • Less time doing Hadoop deployment & version mgmt • Optimized Hadoop is pre-installed

Getting Started

30 Seconds of Terminology • AWS – Amazon Web Services • S3 – Simple Storage Service • EC2 – Elastic Compute Cloud • EMR – Elastic MapReduce

The Three Faces of AWS • Three ways to interact with AWS • Via web browser – the AWS Console • Via command line tools – e.g. “elastic-mapreduce” CLI • Via the AWS API – Java, Python, Ruby, etc. • We’re using the AWS Console for the intro • The “Command Line Tools” module is later • Details of CLI & API found in online documentation • http://aws.amazon.com/documentation/elasticmapreduce/

Getting an Amazon Account • All AWS services require an account • Signing up is simple • Email address/password • Requires credit card, to pay for services • Uses phone number to validate account • End result is an Amazon account • Has an account ID (looks like xxxx-yyyy-zzzz) • Let’s go get us an account • Go to http://aws.amazon.com • Click the “Sign Up Now” button

Credentials • You have an account with a password • This account has: • An account name (AWS Test) • An account id (8310-5790-6469) • An access key id (AKIAID4SOXLXJSFNG6SA) • A secret access key (jXw5qhiBrF…) • A canonical user id (10d8c2962138…) • Let’s go look at our account settings… • http://console.aws.amazon.com • Select “Security Credentials” from account menu

Getting an EC2 Key Pair • Go to https://console.aws.amazon.com/ec2 • Click on the “Key Pairs” link at the bottom-left • Click on the “Create Key Pair” button • Enter a simple, short name for the key pair • Click the “Create” button • Let’s go make us a key pair…

Amazon S3 Bucket • EMR saves data to S3 • Hadoop job results • Hadoop job log files • S3 data is organized as paths to files in a “bucket” • You need to create a bucket before running a job • Let’s go do that now…

Summary • At this point we are ready to run Hadoop jobs • We have an AWS account - 8310-5790-6469 • We created a key pair – aws-test • We created an S3 bucket – aws-test-kk • In the next module we’ll run a custom Hadoop job

Running a Hadoop Job

Overview of Running a Job • Upload job jar & input data to S3 • Create a new Job Flow • Wait for completion, examine results

Setting Up the S3 Bucket • One bucket can hold all elements for job • Hadoop job jar – aws-test-kk/job/wikipedia-ngrams.jar • Input data – aws-test-kk/data/enwiki-split.xml • Results – aws-test-kk/results/ • Logs – aws-test-kk/logs/ • We can use AWS Console to create directories • And upload files too • Let’s go set up the bucket now…

Creating the Job Flow • A Job Flow has many settings: • A user-friendly name • The type of the job (custom jar, streaming, Hive, Pig) • The type and of number of servers • The key pair to use • Where to put log files • And a few other less common settings • Let’s go create a job flow…

Monitoring a Job • AWS Console displays information about the job • State – starting, running, shutting down • Elapsed time – duration • Normalized Instance Hours – cost • You can also terminate a job • Let’s go watch our job run…

Viewing Job Results • My job puts its results into S3 (-outputdir s3n://xxx) • The Hadoop cluster “goes away” at end of job • So anything in HDFS will be tossed • Persistent Job Flow doesn’t have this issue • Hadoop writes job log files to S3 • Using location specified for job (aws-test-kk/logs/) • Let’s go look at the job results…

Summary • Jobs can be defined using the AWS Console • Code and input data are loaded from S3 • Results and log files are saved back to S3 • In the next module we’ll explore server options

Clusters of Servers

Servers for Clusters in EMR • Based on EC2 instance type options • Currently eleven to choose from • See http://aws.amazon.com/ec2/instance-types/ • Each instance type has regular and API name • E.g. “Small (m1.small)” • Each instance type has five attributes, including… • Memory • CPUs • Local storage

Server Details • Uses Xen virtualization • So sometimes a server “slows down” • Currently m1.large uses: • Linux version 2.6.21.7-2.fc8xen • Debian 5.0.8 • CPU has X virtual cores and Y “EC2 Compute Units” • 1 compute unit ≈ 1GHz Xeon processor (circa 2007) • E.g. 6.5 EC2 Compute Units • (2 virtual cores with 3.25 EC2 Compute Units each)

Pricing • Instance types have per-hour cost • Price is combination of EC2 base cost + EMR extra • http://aws.amazon.com/elasticmapreduce/pricing/ • Some typical combined prices • Small $0.10/hour • Large $0.40/hour • Extra Large $0.80/hour • Spot pricing is based on demand

The Large (m1.large) Instance Type • Key attributes • 7.5GB memory • 2 virtual cores • 850GB local disk (2 drives) • 64-bit platform • Default Hadoop configuration • 4 mappers, 2 reducers • 1600MB child JVM size • 200MB sort buffer (io.sort.mb) • Let’s go look at the server…

Typical Configurations • Use m1.small for the master • NameNode & JobTracker don’t need lots of horsepower • Up to 50 slaves, otherwise bump to m1.large • Use m1.large for slaves - ‘balanced’ jobs • Reasonable CPU, disk space, I/O performance • Use m1.small for slaves – external bottlenecks • E.g. web crawling, since most time spent waiting • Slow disk I/O performance, slow CPU

Cluster Compute Instances • Lots of cores, faster network • 10 Gigabit Ethernet • Good for jobs with… • Lots of CPU cycles – parsing, NLP, machine learning • Lots of map-to-reduce data – many groupings • Cluster Compute Eight Extra Large Instance • 60GB memory • 8 real cores (88 EC2 Compute Units) • 3.3TB disk

Dealing with Data

Data Sources & Sinks • S3 – Simple Storage Service • Primary source of data • Other AWS Services • SimpleDB, DynamoDB • Relational Database Service (RDS) • Elastic Block Store (EBS) • External via APIs • HTTP (web crawling) is most common

S3 Basics • Data stored as objects (files) in buckets • <bucket>/<path> • “key” to file is path • No real directories, just path segments • Great as persistent storage for data • Reliable – up to 99.999999999% • Scalable – up to petabytes of data • Fast – highly parallel requests

S3 Access • Via HTTP REST interface • Create (PUT/POST), Read (GET), Delete (DELETE) • Java API/tools use this same API • Various command line tools • s3cmd – two different versions  • Or via your web browser

S3 Access via Browser • Browser-based • AWS Management Console • S3Fox Organizer – Firefox plug-in • Let’s try out the browser-based solutions…

S3 Buckets • Name of the bucket… • Must be unique across ALL users • Should be DNS-compliant • General limitations • 100 buckets per account • Can’t be nested – no buckets in buckets • Not limited by • Number of files/bucket • Total data stored in bucket’s files

S3 Files • Every file (aka object) • Lives in a bucket • Has a path which acts as the file’s “key” • Is identified via bucket + path • General limitations • Can’t be modified (no random write or append) • Max size of 5TB (5GB per upload request)

Fun with S3 Paths • AWS Console uses <bucket>/<path> • For specifying location of job jar • AWS Console uses s3n://<bucket>/<path> • For specifying location of log files • s3cmd tool use s3://<bucket>/<path>

S3 Pricing • Varies by region – numbers below are “US Standard” • Data in is (currently) free • Data out is also free within same region • Otherwise starts at $0.12/GB, drops w/volume • Per-request cost varies, based on type of request • E.g. $0.01 per 10K GET requests • Storage cost is per GB-month • Starts at $0.140/GB, drops w/volume

S3 Access Control List (ACL) • Read/Write permissions on per-bucket basis • Read == listing objects in bucket • Write == create/overwrite/delete objects in bucket • Read permissions on per-object (file) basis • Read == read object data & metadata • Also read/write ACP permissions on bucket/object • Reading & writing ACL for bucket or object • FULL_CONTROL means all valid permissions

S3 ACL Grantee • Who has what ACLs for each bucket/object? • Can be individual user • Based on Canonical user ID • Can be “looked up” via account’s email address • Can be a pre-defined group • Authenticated Users – any AWS user • All Users – anybody, with or without authentication • Let’s go look at some bucket & file ACLs…

S3 ACL Problems • Permissions set on bucket don’t propagate • Objects created in bucket have ACLs set by creator • Read permission on bucket ≠ able to read objects • So you can “own” a bucket (have FULL_CONTROL) • But you can’t read the objects in the bucket • Though you can delete the objects in your bucket

S3 and Hadoop • Just another file system • s3n://<bucket>/<path> • But bucket name must be valid hostname • Works with DistCp as source and/or destination • E.g. hadoopdistcp s3n://bucket1/ s3n://bucket2/ • Tweaks for Elastic MapReduce • Multi-part upload – files bigger than 5GB • S3DistCp – file patterns, compression, grouping, etc.

Map-Reduce Lab

Wikipedia Processing Lab • Lab covers running typical Hadoop job using EMR • Code parses Wikipedia dump (available in S3) • <page><title>John Brisco</title>…</page> • One page per line of text, thus no splitting issues • Output is top bigrams (character pairs) and counts • e.g. ‘th’ occurred 2,578,322 times • Format is tab-separated value (TSV) text file

Wikipedia Processing Lab - Requirements • You should already have your AWS account • Download & expand the Wikipedia Lab • http://elasticmapreduce.s3.amazonaws.com/training/wikipedia-lab.tgz • Follow the instructions in the README file • Located inside of expanded lab directory • Let’s go do that now…

Command Line Tools

Why Use Command Line Tools? • Faster in some cases than AWS Console • Able to automate via shell scripts • More functionality • E.g. dynamically expanding/shrinking cluster • And have a job flow with more than one step • Easier to do interactive development • Launching cluster without a step • Hive interactive mode

Why Not Use Command Line Tools? • Often requires Python or Ruby • Extra local configuration • Windows users have additional pain • Putty & setting up private key for ssh access

EMR Command Line Client • Ruby script for command line interface (CLI) • elastic-mapreduce <command> • See http://aws.amazon.com/developertools/2264 • Steps to install & configure • Make sure you have Ruby 1.8 installed • Download the CLI tool from the page above • Edit you credentials.json file

Using the elastic-mapreduce CLI • Editing the credentials.json file • Located inside of the elastic-mapreduce directory • Enter your credentials (access id, private key, etc) • Set the target AWS region • Add elastic-mapreduce directory to your path • E.g. in .bash_rc, add export PATH=$PATH:xxx • Let’s give it a try…

s3cmd Command Line Client • Python script for interacting with S3 • Supports all standard file operations • List files or buckets – s3cmd ls s3://<bucket> • Delete bucket – s3cmd rb s3://<bucket> • Delete file – s3cmd del s3://<bucket>/<path> • Put file – s3cmd put <local file> s3://<bucket> • Get file – s3cmd get s3://<bucket>/<path> <local path> • Etc…

Using s3cmd • Download it from: • http://sourceforge.net/projects/s3tools/files/latest/download?source=files • Expand/install it: • Add to your shell path • Run `s3cmd --configure` • Enter your credentials • Let’s go try that…

Hadoop and Amazon Web Services

Hadoop and Amazon Web Services

Presentation Transcript

Amazon Web Services

Amazon Web Services

Amazon Web Services and Eucalyptus

Amazon Web Services

Amazon Web Services

Accenture and Amazon Web Services

Amazon Web Services

Amazon Web Services

Amazon Web Services

Amazon Web Services

amazon-web-services-training

Amazon Web Services

Amazon Web Services

Amazon Web Services Training

amazon web services training

Amazon web services training

Amazon Web Services

Amazon Web Services

Amazon Web Services

Aws (Amazon Web Services)