1 / 25

Coursework II: Google MapReduce in GridSAM

Coursework II: Google MapReduce in GridSAM. Steve Crouch s.crouch@software.ac.uk, stc@ecs School of Electronics and Computer Science. Contents. Introduction to Google ’ s MapReduce Applications of MapReduce The coursework Extending a basic MapReduce framework provided in pseudocode

dane-riddle
Télécharger la présentation

Coursework II: Google MapReduce in GridSAM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Coursework II: Google MapReduce in GridSAM Steve Crouch s.crouch@software.ac.uk, stc@ecs School of Electronics and Computer Science

  2. Contents • Introduction to Google’s MapReduce • Applications of MapReduce • The coursework • Extending a basic MapReduce framework provided in pseudocode • Coursework deadline: 27th March 4pm • Handin via ECS Coursework Handin System

  3. Google MapReduce MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, Google Inc., OSDI 2004. http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/mapreduce-osdi04.pdf

  4. Google’s Need for a Distributed Programming Model and Infrastructure • Google implement many computations over a lot of data • Input: e.g. crawled documents, web request logs, etc. • Output: e.g. inverted indices, web document graphs, pages crawled per host, frequent per-day queries, etc. • Input usually very large (> 1TB) • Computations need to be distributed for timeliness of results • Want to do this in an easy, but scalable and robust way; provide a programming model (with a suitable abstraction) for the distributed processing aspects • Realised many computations follow a map / reduce approach • map operation applied to a set of logical input “records” to generate intermediate key/value pairs • reduce operation applied to all intermediate values sharing same key to combine data in a useful way • Used as basis for rewrite of their production indexing system!

  5. History of MapReduce – Inspired by Functional Programming! • Functional operations only create new data structures and do not alter existing ones • Order of operations does not matter • Emphasis on data flow • e.g. Higher-Order functions in Lisp • map() – applies a function to each value in a sequence • fun map f [ ] = [ ] | map f (x::xs) = (f x) :: (map f xs) • reduce() – combines all elements of a sequence using a binary operator • fun reduce f c [ ] = c | reduce f c (x::xs) = f x (reduce f c xs)

  6. Looking at map and reduce Another Way… • map(): • Delegates or distributes the computation for each piece of data to a given function, creating a new set of data • Each computation cannot see the effects of the other computations • The order of computation is irrelevant • reduce() takes this created data and reduces it to something we want • map() moves left to right over the list, applying the given function… can this be exploited in distributed computing?

  7. Applying the Programming Model to the Data Distributed Computing Seminar: Lecture 2: MapReduce Theory and Implementation, Christophe Bisciglia, Aaron Kimball & Sierra Michels-Slettvet, Summer 2007.

  8. For Example… • Counting the number of occurrences of each word in a large collection of documents: reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); • map outputs each word plus occurrence count • reduce sums together all counts emitted for each word doc1,”Hello world” map() Hello, 1 2 (Hello) reduce() doc2,”Hello there” map() Hello, 1 1 (world) world, 1 1 (there) there, 1

  9. How it Works in Practice 7. When all maps and reduces done, Master wakes up user program which resumes 2. Master assigns M map tasks and R reduce tasks to idle workers (either one map or one reduce task each) 1. User program: - Splits work into M 64MB pieces - Program starts up across compute nodes as either Master or Worker (with exactly 1 Master) • 3. A map Worker: • Parses key/value pairs out of its input • Passes each key/value to map function • Buffers intermediate keys/values in mem 4. Periodically, map Worker writes intermediate key/value pairs to disk, informing Master of their locations, who forwards to reduce Workers 5/6. When notified of locations by Master, reduce Worker remotely reads in data, sorts and groups data by key, passes to reduce function, results appended to output file "MapReduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.

  10. Coursework: Part II

  11. Learning Objectives: • To develop a general architectural and operational understanding of typical production-level grid software. • To develop the programming skills required to drive typical services on a production-level grid.

  12. Tasks • Download and install the GridSAM server and client • (a) Extend some Java code stubs (which use the GridSAM Java API) to submit and monitor jobs to GridSAM • (b) Extend some pseudocode that describes a basic MapReduce framework for performing word counting on a number of files

  13. Coursework: Part II –Installing GridSAM

  14. Pre-Requisites • Pre-requisites: • Client and Server: Linux only (e.g. SuSE 9.0, RedHat, Debian, Ubuntu) • May work on other Linuxs but no exhaustive testing • Tested on undergrad Linux boxes • Requires Java JDK 6 (not JRE) or above • Beware: • Firewalls blocking 8080 and your FTP port inbetween client and server – add exceptions • VPNs can cause problems with staging data to/from GridSAM

  15. Preparation/Installation • Java 7 recommended • Note: you may need to upgrade your Java • Ensure JAVA_HOME set on path • Install client… • Download gridsam-2.3.0-client.zip from coursework page • unzip gridsam-2.3.0-client.zip (into a file path that contains no spaces) • cd gridsam-2.3.0-client • java SetupGridSAM • Install server (Linux only)… • Can just reuse your Apache Tomcat 5.5.28/6.0.32 from mgrid (see mgrid install slides) • Download gridsam.war from coursework page • Shutdown Tomcat and copy in gridsam.war to apache-tomcat-6.0.32/webapps and restart Tomcat • Can check log files in apache-tomcat-6.0.32/webapps/gridsam/WEB-INF/logs if any problems occur

  16. Coursework Materials • Download COMP3019-materials.tgz from coursework page • Copy to gridsam-2.3.0-client directory • Unpack, you’ll find some GridSAMExample* files • ./GridSAMExampleCompile to check compilation • Code not complete; that’s the coursework! • GridSAMExampleRun wont until you done the coursework • Note server.domain and port in script – you need to change these to point at your server (use HTTP not HTTPS!!) • Use the scripts and Java code as a basis • Refer to API docs on coursework page as required • To obtain job status, use e.g.: jobStage = jobManager.findJobInstance(jobID).getLastKnownStage().getState().toString(); • Doing job.getLastKnownStage().getState().toString() directly wont work

  17. The Coursework • See the coursework handout on the COMP3019 page: • http://www.ecs.soton.ac.uk/~stc/COMP3019 • Notes for Part 1: • When specifying multiple arguments to your m-grid applet, there is a single string you can use as an argument. • Consider how you pass the two necessary arguments (i.e. a character and a textfile) as a single argument into the applet • To load the text file below into your applet, package it into the jar file along with the code, and use the following in the applet: • InputStream in = getClass().getResourceAsStream(“textfile.txt”); • Part 2 (GridSAM) Notes: • If you encounter problems using the GridSAM FTP server, some students have found success using a StupidFTP server (available under Ubuntu) • When you want to check the status of a job use e.g. jobStage = jobManager.findJobInstance(jobID).getLastKnownStage().getState().toString(); • Doing job.getLastKnownStage().getState().toString() directly wont work

  18. Coursework: Part II –Running a Command Line Example

  19. Example using File Staging • Objectives: submit simple job with data input and output requirements and monitor progress submit JSDL OMII GridSAM Server OMII GridSAM Client monitor 2 input files OMII GridSAM FTP Server 1 output file

  20. JSDL Example • Gridsam-2.3.0/examples/remotecat-staging.jsdl • Change ftp URLs to match your ftp server e.g. ftp://anonymous:anonymous@localhost:55521/concat.sh ): <JobDescription> <JobIdentification> … </JobIdentification> <Application> <POSIXApplication xmlns="http://schemas.ggf.org/jsdl/2005/06/jsdl-posix"> <Executable>bin/concat</Executable> <Argument>dir2/subdir1/file2.txt</Argument> <Output>stdout.txt</Output> <Error>stderr.txt</Error> <Environment name="FIRST_INPUT">dir1/file1.txt</Environment> </POSIXApplication> </Application> …

  21. JSDL Example <DataStaging> <FileName>dir2/subdir1/file2.txt</FileName> <CreationFlag>overwrite</CreationFlag> <Source> <URI>ftp://ftp.do:55521/input2.txt</URI> </Source> </DataStaging> <DataStaging> <FileName>stdout.txt</FileName> <CreationFlag>overwrite</CreationFlag> <DeleteOnTermination>true</DeleteOnTermination> <Target> <URI>ftp://ftp.do:55521/stdout.txt</URI> </Target> </DataStaging> </JobDescription> </JobDefinition> <DataStaging> <FileName>bin/concat</FileName> <CreationFlag>overwrite</CreationFlag> <Source> <URI>ftp://ftp.do:55521/concat.sh</URI> </Source> </DataStaging> <DataStaging> <FileName>dir1/file1.txt</FileName> <CreationFlag>overwrite</CreationFlag> <Source> <URI>ftp://ftp.do:55521/input1.txt</URI> </Source> </DataStaging>

  22. Set up the GridSAM Client’s FTP Server • To allow GridSAM to retrieve input and store output • In gridsam-2.3.0-client directory: > ./gridsam.shGridSAMFTPServer -p 55521 -d examples/ 2010-04-29 08:20:59,250 WARN [GridSAMFTPServer] (main:) ../data/examples/ is exposed through FTP at ftp://anonymous@152.78.237.90:55521/ 2010-04-29 08:20:59,268 WARN [GridSAMFTPServer] (main:) Please make sure you understand the security implication of using anonymous FTP for file staging. FtpServer.server.config.root.dir = ../data/examples/ FtpServer.server.config.data = /home/omii/COMP3019/omii-uk-client/gridsam/ftp/ftp1215306750 FtpServer.server.config.port = 55521 FtpServer.server.config.self.host = 152.78.237.90 Started FTP • Exposes the examples directory through FTP on port 55521 (anonymous access!) • Create input1.txt and input2.txt in this directory with some text in them

  23. CLI Example: Submit to GridSAM Server • Ensure Java is on your path • In gridsam-2.3.0-client directory: • Submit to GridSAM server: • ./gridsam.sh GridSAMSubmit –s “http://localhost:8080/gridsam/services/gridsam?wsdl” -j examples/remotecat-staging.jsdl • Unique job ID is returned • i.e. UID is urn:gridsam:<characters>

  24. CLI Example: Monitoring the Job • Monitor job until completion: > ./gridsam.sh GridSAMStatus -s “http://localhost:8080/gridsam/services/gridsam?wsdl” -j <unique_job_id> • <unique_job_id> is entire urn:gridsam:<characters> string • Job progress indicated by current state: • Pending, Staging-in, Staged-in, Active, Executed, Staging-out, Staged-out, Done • When complete, output resides in the stdout.txt file in the examples/ directory

  25. What to Hand In • Submit: source code, results files, parameter files and output • Other parts that require written answers should form a separate document: • In text, Microsoft Word or PDF • Up to 800 words in length, not including any source or trace output • Submission via ECS Coursework Handin system: Single Zip file: source, results, parameter files, output & written answers

More Related