workflow management n.
Skip this Video
Loading SlideShow in 5 Seconds..
Workflow Management PowerPoint Presentation
Download Presentation
Workflow Management

Workflow Management

133 Vues Download Presentation
Télécharger la présentation

Workflow Management

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Workflow Management CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

  2. Apache Oozie

  3. Problem! • "Okay, Hadoop is great, but how do people actually do this?“ – A Real Person • Package jobs? • Chaining actions together? • Run these on a schedule? • Pre and post processing? • Retry failures?

  4. Apache OozieWorkflow Scheduler for Hadoop • Scalable, reliable, and extensible workflow scheduler system to manage Apache Hadoop jobs • Workflow jobs are DAGs of actions • Coordinator jobs are recurrent Oozie Workflow jobs triggered by time and data availability • Supports several types of jobs: • Java MapReduce • Streaming MapReduce • Pig • Hive • Sqoop • Distcp • Java programs • Shell scripts

  5. Why should I care? • Retry jobs in the event of a failure • Execute jobs at a specific time or when data is available • Correctly order job execution based on dependencies • Provide a common framework for communication • Use the workflow to couple resources instead of some home-grown code base

  6. Layers of Oozie • Bundles • Coordinators • Workflows • Actions

  7. Actions • Have a type, and each type has a defined set of configuration variables • Each action must specify what to do based on success or failure

  8. Workflow DAGs M/R streaming job OK start Java Main OK fork join decision Pig job MORE OK M/R job ENOUGH OK Java Main end FS job OK OK

  9. Workflow Language

  10. Oozie Workflow Application • An HDFS Directory containing: • Definition file: workflow.xml • Configuration file: config-default.xml • App files: lib/ directory with JAR and other dependencies

  11. WordCount Workflow <workflow-app name='wordcount-wf'> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker></job-tracker> <name-node>hdfs://</name-node> <configuration> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'/> <end name='end'/> </workflow-app> Start End M-R wordcount OK Start Error Kill

  12. Coordinators • Oozie executes workflows based on • Time Dependency • Data Dependency Tomcat Check Data Availability Oozie Coordinator WS API Oozie Workflow Oozie Client Hadoop

  13. Time Triggers <coordinator-app name="coord1" start="2009-01-01T00:00Z" end="2010-01-01T00:00Z" frequency="15" xmlns="uri:oozie:coordinator:0.1"> <action> <workflow> <app-path>hdfs://bar:9000/apps/processor-wf</app-path> <configuration> <property> <name>key1</name> <value>value1</value> </property> </configuration> </workflow> </action> </coordinator-app>

  14. Data Triggers <coordinator-app name="coord1" frequency="${1*HOURS}"...> <datasets> <dataset name="logs" frequency="${1*HOURS}" initial-instance="2009-01-01T00:00Z"> <uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="inputLogs" dataset="logs"> <instance>${current(0)}</instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name> <value>${dataIn('inputLogs')}</value> </property> </configuration> </workflow> </action> </coordinator-app>

  15. Bundle • Bundles are higher-level abstractions that batch a set of coordinators together • No explicit dependencies between them, but they can be used to define a pipeline

  16. Interacting with Oozie • Read-Only Web Console • CLI • Java client • Web Service Endpoints • Directly with Oozie DB using SQL

  17. Extending Oozie • Minimal workflow language containing a handful of controls and actions • Extensibility for custom action nodes • Creation of a custom action requires: • Java implementation, extending ActionExecutor • Implementation of the action’s XML schema, which defines the action’s configuration parameters • Packaing of Java implementation and configuration schema into a JAR, which is added to Oozie WAR • Extending oozie-site.xml to register information about custom executor

  18. What do I need to deploy a workflow? • coordinator.xml • workflow.xml • Libraries • Properties • Contains things like NameNode and ResourceManager addresses and other job-specific properties

  19. Configuring Workflows • Three mechanisms to configure a workflow • config-default.xml • • Job Arguments • Processed as such: • Use all of the parameters from command line invocation • Anything unresolved? Use • Use config-default.xml for everything else

  20. Okay, I've built those • Now you can put it in HDFS and run it hdfsdfs -put my_joboozie/app oozie job -run

  21. Java Action • A Java action will execute the main method of the specified Java class • Java classes should be packaged in a JAR and placed with workflow application's lib directory • wf-app-dir/workflow.xml • wf-app-dir/lib • wf-app-dir/lib/myJavaClasses.JAR

  22. Java Action $ java -Xms512m a.b.c.MyMainClass arg1 arg2 <actionname='java1'> <java> ... <main-class> a.b.c.MyJavaMain </main-class> <java-opts> -Xms512m </java-opts> <arg> arg1 </arg> <arg> arg2 </arg> ... </java> </action>

  23. Java Action Execution • Executed as a MR job with a single task • So you need the MR information <actionname='java1'> <java> <job-tracker></job-tracker> <name-node></name-node> ... <configuration> <property> <name>abc</name> <value>def</value> </property> </configuration> </java> </action>

  24. Capturing Output • How to pass parameter from my Java action to other actions? • Add the <capture-output/> element to your Java action • Reference the parameter in your following actions • Write some Java code to link them

  25. <actionname='java1'> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name></name> <value>${queueName}</value> </property> </configuration> <main-class>org.apache.oozie.test.MyTest</main-class> <arg>${outputFileName}</arg> <capture-output/> </java> <okto="pig1"/> <errorto="fail"/> </action>

  26. <actionname='pig1'> <pig> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name></name> <value>${queueName}</value> </property> </configuration> <script>script.pig</script> <param>MY_VAR=${wf:actionData('java1')['PASS_ME']}</param> </pig> <okto="end"/> <errorto="fail"/> </action>

  27. publicstaticvoidmain (String[] args) { String fileName = args[0]; try{ File file = newFile( System.getProperty("")); Properties props = newProperties(); props.setProperty("PASS_ME", "123456"); OutputStreamos = newFileOutputStream(file);, ""); os.close(); System.out.println(file.getAbsolutePath()); } catch(Exception e) { e.printStackTrace(); } • System.exit(0); }

  28. Web Console

  29. Coordinators

  30. Coordinator Details

  31. Job Details

  32. Job DAG

  33. Job Details

  34. Action Details

  35. Job Tracker

  36. A Use Case: Hourly Jobs • Replace a CRON job that runs a bash script once a day • Java main class that pulls data from a file stream and dumps it to HDFS • Runs a MapReduce job on the files • Emails a person when finished • Start within X amount of time • Complete within Y amount of time • And retry Z times on failure

  37. 1 <workflow-app name="filestream_wf" xmlns="uri:oozie:workflow:0.1"> <start to="java-node"/> <action name="java-node"/> <java> <job-tracker>foo:9001</job-tracker> <name-node>bar:9000</name-node> <main-class></main-class> </java> <ok to="mr-node"/> <error to="fail"/> </action> <action name="mr-node"> <map-reduce> <job-tracker>foo:9001</job-tracker> <name-node>bar:9000</name-node> <configuration> ... </configuration> </map-reduce> <ok to="email-node"> <error to="fail"/> </action> ... 2 3 ... <action name="email-node"> <email xmlns="uri:oozie:email-action:0.1"> <to></to> <cc></cc> <subject>Email notification</subject> <body>The wf completed</body> </email> <ok to="myotherjob"/> <error to="errorcleanup"/> </action> <end name="end"/> <kill name="fail"/> </workflow-app>

  38. 6 <?xml version="1.0"?> <coordinator-app end="${COORD_END}" frequency="${coord:days(1)}" name="daily_job_coord" start="${COORD_START}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1" xmlns="uri:oozie:sla:0.1"> <action> <workflow> <app-path>hdfs://bar:9000/user/hadoop/oozie/app/test_job</app-path> </workflow> <sla:info> <sla:nominal-time>${coord:nominalTime()}</sla:nominal-time> <sla:should-start>${X * MINUTES}</sla:should-start> <sla:should-end>${Y * MINUTES}</sla:should-end> <sla:alert-contact></sla:alert-contact> </sla:info> </action> </coordinator-app> 4, 5

  39. Review • Oozie ties together many Hadoop ecosystem components to "productionalize" this stuff • Advanced control flow and action extendibility lets Oozie do whatever you would need it to do at any point in the workflow • XML is gross

  40. References • • • • •