1 / 33

Monitoring Infrastructure for Grid Scientific Workflows

Bartosz Baliś , Marian Bubak. Monitoring Infrastructure for Grid Scientific Workflows. Institute of Computer Science and ACC CYFRONET AGH Kraków, Poland. Outline. Challenges in Monitoring of Grid Scientific Workflows GEMINI infrastructure

olencia
Télécharger la présentation

Monitoring Infrastructure for Grid Scientific Workflows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bartosz Baliś, Marian Bubak Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH Kraków, Poland

  2. Outline • Challengesin Monitoring of Grid Scientific Workflows • GEMINI infrastructure • Event model for workflowexecution monitoring • On-lineworkflow monitoring support • Information model for recordingworkflowexecutions

  3. Motivation • Monitoring of Grid Scientific Workflows important in particularly many scenarios • On-line & off-line performance analysis, dynamicresourcereconfiguration, on-linesteering, performance optimization, provenancetracking, experiment mining, experimentrepetition, … • Consumers of monitoring data: humans (provenance) and processes • On-line & off-linescenarios • Historicrecords: provenance, retrospectiveanalysis (enhancement of nextexecutions)

  4. Grid Scientific Workflows • Traditional scientific applications • Parallel • Homogeneous • Tightlycoupled • Scientific worfklows • Distributed • Heterogeneous • LooselyCoupled • Legacyapplicationsofteninthebackends • Grid environment •  Challenges for monitoring arise

  5. Challenges • Monitoring infrastructurethatconcealsworkflowheterogeneity • Eventsubscriptionandinstrumentationrequests • Standardizedevent model for Gridworkflowexecution • Currentlyeventstightlycoupled to workflowenvironments • On-line monitoring support • ExistingGridinformation systems not suitable for fastnotification-baseddiscovery • Monitoring information model to recordexecutions

  6. GEMINI: monitoring infrastructure • Monitors: query & subengine, eventcaching, services • Sensors: lightweightcollectors of events • Mutators: manipulation of monitoredentities (e.g. dynamicinstrumentation) • Standardized, abstractinterfaces for subscription and instrumentation • ComplexEventProcessing: subscription management via continuousquerying • Eventrepresentation • XML: selfdescribing, extensible but poor performance • Google protocolbuffers: under investigation

  7. Outline • Event model for workflowexecution monitoring • On-lineworkflow monitoring support • Information model for recordingworkflowexecutions

  8. Workflowexecutionevents • Motivation: capturecommonlyused monitoring measurementsconcerningworkflowexecution • Attempts to standardize monitoring eventsexist, but oriented to resource monitoring • GGF DAMED ‘Top N’ • GGF NMWG Network PeformanceCharacteristics • Typically monitoring systems introduce a single eventtype for applicationevents

  9. Workflow Execution Events – taxonomy • Extension of GGF DAMED Top N events • Extensible hierarchy; example extensions: • Loopentered – started.codeReigon.loop • MPI appinvocation – invoking.application.MPI • MPI Calls – started.codeRegion.call.MPISend • Application-specificevents • Events for initiators and performers • Invoking, invoked; started, finished • Event for various execution levels • Workflow, task, code region, data operations • Events for variousexecutionstates • Failed, suspended, resumed, … • Events for executionmetrics • Progress, rate

  10. Outline • Event model for workflowexecution monitoring • On-lineworkflow monitoring support • Information model for recordingworkflowexecutions

  11. On-line Monitoring of GridWorkflows • Motivation • Reaction to time-varying resource availability and application demands • Up-to-date execution status • Typical scenario: ‘subscribe to all execution events related to workflow Wf_1234’ • Distributed producers, not known apriori • Prerequisite: automatic resource discovery of workflow components • New producers are automatically discovered and transparently receive appropriate active subscription requests

  12. Resource discovery in workflow monitoring • Challenge: complexexecution life cycle of a Gridworkflow • Abstractworkflows: mapping of tasks to resources atruntime • Many services involved: enactmentengines, resourcebrokers, schedulers, queuemanagers, executionmanagers, … • No single place to subscribe for notificationsaboutnewworkflowcomponents • Discovery for monitoring mustproceedbottom-up: (1) localdiscovery, (2) global advertisement, (3) global discovery • Problem: currentGridinformation services are not suitable • Orientedtowardsquery performance • Slow propagation of resource status changes • Example: averagedelayfromeventocurrence to notificationin EGEE infrastructure ~ 200 seconds (Berger et al., Analysis of Overhead and Waiting Times inthe EGEE productionGrid)

  13. Resourcediscovery: solution • Whatkind of resourcediscoveryisrequired? • Identity-based, not attribute-based • Full-blowninformation service functionality not needed • Just simple, efficientkey-valuestore • Solution: a DHT infrastructurefederatedwiththe monitoring infrastructureto storeshared state of monitoring services • Key = workflowidentifier • Value = producerrecord (Monitoring service URL, etc.) • Multiplevalues (= producers) can be registered • Efficientkey-value stores • OpenDHT • Amazon Dynamo: efficiency, high availability, scalability. Lack of strong data consistency (‘eventualconsistency’) • Avgget/putdelay ~ 15/30ms; 99th percentile~ 200/300ms (Decandia et al. Dynamo: Amazon’s Highly Available Key-value Store)

  14. Monitoring + DHT (simplified architecture)

  15. DHT-based scenario

  16. Evaluation • Goal: • Measure performance & scalability • Comparisonwithcentralizedapproach • Maincharacteristicmeasured: • Delaybetweenocurrence of a newworkflowcomponent to beginning of data transfer, for differentworkloads • Twomethodologies: • Queuing Network modelswithmultipleclasses, analyticalsolution • Simulationmodels (CSIM simulationpackage)

  17. 1st methodology: Queuing Networks • Solved analitycally (a) DHT solution QN model (b) Centralized solution QN model

  18. 2nd methodology: discrete-event simulation • CSIM simulation package

  19. Input parameters for models • Workload intensity • Measured in job arrivals per second • Taken from EGEE: 3000 to 100.000 jobs per day • Large scale production infrastructure • Assumed range: from 0.3 to 10 job arrivals per second • Service demands • Monitors and Coordinator: prototypes built and measured • DHT: from available reports on large-scale deployments • OpenDHT, Amazon Dynamo

  20. Service demand matrices

  21. Results (centralized model)

  22. Results (DHT model)

  23. Scalability comparison: centralized vs. DHT Conclusion: DHT solution scalable as expected, but centralized solution can still handle relatively large workloads before saturation

  24. Outline • Event model for workflowexecution monitoring • On-lineworkflow monitoring support • Information model for recordingworkflowexecutions

  25. Information model for wf execution records • Motivation: need for structuredinformationabout past experimentsexecuted as scientific workflowsin e-Science environments • Provenancequerying • Mining over past experiments • Experimentrepetition • Executionoptimizationbased on history • State of theart • Monitoring informationmodels do exist but for resource monitoring (GLUE), not execution monitoring • Provenancemodelsare not sufficient • Repositories for performance data areorientedtowardseventtracesorsimpleperformance-orientedinformation • ExperimentInformation (ExpInfo) model • Ontologiesused to describethe model and representtherecords

  26. ExpInfo model A simplifiedexamplewithparticulardomainontologies

  27. ExpInfo model: set of ontologies • General experimentinformation • Purpose, executionstages, input/output data sets • Provenanceinformation • Who, where, why, data dependencies • Performance information • Duration of computationstages, scheduling, queueing, performance metrics (possible) • Resourceinformation • Physical resources (hosts, containers) usedinthecomputation • Connectionwithdomainontologies • Data setswith Data ontology • ExecutionstageswithApplicationontology

  28. Aggregation of data to information • From monitoring events to ExpInforecords • Standardizedprocessdescribed by aggregationrules and derivationrules • Aggregationrulesspecifyhow to instantiateindividuals • Ontologyclasses associated withaggregationrulesthroughobjectproperties • Derivationrulesspecifyhow to computeattributes, includingobjectproperties = associationsbetwenindividuals • Attributesare associated withderivationrules via annotations • SemanticAggregatorusescollectswfexecutionevents and producesExpInforecordsaccording to aggregation and derivationrules

  29. Aggregation rules ExperimentAggregation: eventTypes = started.workflow, finished.workflow instantiatedClass = http://www.virolab.org/onto/exp-protos/Experiment ecidCoherency = 1 ComputationAggregation: eventTypes = invoking.wfTask, invoked.wfTask instantaitedClass = http://www.virolab.org/onto/exp-protos/Computation acidCoherency = 2

  30. Derivationrules Thesimplestcase – an XML element mappeddirectly to a functional property: <DerivationID="OwnerLoginDeriv"> <element>MonitoringData/experimentStarted/ownerLogin</element> </ext-ns:Derivation> More complexcase: which XML elementsareneeded and how to compute an attribute: <DerivationID=”DurationDeriv”> <delegate>cyfronet.gs.aggregator.delegates.ExpPlugin</delegate> <delegateParam>software.execution.started.application/time</> <delegateParam>software.execution.finished.application/time</> </ext-ns:Derivation>

  31. Applications • Coordinated Traffic Management • Executed within K-Wf Grid infrastructure for workflows • Workflows with legacy backends • Instrumentation & tracing • Drug Resistance application • Executed within ViroLab virtual laboratory for infectious diseasesvirolab.cyfronet.pl • Recording executions, provenance querying, visual ontology-based querying based on ExpInfo model

  32. Conclusion • Several monitoring challenges specific to Grid scientific workflows • Standardized taxonomy for workflow execution events • DHT infrastructure to improve performance of resource discovery and enable on-line monitoring • Information model for recording workflow executions

  33. FutureWork • Enhancement of event & informationmodels • Work-in-progress, requiresextensivereview of existing systems to enhanceeventtaxonomy, event data structures and information model • Model enhancement & validation • Performance of large-scaledeployment • Classification of workflowsw.r.togeneratedworkloads • (Preliminarystudy: S. Ostermann, R. Prodan, and T. Fahringer. A Trace-Based Investigation of the Characteristics of GridWorkflows) • Information model for worfklow status • Similar to resource status ininformation systems

More Related