520 likes | 641 Vues
Dominic Battr é , Matthias Hovestadt, Odej Kao, Axel Keller , Kerstin Voss Cracow Grid Workshop 2007. Transparent Cross-Border Migration of Parallel Multi Node Applications. Outline. Highly Predictable Clusters for Internet-Grids EC funded project in FP6. Advanced Risk Assessment &
 
                
                E N D
Dominic Battré, Matthias Hovestadt, Odej Kao, Axel Keller, Kerstin Voss Cracow Grid Workshop 2007 Transparent Cross-Border Migration of Parallel Multi Node Applications
Outline Axel Keller Highly Predictable Clusters for Internet-Grids EC funded project in FP6 Advanced Risk Assessment & Management for Trustable Grids EC funded project in FP6 • Motivation • The Software Stack • Cross-Border Migration • Summary
Axel Keller The Gap between Grid and RMS User asks for SLA Grid Middleware realizes job by means of local RMS BUT: RMS offer Best Effort Need: SLA-aware RMS Guaranteed! user request grid middleware Reliability? Quality of Service? SLA RMS RMS RMS Best Effort! M1 M2 M3
Axel Keller HPC4U: Highly Predictable Clusters for Internet-Grids Objective Software-only solution for an SLA-aware, fault tolerant infrastructure, offering reliability and QoS, and acting as active Grid component Key Features System level checkpointing Job migration Job types: sequential and MPI-parallel Planning based scheduling
HPC4U: Planning Based Scheduling queues new jobs Machine new jobs time Axel Keller
Axel Keller HPC4U: Software Stack User- / Broker- Interface CLI Negotiation RMS Scheduler SSC Storage Process Network Cluster
Axel Keller HPC4U: Checkpointing Cycle 3. Return: “Checkpoint completed!” 1. CP job+halt 6. Resume job 7. Job runningagain RMS 5. Link to Snapshot 4. Snap-shot ! Process Network Storage 2. In- Transit Packets
Axel Keller Cross Border Migration: Intra Domain User- / Broker- Interface User- / Broker- Interface CLI CLI CRM CRM Negotiation Negotiation PP PP RMS RMS Scheduler Scheduler SSC SSC Storage Storage Process Network Process Network Cluster Cluster
Axel Keller Cross Border Migration: Target Retrieval User- / Broker- Interface User- / Broker- Interface CLI CLI CRM CRM Negotiation Negotiation PP PP RMS RMS Scheduler Scheduler SSC SSC Storage Storage Process Network Process Network Cluster Cluster
Axel Keller Cross Border Migration: Checkpoint Migration User- / Broker- Interface User- / Broker- Interface CLI CLI CRM CRM Negotiation Negotiation PP PP RMS RMS Scheduler Scheduler SSC SSC Storage Storage Process Network Process Network Cluster Cluster
Axel Keller Cross Border Migration: Remote Execution User- / Broker- Interface User- / Broker- Interface CLI CLI CRM CRM Negotiation Negotiation PP PP RMS RMS Scheduler Scheduler SSC SSC Storage Storage Process Network Process Network Cluster Cluster
Axel Keller Cross Border Migration: Result Migration User- / Broker- Interface User- / Broker- Interface CLI CLI CRM CRM Negotiation Negotiation PP PP RMS RMS Scheduler Scheduler SSC SSC Storage Storage Process Network Process Network Cluster Cluster
Axel Keller Cross-Border Migration: Using Globus User- / Broker- Interface CLI CRM WS-AG Broker • WS-AG implementation based on GT4 • Developed in EU project AssessGrid • Source specifies SLA / file staging parameters • Subset of JSDL (POSIX Jobs) • Resource determination via broker • Source directly contacts destination • Destination pulls migration data via Grid-FTP • Destination pushes result data back to source • Source uses WSRF event notification Negotiation PP RMS Scheduler SSC Storage Process Network Cluster
Axel Keller Ongoing Work: Introducing Risk Management User- / Broker- Interface CLI CRM WS-AG Broker • Topic of EU project: AssessGrid • Encorporated in SLA • Provider • Estimates risk for agreeing an SLA • Considers propability of failure in schedule • Assessment based on historical data Risk Assessor Negotiation PP RMS Scheduler Consultant Service SSC Monitoring Storage Process Network Cluster
Summary: Best Effort is not Enough Axel Keller Cross border migration and Risk assessment provide new means to increase the reliability of Grid Computing.
Axel Keller More information • Read the paper • AssessGrid www.assessgrid.eu • HPC4U www.hpc4u.eu • OpenCCS www.openccs.eu Thanks for your attention!
Contents Axel Keller BACKUP
Scheduling Aspects Axel Keller • Execution Time • Exact start time • Earliest start time, latest finish time • User provides stage-in files by time X • Provider keeps stage-out files until time Y • Provisional Reservations • Job Priorities • Job Suspension
Axel Keller • HPC4U
Axel Keller Motivation: Fault Tolerance Commercial Grid usersneed SLAs Providers cautious on adoption Reason: Business case risk Missed deadlines due to system failures Penalties to be paid Solution: Prevention with Fault Tolerance Fault tolerance mechanisms available, but Application modification mandatory Overall solution (System software, process, storage, file system, network) required Combination with Grid migration missing
HPC4U Objective Axel Keller • Software-only solution for a SLA-aware, fault tolerant infrastructure, offering reliability and QoS, acting as active Grid component • Key features • Definition and implementation of SLAs • Resource reservation for guaranteed QoS • Application-transparent fault tolerance
Axel Keller HPC4U: Concept SLA negotiation as an explicit statement of expectations and obligations in a business relationship between provider and customer Reservation of CPU, storage and network for desired time interval Job start in checkpointing environment In case of system failure Job migration / restart with respect to SLA
Phases of Operation Axel Keller Acceptance(or rejection) of SLA StageIn Compu-tation StageOut Pre- Runtime Runtime Post- Runtime Negotiation time Lifetime of SLA Allocationof systemresources • Negotiation of SLA • Pre-Runtime: Configuration of Resources • e.g. network, storage, compute nodes • Runtime: Stage-In, Computation, Stage-Out • Post-Runtime: Re-configuration
Phase:Pre-Runtime Axel Keller • Task of Pre-Runtime Phase • Configuration of all allocated resources • Goal: Fulfill requirements of SLA • Reconfiguration affects all HPC4U elements • Resource Management System • e.g. configuration of assigned compute nodes • Storage Subsystem • e.g. initialization of a new data partition • Network Subsystem • e.g. configuration of network infrastructure
Phase: Runtime • Runtime Phase = lifetime of job in system • adherence with SLA has to be assured • FT mechanisms have to be utilized • Phase consists of three distinct steps • Stage-In • transmission of required input data from Grid customer to compute resource • Computation • execution of application • Stage-Out • transmission of generated output data fromcompute resource back to Grid customer Axel Keller
Phase: Post-Runtime • Task of Post-Runtime Phase: • Re-Configuration of all resources • e.g. re-configuration of network • e.g. deletion of checkpoint datasets • e.g. deletion of temporary data • Counterpart to Pre-Runtime Phase • Allocation of resources ends • Update of schedules in RMS and storage • Resources are available for new jobs Axel Keller
Axel Keller • PROCESS
Subsystems Axel Keller • Process Subsystem • checkpointing of network • cooperative checkpointing protocol (CCP) • Network Subsystem • checkpoint network state • Storage Subsystem • provision of storage • provision of snapshot
Axel Keller • STORAGE
Storage subsystem Axel Keller Computing Interface VSM - SR Storage Logical space (data layout strategies) Physical space Storage Resource 1 Storage Resource 2 • Functionalities • Negotiates the storage part of the SLA • Provides storage capacity at a given QoS level • Provides FT mechanisms • Requirement: manage multiple jobs running on the same SR Virtual Storage Manager
Data Container concept Axel Keller Job File System Block Address Mapping LogicalVolume data layout policies (e.g., simple striping) Physical devices • Idea: • create storage environment for applications at a desired QoS level with abstraction of physical devices • Components: File I/O (read, write, open,…) Data Container Block I/O (read, write, ioctl) Logical space Block I/O Storage Resource
Data container properties Axel Keller • Storage part of the SLA Data container section • Size • File system type • Number of nodes that need to access the data container (private/shared) Performance section • Application I/O profile  Benchmark • Bandwidth (in MB/s or IO/s) Or Default configuration Dependability section • Data redundancy type (within a cluster) • Snapshot needed or not • Data replication or not (between clusters) Job specific section • Job’s time to schedule and time to finish
Fault Tolerance Mechanisms Axel Keller • RAID • Tolerate the failure of one or more disks • RAIN • Tolerate the failure of one or more nodes • Implementation • Hardware • Software • Storage FT mechanisms rely on special data layouts Software • Storage Snapshot
Data container snapshot Axel Keller • Provide instantaneous copy of data containers • Technique used: Copy-On-Write (COW) • create multiple copies of data without duplicating all the data blocks • With checkpoint, it allows application restart from a previous running stage • Impact on SR performance • Taken into account at negotiation time
Snapshot single node job restart after node failure 4 1 3 5 Job Job 2 Node failure 1 Restore job’s data from previous snapshot 2 Start data container 3 Restore job’s state from previous checkpoint 4 Job restart Redundant data layout 5 Storage Resource Characteristics: • The job is running on a single node • The data container is private to that node • Data container snapshot resides on the same storage resource Axel Keller
Interfaces with other components Axel Keller Open Source • client-server • callbacks VSM wrapper Exanodes data container data container data container wrapper SR_type2 SR_type1 ClassicalStorage Array Classical Storage Array Proprietary RMS Interface VSM - RMS VSM Interface VSM – SR Storage Subsystem Storage Resource (SR) Network (socket , RDMA, …)
Axel Keller • ASSESSGRID
Grid Fabric Layer with Risk Assessor Axel Keller • NegotiationManager • Agr./Agr.Fact. WS • checks whether offer complies to template • initiation of file transfers • Scheduler • creates tentative schedules for offers • Risk Assessor • Consultant Service • records data • Monitoring • runtime behavior
Precautionary Fault-Tolerance Axel Keller • How many spare resources are available at execution time? • Use of planning based scheduler
Estimating Risk for a Job Execution Axel Keller • Use of planning based scheduler • How much slack time is available for fault tolerance? • How much effort do I undertake for fault tolerance? • What is the considered risk of resource failure? Execution Time Slack Time Latest Finish Time Earliest Start Time
Risk Assessment Axel Keller • Estimate risk for agreeing an SLA • consider risk of resource failure • estimate risk for a job execution • initiate precautionary FT mechanisms low risk middle risk high risk
Risk Management at Job Execution Axel Keller Events Risk Management Decisions Actions Risk Assessment Business Model (price, penalty) Weekend/Holiday/Workday Schedule (SLAs, best effort) Redundancy Measures
Detection of Bottlenecks Axel Keller • Consultant Service • Analysis of SLA violation • Estimated risk for the job • Planned FT mechanisms • Monitoring Information • Job • Resources • Data Mining • Find connections between SLA violations • Detect weak points in the provider’s infrastructure
Axel Keller • WS-AG
Components Axel Keller
Implementation with Globus Toolkit 4 Axel Keller • Why Globus? • Utility: Authentication, Authorization, Delegation, RFT, MDS, WS-Notification • Impact • Problem 1: GRAM (Grid Resource Allocation and Management) • State machine, incl. File-Staging, Delegation of Credentials, RSL • Cannot use it: written for batch schedulers, nor for planning schedulers • Problem 2: Deviations from WS-AG spec. • Different Namespaces WS-A, WS-RF
Implementation with Globus Toolkit 4 Axel Keller • Technical Challenges • xs:anyType • Wrote custom serializers/deserializers • Subtitution groups • Used in ItemConstraint (Creation Constraints) • Cannot be mapped to Java by Axis • Replaced by xs:anyType – use as DOM tree • CreationConstraints • Namespace prefixes in XPaths meaningless • Need for WSDL and interpretation for xs:all, xs:choice, and friends
Context Axel Keller <wsag:Context> … <wsag:AgreementInitiator> <AG:DistinguishedName> /C=DE/O=… </AG:DistinguishedName> </wsag:AgreementInitiator> <wsag:AgreementResponder>EPR</…> <AG:ServiceUsers> <AG:ServiceUser>DN</…> </AG:ServiceUsers> … </wsag:Context> Context Terms Creation Constraints
Terms, SDTs Axel Keller • Conjunction of terms • Common structure of templates • WS-AG too powerful/difficult to fully support • Service Description Term (one) • assessgrid:ServiceDescription (extension of abstract ServiceTermType) • jsdl:POSIXExecutable (executable, arguments, environment) • jsdl:Application (mis-)used for libraries • jsdl:Resources • jsdl:DataStaging * • assessgrid:PoF (upper bound) Context Terms Creation Constraints
Terms, GuaranteeTerms Axel Keller Context Terms Creation Constraints • No hierarchy but two meta guarantees • ProviderFulfillsAllObligations • e.g. Reward: 1000 EUR, Penalty 1000 EUR • ConsumerFulfillsAllObligations • e.g. Reward: 0 EUR, Penalty 1000 EUR • First violation is responsible for failure • No hardware problem, then User fault • Other Guarantees • Execution Time • Any start time (best effort) • Exact start time • Earliest start time, latest finish time • User provides StageIn files by time X • Provider keeps StageOut files until time Y No timely execution No stage-out