Deploying a High Throughput Computing Cluster

Deploying a High Throughput Computing Cluster Jim Basney and Miron Livny Presented by Vishal Singh

Seminar Overview I] Introduction Primary Goal of Condor Condor Overview II] Challenges of deploying an HTC environment Layered Software Architecture Protocol flexibility Remote file access Checkpointing III] System administration of an HTC environment Access policies Reliability System log file management Security IV] Summary

Goals Globus:The Globus project is developing fundamental technologies needed to build computational grids. Grids are persistent environments that enable software applications to integrate instruments, displays, computational and information resources that are managed by diverse organizations in widespread locations. Condor:The goal of the Condor project is to develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources.

Condor Overview Three entities Customer Agent: Manages a queue of application descriptions and sends resource requests to the matchmaker. Resource Agent: Implements the policies of resource owner and sends resource offers to matchmaker. Matchmaker: Finds a match between the resouce requests and the resource offers and notifies the agents when a match is found.

Four Primary Challenges • Evolution of network protocols • Remote file access Utilization of heterogeneous resources • Utilization of non dedicated resources

Layered Software Architecture Reason:Portability of HTC system • Network API : provides both connection-oriented and • connectionless,reliable and unreliable interfaces. • Process management API : provides the ability to create ,suspend, • unsuspend, and kill a process. • Workstation statisticsAPI: reports the information necessary to • 1.>implement the resource owner policies • 2.>verify the validation of customer application requirements.

Layered resource management architecture:Condor

PROTOCOL FLEXIBILTY Why? Inconvenient to frequently update components in a HTC, so new features are not deployed until a future major system upgrade. A general-purpose data format may help Example of protocol data format Backward compatibility is ensured.

Remote File Access Three Implementation options • Distributed file system Guarantees a HTC application, access to data files from any workstation in the cluster. - Requires authentication of customer app. to file system. - Privileges need to be assigned. • Data file staging - Large data files results in high start-up and tear down costs.

Remote File Access (cont.) Redirect file I/O system calls HTC environment must interpose itself between application and operating system and service file system calls. System call interposition How? • Linking application with an interposition • library or trapping system calls thru O.S • HTC environment invokes an RPC Benefits No file system requirements on remote station Drawbacks - Many high latency operations reduce performance of application. - Developing and maintaining a portable interposition system is difficult.

Checkpointing A snapshot of the state of an executing program. Uses • Provide reliability • Enable preemptive-resume scheduling What is a check point? Can be… • kernel-level checkpointing • Often not provided by workstation operating systems. • User level checkpointing

Progress • I] Introduction • Primary Goal of Condor • Condor Overview • II] Challenges of deploying an HTC environment • Layered Software Architecture • Protocol flexibility • Remote file access • Checkpointing • III] System administration of an HTC environment • Access policies • Reliability • System log file management • Security • IV] Summary

System Administration Administrator has to answer to…. • Resource owners: • Enforce access policies of resource owners. • Customers: • Valuable services received from the HTC environment. • Policy makers: • Has to demonstrate that the HTC is meeting the stated goals.

Access policies One method of policy specification is through expressions Answers the question who and when can a resource can be used.

Access policies (cont.) • Can be optimized for throughput • Eg: • For low-bandwidth networks a longer Vacate interval may be negotiated. • ‘Vacate’ need not be attempted when chances of successful check point low. • Administrator may steer matchmaking to utilize resources efficeintly • when network bandwidth limited.

Reliability Complications • Distinguish between normal and abnormal terminations • Choose the correct checkpoint to use for restart • Decide when it is safe to restart the application • ‘problem of one bad node’ in HTC Heuristically determine - if application fails consistently on different nodes - if different applications fail on the same node Imply:HTC must be prepared for failures and must automate failure recovery for common failures.

Problem Diagnosis via System Logs System logs are primary tools for diagnosing system failures. HTC Environment Logs

Monitoring and Accounting HTC environment provides system monitoring and accounting facilities to the administrator Observations: 1.> Approximately 100 resources were added to the cluster during the month. 2.> Resource availability followed a daily cyclic pattern, where more resources were available for HTC during the night 3.>On average, more resources available on weekends compared to weekends.

Security • An HTC environment is potentially vulnerable to: Resource Attack - An unauthorized user gains access to a resource via the HTC environment - An authorized user violates the resource owner’s access policy. Customer Attack - Customer’s account or data files are compromised via the HTC environment. Steps to be taken -Protecting the resources requires an effective user authentication mechanism. - The HTC environment must ensure that all resource agents are trustworthy • Unencrypted network streams and buffer-overflow attacks are potential • vulnerabilities.

Summary • The HTC software must be portable, reliable, and maintainable. • Layered architecture with flexible network provides such a framework. • Remote file access and checkpointing allow HTC to utilize distributively • owned, non-dedicated resources • Development and maintenance costs must be balanced. • The HTC software must provide secure services with effective logging.

Conclusion Deploying an HTC environment is efficiently managing all the complexities described for all the three entities:resource owners, customers and policy makers.It is not exotic scheduling algorithms and mechanisms which make an HTC environment successful,but an emphasis on usability, flexibility, reliability, and maintainability. Web site Condor website : http://www.cs.wisc.edu/condor

Deploying a High Throughput Computing Cluster