Development and Implementation of the UniGrid Platform for Grid Computing in Taiwan

Taiwan UniGrid Yeh-Ching Chung Department of Computer Science National Tsing Hua University Hsin-Chu, 300, Taiwan

Outline • Introduction • Portal • Broker and Scheduler • Resource Information Service • Storage Service • Applications • Conclusion

Introduction (1) • The purpose of grid computing is to integrate various resources within a large network environment. • The purpose of the UniGrid project is to build a platform for academic research using grid-related technologies in Taiwan.

Introduction (2) • 9 institutes join to develop the system • 國網中心 • 清華大學資工系 • 中研院資科所 • 東華大學資工系 • 東海大學資科系 • 中華大學資工系 • 靜宜大學資管系 • 興國管理學院電子商務學系 • 台灣大學大氣科學系

Introduction (3) • All institutes that participate in the UniGrid project contribute some resources. • These resources can be used in collaboration for large scale applications.

Introduction (4) • System Architecture

Portal • The UniGrid portal provides an interface for UniGrid users to use the resources available in the UniGrid system. • Functionalities of the portal • System status monitoring • Single sign-on • User workflow management • Project information

System Status Monitoring (1) • UniGrid users can examine the status of system resources through the portal. • The portal gathers the current system information from the information service and present these information to the users.

System Status Monitoring (2) • Screenshot of the system status monitoring web page

Single Sign-On (1) • Single sign-on is a mechanism whereby a single authentication can permit a user to access all resources where he has access permission, without the need to enter multiple passwords. • All user account information are kept in a database at the portal site. • When a user requests a service, his verification data is passed to that service. • The request will be granted only if the identity is verified by the verification web service

Single Sign-On (2) • User identity verification through single sign-on service

User Workflow Management (1) • A UniGrid user can design and save his own workflows at the UniGrid portal. • A user can select any workflow he designed and execute the workflow through the UniGrid portal. • A user can also monitor the status of his workflow through the UniGrid portal.

User Workflow Management (2) • Structure of a workflow Workflow parallel execution sequential execution

User Workflow Management (3) • The workflows of each user is stored in the portal storage in XML format. • <flow name="testflow" numstages="3"> <stage name="stage1" numjobs="1"> <job id="0"> <sortkey>1</sortkey> <runtype>mpi</runtype> <workdir>/home/test/</workdir> <filename>mm_mpi</filename> <runrp>true</runrp> <datafile/> <argu>256</argu> <otherurl/> <cpuno>4</cpuno> </job> </stage> … </flow>

User Workflow Management (4) • Screenshot of the workflow editing web page

User Workflow Management (5) • When an user submits a workflow, the portal will pass the selected workflow information to the broker. • Upon receiving an execution request, the resource broker will find the required resource for that workflow and schedule its execution.

User Workflow Management (6)

User Workflow Management (7) • Users can examine the execution status of his workflow through the portal’s workflow monitoring system. • All workflow execution information are stored in a database at the machine with resource broker installed on it. • The portal queries the database and obtain the current status of a particular workflow. • The status information is processed and presented in the form of web pages.

User Workflow Management (8) • Screenshot of the workflow monitoring web page

User Workflow Management (9) • Screenshot of the UniGrid workflow management web page

Broker & Scheduler (1) • The broker provides a uniform interface to access available resources in the UniGrid system. • The broker uses the resource information service to obtain the current status of the resources in the system. • After these information are gathered, the broker will allocate the resources that meets the requirements of the current job. • The jobs are then passed to the corresponding local schedulers to be executed locally.

Broker & Scheduler (2) • Broker workflow

Broker & Scheduler (3) • Each participating organization has a local scheduler (Condor) installed to schedule the jobs assigned to that organization. • Condor • A scheduler for large collections of distributively owned computing resources • Developed by the researchers at University of Wisconsin • Specialized for compute-intensive jobs • Uses the “ClassAd” mechanism to match job requirements to machine status and schedule the jobs according to the matching results

Related Research (1) • Tools have been developed to simulate different load sharing and scheduling policies on computing grid and analyze their performance • Queuing methods • Independent clusters • Multiple queues • Forwarding to no-need-to-wait site • Forwarding to shortest-queue site • Forwarding to least-load site, load=

Related Research (2) • Queuing methods (cont’d.) • Single queue • Multi-pool centralized queue • Single-pool centralized queue • One big cluster • Two-level scheduling • Empty queue only • Shortest queue first • Least load first • Two-level local queues • Forwarding to shortest-queue site

Related Research (3) • Scheduling policies • Non-FCFS • Multi-pool centralized queue • Single-pool centralized queue • FCFS • Two-level scheduling • The performance of Non-FCFS is three times better than FCFS

Related Research (4) • Implementation Approaches • Multi-Pool Centralized Queue • Global queue scheduling in the broker, no local queuing system • Global queue scheduling in the broker, making sure available processors through local queuing system • Single-Pool Centralized Queue • Global queue scheduling in the broker, no local queuing system

Related Research (5) • Two-Level Scheduling (Empty-Queue-Only Multi-Pool Grid) • Global queue in the broker, local queues in the local queuing systems

Related Research (6) • Simulation results

Related Research (7) • Simulation results (cont’d.)

Related Research (8) • Discussion • Non-FCFS methods can effectively improve the overall system utilization and performance. • The smallest first non-FCFS policy outperforms all other policies in terms of waiting time and waiting ratio. • As the worst case is concerned, the backfilling policy is superior because it does not allow jobs to be delayed by the backfilling activities

Outline • Introduction • Portal • Broker & Scheduler • Resource Information Service • Storage Service • Applications • Conclusion

Resource Information Services • The resource information service provides information about current resource status, these information can be used by other services of the system • Functionalities of the resource information service • Information system • Performance visualization of MPI parallel program’s execution

Information System (1) • Provides an interface for other services to query various information about computing nodes • The statistics about the individual nodes are obtained using MDS (Monitoring & Discovery Service) provided by the Globus Toolkit • The current network status between machines are gathered using NWS (Network Weather Service) • Automatic update of node information • When a new computing nodes is added/removed

Information System (2) • The Network Weather Service (NWS) • A distributed system that periodically monitors and dynamically forecasts the performance various network and computational resources can deliver over a given time interval • Developed by the researchers at UCSB • It uses numerical models to generate forecasts of what the conditions will be for a given time frame • Because this functionality is analogous to weather forecasting, the system is called Network Weather Service

Information System (3)

Information System (4) • Screenshot of the node status webpage

Performance Visualization of MPI Programs (1) • Input: any application (depending on the availability of compiler in grid platform) • Output: performance visualization of the execution of this application

Performance Visualization of MPI Programs (2) • Execution of a Parallel Application using 4 computing nodes

Related Research (1) • Communication localization & data partitioning techniques in cluster-based grid system • Localized communication enhances performance of parallel applications on grid • Adaptive data partitioning for identical cluster & non-identical cluster grid topology • In-core & out-of-core applications

Related Research (2) • Communication localization techniques for identical cluster Localized communication patterns Original communication patterns

Related Research (3) • Communication localization techniques for non-identical cluster Original communication table

Related Research (4) • Communication localization techniques for non-identical cluster (cont’d.) Localized communication table

Storage Service • The goal of storage service is to provide a collaborative space where UniGrid users can share their data and resources with others. • Components of the storage service • Virtual storage system • Data management system

Virtual Storage System (1) • Virtual storage system architecture

Virtual Storage System (2) • The virtual storage system is implemented with Java as a web service • UniGrid services access the virtual storage system when they need to fetch/modify users’ data files • A client program is available for users to manage his own storage space • The files are stored in a master file server and replicas of the files are distributed to other machines

Virtual Storage System (3)

Development and Implementation of the UniGrid Platform for Grid Computing in Taiwan

Development and Implementation of the UniGrid Platform for Grid Computing in Taiwan

Presentation Transcript

Taiwan

Taiwan

Taiwan

Taiwan

Taiwan

TAIWAN

Taiwan

Taiwan

Taiwan

Taiwan

Taiwan

Taiwan

Taiwan

Introduction to Taiwan UniGrid

Taiwan

Taiwan

TAIWAN

Taiwan