1.29k likes | 1.46k Vues
Public Computing - Challenges and Solutions. Yi Pan Professor and Chair of CS Professor of CIS Georgia State University Atlanta, Georgia, USA AINA 2007 May 21, 2007. Outlines. What is Grid Computing? Virtual Organizations Types of Grids Grid Components Applications Grid Issues
E N D
Public Computing - Challenges and Solutions Yi Pan Professor and Chair of CS Professor of CIS Georgia State University Atlanta, Georgia, USA AINA 2007 May 21, 2007
Outlines • What is Grid Computing? • Virtual Organizations • Types of Grids • Grid Components • Applications • Grid Issues • Conclusions
Outlines -continued • Public Computing and the BOINC Architecture • Motivation for New Scheduling Strategies • Scheduling Algorithms • Testing Environment and Experiments • MD4 Password Hash Search • Avalanche Photodiode Gain and Impulse Response • Gene Sequence Alignment • Peer to Peer Model and Experiments • Conclusion and Future Research
What is Grid Computing? • Analogy is to power grid • Heterogeneous and geographically dispersed
What is Grid Computing? • Analogy is to power grid • Heterogeneous and geographically dispersed • Standards allow for transportation of power
What is Grid Computing? • Analogy is to power grid • Heterogeneous and geographically dispersed • Standards allow for transportation of power • Standards define interface with grid
What is Grid Computing? • Analogy is to power grid • Heterogeneous and geographically dispersed • Standards allow for transportation of power • Standards define interface with grid • Non-trivial overhead of managing movement and storage of power • Economies of scale compensate for this overhead allowing for cheap, accessible power
A Computational “Power Grid” • Goal is to make computation a utility • Computational power, data services, peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way
A Computational “Power Grid” • Goal is to make computation a utility • Computational power, data services, peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way • Standards allow for transportation of these services
A Computational “Power Grid” • Goal is to make computation a utility • Computational power, data services, peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way • Standards allow for transportation of these services • Standards define interface with grid • Architecture provides for management of resources and controlling access • Large amounts of computing power should be accessible from anywhere in the grid
Virtual Organizations • Independent organizations come together to pool grid resources • Component organizations could be different research institutions, departments within a company, individuals donating computing time, or anything with resources • Formation of the VO should define participation levels, resources provided, expectations of resource use, accountability, economic issues such as charge for resources • Goal is to allow users to exploit resources throughout the VO transparently and efficiently
Types of Grids • Computational Grid • Data Grid • Scavenging Grid • Peer-to-Peer • Public Computing
Computational Grids • Traditionally used to connect high performance computers between organizations • Increases utilization of geographically dispersed computational resources • Provides more parallel computational power to individual applications than is feasible for a single organization • Most traditional grid project concentrate on these types of grids • Globus and OSGA
Data Grids • Distributed data sources • Queries of distributed data • Sharing of storage and data management resources • D0-Partical Physics Data Grid allows access to both compute and data resources of huge amounts of physics data • Google
Scavenging Grids • Harness idle cycles on systems especially user workstations • Parallel application must be quite granular to take advantage of large amounts of weak computing power • Grid system must support terminating and restarting work when systems cease idling • Condor system from University of Wisconsin
Peer-to-Peer • Converging technology with traditional grids • Contrasts with grids having little infrastructure and high fault tolerance • Highly scalable for participation but difficult to locate and monitor resources • Current P2P like Gnutella, Freenet, FastTrack concentrate on data services
Public Computing • Also converging with grid computing • Often communicates through a central server in contrast with peer-to-peer technologies • Again scalable with participation • Adds even greater impact of multiple administrative domains as participants are often untrusted and unaccountable
Public Computing Examples • SETI@Home (http://setiathome.ssl.berkeley.edu/) – Search for Extraterrestrial Intelligence in radio telescope data (UC Berkeley) 搜索地外文明的分布式网络计算 • Has more than 5 million participants • “The most powerful computer, IBM's ASCI White, is rated at 12 TeraFLOPS and costs $110 million. SETI@home currently gets about 15 TeraFLOPs and has cost $500K so far.”
More Public Computing Examples • Folding@Home project (http://folding.stanford.edu) for molecular simulation aimed at new drug discovery • Distributed.net (http://distributed.net) for cracking RC5 64-bit encryption algorithm – used more than 300,000 nodes over 1757 days
Grid Components • Authentication and Authorization • Resource Information Service • Monitoring • Scheduler • Fault Tolerance • Communication Infrastructure
Authentication and Authorization • Important for allowing users to cross the administrative boundaries in a virtual organization • System security for jobs outside the administrative domain currently rudimentary • Work being done on sandboxing, better job control, development environments
Resource Information Service • Used in resource discovery • Leverages existing technologies such as LDAP, UDDI • Information service must be able to report very current availability and load data • Balanced with overhead of updating data
Monitoring • Raw performance characteristics are not the only measurement of resource performance • Current and expected loads can have a tremendous impact • Balance between accurate performance data and additional overhead of monitoring systems and tracking that data
Scheduler • Owners of systems interested in maximizing throughput • Users interested in maximizing runtime performance • Both offer challenges with crossing administrative boundaries • Unique issues such as co-allocation and co-location • Interesting work being done in scheduling like market based scheduling
Fault Tolerance • More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research • Multiple administrative domains in VO challenge the reliability of resources • Faults can refer not only to resource failure but violation of service level agreements (SLA) • Impact on fault tolerance if there is no accountability for failure
Fault Tolerance • More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research • Multiple administrative domains in VO challenge the reliability of resources • Faults can refer not only to resource failure but violation of service level agreements (SLA) • Impact on fault tolerance if there is no accountability for failure
Fault Tolerance • More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research • Multiple administrative domains in VO challenge the reliability of resources • Faults can refer not only to resource failure but violation of service level agreements (SLA) • Impact on fault tolerance if there is no accountability for failure
Fault Tolerance • More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research • Multiple administrative domains in VO challenge the reliability of resources • Faults can refer not only to resource failure but violation of service level agreements (SLA) • Impact on fault tolerance if there is no accountability for failure
Communication Infrastructure • Currently most grids have robust communication infrastructure • As more grids are deployed and used, more concentration must be done on network QoS and reservation • Most large applications are currently data rich • P2P and Public Computing have experience in communication poor environments
Applications • Embarrassingly parallel, data poor applications in the case of pooling large amounts of weak computing power • Huge data-intensive, data rich applications that can take advantage of multiple, parallel supercomputers • Application specific grids like Cactus and Nimrod
Grid Issues • Site autonomy • Heterogeneous resources • Co-allocation • Metrics for resource allocation • Language for utilizing grids • Reliability
Site autonomy • Each component of the grid could be administered by an individual organization participating in the VO • Each administrative domain has its own policies and procedures surrounding their resources • Most scheduling and resource management work must be distributed to support this
Heterogeneous resources • Grid resources will have not only heterogeneous platforms but heterogeneous workloads • Applications truly exploiting grid resources will need to scale from idle cycles on workstations, huge vector based HPCs, to clusters • Not only computation power, also storage, peripherals, reservability, availability, network connectivity
Co-allocation • Unique challenges of reserving multiple resources across administrative domains • Capabilities of resource management may be different for each component of a composite resource • Failure of allocating components must be handled in a transaction-like manner • Acceptable substitute components may assist in co-allocating a composite resource
Metrics for resource allocation • Different scheduling approaches are measure performance differently • Historical performance • Throughput • Storage • Network connectivity • Cost • Application specific performance • Service level
Language for utilizing grids • Much of the work in grids is protocol or language work • Expressive languages needed for negotiating service level, reporting performance or resource capabilities, security, and reserving resources • Protocol work in authentication and authorization, data transfer, and job management
Summary about Grids • Grids offer tremendous computation and data storage resources not available in single systems or single clusters • Application and algorithm design and deployment still either rudimentary or application specific • Universal infrastructure still in development • Unique challenges still unsolved especially in regard to fault tolerance and multiple administrative domains
Public Computing • Aggregates idle workstations connected to the Internet for performing large scale computations • Initially seen in volunteer projects such as Distributed.net and SETI@home • Volunteer computers periodically download work from a project server and complete the work during idle periods • Currently used in projects that have large workloads on the scale of months or years with trivially parallelizable tasks
BOINC Architecture • Berkeley Open Infrastructure for Network Computing • Developed as a generic public computing framework • Next generation architecture for the SETI@home project • Open source and encourages use in other public computing projects
BOINC lets you donate computing power to the following projects • Climateprediction.net: study climate change • Einstein@home: search for gravitational signals emitted by pulsars • LHC@home: improve the design of the CERN LHC particle accelerator • Predictor@home: investigate protein-related diseases • SETI@home: Look for radio evidence of extraterrestrial life • Cell Computing biomedical research (Japanese; requires nonstandard client software)
Motivation for New Scheduling Strategies • Many projects requiring large scale computational resources not of the current public computing scale • Grid and cluster scale projects are very popular in many scientific computing areas • Current public computing scheduling does not scale down to these smaller projects
Motivation for New Scheduling Strategies • Grid scale scheduling for public computing would make public computers a viable alternative or complimentary resource to grid systems • Public computing has the potential to offer a tremendous amount of computing resources from idle systems of organizations or volunteers • Scavenging grid projects such as Condor indicate interest in harnessing these resources in the grid research community
Scheduling Algorithms • Current BOINC scheduling algorithm • New scheduling algorithms • First Come, First Serve with target workload of 1 workunit (FCFS-1) • First Come, First Serve with target workload of 5 workunits (FCFS-5) • Ant Colony Scheduling Algorithm
BOINC Scheduling • Originally designed for “unlimited” work • Clients can request as much work as desired up to a specified limit • Smaller, limited computational jobs faced with the challenge of more accurate scheduling • Too many workunits assigned to a node leads to either redundant computation by other nodes or exhaustion of available workunits • Too few workunits assigned leads to increased communication overhead
New Scheduling Strategies • New strategies target computational problems on the scale of many hours or days • Four primary goals: • Reduce application execution time • Increase resource utilization • No reliance on client supplied information • Remain application neutral
First Come First Serve Algorithms • Naïve scheduling algorithms based solely on the frequency of client requests for work • Server-centric approach which does not depend on client supplied information for scheduling • At each request for work, the server compares the number of workunits already assigned to a node and sends work to the node based on a target worklevel • Two algorithms tested targeting either a workload of one workunit (FCFS-1) or five workunits (FCFS-5)
Ant Colony Algorithms • Meta-heuristic modeling the behavior of ants searching for food • Ants make decisions based on pheromone levels • Decisions affect pheromone levels to influence future decisions ?
Ant Colony Algorithms • Initial decisions are made at random • Ants leave trail of pheromones along their path • Next ants use pheromone levels to decide • Still random since initial trails were random ?
Ant Colony Algorithms • Shorter paths will complete quicker leading to feedback from the pheromone trail • Ant at destination now bases return decision on pheromone level • Decisions begin to become ordered ? ?