Public Computing - Challenges and Solutions
E N D
Presentation Transcript
Public Computing - Challenges and Solutions Yi Pan Professor and Chair of CS Professor of CIS Georgia State University Atlanta, Georgia, USA AINA 2007 May 21, 2007
Outlines • What is Grid Computing? • Virtual Organizations • Types of Grids • Grid Components • Applications • Grid Issues • Conclusions
Outlines -continued • Public Computing and the BOINC Architecture • Motivation for New Scheduling Strategies • Scheduling Algorithms • Testing Environment and Experiments • MD4 Password Hash Search • Avalanche Photodiode Gain and Impulse Response • Gene Sequence Alignment • Peer to Peer Model and Experiments • Conclusion and Future Research
What is Grid Computing? • Analogy is to power grid • Heterogeneous and geographically dispersed
What is Grid Computing? • Analogy is to power grid • Heterogeneous and geographically dispersed • Standards allow for transportation of power
What is Grid Computing? • Analogy is to power grid • Heterogeneous and geographically dispersed • Standards allow for transportation of power • Standards define interface with grid
What is Grid Computing? • Analogy is to power grid • Heterogeneous and geographically dispersed • Standards allow for transportation of power • Standards define interface with grid • Non-trivial overhead of managing movement and storage of power • Economies of scale compensate for this overhead allowing for cheap, accessible power
A Computational “Power Grid” • Goal is to make computation a utility • Computational power, data services, peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way
A Computational “Power Grid” • Goal is to make computation a utility • Computational power, data services, peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way • Standards allow for transportation of these services
A Computational “Power Grid” • Goal is to make computation a utility • Computational power, data services, peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way • Standards allow for transportation of these services • Standards define interface with grid • Architecture provides for management of resources and controlling access • Large amounts of computing power should be accessible from anywhere in the grid
Virtual Organizations • Independent organizations come together to pool grid resources • Component organizations could be different research institutions, departments within a company, individuals donating computing time, or anything with resources • Formation of the VO should define participation levels, resources provided, expectations of resource use, accountability, economic issues such as charge for resources • Goal is to allow users to exploit resources throughout the VO transparently and efficiently
Types of Grids • Computational Grid • Data Grid • Scavenging Grid • Peer-to-Peer • Public Computing
Computational Grids • Traditionally used to connect high performance computers between organizations • Increases utilization of geographically dispersed computational resources • Provides more parallel computational power to individual applications than is feasible for a single organization • Most traditional grid project concentrate on these types of grids • Globus and OSGA
Data Grids • Distributed data sources • Queries of distributed data • Sharing of storage and data management resources • D0-Partical Physics Data Grid allows access to both compute and data resources of huge amounts of physics data • Google
Scavenging Grids • Harness idle cycles on systems especially user workstations • Parallel application must be quite granular to take advantage of large amounts of weak computing power • Grid system must support terminating and restarting work when systems cease idling • Condor system from University of Wisconsin
Peer-to-Peer • Converging technology with traditional grids • Contrasts with grids having little infrastructure and high fault tolerance • Highly scalable for participation but difficult to locate and monitor resources • Current P2P like Gnutella, Freenet, FastTrack concentrate on data services
Public Computing • Also converging with grid computing • Often communicates through a central server in contrast with peer-to-peer technologies • Again scalable with participation • Adds even greater impact of multiple administrative domains as participants are often untrusted and unaccountable
Public Computing Examples • SETI@Home (http://setiathome.ssl.berkeley.edu/) – Search for Extraterrestrial Intelligence in radio telescope data (UC Berkeley) 搜索地外文明的分布式网络计算 • Has more than 5 million participants • “The most powerful computer, IBM's ASCI White, is rated at 12 TeraFLOPS and costs $110 million. SETI@home currently gets about 15 TeraFLOPs and has cost $500K so far.”
More Public Computing Examples • Folding@Home project (http://folding.stanford.edu) for molecular simulation aimed at new drug discovery • Distributed.net (http://distributed.net) for cracking RC5 64-bit encryption algorithm – used more than 300,000 nodes over 1757 days
Grid Components • Authentication and Authorization • Resource Information Service • Monitoring • Scheduler • Fault Tolerance • Communication Infrastructure
Authentication and Authorization • Important for allowing users to cross the administrative boundaries in a virtual organization • System security for jobs outside the administrative domain currently rudimentary • Work being done on sandboxing, better job control, development environments
Resource Information Service • Used in resource discovery • Leverages existing technologies such as LDAP, UDDI • Information service must be able to report very current availability and load data • Balanced with overhead of updating data
Monitoring • Raw performance characteristics are not the only measurement of resource performance • Current and expected loads can have a tremendous impact • Balance between accurate performance data and additional overhead of monitoring systems and tracking that data
Scheduler • Owners of systems interested in maximizing throughput • Users interested in maximizing runtime performance • Both offer challenges with crossing administrative boundaries • Unique issues such as co-allocation and co-location • Interesting work being done in scheduling like market based scheduling
Fault Tolerance • More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research • Multiple administrative domains in VO challenge the reliability of resources • Faults can refer not only to resource failure but violation of service level agreements (SLA) • Impact on fault tolerance if there is no accountability for failure
Fault Tolerance • More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research • Multiple administrative domains in VO challenge the reliability of resources • Faults can refer not only to resource failure but violation of service level agreements (SLA) • Impact on fault tolerance if there is no accountability for failure
Fault Tolerance • More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research • Multiple administrative domains in VO challenge the reliability of resources • Faults can refer not only to resource failure but violation of service level agreements (SLA) • Impact on fault tolerance if there is no accountability for failure
Fault Tolerance • More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research • Multiple administrative domains in VO challenge the reliability of resources • Faults can refer not only to resource failure but violation of service level agreements (SLA) • Impact on fault tolerance if there is no accountability for failure
Communication Infrastructure • Currently most grids have robust communication infrastructure • As more grids are deployed and used, more concentration must be done on network QoS and reservation • Most large applications are currently data rich • P2P and Public Computing have experience in communication poor environments
Applications • Embarrassingly parallel, data poor applications in the case of pooling large amounts of weak computing power • Huge data-intensive, data rich applications that can take advantage of multiple, parallel supercomputers • Application specific grids like Cactus and Nimrod
Grid Issues • Site autonomy • Heterogeneous resources • Co-allocation • Metrics for resource allocation • Language for utilizing grids • Reliability
Site autonomy • Each component of the grid could be administered by an individual organization participating in the VO • Each administrative domain has its own policies and procedures surrounding their resources • Most scheduling and resource management work must be distributed to support this
Heterogeneous resources • Grid resources will have not only heterogeneous platforms but heterogeneous workloads • Applications truly exploiting grid resources will need to scale from idle cycles on workstations, huge vector based HPCs, to clusters • Not only computation power, also storage, peripherals, reservability, availability, network connectivity
Co-allocation • Unique challenges of reserving multiple resources across administrative domains • Capabilities of resource management may be different for each component of a composite resource • Failure of allocating components must be handled in a transaction-like manner • Acceptable substitute components may assist in co-allocating a composite resource
Metrics for resource allocation • Different scheduling approaches are measure performance differently • Historical performance • Throughput • Storage • Network connectivity • Cost • Application specific performance • Service level
Language for utilizing grids • Much of the work in grids is protocol or language work • Expressive languages needed for negotiating service level, reporting performance or resource capabilities, security, and reserving resources • Protocol work in authentication and authorization, data transfer, and job management
Summary about Grids • Grids offer tremendous computation and data storage resources not available in single systems or single clusters • Application and algorithm design and deployment still either rudimentary or application specific • Universal infrastructure still in development • Unique challenges still unsolved especially in regard to fault tolerance and multiple administrative domains
Public Computing • Aggregates idle workstations connected to the Internet for performing large scale computations • Initially seen in volunteer projects such as Distributed.net and SETI@home • Volunteer computers periodically download work from a project server and complete the work during idle periods • Currently used in projects that have large workloads on the scale of months or years with trivially parallelizable tasks
BOINC Architecture • Berkeley Open Infrastructure for Network Computing • Developed as a generic public computing framework • Next generation architecture for the SETI@home project • Open source and encourages use in other public computing projects
BOINC lets you donate computing power to the following projects • Climateprediction.net: study climate change • Einstein@home: search for gravitational signals emitted by pulsars • LHC@home: improve the design of the CERN LHC particle accelerator • Predictor@home: investigate protein-related diseases • SETI@home: Look for radio evidence of extraterrestrial life • Cell Computing biomedical research (Japanese; requires nonstandard client software)
Motivation for New Scheduling Strategies • Many projects requiring large scale computational resources not of the current public computing scale • Grid and cluster scale projects are very popular in many scientific computing areas • Current public computing scheduling does not scale down to these smaller projects
Motivation for New Scheduling Strategies • Grid scale scheduling for public computing would make public computers a viable alternative or complimentary resource to grid systems • Public computing has the potential to offer a tremendous amount of computing resources from idle systems of organizations or volunteers • Scavenging grid projects such as Condor indicate interest in harnessing these resources in the grid research community
Scheduling Algorithms • Current BOINC scheduling algorithm • New scheduling algorithms • First Come, First Serve with target workload of 1 workunit (FCFS-1) • First Come, First Serve with target workload of 5 workunits (FCFS-5) • Ant Colony Scheduling Algorithm
BOINC Scheduling • Originally designed for “unlimited” work • Clients can request as much work as desired up to a specified limit • Smaller, limited computational jobs faced with the challenge of more accurate scheduling • Too many workunits assigned to a node leads to either redundant computation by other nodes or exhaustion of available workunits • Too few workunits assigned leads to increased communication overhead
New Scheduling Strategies • New strategies target computational problems on the scale of many hours or days • Four primary goals: • Reduce application execution time • Increase resource utilization • No reliance on client supplied information • Remain application neutral
First Come First Serve Algorithms • Naïve scheduling algorithms based solely on the frequency of client requests for work • Server-centric approach which does not depend on client supplied information for scheduling • At each request for work, the server compares the number of workunits already assigned to a node and sends work to the node based on a target worklevel • Two algorithms tested targeting either a workload of one workunit (FCFS-1) or five workunits (FCFS-5)
Ant Colony Algorithms • Meta-heuristic modeling the behavior of ants searching for food • Ants make decisions based on pheromone levels • Decisions affect pheromone levels to influence future decisions ?
Ant Colony Algorithms • Initial decisions are made at random • Ants leave trail of pheromones along their path • Next ants use pheromone levels to decide • Still random since initial trails were random ?
Ant Colony Algorithms • Shorter paths will complete quicker leading to feedback from the pheromone trail • Ant at destination now bases return decision on pheromone level • Decisions begin to become ordered ? ?