Effective HDPC: Infrastructure, Expansion, and Resilience

Neil Skrypuch COSC 3P93 3/21/2007 Highly DistributedParallel Computing

Overview • a network of computers all working towards a similar goal • network consists of many nodes, few servers • nodes perform computing and send results to a server • servers distribute jobs • node machines do not communicate with eachother

Pros

Relatively Simple • don't need to worry about special interconnections • don't need to worry about cluster booting

Non-Homogeneous Network • can work across different computer architectures, OSes, etc • computers can be of varying speeds • doesn't require the fastest or most expensive computers • computers can be distributed anywhere in the world

Infrastructure • infrastructure for HDPC already exists almost everywhere • anyone with a network of computers is already ready for HDPC • lots of programs already exist that take advantage of HDPC

Expansion • expansion is painless • there are no special constraints on the “shape” of the network • not fast enough yet? keep adding more computers until it is

Resilience to Failure • it doesn't matter if one or more nodes die • only the reliability of the central server(s) matter

Cons

Suitability • not all problems are suited to HDPC • highly communication bound problems are a poor fit for HDPC

Server Dependence • central server dependence is a double edged sword • if the central server becomes unavailable, everything grinds to a halt

Network (In)security • how to verify if a client should be allowed to join the network? • protecting data sent over the network • verifying integrity and authenticity of data sent over the network

Network (Un)reliability • nodes temporarily losing connectivity may make them temporarily useless

Dealing With the Issues

Server Dependence • the central server need not be a single server • server itself may be clustered • countless ways to cluster servers

Clustering With a Database • allow nodes to talk directly to the database • cluster the database over multiple servers • multi-master replication • single master replication • lots more...

Server Hierarchy • multiple tiers of servers may also be used • could be considered recursive HDPC • very similar to the tree architecture of supercomputers

Lost Nodes • define a maximum amount of time to wait for a node's response • use redundancy • assume some nodes will always be lost • send duplicate jobs to multiple nodes simultaneously

Network (In)security • not as big of an issue as one might think • encryption and public key infrastructures mitigate most confidentiality and authenticity concerns • redundancy is useful for both reliability and security

Work Buffering • taking larger portions of work at a time • temporary connectivity issues pose less of a problem this way • a node can continue working without talking to a central server for longer

Where is HDPC Useful?

Combinatorics • search • enumeration • generation

Cryptography • brute force cipher cracking • gives a glimpse of the future, in terms of what the average person will be able to crack

Artificial Intelligence • genetic algorithms • genetic programming • alpha-beta search

Graphics • ray tracing • animation • fractal generation and calculation

Simulation • weather and climate modeling • particle physics

Guidelines for Suitability • most problems involving a large search tree are well suited to HDPC • anything that can be broken down into smaller, self-contained, chunks is a good candidate for HDPC

How Well Does HDPC Work?

Folding@Home • ~200,000 non-dedicated nodes • 240 TFLOPS • approximately 40 central servers, unknown speeds

SETI@Home • ~200,000 non-dedicated nodes • 288 TFLOPS • 10 central servers, all relatively modest

Blue Gene/L • currently the fastest supercomputer • not HDPC • 65,536 dedicated nodes • 280 TFLOPS • cost about $100,000,000 US

HDPC Works Well • typical speedup is close to linear • cost is substantially less than a comparable supercomputer • nodes can also be general purpose computers

Why Does HDPC Work Well?

Infrastructure Reuse • in general, new hardware investments are not necessary • creating new infrastructure is expensive and time consuming • it's easy to justify using things you already have for additional purposes • there are tons of idle CPUs at any given time, why not use them?

Low Barrier to Entry • anyone with a couple of networked computers can start experimenting

Painlessly Scalable • smooth curve upwards for both cost and performance

Simpler to Program • doesn't require as much “thinking in parallel” in comparison to other approaches • thinking in parallel is hard and fundamentally different than thinking serially • pushes the heavy lifting onto the database instead of the application programmer

Commodity Hardware is Fast • a typical desktop machine today is more powerful than a supercomputer from 15 years ago • and costs orders of magnitude less • and outputs much less heat • and takes up much less space • and consumes much less power

The Future • supercomputers will become faster • HDPC will become even faster than supercomputers • as both number of computers and speed increases • both supercomputers and HDPC will fill their own separate niche

Questions and Discussion

References • http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats • http://www.boincstats.com/stats/project_graph.php?pr=sah • http://www.boincstats.com/stats/project_graph.php?pr=bo • http://www.itjungle.com/tlb/tlb033004-story04.html • http://setiathome.berkeley.edu/sah_status.html • http://fah-web.stanford.edu/serverstat.html • http://top500.org/list/2006/11/100

Effective HDPC: Infrastructure, Expansion, and Resilience