Understanding Moore's Law and Future of Computing

Lecture 11:Unix Clusters Asoc.Prof. Guntis Barzdins Asist. Girts Folkmanis University of Latvia Dec 10, 2004

Moore’s Law - Density

Moore's Law and Performance • The performance of computers is determined by architecture and clock speed. • Clock speed doubles over a 3 year period due to the scaling laws on chip. • Processors using identical or similar architectures gain performance directly as a function of Moore's Law. • Improvements in internal architecture can yield better gains cf Moore's Law.

Future of Moore’s Law • Short-Term (1-5 years) • Will operate (due to prototypes in lab) • Fabrication cost will go up rapidly • Medium-Term (5-15 years) • Exponential growth rate will likely slow • Trillion-dollar industry is motivated • Long-Term (>15 years) • May need new technology (chemical or quantum) • We can do better (e.g., human brain) • I would not close the patent office

Different kinds of PC cluster • High Performance Computing Cluster • Load Balancing • High Availability

High Performance Computing Cluster (Beowulf) • Start from 1994 • Donald Becker of NASA assemble the world’s first cluster with 16 sets of DX4 PCs and 10 Mb/s ethernet • Also called Beowulf cluster • Built from commodity off-the-shelf hardware • Applications like data mining, simulations, parallel processing, weather modelling, computer graphical rendering, etc.

Examples of Beowulf cluster • Scyld Cluster O.S. by Donald Becker • http://www.scyld.com • ROCKS from NPACI • http://www.rocksclusters.org • OSCAR from open cluster group • http://oscar.sourceforge.net • OpenSCE from Thailand • http://www.opensce.org

Cluster Sizing Rule of Thumb • System software (Linux, MPI, Filesystems, etc) scale from 64 nodes to at most 2048 nodes for most HPC applications • Max socket connections • Direct access message tag lists & buffers • NFS / storage system clients • Debugging • Etc • It is probably hard to rewrite MPI and all Linux system software for O(100,000) node clusters

Apple Xserve G5 with Xgrid Environment • Alternative to Beowulf PC cluster • Server node + 10 compute nodes • Dual CPU G5 processors (2 GHz, 1 GB memory) • Gigabit ethernet inter-connectivity • 3 TB XServe RAID array • Xgrid offers ‘easy’ pool-of- processors computing model • MPI available for heritage code

Xgrid agents Xgrid Computing Environment • Suitable for loosely coupled distributed computing • Controller distributes tasks to agent processors (tasks include data and code) • Collects results when agents finish • Distributes more chunks to agents as they become free and join cluster/grid Xgrid controller Server storage Xgrid client

Xgrid Work Flow

Cluster Status Offline  turned off Unavailable  turned on, but busy w/ other non-cluster tasks Working  computing on this cluster job Available  waiting to be assigned cluster work

Rocky’s Tachy Tach Cluster Status Displays Tachometer illustrates total processing power available to cluster at any time. Level will change if running on a cluster of desktop workstations, but will stay steady if monitoring a dedicated cluster

Load Balancing Cluster • PC cluster deliver load balancing performance • Commonly used with busy ftp and web servers with large client base • Large number of nodes to share load

High Availability Cluster • Avoid downtime of services • Avoid single point of failure • Always with redundancy • Almost all load balancing cluster are with HA capability

Examples of Load Balancing and High Availability Cluster • RedHat HA cluster • http://ha.redhat.com • Turbolinux Cluster Server • http://www.turbolinux.com/products/tcs • Linux Virtual Server Project • http://www.linuxvirtualserver.org/

High Availability Approach:Redundancy + Failover • Redundancy eliminates Single Points Of Failure (SPOF) • Auto detect Failures (hardware, network, applications) • Automatic Recovery from failures(no human intervention)

Real-Time Disk Replication (DRDB)Distributed Replicating Block Device

Tivoli System Automation (TSA) for Multi-Platform Proprietary IBM Solution Used across all eServers, ia32 from any vendor Available on Linux, AIX, OS/400 Rules Based Recovery System Over 1000 licenses since 2003 IBM Supported Solutions • Linux-HA (Heartbeat) • Open Source Project • Multiple platform solution for IBM eServers, Solaris, BSD • Packaged with several Linux Distributions • Strong focus on ease-of-use, security, simplicity, low-cost • > 10K clusters in production since 1999

HPCC Cluster and parallel computing applications • Message Passing Interface • MPICH (http://www-unix.mcs.anl.gov/mpi/mpich/) • LAM/MPI (http://lam-mpi.org) • Mathematical • fftw (fast fourier transform) • pblas (parallel basic linear algebra software) • atlas (a collections of mathematical library) • sprng (scalable parallel random number generator) • MPITB -- MPI toolbox for MATLAB • Quantum Chemistry software • gaussian, qchem • Molecular Dynamic solver • NAMD, gromacs, gamess • Weather modelling • MM5 (http://www.mmm.ucar.edu/mm5/mm5-home.html)

MOSIX and openMosix • MOSIX: MOSIX is a software package that enhances the Linux kernel with cluster capabilities. The enhanced kernel supports any size cluster of X86/Pentium based boxes. MOSIX allows for the automatic and transparent migration of processes to other nodes in the cluster, while standard Linux process control utilities, such as 'ps' will show all processes as if they are running on the node the process originated from. • openMosix: openMosix is a spin off of the original Mosix. The first version of openMosix is fully compatible with the last version of Mosix, but is going to go in its own direction.

MOSIX architecture (3/9) Preemptive process migration any user’s process, trasparently and at any time, can migrate to any available node. The migrating process is divided into two contexts: • system context (deputy) that may not be migrated from “home” workstation (UHN); • user context (remote) that can be migrated on a diskless node;

MOSIX architecture (4/9) Preemptive process migration master node diskless node

Multi-CPU Servers

Benchmark - Memory 4x DIMM 1GB DDR266 Avent Techn. 4x DIMM 1GB DDR266 Avent Techn. 1x Stream:2x Stream:4x Stream: 2x Opteron, 1.8 GHz, HyperTransport: 1006 – 1671 MB/s 975 – 1178 MB/s 924 – 1133 MB/s 2x Xeon, 2.4 GHz, 400 MHz FSB: 1202 – 1404 MB/s 561 – 785 MB/s 365 – 753 MB/s

Sybase DBMS Performance

Multi-CPU Hardware and Software

Service Processor (SP) • Dedicated SP on-board • PowerPC based • Own IP name/address • Front panel • Command line interface • Web-server • Remote administration • System status • Boot/Reset/Shutdown • Flash the BIOS

Unix Scheduling

Process Scheduling • When to run scheduler • Process create • Process exit • Process blocks • System interrupt • Non-preemptive – process runs until it blocks or gives up CPU (1,2,3) • Preemptive – process runs for some time unit, then scheduler selects a process to run (1-4)

Solaris Overview • Multithreaded, Symmetric Multi-Processing • Preemptive kernel - protected data structures • Interrupts handled using threads • MP Support - per cpu dispatch queues, one global kernel preempt queue. • System Threads • Priority Inheritance • Turnstiles rather than wait queues

Linux • Today Linux scales very well in SMP systems up to 4 CPU’s. • Linux on 8 CPU’s is still competitive, but between 4way and 8way systems the price per CPU increases significantly. • For SMP systems with more than 8 CPU’s, classic Unix systems are the best choice. • With Oracle Real Application Clusters (RAC),small 4 or 8way systems can be clustered to cross the today’s Linux limitations. • Commodity, inexpensive 4way Intel boxes, clustered with Oracle 9i RAC, help to reduce TCO.

Understanding Moore's Law and Future of Computing