320 likes | 430 Vues
The IEEE Task Force on Cluster Computing (TFCC) was established in 1999 due to the growing interest in cluster computing. Its mission includes promoting research, developing technical standards, and coordinating activities across academia, industry, and funding agencies. Over the years, TFCC has organized workshops, published newsletters, and sponsored conferences to foster collaboration within the community. Today, it serves as a key international forum in the field, supporting a range of activities and initiatives crucial for advancing cluster computing technologies.
E N D
The IEEE CS Task Force on Cluster Computing (TFCC) William GroppMathematics and Computer ScienceArgonne National Labwww.mcs.anl.gov/~gropp Thanks to Mark BakerUniversity of Portsmouth, UKhttp://www.dcs.port.ac.uk/~mab
A Little History • In 1998 there was obvious huge interest in clusters, so it seemed natural to set up a focused group in this area. • A Cluster Computing Task Force was proposed to the IEEE CS. • The TFCC was approved and started operating in February 1999 – been going just over 2 years. gropp@mcs.anl.gov
Proposed Activities • Act as an international forum to promote cluster computing research and education, and participate in setting up technical standards in this area. • Be involved with issues related to the design, analysis and development of cluster systems as well as the applications that use them. • Sponsor professional meetings, produce publications, set guidelines for educational programs, and help co-ordinate academic, funding agency, and industry activities. • Organize events and hold a number of workshops that would span the range of activities sponsored by the Task Force. • Publish a bi-annual newsletter to help the community keep abreast of activities in field. gropp@mcs.anl.gov
IEEE CS Task Forces • A TF is expected to have a finite term of existence, normally a period of 2-3 years - continued existence beyond that point is generally not appropriate. • A TF is expected to either increase their scope of activities such that establishment of a Technical Committee (TC) is warranted, or the task force will be merged into existing TCs. • TFCC will submit an application to the CS become a TC later this year. gropp@mcs.anl.gov
Why a separate TFCC! • It brings together all the activities/technologies used with Cluster Computing into one area - so instead of tracking four or five IEEE TCs there is one... • Cluster Computing is NOT just Parallel, Distributed, OSs, or the Internet, it is a mix of them all, and consequently different. • The TFCC is an appropriate body for focusing activities and publications associated with Cluster Computing. gropp@mcs.anl.gov
http://www.ieeetfcc.org gropp@mcs.anl.gov
TFCC Mailing Lists • Currently three emails lists have been set up: • tfcc-l@bucknell.edu – a discussion list open to anyone interested in the TFCC - see TFCC page for info. on “how to subscribe”. • tfcc-exe@port.ac.uk– a closed executive committee mailing reflector. • tfcc-adv@port.ac.uk– a closed advisory committee mailing reflector. gropp@mcs.anl.gov
Annual Conference – ClusterXY • 1st IEEE International Workshop on Cluster Computing (Cluster 1999), Melbourne, Australia, December 1999, about 105 attendees from 16 countries. http://www.clustercomp.org • 2nd IEEE International Conference on Cluster Computing (Cluster 2000), Chemnitz, Germany, November, 2000, anticipate 160 attendees. http://www.tu-chemnitz.de/cluster2000 • 3rd IEEE International Conference on Cluster Computing (Cluster 2001), Newport Beach, California, October 8-11, 2001, expect 250-300 attendees. http://andy.usc.edu/cluster2001 gropp@mcs.anl.gov
Associated Events - GRID’XY • 1st IEEE/ACM International Workshop on Grid Computing (Grid2000), Bangalore, India, December 17, 2000 (attendees from 15 countries). http://www.gridcomputing.org • 2nd IEEE/ACM International Workshop on Grid Computing (Grid2001), at SC2001, November 2001 gropp@mcs.anl.gov
Supercomputing • “Birds of A Feather” at SC99 and SC2000. • Aims of meetings are to gather together interested parties and bring them up to date, but also put together a bunch of short talks and start a discussion on a variety of topics… • Probably be another at SC01 – depending on the community interest. gropp@mcs.anl.gov
Other Activities • Book donation program • Cluster Computing Archive • www.ieeetfcc.org/ClusterArchive.html • TopClusters Project • www.TopClusters.org • TFCC Whitepaper • www.dcs.port.ac.uk/~mab/tfcc/WhitePaper • TFCC Newsletter • www.eg.bucknell.edu/~hyde/tfcc gropp@mcs.anl.gov
TopClusters Project • http://www.TopClusters.org • TFCC collaboration with Top500 project. • Numeric, I/O, Web, Database, and Application level benchmarking of clusters. • Joint BOF with Top500 at SC2000 on Cluster-based benchmarking. • Ongoing effort… gropp@mcs.anl.gov
TFCC Whitepaper • A Whitepaper on Cluster Computing, submitted to the International Journal of High-Performance Applications and Supercomputing, November 2000 • Snap-shot of the state-of-the-art of Cluster Computing. • Preprint, www.dcs.port.ac.uk/~mab/tfcc/WhitePaper/ gropp@mcs.anl.gov
TFCC Membership • Over 300 registered members • Free membership open to all, but few benefits may be restricted - (reduced registration fee for IEEE members) • Over 450 on the TFCC mailing list <tfcc-l@bucknell.edu> gropp@mcs.anl.gov
Future Plans • We plan to submit an application to the IEEE CS Technical Activities Board (TAB) to attain full Technical Committee status. • The TAB see the TFCC as a success and we hope that our application will be successful. • Obviously if we achieve TC status, we will need the continuing assistance and help of the TFCCs current volunteers plus encourage a bunch of new ones… gropp@mcs.anl.gov
Summary • Successful conference series has been started, with commercial sponsorship. • Promoting Cluster-based technologies through TFCC sponsorship. • Helping the community with our book donation program. • Engendering debate and discussion through mailing forum. • Keeping the community informed with our information rich TFCC Web site. gropp@mcs.anl.gov
Scalable Clusters • TopCluster.org list: • 26 Clusters with 128+ nodes • 8 with 500+ nodes • 34 with 64-127 nodes • Most run Linux • Most dedicated to applications • Where are scalable tools developed and tested? • Caveats: • Does not include MPP-like systems (IBM SP, SGI Origin, Compaq, Intel TFLOPs, etc.) • Not a complete list • Only clusters explicitly contributed to topcluster.org gropp@mcs.anl.gov
What is Scalability? • Most common definition in use: • Works for n+1 nodes if it works for n, for small n • Practical definition • Operations complete “fast enough” • 0.5 to 3 seconds for “interactive” • Operations are reliable • Approach to scalability must not be fragile gropp@mcs.anl.gov
Issues in Clusters and Scalability • Developing and Testing Tools • Requires convenient access to large-scale system • Can this co-exist with production computing? • Too many different tools • Why not adopt Unix philosophy? • Example solution: Scalable Unix Tools • Following slides thanks to Rusty Lusk and Emil Ong gropp@mcs.anl.gov
What Are the Scalable Unix Tools? • Parallel versions of common Unix commands like ps, ls, cp, …, with appropriate semantics • A few new commands in the same spirit but without a serial counterpart • Designed for users • New this spring: release of a high-performance implementation based on MPI • One of the original “official” Ptools projects • Original definition published • Proceedings of the Scalable High Performance Computing Conference • http://www.mcs.anl.gov/~gropp/papers/1994/shpcc-paper.ps gropp@mcs.anl.gov
Motivation • Basic Unix commands (ls, grep, find, …) are quintessential tools. • Simple syntax and semantics (except maybe find syntax) • Have same component interface (lines of text, stdin, stdout) • Unix redirection ( <, >, and especially | ) allow tools to be easily combined into powerful command lines • “Old-fashioned”: no GUI, little interactivity gropp@mcs.anl.gov
Motivation, continued • Many parallel machines have Unix and at least partially distinct file systems on each node. • A user needs simple and familiar ways to • Copy a file to local file space on each node • Find all processes running on all nodes • Test for conditions on all nodes • Avoid getting swamped with output • On large machines these commands are not useful unless they take advantage of parallelism in their execution. gropp@mcs.anl.gov
Design Goals • Familiar to Unix users • Similar names (we chose pt<Unix-name>) • Same arguments, similar semantics • Interact well with traditional Unix commands, facilitating construction of powerful command lines • Run at interactive speeds (requires scalability in parallel process manager startup and handling of I/O) gropp@mcs.anl.gov
ptcp ptmv ptrm ptln ptmkdir ptrmdir ptchmod ptchgrp ptchown pttest[ao] Part I: Parallel Versions of Traditional Commands • Select nodes to run on by • -all • -m <file of hostnames> • -M <hostlist> • ‘donner dasher blitzen’ • ‘ccn%d@1-32,42,65-96’ gropp@mcs.anl.gov
Part II: Traditional Commands Producing Lots of Output • ptcat, ptls, ptfind • Have potential to produce lots of output, and the source is also of interest • With –h option: ptls –M node%d@1-3 -h [node1] myfile1 [node2] [node3] myfile1 myfile2 gropp@mcs.anl.gov
Performance of ptcp • Copying a single 10 MB file • to 241 nodes in 14 seconds Time to Copy 10MB file Total Bandwidth gropp@mcs.anl.gov
Watching ptcp ptcp –all bigfile BIGFILE X=1 while true; do \ ptexec -all 'echo "`hostname`: `ls -s BIGFILE \ | awk \ "{print \\"percentage\\" \$ (1)/98 \\" blue \ red\\"}\"`"' | ptdisp -h gropp@mcs.anl.gov
Percentage of Completion gropp@mcs.anl.gov
Percentage of Completion gropp@mcs.anl.gov
Availability • Open source • Get from http://www.mcs.anl.gov/sut • All source, man pages • Configure, make, on Linux, Solaris, Irix, AIX • Needs MPI implementation with mpirun • Developed with Linux, MPICH, MPD, on Chiba City at Argonne gropp@mcs.anl.gov
Chiba City Scalability Testbed • http://www-unix.mcs.anl.gov/chiba/ gropp@mcs.anl.gov
Some Other Efforts in Scalable Clusters • Large Programs • DOE Scientific Discovery through Advanced Computing (SciDAC) • NSF Distributed Terascale Facility (DTF) • OSCAR • Goal is a “cluster in a box” CD • PVFS (Parallel Virtual File System) • Many Smaller Efforts • www.beowulf.org, etc. • Commercial Efforts • Scyld, etc. gropp@mcs.anl.gov