1 / 15

The Terabyte Analysis Machine

The Terabyte Analysis Machine. Computational Observational Astronomy James Annis / Fermilab. Experimental Astrophysics. James Annis Gabriele Garzoglio Kurt Ruthsmandorfer Chris Stoughton.

jwax
Télécharger la présentation

The Terabyte Analysis Machine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Terabyte Analysis Machine Computational Observational Astronomy James Annis / Fermilab

  2. Experimental Astrophysics James Annis • James Annis • Gabriele Garzoglio • Kurt Ruthsmandorfer • Chris Stoughton The Terabyte Analysis Machine (TAM) is a research project of Fermilab’s Experimental Astrophysics group. The members directly involved:

  3. The Sloan Digital Sky Survey James Annis • Dedicated telescope • Extremely wide field optics • Open-air, no dome design • The imaging camera • 138 Megapixels • 5 band passes • 2.5 degrees field of view • The twin spectrographs • 660 fibers • 3 degree field of view • high resolution (0.3 nm) • 400-900 nm

  4. SDSS Science Aims James Annis • 2 dimensional map of one quarter of the entire sky • 3 dimensional map of the local universe (out to ~5 billion light years) • cyan: main survey galaxies • magenta: bright red galaxies

  5. The Data Volume James Annis • 15 Terabytes of corrected frames (2-d map) • 4 Terabytes of atlas images, binned sky and masks • 1 Terabyte of complex object catalogs • 120+ attributes • Radial profiles • Links to atlas image, spectra • 0.1 Terabyte of spectra catalogs (3-d map)

  6. The Archetype Problems James Annis • Search for A-stars or high-z quasars • Exist in a small part of color-color space • I/O dominated • Search for clusters of galaxies • Spatial overdensity • Color-color-magnitude overdensity • CPU dominates; perhaps I/O-CPU balanced • Search for weak lensing signals • Measure specialized ellipticity on all atlas images • CPU dominated

  7. Money –is- an object use commodity computers and learn to cluster Our regime is that of the embarrassingly parallel Do without low latency parallel message passing Have significant disk local to each machine Can turn I/O dominates problems into CPU bound ones Have a global data store Enable data redistribution Ease of use Human thought cycles are the scarcest resource Preserve reasonable interactive performance Demand ease of use Design Notes James Annis

  8. Beowulf clusters are commodity items The rise of the Internet economy created demand Dual PIII are at sweet spot Commodity interconnect Fast Ethernet is nearly free Gigabit Ethernet uplink Commodity local disk EIDE disk is half the cost of SCSI, 2/3rds as fast Global data store We prefer a high-speed server-less design Fibre-channel 100 MB/s/channel All nodes have access to all disks Global File System Open source Server-less Device based file locks Journaling Technology Notes James Annis

  9. The Terabyte Analysis Machine James Annis Gigabit Uplink To Network Interactive Node Fibre Channel Switch Fast Ethernet Switch 1 Terabyte of Disk 6 Compute Nodes 0.5 Terabyte of Local Disk

  10. System integrator Linux NetworX Cluster control box Compute Nodes Linux NetworX Dual 600 MHz Pentium III ASUS motherboard 1 Gig RAM 2x36 Gig EIDE disks Qlogic 2100 HBA Ethernet Cisco Catalyst 2948G Fibre Channel Gadzoox Capellix 3000 Global Disk DotHill SanNet 4200 Dual Fibre Channel controllers 10x73 Gig Seagate Cheetah SCSI disk Software Linux 2.2.16 Qlogic drivers GFS V3.0 Condor The Terabyte Analysis Machine James Annis

  11. GFS: The Global File System James Annis • Open source (GPL’d) • Sistina Software (ex-University of Minnesota) • High performance 64-bit files and file system • Distributed, server-less metadata • Data synchronization via global, disk based locks • Journaling and node cast-out • Three major pieces: • The network storage pool driver • The file system • The locking modules

  12. GFS Performance James Annis • Test setup • 5 nodes • 1 5-disk RAID • Results • RAID limited at 95 Mbytes/sec • at >15 threads, disk head move limited • linear performance before these limits is encouraging

  13. GFS Linux, FreeBSD Distributed metadata Test setup: 5 machines 1 5-disk RAID-5 5 reads and 1 write 1 GB files, 64k blocks Write 5.1 MB/s Read 30.0, 30.0 MB/s Aggregate 65 MB/s / 90 MB/s 72% utilization cXFS (Ramon Pasetes, CD) IRIX Metadata server Test setup: 6 machines 2 9-disk RAID-5 3 reads and 3 writes 1 GB files, 64k blocks Write: 36.5, 28.4, 28.4 MB/s Read: 11.5, 11.6, 11.9 MB/s Aggregate 124 MB/s / 180 MB/s 69% utilization Shared Disk File Systems James Annis

  14. Science Database Research James Annis • The SDSS Science Archive (SX) • Objectivity and specialized astronomical code • Analysis engines • Programs that query SX over sockets • Hash machines • Redistribute data over local nodes for processing • Distance machine • Repartition and re-index data using schemes optimized for particular problems • Optimized K’th nearest neighbor searches (range searching) may take cluster finding to I/O dominated

  15. Summary James Annis • TAM is an innovative University class* analysis cluster • Primary technical challenge • server-less data sharing amongst cluster nodes • Primary research challenge • database redistribution • Primary science challenge • locate the clusters in the first 1000 sq-degrees of SDSS data *In Fermilab terms, “trailer class”

More Related