1 / 30

Large Scale Computations in Astrophysics: Towards a Virtual Observatory

This paper discusses the challenges and potential solutions for handling the massive amount of astronomical data generated by modern telescopes and surveys, and the need for a Virtual Observatory to facilitate data exploration and analysis. It also explores the similarities and differences between the data challenges in astrophysics and high-energy physics.

ndunn
Télécharger la présentation

Large Scale Computations in Astrophysics: Towards a Virtual Observatory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ACAT2000, Fermilab, Oct 19, 2000 Large Scale Computations in Astrophysics: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University

  2. Nature of Astronomical Data • Imaging • 2D map of the sky at multiple wavelengths • Derived catalogs • subsequent processing of images • extracting object parameters (400+ per object) • Spectroscopic follow-up • spectra: more detailed object properties • clues to physical state and formation history • lead to distances: 3D maps • Numerical simulations • All inter-related!

  3. Imaging Data

  4. 3D Maps

  5. N-body Simulations

  6. Trends Future dominated by detector improvements • Moore’s Law growth in CCD capabilities • Gigapixel arrays on the horizon • Improvements in computing and storage will track growth in data volume • Investment in software is critical, and growing Total area of 3m+ telescopes in the world in m2, total number of CCD pixels in Megapix, as a function of time. Growth over 25 years is a factor of 30 in glass, 3000 in pixels.

  7. The Age of Mega-Surveys • The next generation mega-surveys and archives will change astronomy, due to • top-down design • large sky coverage • sound statistical plans • well controlled systematics • The technology to store and access the data is here we are riding Moore’s law • Data mining will lead to stunning new discoveries • Integrating these archives is for the whole community => Virtual Observatory

  8. Ongoing surveys MACHO 2MASS DENIS SDSS GALEX FIRST DPOSS GSC-II COBE MAP NVSS FIRST ROSAT OGLE ... • Large number of new surveys • multi-TB in size, 100 million objects or more • individual archives planned, or under way • Multi-wavelength view of the sky • more than 13 wavelength coverage in 5 years • Impressive early discoveries • finding exotic objects by unusual colors • L,T dwarfs, high-z quasars • finding objects by time variability • gravitational microlensing

  9. The Necessity of the VO • Enormous scientific interest in the survey data • The environment to exploit these huge sky surveys does not exist today! • 1 Terabyte at 10 Mbyte/s takes 1 day • Hundreds of intensive queries and thousands of casual queries per-day • Data will reside at multiple locations, in many different formats • Existing analysis tools do not scale to Terabyte data sets • Acute need in a few years, solution will not just happen

  10. VO- The challenges • Size of the archived data 40,000 square degrees is 2 Trillion pixels • One band 4 Terabytes • Multi-wavelength 10-100 Terabytes • Time dimension 10 Petabytes • Current techniques inadequate • new archival methods • new analysis tools • new standards • Hardware/networking requirements • scalable solutions required • Transition to the new astronomy

  11. VO: A New Initiative • Priority in the Astronomy and Astrophysics Survey • Enable new science not previously possible • Maximize impact of large current and future efforts • Create the necessary new standards • Develop the software tools needed • Ensure that the community has network and hardware resources to carry out the science

  12. New Astronomy- Different! • Data “Avalanche” • the flood of Terabytes of data is already happening,whether we like it or not • our present techniques of handling these data do not scale well with data volume • Systematic data exploration • will have a central role • statistical analysis of the “typical” objects • automated search for the “rare” events • Digital archives of the sky • will be the main access to data • hundreds to thousands of queries per day

  13. Examples: Data Pipelines

  14. SDSS T-dwarf (June 1999) Examples: Rare Events Discovery of several newobjects by SDSS & 2MASS

  15. Examples: Reprocessing Gravitational lensing 28,000 foreground galaxies over 2,045,000 background galaxies in test data (McKay etal 1999)

  16. Examples: Galaxy Clustering • Shape of fluctuation spectrum • cosmological parameters and initial conditions • The new surveys (SDSS) are the first when logN~30 • Starts with a query • Compute correlation function • All pairwise distancesN2, N log N possible • Power spectrum • Optimal: the Karhunen-Loeve transform • Signal-to-noise eigenmodes • N3 in the number of pixels • Needs to be done many times over

  17. Relation to the HEP Problem • Similarities • need to handle large amounts of data • data is located at multiple sites • data should be highly clustered • substantial amounts of custom reprocessing • need for a hierarchical organization of resources • scalable solutions required • Differences of Astro from HEP • data migration is in opposite direction • the role of small queries is more important • relations between separate data sets (same sky) • data size currently smaller, we can keep it all on disk

  18. Data Migration Path Tier 0 Tier 1 portal Tier 2 Tier 3 Astro HEP

  19. Queries are I/O limited • In our applications few fixed access patterns • one cannot build indices for all possible queries • worst case scenario is linear scan of the whole table • Increasingly large differences between • Random access • controlled by seek time (5-10ms), <1000 random I/O /sec • Sequential I/O • dramatic improvements, 100 MB/sec per SCSI channel easy • reached 215 MB/sec on a single 2-way Dell server • Often much faster to scan than to seek • Good layout => more sequential I/O

  20. Distributed Archives • Networks are slower than disks: • minimize data transfer • run queries locally • I/O will scale linearly with nodes • 1 GB/sec aggregate I/O engine can be built for <$100K • Non-trivial problems in • load balancing • query parallelization • queries across inhomogeneous data sources • These problems are not specific to astronomy • commercial solutions are around the corner

  21. Geometric Approach • Main problem • fast, indexed searches of Terabytes in N-dim space • searches are not axis-parallel • simple B-tree indexing does not work • Geometric approach • use the geometric nature of the data • quantize data into containers of `friends’ • objects of similar colors • close on the sky • clustered together on disk • containers represent coarse-grained map of the data • multidimensional index-tree (eg KD-tree)

  22. Attributes Number Sky Position 3 Multiband Fluxes N = 5+ Other M= 100+ Geometric Indexing “Divide and Conquer” Partitioning 3NM HierarchicalTriangular Mesh Split as k-d treeStored as r-treeof bounding boxes Using regularindexing techniques

  23. SDSS: Distributed Archive User Interface Analysis Engine Master SX Engine Objectivity Federation Objectivity Slave Slave Slave Objectivity Slave Objectivity Objectivity RAID Objectivity RAID RAID RAID

  24. Computing Virtual Data • Analyze large output volumes next to the database • send results only (`Virtual Data’): the system `knows’ how to compute the result (Analysis Engine) • Analysis: different CPU to I/O ratio than database • multilayered approach • Highly scalable architecture required • distributed configuration – scalable to data grids • Multiply redundant network paths between data-nodes and compute-nodes • `Data-wolf’ cluster

  25. SDSS Data Flow

  26. Compute node Compute node Compute node Compute node Compute node Compute layer200 CPUs Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Other nodes Objectivity Objectivity Objectivity Objectivity Objectivity Objectivity RAID RAID RAID RAID RAID RAID RAID Interconnect layer 1 Gbits/sec/node Objectivity RAID Database layer 2 GBytes/sec A Data Grid Node Hardware requirements • Large distributed database engines • with few Gbyte/s aggregate I/O speed • High speed (>10 Gbit/s) backbones • cross-connecting the major archives • Scalable computing environment • with hundreds of CPUs for analysis 10 Gbits/s

  27. SDSS in GriPhyN • Two Tier 2 Nodes (FNAL + JHU) • testing framework on real data in different scenarios • FNAL node • reprocessing of images • fast and full regeneration of catalogs from the images on disk • gravitational lensing, finer morphological classification • JHU node • statistical calculations, integrated with catalog database • tasks require lots of data, can be run in parallel • various statistical calculations, likelihood analyses • power spectra, correlation functions, Monte-Carlo

  28. Clustering of Galaxies Generic features of galaxy clustering: • Self organized clustering driven by long range forces • These lead to clustering on all scales • Clustering hierarchy: distribution of galaxy counts is approximately lognormal • Scenarios: ‘top-down’ vs ‘bottom-up’

  29. Clustering of Computers • Problem sizes have lognormal distribution • multiplicative process • Optimal queuing strategy • run smallest job in queue • median scale set by local resources: largest jobs never finish • Always need more computing • ‘infall’ to larger clusters nearby • asymptotically long-tailed distribution of compute power • Short range forces: supercomputers • Long range forces: onset of high speed networking • Self-organized clustering of computing resources • the Computational Grid

  30. www.voforum.org www.sdss.org Conclusions • Databases became an essential part of astronomy: most data access will soon be via digital archives • Data at separate locations, distributed worldwide, evolving in time: move queries not data! • Computations in both processing and analysis will be substantial: need to create a `Virtual Data Grid’ • Problems similar to HEP, lot of commonalities, but data flow more complex • Interoperability of archives is essential:the Virtual Observatory is inevitable

More Related