270 likes | 386 Vues
This paper explores the significant statistical challenges faced in data-intensive astrophysics, particularly through analysis of galaxy surveys. It emphasizes the evolution of angular and cosmic microwave background surveys from 1970 to 2020, highlighting the Sloan Digital Sky Survey and the Cosmic Genome Project. The work discusses innovative methods such as robust incremental PCA to analyze galaxy spectra, tackling issues of sparse signals and noise in high-dimensional data. Various techniques, including Principal Component Pursuit, are presented to enhance the extraction of meaningful astronomical insights from vast datasets.
E N D
Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary)
The Age of Surveys • Angular Galaxy Surveys (obj) • 1970 Lick 1M • 1990 APM 2M • 2005 SDSS 200M • 2011 PS11000M • 2020 LSST30000M CMB Surveys (pixels) • 1990 COBE 1000 • 2000 Boomerang 10,000 • 2002 CBI 50,000 • 2003 WMAP 1 Million • 2008 Planck 10 Million • Time Domain • QUEST • SDSS Extension survey • Dark Energy Camera • Pan-STARRS • LSST… • Galaxy Redshift Surveys (obj) • 1986 CfA 3500 • 1996 LCRS 23000 • 2003 2dF 250000 • 2008 SDSS 1000000 • 2012 BOSS 2000000 • 2012 LAMOST 2500000 Petabytes/year …
Sloan Digital Sky Survey • “The Cosmic Genome Project” • Two surveys in one • Photometric survey in 5 bands • Spectroscopic redshift survey • Data is public • 2.5 Terapixels of images => 5 Tpx • 10 TB of raw data => 120TB processed • 0.5 TB catalogs => 35TB in the end • Started in 1992, finished in 2008 • Extra data volume enabled by • Moore’s Law • Kryder’s Law
Analysis of Galaxy Spectra • Sparse signal in large dimensions • Much noise, and very rare events • 4Kx1M SVD problem, perfect for randomized algorithms • Motivated our work on robust incremental PCA
Galaxy Properties from Galaxy Spectra Spectral Lines Continuum Emissions
Galaxy Diversity from PCA PC 1st [Average Spectrum] 2nd [Stellar Continuum] 3rd [Finer Continuum Features + Age] 4th [Age] Balmer series hydrogen lines 5th [Metallicity] Mg b, Na D, Ca II Triplet
Streaming PCA • Initialization • Eigensystem of a small, random subset • Truncate at p largest eigenvalues • Incremental updates • Mean and the low-rank A matrix • SVD of A yields new eigensystem • Randomized algorithm! T. Budavari, D. Mishin 2011
Robust PCA • PCA minimizes σRMS of the residuals r = y – Py • Quadratic formula: r2 extremely sensitive to outliers • We optimize a robust M-scale σ2 (Maronna 2005) • Implicitly given by • Fits in with the iterative method! • Outliers can be processed separately
Eigenvalues in Streaming PCA Classic Robust
Examples with SDSS Spectra Built on top of the Incremental Robust PCA • Principal Component Pursuit (I. Csabai et al) • Importance sampling (C-W Yip et al)
Principal component pursuit * E. Candes, et al. “Robust Principal Component Analysis”. preprint, 2009. Abdelkefi et al. ACM CoNEXT Workshop (traffic anomaly detection) • Low rank approximation of data matrix: X • Standard PCA: • works well if the noise distribution is Gaussian • outliers can cause bias • Principal component pursuit • “sparse” spiky noise/outliers: try to minimize the number of outliers while keeping the rank low • NP-hard problem • The L1 trick: • numerically feasible convex problem (Augmented Lagrange Multiplier)
Testing on Galaxy Spectra • Slowly varying continuum + absorption lines • Highly variable “sparse” emission lines • This is the simple version of PCP: the position of the lines are known • but there are many of them, automatic detection can be useful • spiky noise can bias standard PCA • DATA: • Streaming robust PCA implementation for galaxy spectrum catalog (L. Dobos et al.) • SDSS 1M galaxy spectra • Morphological subclasses • Robust averages + first few PCA directions
PCA PCA reconstruction Residual
Principal component pursuit Low rank Sparse Residual λ=0.6/sqrt(n), ε=0.03
Not Every Data Direction is Equal Wavelength Selected Wavelengths Wavelength A = C X Selected Wavelengths Galaxy ID Galaxy ID Procedure: 1. Perform SVD of A = U VT 2. Pick number of eigenvectors = K 3. Calculate Leverage Score = i||VTij||2 / K Mahoney and Drineas 2009
Wavelength Sampling Probability k = 2 c = 7 k = 4 c = 16 k = 6 c = 25 k = 8 c = 29
Ranking Astronomical Line Indices • Subspace Analysis of Spectra Cutouts: • Othogonality • Divergence • Commonality (Worthey et al. 94; Trager et al. 98) (Yip et al. 2012 in prep.)
Identify Informative Regions “NewMethod” • Pick the λ with largest Pλ • Define its region of influence using λ Pλ convergence. Mask λ’s from future selection. • Go back to Step 1, or quit. “MahoneySecond” • Over-select λ’s from the targeted number. • Merge selected λ if two pixels lie within a certain distance • Quit.
Identifying New Line Indices, Objectively (Yip et al. 2012 in prep.)
New Spectral Regions (MahoneySecond; k = 5; Overselecting 10 X; Combining if < 30 Å)
NewMethodvsMahoneySecond NM M2
Angle between Subspaces JHU Lick
λ Pλ JHU Lick
Importance Sampling and Galaxies • Lick indices are ad hoc • The new indices are objective • Recover atomic lines • Recover molecular bands • Recover Lick indices • Informative regions are orthogonal to each other, in contrast to Lick • Future • Emission line indices • More accurate parameter estimation of galaxies
Summary Astronomy has always been data-driven….now becoming more generally accepted Non-Incremental changes on the way • Science is moving increasingly from hypothesis- driven to data-driven discoveries • Need randomized, incremental algorithms • Best result in 1 min, 1 hour, 1 day, 1 week • New computational tools and strategies … not just statistics, not just computer science, not just astronomy, not just genomics…