1 / 30

Science In An Exponential World

Science In An Exponential World. Alexander Szalay, JHU Jim Gray, Microsoft Reserach. Evolving Science. Thousand years ago: Science was empirical Describing natural phenomena Last few hundred years: Theoretical branch Using models, generalizations Last few decades: A computational branch

basil
Télécharger la présentation

Science In An Exponential World

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Science In An Exponential World Alexander Szalay, JHU Jim Gray, Microsoft Reserach

  2. Evolving Science • Thousand years ago:Science was empirical • Describing natural phenomena • Last few hundred years:Theoretical branch • Using models, generalizations • Last few decades:A computational branch • Simulating complex phenomena • Today: Data exploration (e-science) • Synthesizing theory, experiment and computation with advanced data management and statistics new algorithms!

  3. Exponential World of Data • Astronomers have a few hundred TB now • 1 pixel (byte) / sq arc second ~ 4TB • Multi-spectral, temporal, … → 1PB • They mine it looking for • new (kinds of) objects or more of interesting ones (quasars), density variations in multi-D space, spatial and parametric correlations • Data doubles every year • Same access for everyone

  4. The Challenges Exponential data growth: Distributed collections Soon Petabytes Data Collection Discovery and Analysis Publishing New analysis paradigm: Data federations, Move analysis to data New publishing paradigm: Scientists are publishers and Curators

  5. Roles Authors Publishers Curators Consumers Traditional Scientists Journals Libraries Scientists Emerging Collaborations Project www site Bigger Archives Scientists Publishing Data • Exponential growth • Projects last at least 3-5 years • Data sent upwards only at the end of the project • Data will never be centralized • More responsibility on projects • Becoming Publishers and Curators • Data will reside with projects • Analyses must be close to the data

  6. Making Discoveries • Where are discoveries made? • At the edges and boundaries • Going deeper, collecting more data,using more dimensions • Metcalfe’s law • Utility of computer networks grows as the number of possible connections: O(N2) • Federating data • Federation of N archives has utility O(N2) • Possibilities for new discoveriesgrow as O(N2)

  7. You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years Oh!, and 1PB ~4,000 disks At some point you need indices to limit searchparallel data search and analysis This is where databases can help If there is too much data to move around, take the analysis to the data! Do all data manipulations at database Build custom procedures and functions in the database You can FTP 1 MB in 1 sec You can FTP 1 GB / min (= 1 $/GB) … 2 days and 1K$ … 3 years and 1M$ Data Access is Hitting a WallFTP and GREP are not adequate

  8. Next-Generation Data Analysis • Looking for • Needles in haystacks – the Higgs particle • Haystacks: Dark matter, Dark energy • Needles are easier than haystacks • ‘Optimal’ statistics have poor scaling • Correlation functions are N2, likelihoodtechniques N3 • For large data sets main errors are not statistical • As data and computers grow with Moore’s Law, we can only keep up with N logN • Take cost of computation into account • Controlled level of accuracy • Best result in a given time, given our computing resources • Requires combination of statistics and computer science • New algorithms

  9. Our E-Science Projects • Sloan Digital Sky Survey/ SkyServer • Virtual Observatory • Wireless Sensor Networks • Analyzing Large Numerical Simulations • Fast Spatial Search Techniques • Commonalities • Web services • Analysis inside the database!

  10. Why Is Astronomy Special? • Especially attractive for the wide public • It has no commercial value – “worthless!” (Jim Gray) • No privacy concerns, freely share results with others • Great for experimenting with algorithms • It is real and well documented • High-dimensional (with confidence intervals) • Spatial, temporal • Diverse and distributed • Many different instruments from many different places and many different times •  Virtual Observatory • The questions are interesting • There is a lot of it (soon Petabytes)

  11. Features of the SDSS • Goal • Create the most detailed mapof the Northern sky in 5 years • “The Cosmic Genome Project” • Two surveys in one • Photometric survey in 5 bands • Spectroscopic redshift survey • Automated data reduction • 150 man-years of development • Very high data volume • 40 TB of raw data • 5 TB processed catalogs • Data is public The University of Chicago Princeton University The Johns Hopkins University The University of Washington New Mexico State University Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study Max Planck Inst, Heidelberg Sloan Foundation, NSF, DOE, NASA

  12. The Imaging Survey • Drift scan of 10,000 square degrees • 24k x 1M pixel “panoramic” images in 5 colors – broad-band filters (u,g,r,i,z) • 2.5 Terapixels of image

  13. The Spectroscopic Survey • Expanding universe • Redshift = distance • SDSS redshift survey • 1 million galaxies • 100,000 quasars • 100,000 stars • Two high throughput spectrographs • Spectral range 3900-9200 Å • 640 spectra simultaneously • R=2000 resolution, 1.3 Å • Features • Automated reduction of spectra • Very high sampling density and completeness

  14. SkyServer • Sloan Digital Sky Survey: Pixels + Objects • About 500 attributes per “object”, 400M objects • Spectra for 1M objects • Currently 2.4TB fully public • Prototype eScience lab • Moving analysis to the data • Fast searches: Color, spatial • Visual tools • Join 2.5 Terapix pixels with objects • Prototype in data publishing • 160 million web hits in 5 years • http://skyserver.sdss.org/

  15. The SkyServer Experience • Sloan Digital Sky Survey: Pixels + Objects • About 500 attributes per “object”, 400M objects • Currently 2.4TB fully public • Prototype eScience lab (800 users) • Moving analysis to the data • Fast searches: Color, spatial • Visual tools • Join pixels with objects • Prototype in data publishing • 180 million web hits in 5 years • 930,000 distinct user • http://skyserver.sdss.org/

  16. SkyServer Traffic

  17. DR1 DR1 DR2 DR2 DR2 DR3 DR3 DR3 DR3 Public Data ReleaseVersions • June 2001: EDR • Early Data Release • July 2003: DR1 • Contains 30% of final data • 150 million photo objects • 3 versions of the data • Target, Best, Runs • Total catalog volume 5TB • Published releases served ‘forever’ • EDR, DR1, DR2, …., now at DR5 • Next: Include e-mail archives, annotations • O(N2) – only possible because of Moore’s Law! EDR ………

  18. Spatial Information For Users • What surveys covered this part of the sky? • What is the common area of these surveys? • Is this point in the survey? • Give me all objects in this region • Cross-matching these two catalogs • Give me the cumulative counts over areas • Compute fast spherical transforms of densities • Interpolate sparsely sampled functions

  19. Spatial Queries In SQL • Regions and convexes • Boolean algebra of spherical polygons • Indexing using spherical quadtrees • Hierarchical Triangular Mesh • Fast Spatial Joins of billions of points • Zone algorithm • All implemented in T-SQL and C#, runninginside SQL Server 2005

  20. Things Can Get Complex

  21. Simulations • Cosmological simulations have 109particles and produce over 30TBof data (Millennium) • Build up dark matter halos • Track merging history of halos • Use it to assign star formation history • Combination with spectral synthesis • Realistic distribution of galaxy types • Too few realizations (now 50) • Hard to analyze the data afterwards -> need DB • What is the best way to compare to real data?

  22. Trends CMB Surveys 1990 COBE 1000 2000 Boomerang 10,000 2002 CBI 50,000 2003 WMAP 1 Million 2008 Planck 10 Million Angular Galaxy Surveys 1970 Lick 1M 1990 APM 2M 2005 SDSS 200M 2008 VISTA 1000M 2012 LSST 3000M Time Domain QUEST SDSS Extension survey Dark Energy Camera PanStarrs: 1PB by 2007 LSST: 100PB by 2020 Galaxy Redshift Surveys 1986 CfA 3500 1996 LCRS 23000 2003 2dF 250000 2005 SDSS 750000 Petabytes/year by the end of the decade…

  23. Exploration Of Turbulence • We can finally “put it all together” • Large scale range, scale-ratio O(1,000) • Three-dimensional in space • Time-evolution and Lagrangian approach (follow the flow) • Unique turbulence database • We are creating a databaseof O(2,000) consecutivesnapshots of a 1,0243simulation of turbulence • close to 100 Terabytes • Treat it as an experiment

  24. Wireless Sensor Networks • Will use 200 wireless (Intel) sensors, monitoring • Air temperature, moisture • Soil temperature, moisture, at least in two depths (5cm, 20 cm) • Light (intensity, composition) • Gases (O2, CO2, CH4, …) • Long-term continuous data • Small (hidden) and affordable (many) • Less disturbance • >200 million measurements/year • Collaboration with Microsoft • Complex database of sensor data and samples

  25. Current Sensor Database • Using sensor deployment at JHU (Szlavecz talk) • 10 motes * 5 months = 8M data points • SQL Server 2005 database • Adopted from astronomy: NVO+Skyserver • Started with “20 queries” • Rich metadata stored in database • Data access via web services • Graphical interface • DataCube under constructionin collaboration with Stuart Ozer(multidimensional summary of data)

  26. Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it How to coexist with others The Big Picture Experimentsand Instruments facts questions facts ? Other Archives facts answers Literature facts Simulations The Big Problems • Query and Visualization tools • Support/training • Performance • Execute queries in a minute • Batch query scheduling

  27. Summary • Data growing exponentially • Requires a new model • Having more data makes it harder to extract knowledge • Information at your fingertips • Students see the same data as professionals • More data coming: Petabytes/year by 2010 • Need scalable solutions • Move analysis to the data! • Same thing happening in all sciences • High energy physics, genomics/proteomics, medical imaging, oceanography… • E-Science: An emerging new branch of science • We need multiple skills in a world of increasing specialization…

  28. Microsoft Computational Science Workshopat the Johns Hopkins University Oct 13-15, 2006

  29. © 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

More Related