Data Intensive Engineering and Science

Data Intensive Engineering and Science • Gaia and Virtualisation • Bringing the process to the data – Virtualization? • William O’Mullane • Gaia Science Operations Development Manager • European Space Astronomy Centre (ESAC) • Madrid,Spain http://www.rssd.esa.int/Gaia

Satellite • Mission: • Stereoscopic Census of Galaxy • arcsec Astrometry G<20 (10^9 sources) • Radial Velocities G<16 • Photometry G < 20 • Status: • ESA Corner Stone 6 – ESA provide the hardware and launch • Launch: Spring 2012. • Satellite In development • EADS/Astrium

Cruise to L2 Graphic -EADSAstrium

Lissajous Scanning L2 Full sky 3 fold every six months 5 year coverage Lindegren

Giga Pixel Focal Plane 106 CCDs , 938 million pixels, 2800 cm2 2 2 104.26cm Wave Front Sensor Wave Front Sensor Red Photometer CCDs Blue Photometer CCDs 42.35cm Radial-Velocity Spectrometer CCDs Basic Angle Monitor Basic Angle Monitor Star motion in 10 s Sky Mapper CCDs Astrometric Field CCDs Imagemotion

Pixels and images …. • Need the astrometric centroid of the CCD image determined to an accuracy of 1% of the pixel size! • There will be 10^12images • Images are ‘windowed’ on board • Only binned windows are down linked • Never actually get to see the Gaia ‘picture’ • Milimag Photometry also difficult (calib) • Spectra – serious blending problems • And then there is radiation (CTI effects)

Processing • Done by the community • Data Processing and Analysis Consortium (DPAC) • ~360 active (means >=10%) participants • Divied in 9 Coordination Units (CU) • Top level Executive (DPACE) and Project Office (PO) • 6 Data Processing Centres to run Software • All code in Java (only one exception ) • for portability have to run till 2020 • Maintainability, testability etc.. JUnit, Hudson • Easier to write CORRECT code in higher level language • Fewer Core Dumps • OK so its replaced by the NPE (Null Pointer Exception) • Several ‘Relational’ Databases • Oracle • Postgress • MySql • Derby for Testing

Architecture • Highly distributed • Multiple independent DPCs and CUs • Want/need decouple • Reduce • dependencies • risk • Hub and spokes Max flexibility for CUs and DPCs Minimum ICDs = Interface Control Document

ESA/SOC and DPAC • The Science Operations Centre (SOC) is funded by ESA (part of Gaia CAC) • Will carry out Science Operations of Gaia • Get the data to DPAC for processing • Initial processing software to run in ESAC • SOC is also embedded directly in the DPAC structure since outset: • Provides Architecture and Technical advice/guidance (CU1) • CU1 also has CNES and other DPC people • All CU leaders in Executive • Provides one of the six Data Processing Centres • Provides Technical support for Core processing • Specifically significant effort in Astrometric Solution

AGIS • Astrometric Global Iterative Solution (Lindegren,Lammers) • Provide rigid independent reference frame for Gaia Observations • rotate to ICRS using quasars • perhaps about 10% of all the processing • only deals with about 10% of data (well behaved stars to make the grid) • Block iterative solution • using Gauss-Seidel “preconditioner” with simple iterations • Moving to Conjugate Gradient • Collaboration with Yoshiyuki Yamada for Nano Jasmine processing • Picardo – last year, Parache now

Scan width: 0.7° Sky scans (highest accuracy along scan) Global Iterative Solution 1. Objects are matched in successive scans 2. Attitude and calibrations are updated 3. Objects positions etc. are solved 4. Higher-order terms are solved 5. More scans are added 6. Whole system is iterated

Symbolically: AGIS – Observation Model The centroid of a star image is modelled as e.g. PPN g white gaussian, known s quaternion q(t) represented by cubic spline coefficients 6 astrometric parameters fixed geom. calibration + chromaticity + CTI shift

AGIS – How? Block-iterative least-squares solution of the over-determined system of equations initialize S, A, C, G (order of operations may vary) one star at a time one attitude interval at a time one calibration unit a time + renormalise* for the whole data set iterate until convergence renormalise** S and adjust A * defines origin of instrument axes ** defines origin of celestial axes

Attitude Attitude Source Attitude Global Calibration Source Attitude Calibration Source Calibration Global Global Source Calibration Attitude Source Calibration Global Global Source Calibration Global Attitude Data Access Layer ElemetaryTakers GaiaTable Store ObjectFactory AGIS Architecture Datatrains drive through AGIS Database passing observations to algorithms. There can be as many Datatrains in parallel as we wish AstroElementary Optimised AGIS Database

Scheduling • Very simple .. • Keep all machines busy all the time! • Busy = CPU ~90% • Post jobs on whiteboard (like OPUS blackboard) Trains/Workers Mark Jobs – and do them Mark finished – repeat until done Previous attempt had much more general scheduling It was also ~1000 times slower. * Job1 * Job2 * Job3 * Job4 Job.. * JobN

Some important architecture points • From the outset we try in Gaia to: • Keep it just as simple as possible • Isolate algorithms form Data • Already tried to virtualize algorithms • Let Data drive the system (DataTrain) • Algorithms mostly not allowed to ‘query’ • Specific data access patterns Data orgainised accordingly • Similar to Ferris-Wheel idea (Szalay ) but no hopping on/off! • Access any piece of data on disk exactly once • preload some data on each node • E.g. 5 years attitude quaternions fit in 150-250Mb • Be distributed • try to avoid large memory processes • Again in some cases it makes sense ..

Notes on AGIS Implementation • Highly distributed usually running on >40 nodes has run on >100 (1400 threads). • Only uses Java no special MPI libraries needed – new languages come with almost all you need. • Hard part is breaking problem in distributable parts – no language really helps with that. • Truly portable – can run on laptops desktops, clusters and even Amazon cloud.

AGIS Evolution –selected iterations • Assuming: • Need 40 Iterations • more complexity • scaling to full data • availability of a x25 better machine ~10TFLOP • Final AGIS would take ~50days in ESAC

Efficiency - an aside • High load, low network, high CPU! • Some HPC people tell us we should rewrite in C (save energy) • On SOME machines C is faster • The energy bill is large (see later) • The (re)coding effort also large • IMHO Cost more than the energy • Maintainability ? (to 2020) • Supercomputer centres seem to have very specific macros to include in C code to make it efficient for THEIR machine. • looks a little like a virtual machine • Why not provide a better JVM for their machine ? • Or windows CLR ?

Virtualization • Started looking at virtualization ~2007 • Seemed ideal for the multiple test setups needed (VMWARE) • Agreed Cloud experiment 2009 (with Parsons) • Had to be convincing • RUN AGIS obvious choice • Already 4 years in development • In Java • so its portable right!

AGIS on the cloud • Took one person less than one week to get running (Parsons,Olias). • Main problem DB config • Also found scalability problem in our code (never had one hundred nodes before) • It ran at similar performance to our in house cheap cluster. • E2C indeed is no super computer • Oracle image was available already • AGIS image was straightforward to construct but was time consuming – better get it correct ! • Availability of large number of nodes very interesting –not affordable in house.

Cost effectiveness of E2C. • AGIS runs intermittently with growing Data volume. • Estimate 2015 ~1.1MEuro (machine) + 3Meuro (energy bill more?) = ~4Meuro • In fact staggered spending for machines • buy machines as data volume increase • Estimate on Amazon at today prices -340K for final run + 1.7MEuro for intermittent runs (less data) = ~2Meuro • Possibility to use more nodes and finish faster ! • Reckon you still need in house machine to avoid wasting time testing on E2C • Old nut, Vendor lock-in ? (Sayeed railways..)

Final cloudy thought- the title! • Gaia Archive will have multi parameter data in time series • Solar System sources • Galactic sources • Extra Galactic sources • Cloud seems ideal way to allow complex access to an archive. - Ton Hey tells us MS is doing it – Amazon offer free storage for public datasets.. • Make Data available as Database on cloud • Provide VM to user • User codes in favorite language directly against DB api • BUT it runs local to the Data ! • Should we consider a new type of Archive? • Could VO = Virtualized Observatory ?

Questions ?? Ariane V188 carrying Herschel and Planck (May 14 2009)

AGIS matrix source attitude calibration s1s2s3 ... a1a2a3... c Filled 0 0 source (5·108) Sparse Zeroes 0 attitude (4·107) 0 calibration (~106) Gauss-Seidel Pre-Conditioner (Lammers)

Data Train load

EC2 Instance Types

AstroElementary Elementary DataTrain Source Collector Source Collector Attitucde Collector Object Factory Calibration Collector Request AstroElementaries between a range (x,y) GaiaTable Global Collector Data Trains Store Architecture in the Cloud 1x Large Instance AGIS AMI Elastic IP Convergence Monitor RunManager <n> x Extra Large or High CPU Large instances AGIS AMI 1x Large instance Oracle AMI Elastic IP AGIS DB Attitude Update Server 3 x Extra Large instances AGIS AMI

FLOPS and FLOP count estimates Current hardware 18 Dual-processor, single-core Xeon blades 8 Dual-processor, quad-core Xeon blades 5 TB FibreChannel SAN Gives about 400 GFLOPS Run time of one outer iteration: 1h/106 stars, so, 1 cycle with 40 iterations takes about 2d CPU occupancy >90% (I/O never a problem) FLOP count estimates: 1.4 · 1020 for creation of final catalog 2.2 · 1019 for final cycle [50d on 10 TFLOP machine] Estimate is based on a simple model Regularly updated

Data Intensive Engineering and Science

Data Intensive Engineering and Science

Presentation Transcript

Acronym Engineering: DIS = Data Intensive Science? No! DIS = DDI Into SDMX!

Data Grids and Data-Intensive Science

Attacking Data Intensive Science with Distributed Computing

Data-Intensive Scalable Science: Beyond MapReduce

Science and Engineering

IU Twister Supports Data Intensive Science Applications

Science and Engineering

Data Intensive Cyberinfrastructure

Cyberinfrastructure for Data Intensive Science (DIS)

“Big Data” and Data -Intensive Science (eScience)

Challenges and New Trends in Data Intensive Science

Protocols and Services for Distributed Data-Intensive Science

Cooperative Computing for Data Intensive Science

Data-Intensive Science at Johns Hopkins University

DATA GRIDS for Science and Engineering

Enabling Data Intensive Science with PetaShare

Huge-Memory Systems for Data-Intensive Science

Data-Intensive Science at Johns Hopkins University

Database and Data-Intensive Systems

Data-Intensive Science (eScience)

Huge-Memory Systems for Data-Intensive Science

Enabling Data Intensive Science with PetaShare