1 / 33

Database technology for seismology

Database technology for seismology. Martin Kersten Jennie Zhang February 2011. Trends in database technology One size does not fit all … … all exhibit similar problems Database technology needs in science Webservices [out of scope] Workflow management [out of scope] Meta-data catalogs

zenda
Télécharger la présentation

Database technology for seismology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database technology for seismology Martin Kersten Jennie Zhang February 2011

  2. Trends in database technology • One size does not fit all … • … all exhibit similar problems • Database technology needs in science • Webservices [out of scope] • Workflow management [out of scope] • Meta-data catalogs • Data vaults and UDFs • Scale-up and –out [out of scope] • Array query languages

  3. Trends in database technology • One size does not fit all … • … all exhibit similar problems • MySQL, PostgreSQL, Oracle, DB2, Microsoft SQLserver,… • Have not been designed for datawarehouses • Have not been designed for science • Have not been designed for signal processing

  4. Row versus column stores • Traditional database systems were designed for online transaction processing • How to move money from A to B • Modern database systems are designed for business intelligence • Who should I trust to lend money ?

  5. Row versus column stores • The application domains dictate the storage and processing schemes • Row stores storage • All records are organized as linear sequence of multiple fields • An insurance company record contains around 150 fields • A stellar object contains around 500 fields • A factory assemblage line may contain 10.000 fields

  6. Row versus column stores • The application domains dictate the storage and processing schemes • Row stores processing • Retrieve all fields of a particular record • Non-discriminative • Update the field of a record • Finally get a raise • Fast search using indices. • 15% overhead on storage saves a lot

  7. Storing Relations in MonetDB Virtual OID: seqbase=1000 (increment=1)

  8. Hash tables, T-trees, R-trees, ... BAT Data Structure is an array C - ARRAY BAT: binary association table BAT heap: - consecutive memory block (array)‏ - memory-mapped file Tail Heap: - best-effort duplicate elimination for strings (~ dictionary encoding) Tail Head

  9. Why database systems may fail? • Meta-data model • XML or NO-SQL or FLAT files • Synchronise on the datacatalog • Computational paradigm • Database systems use the relational model • Scientists use an array model (e.g. Matlab) • Storage cost • Database systems index data • Scientists use structured files (e.f.FITS, NETCDF,MSEED) • Experience • Database query formulation is “hard” for scientists • Data manipulation is “hard” for database developers

  10. Meta-data, the astro case • Astronomy explores space using a variety of instruments, with highly different characteristics • Astronomers in the 90’s were convinced that everyone should learn to program C++ • Astronomers needed a shared catalog to correlate observations • 1997-2003 J. Gray + A. Szalay bridged the gap

  11. 230 million object images • 1 million spectra • 4TB catalog data • 9TB images A project to make a map of a large part of the Universe SkyServer provides public access to SDSS for astronomers, students, and wide public

  12. SkyServer Schema Vertical fragment of 100+ popular columns 446 columns >600 million rows Materialized join of Photo and Spectra

  13. LOFAR example • Every second a 2K x 2K image of the sky • Number crunching to extract ca 1000 light sources • Sent to database for spatial matching and checking for transients. • Database growth ca 50TB/yr • Single MonetDB/SQL instance on 8 core 16 G machine.

  14. SciLens project • Explore the other sciences to • Characterise the challenges for database researchers • Create real-world show cases (e.g. Skyserver) • Derive the database technology challenge from astronomy, seismology, remote sensing,... • Develop novel query language techniques • Provide an experimentation facility for cooperative projects

  15. The SciLens platform

  16. Loading Mseed data into a DBMS • ORFEUS has > 3.3M compact Mseed files • Loading (meta-) data into a database structure is time and space consumptive • Option 1: use INSERT statements • Option 2: use COPY into with CSV files • Option 3: use binary attachment

  17. Estimated Loading time INSERT INTO mseed VALUES(….) 28 years if not careful COPY INTO mseedFROM‘repro.csv’ USING ‘\t’ \n’ 340 days + conversion COPY INTO mseed FROM (‘repro.mseed’) 10 days with some care (200ms)

  18. Data Vaults • The database system and science repositories should act as a symbiotic organism • Data is loaded dynamically upon demand at light speed • All mseed record headers can be easily handled in a modern database system • (~ 100 M records)

  19. DON’T MOVE THE DATA

  20. The holy grail An Array-DBMS

  21. Science DBMS landscape

  22. Use case • Rietbrock: Chili earthquake … 2TB of wave fronts … filter by sta/lta … remove false positives … window-based 3 min cuts … heuristic tests … interactive response required … • How can a database system help? • Scanning 2TB on modern pc takes >3 hours

  23. Use case, a SciQL dream • Rietbrock: Chili earthquake create array mseed ( tick timestamp dimension[1988:*], data decimal(8,6), station string ); -- > 3 10^11 events, 300 billion events

  24. Use case, a SciQL dream • Rietbrock: … filter by sta/lta --- average by window of 5 seconds select A.tick, avg(A.data) from mseed A group by A[tick:tick + 5 seconds]

  25. Use case, a SciQL dream • Rietbrock: … filter by sta/lta select A.tick from mseed A, mseed B where A.tick = B.tick and avg(A.data) /avg(B.data) > delta group by A[tick:tick + 5 seconds], B[tick:tick + 15 seconds]

  26. Use case, a SciQL dream • Rietbrock: … filter by sta/lta create view candidates( station string, tick timestamp, ratio float ) as select A.station, A.tick, avg(A.data) /avg(B.data)as ratio from mseed A, mseed B where A.tick = B.tick and avg(A.data) /avg(B.data) > delta group by A[tick:tick + 5 seconds], B[tick:tick + 15 seconds]

  27. Use case, a SciQL dream • Rietbrock: … remove false positives -- remove isolated errors by direct environment -- using wave propagation statics create table neighbors( head string, tail string, delay timestamp, weight float)

  28. Use case, a SciQL dream • Rietbrock: … remove false positives select A.tick, B.tickfrom candidates A, candidates B, neighbors N where A.station = N.head andB.station = N.tail and B.tick = A.tick + N.delay and B.ratio* N.weight < A.ratio;

  29. Use case, a SciQL dream • Rietbrock: … remove false positives delete from candidates select A.tickfrom candidates A, candidates B, neighbors N where A.station = N.head andB.station = N.tail and B.tick = A.tick + N.delay and B.ratio* N.weight < A.ratio;

  30. Use case, a SciQL dream • Rietbrock: … window-based 3 min cuts … heuristic tests select B.station, myfunction(B.data)from candidates A, mseed B where A.tick = B.tickgroup by distinct B[tick:tick + 3 minutes]; -- using a User Defined Function written in C.

  31. Use case • Rietbrock: … interactive response required … The query over 2TB of seismic data will be handled before he finishes his coffee.

  32. An Array-DBMS An Array DBMS • An array DBMS is yet to be provided to the community • Major implementation impediments, • Integration with existing SQL stack • Integration of the proper MATH libraries

  33. Conclusions • Catalogs Relational database systems are effective for building community-based meta-data catalogs (Skyserver). • Streaming SQL-based processing can perform well in streaming applications (LOFAR) • Data vaults Shared responsibility of science repositories is feasible [Development] • Arrays Relational and array-based declarative query processing [Research & development]

More Related