1 / 31

Why Don’t Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

Why Don’t Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh. Digital Libraries grant IIS 98-17444 (NSF,DARPA,NLM, LoC,NEH, NASA) http://db.cis.upenn.edu http://db.cis.upenn.edu/Research/provenance.html. Why Don’t Scientists Use Databases. Much?.

brian
Télécharger la présentation

Why Don’t Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Why Don’t Scientists Use Databases?Peter BunemanDivision of InformaticsUniversity of Edinburgh Digital Libraries grant IIS 98-17444 (NSF,DARPA,NLM, LoC,NEH, NASA) http://db.cis.upenn.edu http://db.cis.upenn.edu/Research/provenance.html Why Don’t Scientists Use Databases?

  2. Why Don’t Scientists Use Databases Much? Relational • Thanks to: • The ontologists and astronomers at Edinburgh • The database and bio-informatics groups at Penn • Aleri Inc. • Special thanks (material stolen from) • Sanjeev Khanna, Wang-Chiew, Keishi Tajima, Susan Davidson, Fidel Salas Why Don’t Scientists Use Databases?

  3. Scientific Data is Ubiquitous • 500 or so public molecular biology databases. • much discovery in silico • Vast amounts of satellite imagery • maintaining it is very expensive • Terabytes of astronomical data (not image data) • Linguistic corpora are essential research tools -- also in terabytes Why Don’t Scientists Use Databases?

  4. Id Name Address 123 H. Simpson Springfield 456 L. Simpson Springfield 321 A. Jones London Title Dept Teacher Algorithms CompSci Dr. Deadhead Voltaire French Prof. lePew Geometry Math Dr. Obtuse Id Course Grade 456 Geometry A 123 Algorithms D 456 Voltaire A 321 Geometry B 321 Algorithms C Relational DBs -- tabular data • Useful information is obtained by combining tables. • Efficient algorithms for • comining and indexing tables • transaction processing (updates and multiple users) • Relational databases are a multi giga-$ industry Why Don’t Scientists Use Databases?

  5. Reasons for Mismatch • Scientific data sets are too large (image data, huge analyses) • Scientific data is too complex • Relational databases don’t work well with arrays and scientific computation • Schema evolution and history are important • Databases are too expensive Why Don’t Scientists Use Databases?

  6. Metadata ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] RP SEQUENCE OF 22-30 AND 297-302. RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21:157-167(1980). CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). DR EMBL; M36407; G167492; -. DR PIR; S00366; FWPU1B. DR PROSITE; PS00305; 11S_SEED_STORAGE; 1. KW SEED STORAGE PROTEIN; SIGNAL. FT SIGNAL 1 21 FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. FT CHAIN 22 296 GAMMA CHAIN (ACIDIC). FT CHAIN 297 480 DELTA CHAIN (BASIC). FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27 27 S -> E (IN REF. 2). FT CONFLICT 30 30 E -> S (IN REF. 2). SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE // Swissprot -- a curated database (1 of ~100,000 entries) ... OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). ... Data ??? MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE ... Why Don’t Scientists Use Databases?

  7. Record (inadequate) of history DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) Hierarchical data. Order important. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] RP SEQUENCE OF 22-30 AND 297-302. RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21:157-167(1980). ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] RP SEQUENCE OF 22-30 AND 297-302. RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21:157-167(1980). CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). DR EMBL; M36407; G167492; -. DR PIR; S00366; FWPU1B. DR PROSITE; PS00305; 11S_SEED_STORAGE; 1. KW SEED STORAGE PROTEIN; SIGNAL. FT SIGNAL 1 21 FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. FT CHAIN 22 296 GAMMA CHAIN (ACIDIC). FT CHAIN 297 480 DELTA CHAIN (BASIC). FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27 27 S -> E (IN REF. 2). FT CONFLICT 30 30 E -> S (IN REF. 2). SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE // Why Don’t Scientists Use Databases?

  8. Tree data (recursive query processing?) OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. Array indices (array operations?) FT SIGNAL 1 21 FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. FT CHAIN 22 296 GAMMA CHAIN (ACIDIC). FT CHAIN 297 480 DELTA CHAIN (BASIC). FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27 27 S -> E (IN REF. 2). FT CONFLICT 30 30 E -> S (IN REF. 2). ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] RP SEQUENCE OF 22-30 AND 297-302. RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21:157-167(1980). CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). DR EMBL; M36407; G167492; -. DR PIR; S00366; FWPU1B. DR PROSITE; PS00305; 11S_SEED_STORAGE; 1. KW SEED STORAGE PROTEIN; SIGNAL. FT SIGNAL 1 21 FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. FT CHAIN 22 296 GAMMA CHAIN (ACIDIC). FT CHAIN 297 480 DELTA CHAIN (BASIC). FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27 27 S -> E (IN REF. 2). FT CONFLICT 30 30 E -> S (IN REF. 2). SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE // Why Don’t Scientists Use Databases?

  9. Structure in comments = schema evolution CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] RP SEQUENCE OF 22-30 AND 297-302. RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21:157-167(1980). CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). DR EMBL; M36407; G167492; -. DR PIR; S00366; FWPU1B. DR PROSITE; PS00305; 11S_SEED_STORAGE; 1. KW SEED STORAGE PROTEIN; SIGNAL. FT SIGNAL 1 21 FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. FT CHAIN 22 296 GAMMA CHAIN (ACIDIC). FT CHAIN 297 480 DELTA CHAIN (BASIC). FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27 27 S -> E (IN REF. 2). FT CONFLICT 30 30 E -> S (IN REF. 2). SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE // Why Don’t Scientists Use Databases?

  10. To turn Swissprot into tables requires: • 20 - 30 tables • nothing extraordinary by relational standards, but • huge query to reconstruct original form • Invented keys • Queries on order and arrays • Recursive query processing • Also need to deal with schema evolution Why Don’t Scientists Use Databases?

  11. Curated Databases • Useful scientific databases are often curated : they are created/ maintained with a great deal of “manual” labour. What really happens DB2 DB1 select xyz from pqr where abc Database people’s idea of what happens Why Don’t Scientists Use Databases?

  12. GERD EpoDB TRRD BEAD TransFac GenBank GAIA Swissprot Database Inter-dependence is Complex Why Don’t Scientists Use Databases?

  13. Three new topics • Annotation • how do I annotate a data element, and how is this passed through queries? • Archiving • how do we keep all the old versions of a database? • Vertical partitioning. • combining databases and vector processing. Why Don’t Scientists Use Databases?

  14. Data Annotation(Khanna, Tan) • Some databases (e.g. biology and linguistics) are designed to accommodate annotations • Also a need for ad hoc (unanticipated) annotations. • How are annotations communicated? • How are they passed through queries? • No general techniques or principles. Why Don’t Scientists Use Databases?

  15. Serves fine French Cuisine in elegant setting. Jackets required. Cost Type Restaurant Pacifica $ Chinese $ Soho Kitchen & Bar American Cost Type Restaurant Extensive wine list! Peacock Alley $$$ French Bull & Bear $$$ Seafood Pacifica $ Chinese $ Soho Kitchen & Bar American Sharing annotations(courtesy of Wang-Chiew Tan) NYRestaurants (Source Table) Cost Type Restaurant Zip Peacock Alley $$$ French 10022 Bull & Bear $$$ Seafood 10022 Pacifica $ Chinese 10013 $ Soho Kitchen & Bar American 10022 Yummy chicken curry!! Cheap Restaurants (View 2) All Restaurants (View 1) Why Don’t Scientists Use Databases?

  16. Annotation looks simple but ... • Computing how an annotation should move through a query is intractable • Equivalent queries may not carry annotations in the same way • New insights are needed! Why Don’t Scientists Use Databases?

  17. How do we Build Archival Databases?[Khanna, Tajima, Tan] • Many scientific database keep archives. It’s important to preserve the state of knowledge as it was in the past • Archive frequently: space consuming • Archive infrequently: delay in getting recent information published. Why Don’t Scientists Use Databases?

  18. The dangers of electronic documents Report of a DOE “bioinformatics summit” ca. 1994 Then: APPENDIX I: SAMPLE QUESTIONS FOR A FEDERATED DATABASE Continued HGP progress will depend in part upon the ability of genome databases to answer increasingly complex queries that span multiple community databases. Some examples of such queries are given in this appendix. Note, however, until a fully relationalized sequence database is available, none of the queries in this appendix can be answered. ... http://www.ornl.gov/hgmis/publicat/miscpubs/bioinfo/inf_rep2.html#AppenI Now: APPENDIX I: SAMPLE QUESTIONS FOR A FEDERATED DATABASE Continued HGP progress will depend in part upon the ability of genome databases to answer increasingly complex queries that span multiple community databases. Some examples of such queries are given in this appendix. Note, however, until a fully atomized sequence database is available (i.e., no data stored in ASCII text fields), none of the queries in this appendix can be answered. ... (No archive/edition! No footnote!) Why Don’t Scientists Use Databases?

  19. Examples from Bioinformatics • Swissprot. New version produced every four months. • Old versions are kept. • Difficult to get at most recent data • OMIM. New version produced every day • Old versions are not kept • Impossible to reconstruct past states of the data Why Don’t Scientists Use Databases?

  20. Current approaches use “diff” Line Number Version 1: <DB> <Person> <Name>Joe</> <DateOfBirth>March</> <Address>South Street</> <Zip>12345</> </> <Person> <Name>Jane</> <DateOfBirth>May</> <Address>Pine Street</> <Zip>67890</> </> </> Version 2: <DB> <Person> <Name>Jane</> <DateOfBirth>May</> <Address>South Street</> <Zip>12345</> </> <Person> <Name>Joe</> <DateOfBirth>March</> <Address>Pine Street</> <Zip>67890</> </> </> Output of line diff (versions 1-2): 3,4c <Name>Jane</> <DateOfBirth>May</> 9,10c <Name>Joe</> <DateOfBirth>March</> need to preserve “object continuity” through time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Why Don’t Scientists Use Databases?

  21. A Sequence of Versions Why Don’t Scientists Use Databases?

  22. “Pushing” time down [Driscoll, Sarnak, Sleator, Tarjan: “Making Data Structures Persistent.” ] Why Don’t Scientists Use Databases?

  23. Legend • archive • inc diff • version • compressed inc diff • compressed archive Experimental Results (OMIM) archive, inc diff version Size (bytes) x 106 Uncompressed • Archive size is • 1.01 times diff repository size •  1.04 times size of largest version Compressed • archive size is between 0.94 and 1 times compressed diff repository size • gzip - unix compression tool • XMill - XML compression tool gzip(inc diff) XMill(archive) Number of versions Why Don’t Scientists Use Databases?

  24. archive inc diff Size (bytes) x 106 • Legend • archive • inc diff • version • compressed inc diff • compressed archive version gzip(inc diff) XMill(archive) Number of versions Experimental Results (Swissprot) Uncompressed • Archive size is • 1.08 times diff repository size •  1.92 times size of largest version Compressed • archive size is between 0.59 and 1 times compressed diff repository size • gzip - unix compression tool • XMill - XML compression tool Why Don’t Scientists Use Databases?

  25. The Bottom Line • We have built an archiver, using XML as the base format • We can build a year of archives (archive as often as you like) for a 14% increase on the size of the most recent database • Based on keys -- preserves object history • Works well with compression • Obtaining an old archive is no more expensive than getting the current version. Why Don’t Scientists Use Databases?

  26. Vertical partitioning • An old idea revisited • Fusion of array processing languages and database query languages • Substantial use on Wall Street!!! Why Don’t Scientists Use Databases?

  27. Conventional Storage Rows are stored contiguously. Order is not preserved (Horizontal partitioning) disk pages Why Don’t Scientists Use Databases?

  28. Problems with conventional storage • Unanticipated queries will probably read the whole database SELECT average sqrt(shoe-size) FROM employee WHERE hat-size > shoe-size (this only needs two fields) • Order or rows is “random” and does not support order-sensitive functions: moving window averages, convolutions, etc. Why Don’t Scientists Use Databases?

  29. Vertical partitioning (vectorisation) Columns are stored contiguously. Order is preserved (Vertical partitioning) disk pages Why Don’t Scientists Use Databases?

  30. Advantages of Vertical Patitioning • Faster queries. • A query that reads 2 columns in 100 does 2% of the i/o (i/o cost dominates) • A few columns can often reside in memory. • Computation on order • Can use both SQL and vector processing languages • Downside: deletions are horribly expensive. • but deletions are uncommon in scientific DBs • Vertical partitioning can also be performed on hierarchical structures -- like Swissprot -- and XML Why Don’t Scientists Use Databases?

  31. Many other issues • Heterogeneous data integration • a perennial problem • can it be done by the end-users? • Distributed query evaluation against redundant, constrained data. • Data provenance • Data streams • and many more All these involve hard, fundamental problems in Computer Science Why Don’t Scientists Use Databases?

More Related