1 / 28

The Gedeon Project: Data, Metadata and Databases Yves DENNEULIN LIG laboratory, Grenoble

The Gedeon Project: Data, Metadata and Databases Yves DENNEULIN LIG laboratory, Grenoble. Laboratoire LIP6. ACI MD. Context and goals. Heterogeneous metadata management on grids Clusters of clusters High-level queries using metadata Easy and flexible deployment and configuration

dyani
Télécharger la présentation

The Gedeon Project: Data, Metadata and Databases Yves DENNEULIN LIG laboratory, Grenoble

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Gedeon Project: Data, Metadata and DatabasesYves DENNEULINLIG laboratory, Grenoble Laboratoire LIP6 ACI MD

  2. Context and goals • Heterogeneous metadata management on grids • Clusters of clusters • High-level queries using metadata • Easy and flexible deployment and configuration • Minimal overhead • Various interfaces • Initial target application domains • Biocomputing (lots of metadata, few data) • Microscopic imaging (lots of data data, few metadata)

  3. The Gedeon middleware • Metadata management on lightweight grids • Records of (attribute,value) pairs stored in files • Flexible requests • Can be combined through scripting • Various interfaces • Command line (tools) • Libraries • Virtual FS (legacy applications support) • Deployment “à la carte” • Composition of various data sources • Performances • Dedicated I/O library • Semantic caching

  4. Outline • General architecture • Gedeon internal structure • Composition of various data sources • Practical use • « dual » cache Conclusion

  5. Example of a deployment Query Interface (API, FS, GUI, ...) cache Local proxy Client cache cache Servers « close » to the client Interconnect middleware Interconnect middleware cache cache cache cache cache Local proxy Local proxy Local proxy Storage sites Interconnect

  6. application lowerG vSGF fuple fuple network network Gedeon components • Gedeon Kernel • fuple • I/O Library • Evaluate the queries • lowerG • Operators to compose bases • Remote access • Interface • API lowerG • Virtual FS • Cache Local proxy cache lowerG

  7. What inside the sources? • Records of pairs attribute/value Record Id 457 classifA Bacteria classifB Clostridia taille 26 ref

  8. Example of composition of sources site S2 site S1 site S3 + J RR Metadata can be local or copies client

  9. Union enreg. A1 enreg. B1 enreg. A2 enreg. A1 enreg. B1 enreg. A3 enreg. A2 enreg. B2 + enreg. B2 enreg. A3 enreg. B3 enreg. B3 enreg. A4 enreg. B4 ... ... enreg. A4 enreg. B4 Unify storage space + Parallel evaluation ...

  10. Round Robin Fault Tolerance Source 1 RR client Source 2

  11. Round Robin Load Balancing Source 1 client RR client Source 2

  12. Join operator Id 457 Id 457 A1 v1 A1 v1 A2 v2 A2 v2 Id 457 A3 v3 A3 v3 An vAn1 J An vAn1 Id 458 Id 458 Id Id 458 A1 v4 An vAn2 A1 v4 A2 v5 ... A2 v5 A3 v6 Enrich a source with another A3 v6 ... An vAn2 ...

  13. Outline • General architecture • Gedeon internal structure • Composition of various data sources • Practical use • « dual » cache Conclusion

  14. Tools 1/2 • Libraries • CLI • Operations • sort • projection • select • index • ...

  15. Tools 2/2 • Examples • sort$> cat mesmeta.g | fsort 'taille' > trie_taille.g sort(attr='taille') • index .Id.idx create_idx(attr='Id') search_idx('Id', 'P0123') .Id.idx .Id.idx

  16. Language for the requests • Simple ($, type control with the operators) • Regular expressions • Of the second order

  17. Select expression Id 457 classifA Bacteria classifB Clostridia taille 26 Select $Id>459 Id 460 classifA Fermicutes Id 459 classifB Bacteria taille 47 Id 460 classifA Fermicutes

  18. Select using regexp Id 457 Id 457 classifA Bacteria classifA Bacteria classifB Clostridia classifB Clostridia taille 26 taille 26 Select $classifB==/.*a$/ Id 459 Id 459 classifB Bacteria classifB Bacteria taille 47 taille 47 Id 460 classifA Fermicutes

  19. Select using 2nd order logic Id 457 classifA Bacteria classifB Clostridia taille 26 Id 459 Select $/classif[AB]/==Bacteria && $taille>=36 classifB Bacteria Id 459 taille 47 classifB Bacteria taille 47 Id 460 classifA Fermicutes

  20. Virtual FS interface • Just a specific file-oriented interface • Data and metadata can be anywhere in the grid • Definition of logical directories • Ex: cd '$classifB==|.*a$|' • « and » between directories • 1 filename =value of a metadata: logical view/fs_virt/$classifB==|.*a$|> ls457 459/fs_virt/$classifB==|.*a$|> cat *>/tmp/mater/fs_virt/$classifB==|.*a$|>

  21. Outline • General architecture • Gedeon internal structure • Composition of various data sources • Practical use • « dual » cache Conclusion

  22. Dual cache (1) • 2 cooperative caches • cache of requests (R, {id,...})-> save computing power • cache of data (id, {attr,...})-> save bandwidth • Semantic cache • Can evaluate a query using the data in the cache • Can generate a remainder to complement the data cached

  23. Example • Refinement of a request • '$OC==/Eukaryota/'-> (R, Lid={id1,id2, ...}) • '$OC==/Eukaryota/ && $year>=1998'Select(*Lid, '$year>=1998')

  24. Dual cache (2) • Distributed semantic cache • Typically used inside communities • Lots of common requests • No location constraints • Members of the community can be geographically scattered • Distributed data cache • Minimize time and data transfer • Cooperation between close, from a topological point of view, sites

  25. Rennes Grenoble Servers Semantic locality Dual cache Geographic locality Query cache Object cache Community Archaea Community Eukaryota Dual cache (3)

  26. Dual cache (4) • Work in progress on the notion of distance • Find geographical proximity • Find common interests between communities • Create hybrid communities based on their requests • Could be used to change the cache parameters • Manual and/or automatic

  27. Conclusion • A data integration middleware • Handling of metadata • Distributed and modular • Deployment can be done according to architectural/organisational constraints • Definition of a dual cache infrastructure • Reflect both organisational use • Prototype in use • Packaging and documentation needed

  28. Questions?

More Related