1 / 15

A Metadata Binding Store for Distributed Scientific Data

A Metadata Binding Store for Distributed Scientific Data. Yin Chen, Malcolm Atkinson, Stuart Aitken Dec. 2009. UK e-Science All Hands Meeting 2009, Oxford, 08 Dec. 2009. MOTIVATION. Scientific data/metadata are generated at great speed and high volume.

eagan
Télécharger la présentation

A Metadata Binding Store for Distributed Scientific Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Metadata Binding Store for Distributed Scientific Data Yin Chen, Malcolm Atkinson, Stuart Aitken Dec. 2009 UK e-Science All Hands Meeting 2009, Oxford, 08 Dec. 2009

  2. MOTIVATION • Scientific data/metadata are generated at great speed and high volume • Metadata are the key to data access, discovery, preservation, provenance, interpretation • Data and Metadata are often created independently • We view the relationship between data and metadata as a binding • Hypothesis: A binding service is useful to serve various scales distributed scientific data

  3. IS BINDING A PROBLEM? Genepaint Robotics 14.5 days mouse embryo Section slides Automatic ISHs (8 EU Bio labs) • EurExpress Project, EU funded under FP6, 2005-2009. • Aim to capture >20,000 gene via RNA in situ hybridization (ISH). • Generate digital ‘transcriptome atlas’ High resolution gene express images ISH management (LIME system) Gene Expression Data Repository Template Meta data Images Annotation (FIATAS) Alicante • Nov.2009: 19,411 assay, 15,715 annotations, ~5TB data

  4. REAL WORLD OBSERVATIONS • Information inconsistency • Significant human operating errors • Consistency checking became more difficult as data increased The Numbers of gene expression images without metadata The Numbers of probe genes miss-matched with the template design • The bindings have to be efficiently managed!

  5. DESIGN PRINCIPLES • A binding system manages bindings • Federate references of data and metadata • Data warehousing approach is no longer feasible • Data become too large, too dynamic, too unwieldy to copy • No permit to copy • Refreshness • Generic approach, independent from data resources • Can be combined with other services • Allow binding sharing among user communities, scalable • Design principle: Simple • Minimize internal complexity: no conflict • Maximize external integrity: less overlap

  6. A SIMPLE BINDING STORE • Binding Data Model • Binding ID – UUID, need no central registration authority, unlimited • Binding subject/object – URIs, used by most web accessible data resources • Binding description – Tags, efficient, flexible • Binding APIs • Manipulation operations • Discovery operations • Delivery operations

  7. IMPLEMENTATION • Grid tech. OGSA-DAI • OGSA-DAI server activities • OGSA-DAI client activities • OGSA-DAI client toolkits • Service Proxy APIs, programmable interface for users • Command-line UI • Not included in current work

  8. Evaluation • Use workload modelling and simulation method • No available binding data • Observations from wwwPDB, BADC, EurExpress, NanoCMOS, Flickr • Creation patterns, access patterns, and content patterns are observed • Simulation of the real-world observations

  9. WORKLOAD MODELLING Number of Annotation per day New PDB Structure per Month Number of Data File per day CreationWorkloads Number of Access per day Tag Behaviours Access Workloads

  10. WORKLOAD SIMULATION Probability of the intervals occurrence Hidden Markov Model Two Poisson Processes, Two Uniform Dist. Poisson Process: Uniform Dist.: Trend: Weibull Dist. Zipf’s Dist. α=0.2 Zipf’s Dist. α=0.9 Zipf’s Dist. α=0.4

  11. EXPERIMENT SETUP • Inter(R) Core2 2.66GHz, RAM 7GB, 144GB HD, 100Mbps network conn, Red Hat 4.1, Tomcat 5.5, OD 3.1, MySQL 6.0, R 2.9. • SSJ, Colt, benchmark script • 10 runs per configuration, collected Means, SEs, 95% CIs

  12. EXPERIMENT RESULTS • Robust to different types of workloads • Robust to small ~ large scale workloads • Robust to both independent and combined workloads • Stressed by the Ultra scale workloads

  13. FUTURE WORK A Scalable Binding Store • Cloud Computing promises to be scalable • Our Evaluation of the Hadoop

  14. BINDING APPLICATIONS • Web move to web3.0 • Binding index • Combine with metadata management tools • Mashup applications

  15. ACKNOWLEDGEMENT • National e-Science Center, research group, support team, middleware team • MRC HGU Biomedical Statistical Analyse Section: Prof Richard Baldock, Dr Duncan Davidson • Newcastle HDBR: Prof Susan Lindsay, Steven N. Lisgo • EDINA Geo Research & Data Library: Chris Higgins, Dr David Medyckyj-Scott • Data resourses: DGEMap, EurExpress Prof Richard Baldock, Lalit Kumar, NanoCMOS Dr Clive Davenhall, Prof Richard Sinnott • Technique support: OGSA-DAI team • Research materials: COBrA-CT, OntoGrid Prof Carole Goble, Dr Oscar Corcho, MyGrid Dr Phillip Lord

More Related