140 likes | 317 Vues
OGSA-DAI Requirements Gathering Exercise. 2 nd DIALOGUE workshop eSI, 9-10 February 2006. OGSA-DAI Requirements Gathering. Aims learn more about the data access and integration challenges that other projects are facing
 
                
                E N D
OGSA-DAI Requirements Gathering Exercise 2nd DIALOGUE workshop eSI, 9-10 February 2006
OGSA-DAI Requirements Gathering • Aims • learn more about the data access and integration challenges that other projects are facing • use this information to inform the future development of the OGSA-DAI software • Timescale • Nov 2005 – Jan 2006 • Gatherers • Ally Hume • Amy Krause • Tom Sugden 2nd DIALOGUE workshop - 9-10 February 2006
Projects • AstroGrid • (www.astrogrid.org) - distributed queries over large astronomy databases. • Automed and ISpider • (www.doc.ic.ac.uk/automed/) and (www.ispider.man.ac.uk) – model-based data integration and Grid-based informatics platform for proteomics. • CancerGrid • (www.cancergrid.org) – storage and analysis of distributed data containing clinical trial and lab data. • ESSC • (www.nerc-essc.ac.uk[MA1]) – environmental and atmospheric simulations. • Gold • (www.goldproject.ac.uk) – provides infrastructure for virtual organisations. • NTRAC • (www.ntrac.org.uk) – similar to CancerGrid. 2nd DIALOGUE workshop - 9-10 February 2006
Structure of Meeting Reports • Data • the kind of data that the project is concerned with, including the structure, quantity and types of data resource. • Queries • the types of queries that are performed against this data, including the query languages used and the typical size of result sets. • The problem • the main problems that the project are currently facing with regards to data access and integration. • What Can OGSA-DAI Provide? • the functionality that the project would like OGSA-DAI to provide. • Checklist • summarises the importance of various aspects of data access and integration for the project. 2nd DIALOGUE workshop - 9-10 February 2006
AstroGrid • a number of distributed databases, each of which contains astronomical data captured from different modalities • Almost all the tables in these databases contain a spatial coordinate of each feature and some numerical attributes associated with that feature. • want to do distributed queries using their algorithmic domain-specific joins. 2nd DIALOGUE workshop - 9-10 February 2006
AutoMed and ISpider • middleware to transform schemas from different data sources (relational databases, XML documents, etc.) and evaluate distributed queries expressed in their own IQL language. • By creating a path of schema-transformations, it is possible to federate multiple data sources so that they appear as a single data source to the user • how to optimise distributed queries using metadata such as data size, occurrence of indexes, performance rates, etc. • how to fit AutoMed into a grid architecture 2nd DIALOGUE workshop - 9-10 February 2006
CancerGrid • By analysing laboratory data and correlating it with hospital and trials data, it is hoped that new subsets of patients can be discovered who respond best to particular treatments • Security is a major concern because many of the owners of data are aware of the value of their data and consequently are concerned about who has access to it. • A good means of transforming trial forms (XML documents) into a format suitable for automatic insertion into relational tables is required. 2nd DIALOGUE workshop - 9-10 February 2006
ESSC • dealing with large data sets of between 2 to 3 terabytes, stored mostly on a single machine. The user requests portions of data, often assembled from various files. • Uniform web service interfaces are provided for accessing data sets using the standard APIs associated with the binary data file formats that are used (netCDF, GRIB, HDF, etc.). • The queries used by ESCC are currently synchronous which causes request timeout problems when the resulting datasets are large. Sceptical of current WS-Notification implementations that require open ports on client machines. 2nd DIALOGUE workshop - 9-10 February 2006
GOLD • develop an infrastructure to facilitate collaboration within virtual organisations • Data storage services will be used for capturing interactions amongst parties of a VO in order to facilitate auditing and VO-playback. • Data analysis services will be used for performing particular types of analysis of data existing mostly in relational database back-ends. • primary concern is managing security policies and service access rights of different types of user dynamically. 2nd DIALOGUE workshop - 9-10 February 2006
NTRAC • build platforms to bring different systems together • Many of the data resources that they are accessing are stored in private networks (e.g. NHS patient information) with no open gateway to the public. • Researchers want to mine the data to find people to recruit into studies. 2nd DIALOGUE workshop - 9-10 February 2006
Prioritised Requirements 2nd DIALOGUE workshop - 9-10 February 2006
Notes on requirements • Prioritised based on a judgement of their importance to the various projects that were investigated. • Whether or not they are within the scope of the OGSA-DAI project, or have already satisfied by OGSA-DAI, is not considered here. • Frequent mention of the non-functional requirement: ease-of-use. • Some concern that installation and configuration remains too complex when compared with typical WAR-based web service deployment. • Hope to publish the full document in near future • let me know if you want a copy 2nd DIALOGUE workshop - 9-10 February 2006
Conclusions • Efficient transportation of large quantities of data between heterogeneous data resources is a crucial requirement for several projects from distinct domains. • This is also an implicit requirement for projects requiring data federation and distributed query processing. • If we could solve this problem, it would be of great benefit to these projects, and also to higher-level middleware projects such as OGSA-DQP • Security remains a major concern because of the commercial and sensitive nature of much data • want a generalised, role-based mechanism for exposing different views of data resources to different users, and managing these views dynamically. • is this outside the scope of data integration middleware? • While we were previously aware of most of the requirements described in this document, associating them with actual projects can help with prioritisation. 2nd DIALOGUE workshop - 9-10 February 2006
Further information • The OGSA-DAI Project Site: • http://www.ogsadai.org.uk • The DAIS-WG site: • http://forge.gridforum.org/projects/dais-wg/ • OGSA-DAI Users Mailing list • users@ogsadai.org.uk • General discussion on grid DAI matters • Formal support for OGSA-DAI releases • http://bugs.ogsadai.org.uk/ • OGSA-DAI training courses 2nd DIALOGUE workshop - 9-10 February 2006