370 likes | 477 Vues
Application of International GeoSample Number (IGSN) to Sample Collections. Sri Vinay Geoinformatics for Geochemistry (GfG) Program Lamont Campus of Columbia University 2007 September 25. Presentation Outline. Unique identifiers and their application to sample and data management
E N D
Application ofInternational GeoSample Number (IGSN) to Sample Collections Sri Vinay Geoinformatics for Geochemistry (GfG) Program Lamont Campus of Columbia University 2007 September 25
Presentation Outline • Unique identifiers and their application to sample and data management • System for Earth SAmple Registration (SESAR) and International GeoSample Number (IGSN) • Current Status and Activities of SESAR • IGSN Implementation Strategies • Discussion
Unique Identifiers An identifier is an unambiguous label which specifies an entity. Unique identifiers are widely used to designate physical objects, assisting in trading (e.g., the Universal Product Code bar code system), and the extension of similar principles to digital and abstract entities is a prerequisite for digital commerce of rights and intellectual content. Although the design of unique identification schemes is a technical problem, it is also a business issue with implications for what is identified and how identified items are made available.
“In a dynamic and distributed information environment, the effective management of both metadata records and the resources they describe requires a systematic way of generating and assigning unique identifiers.” (N. Friesen 2002: Recommendations for Globally Unique, Location-Independent, Persistent Identifiers) URN:NBN:fi-fe976238 tel:+1-816-555-1212 DOI:10.1000/ISSN1047-935X
Life Sciences - Bioinformatics “The World-Wide Web provides a globally distributed communication framework that is essential for almost all scientific collaboration, including bioinformatics. However, several limits and inadequacies have become apparent, one of which is the inability to programmatically identify locally named objects that may be widely distributed over the network. This shortcoming limits our ability to integrate multiple knowledgebases, each of which gives partial information of a shared domain, as is commonly seen in Bioinformatics” (Clark, T., Martin S., Liefeld T., 2004: Globally distributed object identification for biological knowledgebases. Briefings in Bioinformatics. Vol.5 (1), 59-70.) LSID = Life Science Identifier URN:LSID:ncbi.nlm.nih.gov:GenBank.accession:NT_001063:2
Geosciences - Geoinformatics Kai Lin (SDSC):“Ontology Based Resource Registration and Integration in GEON”, Lecture July 2005
Sample Naming in the Geosciences Examples from the PetDB Database Sample names are duplicated. Sample names are modified or changed.
Geosciences - Geoinformatics • Integration of data in a distributed system requires unique identification of samples. • Currently, naming of samples is ambiguous. • Different samples have identical names. • Samples are renamed. • Metadata that allow unique identification are often missing for terrestrial samples. • Institutions have their own naming protocols, no assurance that names are unique on a global scale. • Access to information about the samples • Need to ensure proper evaluation and facilitate interpretation of sample-based data. • Links to physical specimens • to make observations & measurements and the science derived from them reproducible. • to allow discovery & re-use of samples for improved use of existing collections.
Urgency to Act • Growing number of data systems with sample-based data • Growing demand for ‘fine-grained’ access to data at the level of individual samples • New technologies for linking and integrating data (interoperability) • Increasing need to share samples
Generating Unique IDs: Options • “Registration-based schemes” • Require a central clearinghouse • Register personal or institutional names • Register prefix or namespace (e.g. URN) • Register metadata that allow the central clearinghouse to generate identifiers • Schemes without registration • use a computational process (naming protocol) to produce an ID based on metadata • No central authority
No-Registration Scheme • Risk of incorrect application of naming protocol • Risk of name duplication • Identifier might grow to impracticable length to insure uniqueness • Metadata missing for legacy samples • Easy implementation
SESAR - A Centralized Approach • Response to urgent need for unique ID • Easier to prevent duplicate registrations • Easier to ensure links between parent and child samples • Provide a central access point for Peer2Peer registration • Facilitate international collaboration • Build a Global Sample Catalog
SESAR – A Centralized Approach • Proposed to NSF in July 2004 • SGER (EAR) award received for September 2004 - August 2005 • First presented to community at Marine Curators’ Meeting at LDEO, September 2004 • Supplement received in Sept 2005 until May 2006 • Workshop at SDSC January 2005 • Proposal to NSF August 2005 • Three year grant awarded in April 2006 (NSF-OCE).
IGSN:SIO001324 Unique user code String of random characters International GeoSample Number:A Global Unique Identifier for Earth Samples • Managed at central clearinghouse (SESAR) • Strict Syntax (9 characters: letters [A-Z] & numbers [0-9]) • Fits sample labels • Fits data tables in publications • Allows 2,176,782,336 sample identifiers per registrant • Generated by SESAR or by users • Does not replace personal or institutional names
Benefits of the IGSN & SESAR • Ability to unambiguously identify samples • allows to link & integrate data for a single sample • advances interoperability among digital data management systems & the development of Geoinformatics. • helps build more comprehensive data sets for samples. • fosters new cross-disciplinary approaches in science. • aids preservation and curation, orphaned samples can be identified. • ensures proper linking of data from samples and sub-samples. • facilitates sharing of samples.
SESAR: Status • Basic version of system functional since Fall 2004 • Nearly 3.6 Million GeoObjects registered • All DSDP/ODP GeoObjects (holes, cores, core sections, core samples) • Dredge and core collections from Scripps, WHOI, Lamont, Antarctic Research Facility (ARF) • >40,000 mineral specimens from Harvard Museum • Rocks & minerals from the US Polar Rock Repository • IGSN implemented in Geoscience data systems (e.g. EarthChem, MetPetDB, PaleoStrat, CoreWall) • Revised & extended version to be released in phases by end of 2007
SESAR: Sample Registration • Obtain account via website • Set up login/password • Get a unique user code • Submit sample information • Via Batch Registration Forms (.xls workbooks) • Via web site (currently off-line for upgrade) • Via web services (under development)
Registration via Spreadsheet Forms Available Batch Registration Forms • Coring GeoObjects • Dredges/trawl/grabs • Individual samples • Sections, Suites, & Sequences
Registration via Web Services:Under Development • Registration of objects via collaborating data systems • Automatically register samples when sample metadata are entered into collaborating data systems (e.g. IODP, MGDS) • Eliminates redundant metadata submission • Systems communicate via web services • Starting with REST based services. Could support SOAP in future. • Authentication • Investigating different technologies including GEON/GAMA • Metadata exchange and validation • XML schema
SESAR Service “MyGeoSamples” Assist investigators to manage their samples. • Current Services: • Long-term preservation of information about samples • Lists of personal sample collections • Store images, field notes, etc.
SESAR Service “MyGeoSamples” • Services “Under Construction” • Search & sort personal sample collections • Create maps of sample locations • Establish links to data (publications, data systems) • Download tabular sample information to spreadsheets Antarctic Research Facility, FSU Ca. 7,000 cores
SESAR Service “MyGeoSamples” Extended Services for Sample Curation? • Potential Services: • Modules to manage administrative metadata (customizable) • Modules for creating & operating web interfaces to collections • Advantages • No IT infrastructure required (except a computer and an internet connection) • No maintenance and risk & contingency management • Access from anywhere by authorized individuals. • Platform independent
The SESAR Global Sample Catalog • SESAR integrates the World’s sample collections • Allows users to find/discover existing samples • Provides access to “sample profiles” • View sample information in SESAR as provided • Link to the specimen’s ‘home’ (archive) • Link to data (publications, databases)
The Challenges • Multiple systems and catalogs • Data Management Systems for Science Programs • Ridge2000 - MGDS • MARGINS - MGDS • IODP • Domain Specific Catalogs • NGDC – IMLGS • National Catalogs • Canadian National Sample Management System • SESAR • Issues • Redundancies • Unacceptable demands on investigators • Inconsistencies • Fragmentation • Competition rather than collaboration • Adoption • Sample curation • Data publication • Diversity of collections • Repositories • Museums • Individual Investigators • Structured science & field programs • Metadata requirements • Sample types & relations • Vocabularies • Global Scope • Data Generated by International Collaborations • IODP • ICDP • InterMARGINS, InterRidge • Data are shared globally • Scientific literature • Web bases repositories • Samples are shared globally
IGSN Implementation Strategies • Work with investigators, curators and repositories to define & integrate registration process and IGSN into existing sample and data management workflows • Joint Workshop of SESAR & NGDC, February 26 & 27, 2007, Boulder, CO • Registration of repository and museum collections ongoing • Advance adoption of IGSN • Work with editors to make IGSN a requirement for data publication (e.g. Editors’ Round Table, Societies) • Work with funding agencies, large science programs (e.g. IODP, MARGINS, ANDRILL), CI projects (e.g. GEON, CHRONOS), and repositories on sample and data archiving policies • Work with CI Partners on system design & interoperability • Interoperability Workshop, January 2005 at SDSC • Working with GEON on authentication scheme • Working with IODP and KU/EarthChem on web services
Editor’s Breakout* - Reporting Data: • Published paper is point of record. All data should be reported. No “representative data”, no “data can be obtained from author”, no data available at personal websites • Submission to databases should be strongly encourage • Unique sample identifier (IGSN) • This may solve the problem of poor sample metadata • This system is being implemented. • Essential component of successful database - contains sample metadata, allows samples to be followed through its analytical history. • Tracks samples and subsamples. • We should start using it now. *at the GERM Meeting, May 2006, recommendations of Editors’ Breakout presented by Steve Goldstein
Support by Funding Agencies “We have also funded an effort (SESAR) to uniquely identify all samples so that various analyses on the same samples can be cross referenced and listed. I would also like you to indicate in your dissemination plan that your suite of samples will be registered with SESAR.” Letter of NSF Program Manager (OCE/MG&G) to a PI, processing paperwork for a grant (January 2007)
identifying, organizing, documenting, and cataloging existing data collections, preferably in a digital format; • constructing logical linkages and search engines that facilitate access to organizations and their geoscience sample and data collections; • dedicating adequate space — physical and digital — for storing and efficient accessing of existing and future samples and data sets;” “Government, educational, and private sector organizations, individually as well as collectively, are encouraged to aggressively address the following Geoscience data-preservation challenges”
Joint Workshop of SESAR & NGDC IMLGSBoulder, CO, February 26 & 27, 2007 • Define procedures & best-practices for • Creating & assigning IGSNs • Submitting metadata for GeoObjects to SESAR • Work towards an integrated system of sample catalogs • Recommend ways to define & implement standards for metadata and vocabularies • Identify possibilities for streamlining procedures for submission of sample metadata to catalogs
Workshop Recommendations • Streamlined Registration Process • Registration process should be simple • Options to integrate easily into existing sample and data management workflows • Ability to adopt required metadata from existing forms in use to avoid redundant metadata submission to multiple systems • Support automated registration from other systems via web services to avoid manual/redundant metadata submission • Best Practices • Objects should receive an IGSN at the time of labeling • Objects should have an IGSN before being distributed among multiple investigators and users • Parent objects should be registered before child objects • Metadata should include geospatial info (coordinates prefd.)
Workshop Recommendations • Batch Registration Forms • It is preferred that forms for the MGDS, IMLGS, and SESAR have the same column headers, which the metadata listed under this header clearly defined. The order of the headers can vary. • An XML schema for sample metadata should be developed to which the metadata in any spreadsheet can be exported. • SESAR Batch Registration Forms should be customizable, e.g. buttons beneath the header should allow to hide unnecessary columns. Columns for metadata that are identified as ‘recommended’ should always be visible. • SESAR should develop a manual for filling out the forms. The manual should include instructions regarding definition of parent – child relations. It needs to be decided if a site should get an IGSN. It is possible to link multiple stations taken at one site by including the site name as metadata. • Vocabularies and Classification Schemes • Adopt from existing standards as much as possible and work with repositories and other systems to use common schemes • It is preferable for different systems (MGDS, IMLGS, SESAR) to allow multiple vocabularies • List allowed vocabularies on the Marine Metadata Initiative (MMI) web site.
Registration Procedures to Support Integration with Existing Workflows:Under Implementation • Trusted Agents • A registrant can apply to become a Trusted Agent. Trusted Agents are authorized to generate unique IGSNs within their registered name space (user code). They can use tools, e.g. Excel, on the ship or in the field, to generate IGSNs within their given name space, have the samples labeled with IGSN, and submit the IGSN along with metadata via web services within a short time frame. Trusted Agents must sign a MOU outlining policy and procedures related to handling IGSN with trusted agents. • Example IODP: Name Space “DR0”, “DR1”,… Ship/Field 1. Generate Label with IGSN SESAR Data System 2. Ingest IGSN & Metadata 3. Submit Metadata & IGSN to SESAR (Web Services) Trusted Agent Operation
Registration Procedures to Support Integration with Existing Workflows:Under Implementation • Pre-Assigned IGSNs • Upon request, SESAR provides forms (spreadsheets) with pre-assigned IGSNs to chief scientists/investigators/repositories to take on ship/field. Forms filled with metadata should be submitted to SESAR post-collection. E.g.: SCRIPPS. • Other systems or repositories pre-populate their existing forms with IGSNs, obtained from SESAR, and provide to chief scientists. E.g.: MGDS provide forms with IGSNs to PIs in advance of R2K and MARGINS cruises. Post-cruise, MGDS will submit the sample metadata to SESAR. SESAR 1. Get forms with IGSN Ship/Field 2. Enter metadata with IGSN 3. Submit forms with metadata and IGSN Data System 2. Forms with IGSN 1. Get IGSN Ship/Field 3. Enter metadata with IGSN 4. Forms with metadata and IGSN 5. Submit Metadata & IGSN to SESAR (Web Services)
Collaboration with Repositories & Systems:Ongoing • IODP • Registered DSDP/ODP holes, cores, core sections, core samples • “Trusted Agent” arrangement in progress • MGDS • Registered existing dredges, cores, and core samples • Incorporating IGSN into existing MGDS forms • LDEO (Lamont) • Registered existing dredge and core collections • WHOI • Registering existing dredge and core collections • Future arrangements like “Trusted Agent” to be discussed • SIO (SCRIPPS) • Used SESAR forms with pre-assigned IGSNs on cruise for dredge collections • Metadata need to be updated
Collaboration with Repositories & Systems:Ongoing • Antarctic Research Facility (ARF) • Registering existing dredge and core collections • US Polar Rock Repository • Registered existing rocks and minerals • Need pre-assigned IGSNs and web service registration • Harvard Museum • Registered existing mineral specimens • Project for adding simple sample curation module in progress • OSU • Start with IGSN for historic samples • Then become trusted agent and issue IGSNs to new samples including those given to PIs • NGDC • May register some orphaned historical samples • Work with curators/repository and SESAR to streamline and standardize metadata fields and entry forms
Collaboration with Repositories & Systems:Ongoing • Canadian National Marine Geoscience Collections • Likely to register existing collections • May become “Trusted Agent” in future • Limnological Research Center (LRC/LacCore) • Likely to register via batch registration forms • May use pre-assigned IGSNs or become “Trusted Agent” in future • USGS • Discussions are on-going with USGS to make them aware of SESAR effort • Plan to contact state geological surveys • Other Repositories • Efforts are under way to reach out and propose suitable process • OSU model may be most applicable (First register legacy samples and then become trusted agent or use pre-assigned IGSNs) • Could offer sample curation module for small operations