450 likes | 467 Vues
This article explores the importance of knowledge sharing and collaborative problem solving in biodiversity informatics, specifically focusing on the Species 2000 vision. It discusses the challenges faced in cataloguing and classifying species and introduces the SPICE project as a solution. The article also highlights the interactive use of SPICE and the different types of requests that can be made.
E N D
Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK
The Species 2000 vision • To enumerate all known species of plants, animals, fungi and microbes on Earth as the baseline dataset for studies of global biodiversity • To provide a simple access point enabling users to link from Species 2000 to other data systems for all groups of organisms, using direct species-links • To enable users worldwide to verify the scientific name, status and classification of any known species through species checklist data drawn from an array of participating databases • (More recently) to provide a “synonymy server” for use as a service by other applications needing to obtainsuitable scientific names, e.g. for queryingbiological data sets
Need for a catalogue • Suppose we wished to retrieve all locations where specimens of Caragana arborescens have been collected, from various specimen distribution databases. • A taxonomic checklist might include: Caragana arborescens Lam. [accepted name] Caragana sibirica Medikus [synonym] • Classification of organisms is based on opinion regarding • what the groups are • identification of individuals • So we need to use both these names as search terms • In practice the problem might be far worse
SPICE for Species 2000: Meeting the Computing challenges • The SPICE for Species 2000 project aimed to: • build a federated ‘registry’ of scientific names organised by taxon (species, etc.) • accommodate GSD (Global Species Database) heterogeneity • accommodate GSD autonomy & instability • ensure scalability • Funding: • SPICE was funded by the UK BBSRC/EPSRC Bioinformatics panel • EuroCat – new EU-funded project to augmentSPICE catalogue of life & develop/maintainSPICE software
SPICE Project Staff Cardiff – Prof. Alex Gray, Dr. Andrew Jones, Prof. Nick. Fiddian, Dr. Xuebiao Xu, (Mr. Nick Pittas). Object and Knowledge-based Systems Group, Department of Computer Science, Cardiff University, PO Box 916, Cardiff CF24 3XF Email: {W.A.Gray|Andrew.C.Jones|N.Fiddian|X.Xu|N.Pittas}@cs.cf.ac.uk Telephone +44 (0)29 2087 4812 Reading – Prof. Frank Bisby, Prof. Sir Ghillean Prance andDr. Sue Brandt. Centre for Plant Diversity & Systematics, The University of Reading, Reading RG6 6AS Email: {F.A.Bisby|S.M.Brandt}@reading.ac.uk Telephone +44 (0) 118 378 6437 Southampton – Dr. Richard White and Mr. John Robinson. Biodiversity & Ecology Research Division, School of Biological Sciences, University of Southampton, Southampton SO16 7PX Email: {R.J.White|J.S.Robinson}@soton.ac.uk Telephone +44 (0)23 8059 2021 Royal Botanic Gardens, Kew - Prof. Peter Crane, Dr. Don Kirkup, Ms. Sally Hinchcliffe, Mr. Graham Christian and others Natural History Museum, London - Prof. Paul Henderson, Mr. Charles Hussey and others BIOSIS UK - Mr. Michael Dadd, Ms. Judith Howcroft and others 5
Basic uses for the catalogue • User wishes to check taxonomy of some organisms interactively; or • User wishes to access or store data (observations, gene sequences; …) associated with a given species: • Catalogue gives information about accepted name/synonyms • Can use all names for retrieval, for example • May well want to use the accepted name provided by SPICE for storing new data.
The “standard data” • Comprises the information about a species which Species 2000 wishes to provide: • AVCNameWithRefs • SynonymWithRefs • CommonNameWithRefs • Family • Comment • Scrutiny • DataLink • Geography • Minimalistic CDM devised: • The basic information needed for a catalogue of life; • If GSD can’t be wrapped to conform, probably doesn’t contain required information
Request Types 0-5 • Again, a fairly simple set of operations is required: • Type 0: Get CDM version compliance for a GSD • Type 1: Search for a name in a GSD • Type 2: Fetch “standard data” about a chosen species • Type 3: Get information about a GSD • Type 4: Move up the taxonomic hierarchy • Type 5: Move down the taxonomic hierarchy
Type 1 response (XML) extract <type1result> <SPECIESNAME> <SYNONYMWITHAVC> <SYNONYM> <FULLNAME> <GENUS>Abrus</GENUS> <SPECIES>abrus</SPECIES> <AUTHORITY>(L.) Wright</AUTHORITY> </FULLNAME> <INFRASPECIFICPORTION> </INFRASPECIFICPORTION> <SYNONYMSTATUS>synonym</SYNONYMSTATUS> </SYNONYM> <AVCNAME> <FULLNAME> <GENUS>Abrus</GENUS> <SPECIES>precatorius</SPECIES> <AUTHORITY>L.</AUTHORITY> </FULLNAME> <AVCSTAT>accepted</AVCSTAT> <IDL>1571</IDL> </AVCNAME> </SYNONYMWITHAVC> </SPECIESNAME> <SPECIESNAME> …
SPICE architecture User (Web browser) User (Web Browser) …… CORBA User Server module (HTTP) CAS knowledge repository (taxonomic hierarchy, annual checklist, genus and other caches, ...) Common Access System (CAS) ‘Query’ co-ordinator …… Wrapper (e.g.CGI/XML+ ODBC) Wrapper (e.g. JDBC) (in some cases, generic) CORBA ‘wrapper’ element of GSD Wrapper GSD GSD
Why a federation of autonomous, heterogeneous GSDs? • Taxonomists have specialist knowledge of a limited range of organisms, and want to make their data available in various ways • So • the hierarchy is divided into sectors, with an individual or group of scientists responsible for each • scientists are given control over their databases • we accommodate existing heterogeneous GSDs; also new ones built for various purposes • This helps assure taxonomic data quality (peer review of GSDs is also used)
Specialist GSDs mean better data quality than non-specialist ones … • … but data quality problems still arise: • “Non-overlapping” sectors may, in fact, overlap • GSDs may be inconsistent taxonomically • GSDs may be formed by merging two or more other databases, mutually inconsistent
LITCHI Project A rule-based tool for the detection and repair of conflicts and merging of data in taxonomic databases
Project Staff Suzanne Embury, Alex Gray, Andrew Jones, Iain Sutherland Object and Knowledge-based Systems Group, Department of Computer Science, University of Wales, Cardiff, PO Box 916, Cardiff CF24 3XF Frank Bisby, Sue Brandt Centre for Plant Diversity and Systematics, School of Plant Sciences, The University of Reading, Reading RG6 6AS John Robinson, Richard WhiteBiodiversity & Ecology Research Division, School of Biological Sciences, University of Southampton, Southampton SO16 7PX
Summary • We modelled the knowledge integrity rules in a taxonomic treatment • The knowledge tested is implicit in the assemblage of scientific names and synonyms used to represent each taxon (examples later) • Practical uses include detecting and resolving taxonomic conflicts when merging or linking two databases
Example 1 Checklist A • Caragana arborescens Lam. [accepted name] Caragana sibirica Medikus [synonym] Checklist B • Caragana sibirica Medikus [accepted name] Caragana arborescens Lam. [synonym]
Example 2 Treatment Arecognises one genus, Cytisus Treatment Brecognises two genera, Cytisus and Sarothamnus Cytisus multiflorus Genus Cytisus multiflorus Cytisus praecox Cytisus Cytisus praecox Genus Cytisus Sarothamnus scoparius Cytisus scoparius Genus Sarothamnus striatus Cytisus striatus Sarothamnus In the case of the species Cytisus scoparius Treatment A will list it as Cytisus scoparius (synonym Sarothamnus scoparius) Treatment B will list it asSarothamnus scoparius(synonym Cytisus scoparius)
Example of a rule • In each of the 2 examples, merging the checklists would lead to violation of: • “A full name which is not a pro-parte name may not appear as both an accepted name and a synonym in the same checklist” • (Violations of other rules help user to distinguish the taxonomic causes; various options to repair thisviolation) • violation:- • accepted_name(N,A,C1,L,T1), • synonym(N,A,C2,L,T2), • (\+pro_parte(C1); \+pro_parte(C2)).
LITCHI: current status • Good selection of rules (for botanical nomenclature) • A research project, now in need of re-engineering: • Implemented in Prolog & Visual Basic; not portable • Uses XDF file format for data import/export
Some future developments of LITCHI • BiodiversityWorld • BiodiversityWorld is not funded to develop LITCHI at all, but will be able to take advantage of LITCHI developments for ‘taxonomically intelligent navigation’ • EuroCat • Re-engineer LITCHI, to work with GSDs wrapped to SPICE CDM 1.2 • Use for • Intra- and inter- GSD consistency checking • Navigation between resources organised according to differing taxonomies, e.g. for access to regional hubs • Use in conjunction with, and for generating, ‘cross-maps’
Litchi in (future) use Checklist A Checklist B Read into system Taxonomic intelligence • Conflict detection • Conflict display • Conflict repair (not necessarily used in this context) • Rules • Conflict description • Possible repairs Write Cross-map
BiodiversityWorld • Problem solving environment for biodiversity informatics on the GRID • UK BBSRC-funded • Universities of Reading, Cardiff & Southampton, and The Natural History Museum, London
BiodiversityWorld – The ChallengeSome difficult Biodiversity questions • How should conservation efforts be concentrated? • (example of Biodiversity Richness & Conservation Evaluation) • Where might a species be expected to occur, under present or predicted climatic conditions? • (example of Bioclimatic modelling and Climate Change) • Is geography a good predictor of relationship between lineages? (e.g. are the more closely related species found near each other?) • (example of Phylogenetic Analysis & Biogeography)
Some relevant resource types • Data sources: • Catalogue of life • Species Information Sources (SISs) • Species geography • Descriptive data • Specimen distribution • Geographical • Boundaries of geographical & political units • Climate surfaces • Genetic sequences • Analytic tools: • Biodiversity richness assessment – various metrics • Bioclimatic modelling – bioclimatic ‘envelope’ generation • Phylogenetic analysis (generation of phylogenetic trees)
Some challenges … • Finding the resources • Knowing how to use these heterogeneous resources • Originally constructed for various reasons • Often little thought was given to standards or interoperability • One important specific issue: using appropriate scientific name for SIS queries (hence SPICE for Species 2000)
Our vision • Biodiversity Problem Solving Environment – • Heterogeneous diverse resources • Flexible workflows • Main challenges centre around metadata, interoperability, etc; • High-performance computing secondary (though relevant) • Our previous GRAB demonstrator illustrates some Bioclimatic Modelling elements, with a fixed workflow …
Typical GRAB display Applet monitoring communication between GRAB server and GRAB databases Web browser ‘front-end’ to the GRAB server
Why the GRID for BiodiversityWorld (or even GRAB?) • HPC; mobility of data & programs • Resource discovery • OGSA (Open Grid Services Architecture) – not Globus-specific – gives Web Services & life cycle management, etc • Workflow for orchestrating resources, etc.
Taxonomic index (SPICECatalogue of Life) Analytic tool Analytic tool GSD GSD GSD Proxy GSD Proxy Proxy • Ontology: • Metadata • Intelligent links • Resource & Analytic tool descriptions • Maintenancetools BioD-GRID • Problem Solving Environment: • Broker agents • Facilitator agents • Presentation agents Proxy Proxy Proxy User Problem Solving Environment User Interface Thematic Data source Abiotic Data source Local tools BiodiversityWorld architecture
Bioclimatic modellingCase Study - Leucaena leucocephala • Leucaena leucocephala (Lam.) De Wit • Native of Central America • Widely introduced around the tropics • Widely utilised around the globe for: • Wood • Forage • Soil enrichment and erosion control • Regarded as an invasive weed in some areas
Workflow • Our PSE should provide flexible support for development of complex workflows for: • experimental design of in silico biodiversity-related experiments • repeatability • modification of experiments
Typical workflow START Distributed Array of GSD’s Enquiry name(s) Species 2000 Catalogue of Life STAGE 1 Returns list of accepted taxa, synonyms and common names Enquiry: select ‘data’ for ‘taxon set’ Distributed array of thematic data sources STAGE 2 Return dataset composed ofhomologous responses from multiple thematic data sources Analytical Toolbox Reference to Abiotic datasets STAGE 3 Presentation and storage of results
Initial test workflow SPICE Climate Submit scientific name; retrieve accepted name & synonyms for species Climate surfaces Retrieve distribution maps for species of interest Localities ClimateSpace Model Model of climatic conditions where species is currently found World or regional maps Climate Base Maps Possibly different climate surfaces (e.g. predicted climate) Prediction Prediction of suitable regions for species of interest
BiodiversityWorld – much more complex than SPICE • Much more heterogeneity • diverse kinds of databases and tools • Much greater range of data quality and terminology problems, e.g. • accuracy of “point data” • country names • …
Role/use of metadata • Descriptive • Create electronic book for user • Create workflows • necessary transformations • provenances • interoperability • Locate appropriate elements • Rerun processing (possibly with modifications)
Conclusion • The field of biodiversity informatics presents various challenges including: • taxonomic/naming • heterogeneity & autonomy • data quality • need for extensive metadata