Acknowledgements

Acknowledgements Chris Catton BioImage Development Manager: ImageStore Ontology and SABO developer Simon Sparks BioImage Software Engineer: OWLBase query engine developer John Pybus BioImage Systems Manager Chris Wilson SABO research project Chris Holland ImageBLAST research project Ruth Dalton SABO research project European Commission funding of the ORIEL Project - IST-2001-32688

Outline of my presentation • Expert knowledge and tacit knowledge • The Semantic Web and ontologies • Ontologies in biology • The BioImage Database: its purpose, structure and ontology usage • Enabling ‘smart queries’ by importing external ontologies into BioImage • ImageBLAST: hypersearches across distributed biological databases • Concluding remarks and cautionary tales

This is a fairly straightforward article, but nowhere in it are you told that: • Caenorhabditis elegans is a nematode worm, one of the handful of model organisms for which the complete genome has been sequenced or that • A transcription factor bind to nuclear DNA to control the readout of genetic information from a particular gene • These facts are so basic to the paper that they are assumed

Expert knowledge and tacit knowledge • Mutual understanding within any field of knowledge is based on a shared conceptualisationdeveloped by scholars over the years • This shared conceptualisation is often implicit through scholars’ choice of vocabulary and theories when speaking or writing • Furthermore, in order to communicate at the highest level (as in the Nature paper), scholars must assume that those listening to or reading their words are part of this community and share the conceptualization • Much of what is communicated in a paper or an academic lecture is first a reinforcement and then an extension of the shared tacit knowledge. • It is this assumed tacit knowledge, every bit as much as the technical jargon, that makes scientific literature so impenetrable to non-specialists • My next few slides are designed to make explicit some of the key points relating to ontologies, for the benefit of those for whom this may be new

Electronic communication of complex knowledge • In human society, much of our knowledge is implicit or tacit - we know more than we think we know! • However, today, as more and more knowledge is held on-line, more and more communication needs to be M2M, from one computer to another • To accomplish such communication successfully, and to permit semantic reasoning over distributed information resources • such tacit knowledge must be made explicit, and • the meaning of information must be specified unambiguously • This is difficult, and demands anal attention to detail • The next slide illustrates what I mean . . .

This is not a panda This is not a photograph of a panda What is this? This is a caption for a projected digital image of a photograph of a panda This is noteven a projected digital image of a photograph of a panda

In biology, meanings may be complex • In normal conversation, “daughter” means a female human child conceived by sexual intecourse between mother and father, and then born after a gestation of nine months within the mother’s uterus • In non-mammalian animal species, development is usually from eggs • But sex is not always required: female aphids can give birth to daughters by parthenogenesis, without the need for fertilization of the eggs by male sperm • And in the field of cell biology, the word “daughter” has an entirely separate meaning: two genetically identical “daughter cells” are produced every time a single cell divides • Biological ontologies have thus to understand the context in which the word “daughter” is used, in order to apply the correct meaning

What is the Semantic Web, and how can it help? • The concept of the Semantic Web was first clearly articulated in 2001 in an eponymous SciAm article by Tim Berners-Lee, Jim Hendler and Ora Lasilla • While the World Wide Web permits access to data in human-readable form, the Semantic Web provides access to information structured in a formal logical manner, such that computers can reason over it, extracting meaning • It involves three technologies, each resting hierarchically on the previous one: • The use of XML as a markup languagemore expressive than HTML • RDF triples that permits one to make simple logical statements (subject-verb-object) written in XML, in a form that a computer can understand • The use of ontologies – formal representations of a particular domain of knowledge (e.g. the GO ontology about genes and gene products) – written in a high level ontology language such as OWL (W3C’s Web Ontology Language), which is itself expressed as a set of RDF statements

RDF triples • An RDF triple might state that amouseis_amammal, informing the computer that an entity ‘mouse’ is included in the more general category of ‘mammal’ • This has the advantage that mouse inherits all class properties previously defined for mammal, such as the possession of four legs and fur • By using several RDF triples referring to the same subject, multiple attributes can be defined: Subject (Entity) = Mouse (class) This mouse (instance) Property (Attribute) = is_a / has_location / has_identifier Object (Value) = Mammal / Oxford / 667 • In RDF, the statement “This mouse is located in Oxford” is simply: <rdf:RDF> <rdf:Description about=“Mouse”> <Location>Oxford</Location> </rdf:Description> </rdf:RDF>

What type of animal is shown in this image? Ailuropoda melanoleuca German taxonomists claimed it was a bear British taxonomists claimed it was a racoon US taxonomists weren’t quite sure

Today, the balance of opinion is “bear” • So what is an ontology? “An ontology is a formal explicit specification of a shared conceptualisation” • The role of an ontology is to facilitate the understanding, sharing, re-use and integration of knowledge through the construction of an explicit domain model • A panda is only a bear because we all now say it is!

We understand taxonomic hierarchies Animal is_a Vertebrate is_a Mammal is_a Rodent is_a • Mouse In an ontology, one can express more complex relationships about a mouse, other than just its taxonomy

A partial ontology of ‘mouse’ Group of Mus musculusorganisms is_a Colony has_species_name member_of Mouse proper_part_ofhas_ID Leghas_mode_667 (has_cardinality: 4)of_locomotion (has_position: front / rear) (has_handedness: left / right) (has_length: number) used_forLocomotion proper_is_apart_ofRunning Fur hypothesised_ (default_colour: white)function (has_length: number unit) (has_density: number per unit area) Escape

How do you build an ontology? • You need to define all the terms within a domain of knowledge, and specify the relationships they have to one another • The structure of these relationships is a Directed Acyclic Graph, in which child terms can have more that one parent • The relationships of a child term to its two (or more) parent terms can be different, as shown in the previous example: • mouseis_a rodent – type relationship • mousemember_ofcolony – collective relationship

The thinking crow problem To properly annotate videos of Betty, we need to be able to structure not only people’s interpretations of the world, but also Betty’s view of what is going on!

Biological ontologies • There is good ontological coverage of the genes and gene products of model organisms in the form of the Gene Ontology (http://www.geneontology.org) • But until very recently little work had been done at the other end of the biological spectrum, in the field of animal behaviour • However, my department is full of people undertaking whole animal biology • To be able to include their images and videos within the BioImage Database, we decided to develop a draft standard animal behaviour ontology, SABO • SABO is an upper level ontology designed to cover all of animal behaviour, build around Otto Tinbergen’s four questions: “How does it work? How did it develop? How is it used? and How did it evolve?” • Because interpretations of behavioural events can be very subjective, we have been careful to separate fact from hypothesis in the design of SABO, with emphasis on the authority for any claims

Fact and hypothesis in SABO

For example, a courtship event

Courtship behaviour in ducks • Male mallard ducks attract their mates using a “grunt-whistle”, which Konrad Lorenz hypothesised in 1941 was derived from body shaking • Using the SABO ontology, this can be recorded in the following RDF triples: • Grunt-Whistle (a type of courtship behaviour) generates hypothesis Hypothesis About Evolutionary Origin (an ontology class) • Hypothesis About Evolutionary Origin hypothesised evolutionary origin Body Shaking (a type of behaviour) • Hypothesis About Evolutionary Origin has author “Lorenz, Konrad” (instance data) • Hypothesis About Evolutionary Origin has date “1941”(instance data)

The Ethodata Ontology • SABO was used as one of the two starting points for a recent Animal Behaviour Metadata Workshop held at Cornell University, at which leading international ethologists worked together to create an Animal Behavior Metadata Standard • Our introduction of formal ontologies to this community was greatly helped by the fact that Chris Wilson, who had worked with us on SABO, recently started a Ph. D. at Cornell with Jack Bradbury, the workshop organiser • The Workshop output is a human-readable hierarchy of defined ethological terms, the draft Animal Behavior Metadata Standard (ethodata.comm.nsdl.org) • The Workshop has commissioned us to develop this hierarchy into a fully-fledged computable ontology of animal behaviour, for the benefit of the whole ethological community • Based on the draft Animal Behavior Metadata Standard and on SABO, and written in OWL, this has the new agreed name of the Ethodata Ontology • We have already made a start on this work, and will use it to enter structured ethological image metadata into the BioImage Database

A view of theBioImage home page structure www.bioimage.org Note the hierarchical browse categories and the alternative Browse / Search arrangement

The BioImage Database Project • The value of digital image information depends upon how easily it can be located, searched for relevance, and retrieved • Detailed descriptive metadata about the images are essential, and without them, digital image repositories become little more than meaningless and costly data graveyards • The BioImage Database aims to provide a searchable database of high-quality multidimensional research images of biological specimens, both ‘raw’ and processes, with detailed supporting metadata concerning: • the biological specimen itself • the experimental procedure • details of image formation and subsequent digital processing • the people, institutions and funding agencies involved • the curation and provenance of the image and its metadata • to provide rich and accurate search results to queries over our data • and to integrate such multi-dimensional digital image data with other life science resources by providing links to literature and ‘factual’ databases

The organisation within BioImage • The basic unit of organisation within the BioImage Database is the BioImage Study, roughly equivalent to a scientific publication • A BioImage Study will contain one or more Image Sets, each corresponding to a particular scientific experiment or investigation • Each Image Set will contain one or more Imageson a common theme • Such an Image may be of any form or dimensionality • a 2D image, a 3D image, a video, or a 4D (x, y, z, time) image set • Users may browseor searchthe BioImage Database • by Study, by Image Set or by Image • For each representation, a thumbnail representative image and core metadata of the results (title, authors, description, LSID) are initially presented, and deeper metadata is available by clicking the title • Browses and searches may then be progressively refined

The basic BioImage metadata model Cell or organism Researcher Experimental study conditions or manipulations Preparation Subject or specimen Photographer or microscopist Camera or microscope, illumination, focus, etc Image capture Image sets of multidimensional images, including videos So people are related to objects and conditions / equipment through events

The structure of the BioImage Database

Things to note about the architecture: external • User submission, searching and browsing activities are all mediated by the ImageStore Ontology • Submission forms are generated dynamically from the ontology, to suit the type of submission • Thus, for instance, people submitting light microscopy images are not asked for the accelerating voltage of their electron microscope • There is complete separation of content from presentation • Presentation to users is via HTML, while SOAP is used to communicate with Web Service clients • The Struts controller orchestrates data transfer between the system and the user • This permits simple customization of the appearance of the data

Multilingual capabilities enabled by Struts achieved simply by re-setting the default language of the user’s browser This shows the Access Control Interface The same HTML page is being viewed in both cases, using alternate resource bundles

Things to note about the architecture: internal • Data are exchanged within the system in XML format, using the BioImage schema • There is no hard-coded ‘business logic’ - structures and semantics are generated at run time • The ImageStore Ontology is the central data model • This single point of control greatly simplifies database maintenance, since changes are automatically and dynamically propagated throughout the system • The entire BioImage database structure can be automatically regenerated from the ImageStore Ontology whenever this is required (for example in a new form after updating the ImageStore Ontology), using metadata from a previous XML dump • This allows easy migration to a new DBMS, e.g. from PostgreSQL to Oracle • OWLBase is used to reference the ontology and to mediate data transfers • OWLBase thus provides an abstraction layer for submissions and queries

The ImageStore Ontology • The ImageStore Ontology was constructed using the Jena toolkit (www.hpl.hp.com/semweb) and our own open source Ontology Organiser, an ontology constraint propagator and datatype manager • ImageStore: • uses a subset of the class model of the Advanced Authoring Format (sourceforge.net/projects/aaf and www.aafassociation.org) to describe media objects • uses a subset of MPEG-7 to describe multimedia content, and • has its own data model to describe scientific experiments • It is currently written in DAML+OIL • We are in the process of upgrading BioImage to use Jena 2, which will permit us to convert the ImageStore Ontology into OWL

What is required of an image ontology? • Such a generic image ontology as the ImageStore Ontology must describe all aspects of the images themselves: • their acquisition (including details of who took the original micrograph, where, when, under what conditions, for what purpose, etc.) • the media object itself (source and derivation, image type, dynamic range, resolution, format, codec, etc.) • the denotation of the referent (a description of exactly what is recorded by the image, e.g. the nature, age and pre-treatment of the subject), and • the connotation of the referent (i.e. the interpretation, meaning, purpose or significance imparted to the image by a human, its relevance to its creator and others, and its semantic relationship to other images). • In addition to these ancillary metadata about the image, there is yet a further need to record semantic content metadata related directly to the information content of the images or videos themselves • These semantic content metadata carry very high information value, since they relate directly to spatial (or spatio-temporal) features that are of most immediate relevance to human understanding of media content, namely “Where, when and why is what happening to whom?”

Image description – separating fact from hypothesis • BioImage Study title: Xklp1:a Xenopus kinesin-like protein essential for spindle organisation and chromosome positioning Denotation (raw fact): Immunofluorescence localization of Xklp1 in XL177 cells Connotation (interpretation): Xklp1 is involved in chromosome localization during mitosis in embryonic Xenopus cells, since it is positioned at the metaphase plate Vernos et al., 1995

Representing fact and hypothesis within ImageStore Narrative world Media world Real world

The BioImage advanced search interface The Advanced Search Interface permits Boolean searches, search restrictions, and re-use of previous searches in combination with new terms

Automated SQL query generation

Stage one: user inputs a query “Find images of bears”

Stage two: the ontology reasons over the request

Stage three: OWLBase convert the request to SQL

Stage four: metadata is retrieved from the database

Stage five: metadata is returned to OWLBase as XML

In summary: • Queries are made by our ontology-driven database query engine, OWLBase • OWLBase passes a query via the ImageStore ontology to the underlying PostgreSQL metadata relational database • The database returns metadata of studies matching the search term: • authors • title • description • network locator (URI) for the representative thumbnail image • IDs of all the component datasets and images • These XML data are then used to populate the HTML Study Results Web page that is displayed to the user • Many of these items link to deeper metadata • If the user now clicks on one of the nodes linking to deeper metadata, a new OWLBase query is initiated that returns information about that component

Search result, showing Studies

What’s so special? • For each query, OWLBase builds in memory an RDF ‘knowledge graph’ representing the structure of the components of each of the matching studies • As the user clicks on nodes linking to deeper metadata, each new OWLBase query return is used to extend the RDF graph of the resource • In this way, the in-memory representation of the relevant metadata is built up dynamically and incrementally, as required • At present, this would not seem to provide much additional functionality over and above a conventional relational database SQL query system • However, the fact that the searches use the ImageStore Ontology and build up an OWLBase RDF graph opens the possibility to three novel advances: • Use of external third-party ontologies • Smart querieswithin the BioImage Database and • Hypersearchesacross distributed resources

‘People’ metadata within BioImage • People have attributes: • First and last names, dates of birth, addresses, phone numbers, etc • People have various affiliations: • Current membership of an institution, e.g a university • Former membership of another institution – e.g. undertook the research while a postdoc there • Simultaneous membership of a third organisation, e.g. an international research project partnership • People have grants: • “The work in this BioImage Study was funded by BBSRC” • People may have different roles within a BioImage Study: • This person planned the study – Principal investigator • That person prepared the specimen – Technician • A third person undertook the electron microscopy – Postdoc • Together they wrote the Nature paper – Authors

Use of external ontologies • Because all BioImage queries are passed through the ImageStore ontology, and because ImageStore can be extended using external third-party ontologies, we have the possibility of using such external ontologies to enhance BioImage searches • In its simplest form, this can just be used to simplify metadata submission • For example, an organisation such as a pharmaceutical company might choose to use an instance of the BioImage Database System internally, behind its own firewall, for the organization of its own confidential research images • If that company already had an ontology-controlled database of all its employees’ details, there would be no need to re-enter those metadata for each image these people wished to record – all that would be required would be to link the BioImage Database System to the employee records ontology • But external ontologies can do much more for us . . .

Using external biological ontologies within BioImage • Biological content can be described using external ontologies – currently • the GO ontology (www.geneontology.org) for genes and gene products, and • the NCBI taxonomy (www.ncbi.nlm.nih.gov/Taxonomy) to identify species • and soon others will also be used, e.g. the Ethodata Ontology • We have already implemented the display of an interactive taxonomic hierarchy that permits the user to browse by narrowing or broadening the scope of the results displayed after a query, by clicking at different points in the taxonomy • Thus the images of specimens derived from all rodents can be refined to show only those from mice, or broadened to show all mammalian images • Similar modification of other parameters is also possible • For instance from confocal fluorescence images to real-time confocal images or to all fluorescence images (these relationships being structured within the ImageStore Ontology) • At present we can use third party ontologies only if we pre-import them • We wish now to extend this functionality by creating dynamic access to external ontologies that are published in XML on the Web, thus ensuring that we always access the most recent version

Smart queries within the BioImage Database • We propose next to use external ontologies to provide the ability to undertake semantically rich searches of the BioImage Database that can handle • synonyms (‘mouse’ and ‘Mus musculus’) • hierarchies (‘rodent’ and ‘mammal’) • exclusions (not a computer mouse) • and related terms (‘laboratory animal’ and ‘model species’) rather than being limited to conventional ‘Google-like’ searching by means of exact keyword matching, results of which are rather unpredictable! • We do not yet know how this Semantic Web approach to database querying will scale with increasing database size, and we will need to undertake comparative research after implementing it

Hypersearches of distributed information sources • At present, the BioImage Database gives users the straightforward capability of linking out from a BioImage study, dataset or image via standard Web hyperlinks to relevant material elsewhere on the Web • For example, the Advanced Search Interface enables users to enter BioImage queries of the type: “Retrieve all images of Drosophila testes showing expression of the gene always early (aly)”, and then enable users to link out from these BioImage studies both to the gene sequences and to literature publications of relevance • What we cannot do at present, however, is to send complex queries across a set of databases, of the type: “Retrieve images of whole Drosophila, Xenopus and mouse embryos showing the comparative neural expression of the most anterior of their Hox genes at different developmental stages, and show me these gene sequences aligned to maximise homology” • We wish to investigate how to undertake complex integrated ‘hypersearches’ simultaneously over the BioImage Database and relevant ontology-enabled and Web Services-enabled sequence, structural and literature databases

How to implement hypersearches • The conventional way to search across disparate databases would be to map their schemas onto some common system, and then use that to distribute a query across them in a manner that each database can understand. • Our approach is somewhat different, and relies on the fact that OWLBase dynamically builds up an RFD representation of the information space of interest, and that external ontologies can be integrated with ImageStore • Specifically, we plan to import relevant sub-graphsfrom published external ontologies (i.e. class data rather than instance data) dynamically into the RDF graph being built up within OWLBase during each query • We will then use this extended graph to structure the hypersearches, by providing ‘internal’ knowledge about the structure of external databases • OWLBase will thus act as more than just a query engine. It will build dynamic graphs of relationships between stuff within BioImage and stuff outside, and then run queries over that bigger graph

Acknowledgements