Knowledge Organization Systems and Information Discovery

Knowledge Organization Systemsand Information Discovery Douglas Tudhope Inaugural Lecture

Acknowledgements Research team members and collaborators • Ceri Binding (University of Glamorgan) • Andreas Vlachidis (University of Glamorgan) • Keith May, English Heritage (EH) • Stuart Jeffrey, Julian Richards, Archaeology Data Service (ADS) Archaeology Department, University of York

Collaborative acknowledgements Harith Alani Steve Harris Paul Beynon-Davies Traugott Koch Dorothee Block Marianne Lykke Daniel Cunliffe Brian Matthews Emlyn Everitt Stuart Lewis Kora Golub Hugh Mackay Rachel Heery Jim Moon Chris Jones Renato Souza Iolo Jones Carl Taylor

Information Discovery • Literal string match (eg Google) is good for some kinds of searches: specific concrete topics where all we want are some relevant results - not care how many we miss! • Google less good at more conceptual (re)search topics where important to be sure not missed anything important eg medical, legal, scholarly research ------------- • Searching data and documents a recent general research focus variously termed ... eScience, Digital Humanities, Cyberinfrastructure - data.gov.uk a recent initiative for government data

Words are tricky! "When I use a word," Humpty Dumpty said in rather a scornful tone, "it means just what I choose it to mean--neither more nor less." (Lewis Carroll) • Various potential problems with literal string search • Different words mean same thing • Same word means different things • Trivial spelling differences can affect results or a particular choice of synonym or a slightly different perspective in choice of concept - How to address this issue?

This lecture • Brief look at the history of work on this topic at Glamorgan • Examples from recent AHRC funded research on cross search of different archaeological datasets and reports - try to give a general flavour • Discuss some current research issues

This lecture • Part of a general move towards a (more) machine understandable Web

Machine readable vs machine understandable Whatwesay to the machine: <h1>The Cat in the Hat</h1> <ul> <li>ISBN: 0007158440</li> <li>Author: Dr. Seuss</li> <li>Publisher: Collins</li> </ul> What the machine understands: <<h1>asdplubgithmys</h1> <ul> <li>jvfr: 0007158440</li> <li>vuyrok: Dr. Seuss</li> <li>Publisher: Collins</li> </ul>

(More) machine understandable Whatwesay to the machine: <h1>Title:The Cat in the Hat</h1> <ul> <li>ISBN: 0007158440</li> <li>Author: Dr. Seuss</li> <li>Publisher: Collins</li> </ul> What the machine understands: <<h1>asdplubgithmys</h1> <ul> <li>jvfr: 0007158440</li> <li>vuyrok: Dr. Seuss</li> <li>Publisher: Collins</li> </ul>

(More) machine understandable Book ID Author Publisher --------------- conceptual structure (ontology) Whatwesay to the machine: <h1>Title:The Cat in the Hat</h1> <ul> <li>ISBN: 0007158440</li> <li>Author: Dr. Seuss</li> <li>Publisher: Collins</li> </ul> What the machine understands: <<h1>asdplubgithmys</h1> <ul> <li>jvfr: 0007158440</li> <li>vuyrok: Dr. Seuss</li> <li>Publisher: Collins</li> </ul>

(More) machine understandable Book ID Author Publisher --------------- conceptual structure (ontology) --------------- vocabularies for terminology and knowledge organization Whatwesay to the machine: <h1>Title:The Cat in the Hat</h1> <ul> <li>ISBN: 0007158440</li> <li>Author: Dr. Seuss</li> <li>Publisher: Collins</li> </ul> What the machine understands: <<h1>asdplubgithmys</h1> <ul> <li>jvfr: 0007158440</li> <li>vuyrok: Dr. Seuss</li> <li>Publisher: Collins</li> </ul> Theodor Geisel

Knowledge Organization Systems • Knowledge Organization Systems eg classifications, thesauri and ontologies help semantic interoperability • Reduce ambiguity by defining terms and providing synonyms • Organise concepts via semantic relationships

EH Monuments Type Thesaurus Knowledge Organization Systems • Knowledge Organization Systems - classifications, thesauri and ontologies help semantic interoperability • Reduce ambiguity by defining terms and providing synonyms Organise concepts via semantic relationships

Origins of research Polytechnic of Wales Research Assistantship (collaborating with Paul Beynon-Davies, Chris Jones - Carl Taylor’s PhD) Experimental museum exhibit Extract of collections database - Pontypridd Historical and Cultural Centre

Origins of research Polytechnic of Wales Research Assistantship (collaborating with Paul Beynon-Davies, Chris Jones - Carl Taylor’s PhD) Experimental museum exhibit Extract of collections database - Pontypridd Historical and Cultural Centre Hard to generalise and maintain if based on manual linking of information • dynamic implicit links In this case based on Social History and Industrial Classification (SHIC) and indexing for place, time period

Indexing on subject, period, place

Similar or different?

FACET - Faceted Access to Cultural hEritage Terminology Subsequent EPSRC funded project with Science Museum, National Railway Museum and J. Paul Getty Trust - Art & Architecture Thesaurus (AAT) Aims: • Integration of thesaurus into user interface • Semantic query expansion

FACET research question • “The major problem lies in developing a system whereby individual parts of subject headings containing multiple AAT terms are broken apart, individually exploded hierarchically, and then reintegrated to answer a query with relevance” • (Toni Petersen, AAT Director) • Example Query: mahogany, dark yellow, brocading, Edwardian, armchair • for National Railway Museum collection - egroyal carriage

FACET Web Demonstrator- Semantic Query Expansion

FACET Web Demonstrator- how to generalise? FACET - more sophisticated search but still a single database How to generalise to multiple datasets and thesauri? How to connect with text documents?

STAR Semantic Technologies for Archaeological Resources • AHRC funded project(s) with English Heritage and the ADS Generalise previous methods to :- • Different datasets with different structures • Reports of excavations ADS OASIS Grey Literature Library (unpublished reports) Online AccesS to the Index of archaeological investigationS

STAR Semantic Technologies for Archaeological Resources • Currently excavation datasets isolated with different terminology systems • Currently no connection with grey literature excavation reports Aims • Cross search at a conceptual level archaeological datasets with associated grey literature

STAR Semantic Technologies for Archaeological Resources • Need for integrating conceptual framework and terminology control via thesauri and glossaries • EH (Keith May) designed an ontology describing the archaeological process

The archaeological process • Events in the present and events in the past, related by the place in which they occur and the physical remains in that place • Activities in the present investigate the remains of the past (affecting them in the process)

Events in the present Excavation // Drawing and Photography Survey // Sampling Treatments and Processing Classification // Grouping and Phasing Measuring including scientific dating Recording of observations Dissemination // Interpretation // Analysis

Events in the past have results in the present • Events shaping natural environment geological, environmental and biological processes

Events in the past have results in the present • Events shaping natural environment geological, environmental and biological processes • Events concerned with object production, disposal or loss (how ‘finds’ produced and later deposited in archaeological context)

Events in the past have results in the present • Events shaping natural environment geological, environmental and biological processes • Events concerned with object production, disposal or loss (how ‘finds’ produced and later deposited in archaeological context) • Construction, modification and destruction events relating to human buildings

Events in the past have results in the present • Conceptual framework to model these archaeological events (an EH extension of a standard cultural heritage ontology) • Need to move beyond simple Who – What – Where – When model typically used in state of the art cultural heritage databases

Typical ‘Advanced Search’ model- does not deal with events Typical Who - What - Where - When advanced search user interface Who OandOor What OandOor Where OandOor When -------- Resources

Typical ‘Advanced Search’ limitations Typical Who - What - Where - When model - needs more semantics Archaeological ‘find’ (eg coin) Archaeological ‘context’ (eg hearth) Who OandOor What OandOor Where OandOor When -------- Resources

Typical ‘Advanced Search’ limitations Need to define relationships between entities and allow multiple connections Archaeological ‘find’ (eg coin) Archaeological ‘context’ (eg hearth) When photo was taken? When ‘find’ originally made? When ‘find’ deposited? Who OandOor What OandOor Where OandOor When -------- Resources

Typical ‘Advanced Search’ limitations Assigning dates and classifying are important ‘events’ in the present - outcomes of the archaeological process (interpretations can differ) Who made dating judgment? Archaeological ‘find’ (eg coin) Archaeological ‘context’ (eg hearth) When photo was taken? When ‘find’ originally made? When ‘find’ deposited? Who OandOor What OandOor Where OandOor When -------- Resources

Broader conceptual framework (ontology) Who made dating judgment? Archaeological ‘find’ (eg coin) Archaeological ‘context’ (eg hearth) When photo was taken? When ‘find’ originally made? When ‘find’ deposited? Who made dating judgment? Archaeological ‘find’ (eg coin) Archaeological ‘context’ (eg hearth) When photo was taken? When ‘find’ originally made? When ‘find’ deposited? Modeling multiple interpretations – linked to underlying data within the ontology  ‘multivocality’ in archaeology Who made dating judgment? Archaeological ‘find’ (eg coin) Archaeological ‘context’ (eg hearth) When photo was taken? When ‘find’ originally made? When ‘find’ deposited? Who OandOor What OandOor Where OandOor When -------- Resources

Broader conceptual framework (ontology)EH extension of CIDOC Conceptual Reference Model (CRM) explicit modelling of archaeological events – complicated!

STAR general architecture EH Thesauri and CRM ontology • Windows applications • Browser components • Full text search • Browse concept space • Navigate via expansion • Cross search • archaeological datasets Grey literature indexing (CRM) STAR web services ArchaeologicalDatasets (CRM) STAR client applications STAR datasets (expressed in terms of CRM)

Natural Language Processing (NLP) of archaeological grey literature • Extract key concepts in same semantic representation as for data. • Allows unified searching of different datasets and grey literature • in terms of same underlying conceptual structure “ditch containing prehistoric pottery dating to the Late Bronze Age”

NLP output – what the machine sees!

STAR Demonstrator – search for a conceptual pattern An Internet Archaeology publication on one of the (Silchester Roman) datasets we used in STAR discusses the finding of a coin within a hearth. -- does the same thing occur in any of the grey literature reports? Requires comparison of extracted data with NLP indexing in terms of the ontology.

STAR Demonstrator – search for a conceptual pattern • Research paper reports finding a coin in hearth – exist elsewhere?

Current issues and goals • Apply research outcomes in practice (knowledge transfer) semantic terminology services ‘rubbish example’ using the ADS Archaeology Image Bank • NLP challenges negation!  Negative findings? • Multivocality in archaeology broader picture of the research issues

Archaeology is rubbish! • Google search for archaeology rubbish

ADS Archaeology Image Bank ExampleNo results when search for rubbish or refuse – what to do?

STAR Semantic Terminology Services- concept expansion (as web service)  midden

MIDDEN n dunghill, refuse heap midden dunghill, compost heap, refuse heap, ... muddle, mess ... dirty slovenly person ... middenmavisor middenraker--- searchers of refuse heaps (Concise Scots dictionary - Mairi Robinson, Scottish National Dictionary Association)

ADS Archaeology Image Bank ExampleNo results when search for rubbish or refuse – try midden!

NLP challenges – not just negation detection

Knowledge Organization Systems and Information Discovery