1 / 66

Thesaurus Building

Thesaurus Building. Martin Doerr. Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas. Athens June 17, 2013. Overview. Motivation and Definitions Words, Terms and Concepts Knowledge Organisation Systems Thesaurus structure

louisa
Télécharger la présentation

Thesaurus Building

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens June 17, 2013

  2. Overview • Motivation and Definitions • Words, Terms and Concepts • Knowledge Organisation Systems • Thesaurus structure • Thesaurus construction • Examples

  3. Motivation • The simple idea to standardise expressions of classificationfor better communication: • Results in encyclopedia, knowledge bases, touches language engineering, cognitive science. • Becomes a major issue of electronic communication and information access. • Questions: is this item well characterized by this term? • Would every expert expect to find this object under this same term? • If not, would such terms be variants of the same concept? • Is there a unique answer to “what is this”? • Does this database contain descriptions of things falling under this term? • Historically: Roget’s Thesaurus to assist writers with better words…

  4. Words, Terms, Concepts • Words • Constituents of natural languages. Categorical meaning, in contrast to “proper names”. Multiple senses depend on context. (Example: “order”) • Term • Constituent of expert language. A word with a specific (categorical) meaning, either defined in a scientific document or common to an expert group and discipline. (Example: “hepatitis A”) • Concept • A class or set of items grouped together on the basis of some implicit or explicit criterion or rule. The criterion can be unconscious or even innate ! (Example: “δημόσιος υπάλληλος”). • A concept is not a term and not a language element!

  5. Functions of Terminology • Unambiguous scientific expression • Use in expert discussions, expert opinions (diagnoses!) and scientific publication. Defined in disciplinary dictionaries. • Research • Defined ad-hoc to discriminate items in a research project (archeology!). Conclude from form on function, form on provenance etc. • Data search • Find all items (publications, objects etc.) possibly relevant for my researchquestion. • Unfortunately, each function needs a different approach!

  6. From Words to Concepts • Terms are created by • selecting or inventing a word, often a compound (“black-figure pottery”) • fixing an expert group(“classical archaeologists”), • fixing a scientific context(“antique Greek vases”) • Term alone makes no sense (“registration”) • A concept is detected • As the sense of a term or one sense of a word or the use of words in a text • by analyzing context-specific use (written definitions, interviews, dialogues). • A concept may be (first time) created by expressing/writing definitions. • A concept is formally identified • By assigning an identifier to a description (“definition”) sufficient to clarify its meaning and disambiguate it from other concepts.

  7. From Words to Concepts • Understanding • Comes from disambiguating the concepts (senses) behind words (terms) in a context. • This can be unconscious, • conscious by context analysis, • by asking clarifications (dialogue) • Databases and database records are contexts • Therefore humans can understand a word in a data field • Computers do not understand senses • Therefore machines cannot relate (retrieve) records by common sense • Therefore senses must be identified to machines as entities • and be related to terms

  8. Classification • Concepts are many, words are few • LCSH: 500.000 concepts, only general subjects, millions in our mind, UMLS: over 5.000.000 concepts. • Words : some 60.000 in our mind, some 400.000 in a language, some 30.000 in a typical dictionary. • A typical thesaurus : 10.000 to 100.000 concepts • One word may have some dozens of meanings (referred concepts) • One concept may be referred to by severalwords or terms • Terms are noun phrases, composed of words • Concepts are used to classify things in texts and database records, either by referring to words, terms or concept identifiers.

  9. Purpose of Classification 1 • Organise a Universe of Discourse by concepts for cognition and comprehension • recognition of discriminant attributes, attribute distribution • for generalisation of observation • for inferences from evidence to cause • exclusive, avoiding “mixed forms”, prototypical, selective on reality • Communication of conceptualisation • presentation of a domain of discourse • help for exploration of a topic • descriptive, rich, detailed, fuzzy, “cautious”, incomplete

  10. Purpose of Classification 2 • Determination of items in an automated communication process • widely agreed-on naming for kinds of objects we share in a cultural space • e.g. artefact, kris (malayan), • analogous or constructive classification of kinds of objects out of our cultural space with terms from our space • e.g. knife, dagger = puuko (finnish) • information seeking by constraining attribute values • e.g. weapons, 18th century, south-east Asia • Surrogate role, poor, binary, standardised, comprehensive, recall-oriented rather than detailed. • For electronic communication, prescribefew, mandatory high level terms, refer in data records also all good expert terms from here on, we only talk about this function

  11. Knowledge Organisation Systems • For electronic communication • Organize terms, concepts and their relationships into • digital (machine readable) dictionaries for human comprehension • such that machines can make inferences humans would approve. • Such inferences are • identity (get all cats by “cat”) • generalization (get “cats” by “felines” • related terms (get Heraklion by “Candia”, get “bridge construction” by “bridges”, get Heraklion by “Crete”) • We call these KOS • E.g., LCSH, AAT, geonames, terms lists, ULAN…

  12. Kinds of KOS • A dictionary is a listing of words and phrases giving information such as spelling, morphology and part of speech, senses, definitions, usage, origin, and equivalents in other languages (bi- or multilingual dictionary). • A controlled vocabulary is a limited list of terms to be used in a database field. Only an authority may add terms. • Authority files are lists of persons (authors) or places (also gazetters) together with recommended names (controlled). • A classification system is a structure that organizes concepts into a (mono) hierarchyin order to partition some material following a sequence of decision criteria.

  13. Kinds of KOS • An ontology “is a logical theory… • …accounting for the intended meaning of a formal vocabulary, i.e. its ontological commitment to a particular conceptualization of the world. The intended models of a logical language using such a vocabulary are constrained by its ontological commitment. An ontology indirectly reflects this commitment (and the underlying conceptualization) by approximating these intended models.” We use “ontology” only to formally describe the meaning of information structures • A thesaurus is a controlled vocabulary of categorical terms related to concepts, and with semantic relationships between concepts. • A monolingual thesaurus has terms form one expert group or community • A multilingual thesaurusrelates terms and concepts from two or more expert groups or communities (see next slide)

  14. Multilingual thesauri • Translated thesauri: • Each concept is optimally interpreted in words of another or multiple languages, to allow speakers of those languages to understand it better. • Correlated thesauri: • Multiple thesauri with terms and concepts from respective groups, and a set of concept-based mappings between the different thesauri of that aggregate, in order to process queries across different terminologies. • Interlingua: • Concepts are created by fusing each cluster of similar concepts from different social groups into a new concept. One term from each user group is attached to the new concept as the identifier to be used by this group. The interlingua provides the sharing of concepts between social groups, e.g. as a legal basis used by the European Commission like the EBTI. Note that the interlingua may not contain any of the original concepts of any user group; it contains a set of compromises to remove interpretational differences. Its concepts may again be translated and correlated to other thesauri.

  15. Multilingual Thesauri Merged English heritage thesaurus Merimee Thesaurus And interthesaurus correlations +/- linguistic translation linguistic translation +/- +/- French Vocabulary English Vocabulary interlingua

  16. Thesaurus Structure • Nodes and Links • Nodes for concepts and terms • Nodes are reference objects with accepted identity. • Links for semantic relations concept-concept, concept-term. • Links express opinions, constitute the thesaurus. • 3 dimensions to specialize links • By meaning. E.g. synonymity: who used, when and in which context this expression for that concept... • By version. When introduced, when withdrawn. • By opinion. E.g. Who says, that this concept is subordinate to that... • 2 Dominant standards: ISO2788 / ISO5964 and SKOS

  17. ISO2788-1986 • Standard about the methodology, entities and relationships of a thesaurus, but not the format • Entities: • thesaurus • preferred term • non-preferred term • compound term • node label (facet indicator) • facets • Does not yet clearly distinguish concepts and terms. • Getty Research Institute uses the term “descriptor” for representing concepts

  18. SKOS • Simple Knowledge Organization Systems (SKOS) • It provides a model for expressing the basic structure and content of multilingual concept schemes such as • thesauri, • classification schemes, • taxonomies, • subject-heading systems, • or any other type of structured controlled vocabulary. • It is the first widely accepted encoding format in RDF • Introduces persistent concept identifiers • Tends to be abused for placenames (gazetteers) and person lists (particulars)…

  19. ThesaurusNotion ThesaurusExpression : Generalisation (isA) ThesaurusConcept Term HierarchyTerm (concept) Preferred Term Non-Preferred Term TopTerm (concept) Descriptor AlternativeTerm UsedForTerm ObsoleteDescriptor ObsoleteTerm NodeLabel Thesaurus Concepts (SIS-TMS)

  20. ThesaurusNotionType Hierarchy Facet ObjectFacet TopTerm Descriptor Single Built Works <single built works> Semantic part Functional part Logical Thesaurus Structure In In In In M1_Class S_Class HierarchyTerm Object Genres In n belongs to Token BT fortifications

  21. Thesaurus Structure: Concept Record Intrathesaurus relations (ISO 2788) • Hierarchical Relations (from Concept/Descriptor, to Concept/Descriptor) • BT (Broader Term) • BTP (Broader Term Partitive) = actual kind of RT • BTG (Broader Term Generic) = actual BT (IsA) • Associative Relations (from Concept/Descriptor, to Concept/Descriptor) • RT (Related Term) = “world of ontologies” • Equivalence Relations (from Concept/Descriptor, to Term/Language) • ALT (Alternative Term) • UF (Used For Term) often extended by group/language • Now all thesauri use also a concept identifier (possibly LoDidentifier).

  22. Thesaurus Structure: Linking Concepts Interthesaurusrelations (ISO 5964): • partial equivalence SKOS: broader equivalence (is subset of) narrower equivalence (is superset of) • exact equivalence(same set as) • inexact equivalence (overlaps with) good for FTR only • single to multiple equivalence (future!)

  23. Thesaurus Structure • HIERARCHICAL RELATIONSHIPS. AAT Definition Broader and narrower (parent/child) relationships between concepts. Hierarchical relationships are generally either whole/part or genus/species; in the AAT, most hierarchical relationships are genus/species (e.g., chalice is a type ofdrinking vessel). Relationships may be polyhierarchical, meaning that each child may be linked to multiple parents. • Broader term (BT). Also called a broader context. A vocabulary record to which another record or multiple records are subordinate in a hierarchy. In thesauri, the relationship indicator for this type of term is BT. Variations on the notation include BTG, (broader term generic), BTP (broader term partitive), BTI (broader term instance), BT1 (broader term level 1), BT2 (broader term level 2), etc. • Narrower term. Also called narrower context. A record to which another record or multiple records are superordinate in a hierarchy (for example, Brewster chair is a narrower term to armchair). In thesauri, the relationship indicator for this type ofterm is NT. Variations on the notation include NTG, (narrower term generic), NTP (narrower termpartitive), NTI (narrower term instance), NT1 (narrower term level 1), NT2 (narrower term level 2), etc. • Do not use BT1,BT2,BTI. Always NT must be inverse of BT. Do not use BT for BTP!

  24. Thesaurus Structure • ASSOCIATIVE RELATIONSHIPS AAT. AAT Definition The relationships between concepts that are closely related conceptually, but the relationship is not hierarchical because it is not whole/part or genus/species. • Related term (RT). A concept that is associatively (not hierarchically) linked to another concept in a thesaurus. In thesauri, the relationship indicator for this type of term is RT. • We encourage to define specializations of RT

  25. Thesaurus Structure • “Equivalence relationships”. AAT Definition The relationships between synonymous terms or names that refer to the same concept, typically distinguishing preferred terms (descriptors) and non-preferred terms (variants, or ALTs and UFs). • Alternate descriptor (ALT). A variant form of a descriptor available for use; usually a singular form or a different part of speech than the descriptor (for example, lithograph is an alternate descriptor for the plural descriptor, lithographs). The relationship indicator for this type of term is ALT. • Used for term. Also called a UF. In thesaurus jargon, a term that is not a descriptor and not an alternate descriptor. If the thesaurus is being used as an authority, a used for term is not authorized for indexing. Used for terms typically comprise spelling or grammatical variants of the descriptor or have true synonymity with the descriptor. • These are now “labels” in SKOS, concept-to-string links.

  26. Thesauri structure • Scope note (AAT Definition ): • A Note that describes how the term should be used within the context of the AAT, and provides descriptive information about the concept or expands upon information recorded in other fields. The Scope Note in AAT is analogous to the Descriptive Note in ULAN and TGN.

  27. Example Thesaurus Record Carmine (lake) Scope Note, SN: A generic name for two closely related organic red lakes that are obtained from scale insects, cochineal and kermes. Neither pigment is permanent enough for use in fine art because they discolor in sunlight. They were replaced first by madder and alizarin, then later by synthetic organic red colors. Broader Terms, BT: colorant (material), lake (pigment) Alternative Terms, ALT: carmine lake Related Terms, RT: cochineal (colorant), kermes (colorant) Used For, UF: carmine lake, carmin (lake), Karmesin lake, new red lake, Kugel lake, Parisian lake, Munich lake, Venetian lake, Karmin (Lack)

  28. AAT term record

  29. AAT term record

  30. Thesaurus Construction • Global knowledge and isolated sources • Most thesauri are small, agreement of few experts, integrated into one local database, seen from a specific view, in one language. • Some thesauri cover large “general” subjects, and fail in specialisation. • Scientists and scholars share systems of global concepts. • Thesauri should be organised by domains • Examples of different scope and scale: • General purpose authorities, high-level: AAT, LCSH, RAMEAU, SWD • Specialized vocabularies: Beasley, SHIC, ACM • Use CIDOC CRM for global concepts • Relate your concepts to as many thesauri possible via persistent identifiers. • Make sure identity of concept after update.

  31. Thesaurus Construction • Distinguish use case: • thesauri for keyword search in free text (not my talk today) • thesauri to fill in database (metadata) fields • The process • Define a purpose/function • Engineer terms from existing vocabularies, dictionaries, interviews • Engineer concepts from terms, term use, interviews • Relate concepts and terms • Write concept records • It is a collaborative problem • Manage information for common reference, expressions of opinion, agreement, disagreement • Think of long term maintenance: Only a curated KOS can be used.

  32. Thesauri Construction • Define a purpose, for example (from D. Soergel), • A classification of diseases for diagnosis • A classification of medical procedures for insurance billing • A classification of medical outcomes to assist with treatment evaluation • A classification of commodities for customs • A classification of educational objective for instructional development • A classification of occupations for matching job applicants with job openings and for pay scale • A classification of skills for employee task assignments • In cultural heritage, think of research question or preservation functions

  33. Engineering Terms • Words and terms depend on social group and context: • Natural language, dialect, scientific language, slang • Σπίνος - fringillacoelebs - chaffinch, σκυλάκι - ορχεοειδές,…. • Can be traditional, missing, phrases, “coined”, ad-hoc • γιαταγάνι, kalathoi, gilded chairs, the Web, let’s call it... • Appear in different grammatical forms,or combination rules • pre-coordinated : “rugs, Persian”, “Persian rugs”, • post-coordinated : ”Persia + rug” • Use “coined terms” if necessary. Use “post coordination” (S/W will do it)

  34. Engineering Terms Concept-term relationships (terminological structure) Controlling Synonyms Term Preferred synonyms Teenager Adolescent Teen Adolescent Youth (young person) Adolescent Pubescent Adolescent Black African American Afro - American African American Alcoholism Alcohol dependence Inheritance Heredity Ultrasonic cardiography Echocardiograpgy

  35. Engineering Terms Stepwise reduction of a set of terms

  36. Engineering Terms Stepwise reduction of a set of terms 1 2 3 4 5 Morphological variants consolidated Spelling variants consolidated Synonyms consolidated Quasi- Synonyms consolidated Descriptors for- post combination ISAR system Disease Disease Disease Disease, illness Disease, illness Illness Illness Illness Sickness Sickness Sickness Ailment Ailment Ailment Following the lines from right to left, the searcher finds in column 1 all the terms and spelling variants to use.

  37. Engineering Terms Disambiguating homonyms • Administration 1 (management) • Administration 2 ( drugs) • Läufer 1 (Sportler) English: runner (athlete) • Läufer 2 (Teppich) English: long, narrow rug • Läufer 3 (Schach) English: bishop (chess) • Discharge 1 (from hospital or program) German: Entlassung • Discharge 2 (from organization or employment) Preferred synonym: Dismissal German: Entlassung • Discharge 3 (medical symptom) German: Absonderung, Ausfluss • Discharge 4 (into a river) German: Ausfluss • Discharge 5 (electrical) German: Entladung (which also means unloading)

  38. Classifying by Term: A case • E.g. searching for comparative studies • How do I spell It? Ushabti, ushabty, ushebti, shawtaby? Will it be written the same everywhere? • Should I call it : “grave goods”(AAT), “burial figurines”,“dolls”, “afterlife helpers”, “personality surrogate”, “burial ritual”? • And what about “xαρώνειο, δανάκη” ? • Should I call it: “toll”, “cheap coin”, “afterlife helper”, “corpse equipment”, “burial gift”, “burial rites” ? • Would be “grave goods” distinctive enough?

  39. Using Classification for Querying • How to find the characteristic termitself ? • How to discover related literature ? • Relevant abstractions are not standardized • How to make statistics even about the same item? • The same items can be referred in a thousand ways • How to do comparative studies by features ? • Implicit features are not declared, explicit features need systematic documentation

  40. Thesauri and Classification:A Case of a Term • Analyzing a term: • What is an ushebti, what a shawabty ? • What did it mean, and when? • What was is made for? • How was it made? • Where was it used ? • Ideas, concepts, rather than words • Multiple aspects of interest !

  41. Concepts • A concept is class or set of entities which are grouped together on the basis of some criterion or rule • Inner representation- the personal comprehension • cannot be communicated • A set of entities characterised by explicit properties (rules) • “objective”, allows reasoning about analogous objects from other cultures/domains • BUT: how to characterise properties? • find discriminative attributes (what is an Elephant?) • non-verbal characteristics (aquarelle etc. ) • often difficult, misleading, impossible

  42. Concepts • the “words of mentalese” the common language of the human mind • basis for communication in foreign languages • completely unknown • A set of entities characterised by common agreement • depends on a social group (must be noted!!) • covers everything people can recognise and agree on (implicit mentalese) • does not allow for reasoning about analogous objects • also called “primitive concept” • This is what we need most (eventually plus rules)

  43. Thesauri and Classification:Concepts • Concepts are relative • to scope : fuzzy bounds , e.g. knife, weapon, seat, • outer bounds for retrieval, inner for science, if negated inner bounds for retrieval…. • to purpose : weapon, friend, stone building, school house, neoclassic building • there are essential classes (related to reason for existence) • construction-related, morphological, functional, contextual • Concepts are related • by nature : coffin - container, coffin - funerary object, bath tube - container • polyhierarchies of genus-species OR isAOR generalization OR subclass-superclass(provides also a notion of similarity) • associative : bridge - bridge construction, house - roof

  44. Engineering Concepts • Concepts can be • natural and explicit - there is a term for it in some language • natural implicit (hidden) - there is no word • English “parts & accessories” , “too” translated to Greek • terms need to be invented (“coined term”) • new - like “the Web” • compounds - “blue rugs”, “19th century Persian rugs”, open problem • Natural concepts are the best, but often others are needed • often contextually overloaded (sword, ushebti) • need typically contextualredefinition to become precise (AAT “knife”) • or need to be combined with other terms • In particular generic concepts often miss a term!

  45. Engineering Concepts • Quality problem: Is classification reliable? • Completeness, at least partial?: • Do thingsnot classified by one term notbelong to this term? • Can at least partial sets of data be identified, that are completelyclassified with respect to term x ? • Can I find things that may belong to term x under term x? • Classification for retrieval must be “inclusive” and completefor a given collection

  46. Engineering Concepts • Particularly Objects can be seen under different aspects • E.g.: School house, all-wooden building, 18th century American style • Characteristic aspects: • functional • morphological • constructive • contextual • Need to make aspect explicit (open problem). • Interesting problem: repurposing resources for other aspects.

  47. Concept Definition • By “Scope Note” : • A statement that clarifies the meaning and usage of a term within the thesaurus • Definition by properties, occurrence, similarities • Definition of scope - limitationsand distinctions • Guidance of users to similar, overlapping, associated concepts • Context of usage, purpose, view • Origin and history of the term and concept • Reference to literature (“literature warrant”) • Examples. • Often the scope note reminds only a certain meaning we share, and restricts it. Examples most helpful as reminder!

  48. Thesauri and Classification:Concept Definition • Assisted by example • A particular instance (e.g. Mona Lisa for “painting”) • Optically • by graphics, drawing, images of models • Assisted by semantic placement • Generalizations / specializations • Associations to other concepts = co-occurrence in certain contexts, producer-process-product relations etc. • Synonyms, similar concepts, translations

  49. S. R. Ranganathan • Three cognitive “planes”: • Idea plane - Verbal plane - Notational plane • confusion hinders analysis and problem solution: • Missing terms for existing ideas (concepts are many, words are few) and • notational limitations inhibit idea plane work. • The invention of the “facets” • Priority of the idea plane (= concept, not term) • Conceptual structures are multidimensional • Shelving of books is no argument, a taxonomy is not an index. • Colon Classification is a system of library classification developed by S. R. Ranganathanbetween 1925-1965. It uses five primary categories, or facets, to further specify the sorting of a publication. Collectively, they are called PMEST.

More Related