Knowledge Engineering and Semi-Automatic Population of Medical Ontologies Using NLP Methodologies Munich 11.06.2007 Pinar Oezden Wennerberg email@example.com
Agenda • Knowledge Engineering and Ontology • Definitions, methodologies, guidelines • Medical Terminology and Natural Language Processing (NLP) • The problem of medical terminology • The context: users, tasks, types of information in the medical domain • The role of NLP and knowledge engineering • Motivation for Semi-Automatic Ontology Population • The knowledge acquisition bottleneck • Vast amount of knowledge available in (un- / semi-)structured text, WWW, databases etc. • One example approach • Ontology population via Supervised Machine Learning (ML) • Challenges
Knowledge Engineering and Ontologies • Some Definitions: • Humans and software agents need knowledge about the world in order to reach good decisions • Such knowledge is typically stored in knowledgebases • „Knowledge engineering is the process of building a knowledgebase“ • „A knowledge engineer is someone, who • investigates a particular domain, • determines what concepts and relations are important in that domain, • and creates a formal representation of objects and relations in that domain“. (Russel & Norvig, 1995)
Knowledge Engineering and Ontologies • An ontology specifies a finite, controlled, extensible and machine processable vocabulary for a given knowledgebase • Consists of concepts, properties, relations, axioms… • Knowledge engineering guidelines • Decide what to talk about and on the vocabulary, • Encode general knowledge and a specific problem case • Execute queries and verify inference (Russel & Norvig 1995)
Medical Terminologies and Natural Language Processing (NLP) • Problem statement: • Numerous heterogenious medical terminologies and coding schemes exist that need to interoperate • e.g. Systemized Nomenclature of Medicine (SNOMED) for coding paptient notes, ICD (International Classification of Diseases), ICD-9-CM for billing purposes,RIZIV, IDEWE, ICPC-2, ATC etc. • Existing efforts UMLS, Galen, MeSH, etc.
Medical Terminologies and Natural Language Processing (NLP) • Definition of context : • Information types to be collected are about • Individuals (e.g. medical records) • Groups (e.g. data about epidemiology, public health…) • Institutions (e.g. planning, management in hospitals, clinics) • Domain specific knowledge (e.g. state-of-the-art publications, proceedings) • Domain relevant tasks • Data entry, query and retrieval about patients • Information sharing and integration from different applications and medical records
Medical Terminologies and Natural Language Processing (NLP) Question Answering Information Extraction Knowledge Representation and Reasoning Natural Language Processing Machine Learning Information Retrieval Knowledge Discovery, Text Mining Ontology Engineering Adapted from Jena University www.julielab.de
Motivation for Semi-Automatic Ontology Population • The knowledge acquisition bottleneck • Ideally the knowledge engineer interviews the knowledge expert to get educated about the domain i.e. to acquire knowledge expensive in time and resources domain experts not alwaysavailable • Availability of vast amount knowledge • In resources such as medical databases, journals, publications, conference proceedings, medical reports etc. • World Wide Web
Ontology Population via Supervised Machine Learning • Problem statement • Identify and extract relevant knowledge (terms, phrases, relations, facts) in text e.g. • Terms: “health disorder”, “malfunction”, “sickness”, “illness”, “maladie”, “Krankheit” Disease • Smoking causes cancer <Smoking, Cancer> • Goal • Assign them to the appropriate concepts of the ontology as instance • Concept: Disease • Relation: causes
Ontology Population via Supervised Machine Learning • Processes • Annotate (i.e. supervised) • <CAU>Smoking<CAU/> <CAU-R>causes</CAU-R> <DIS>cancer</DIS> • CAU: DiseaseCause, CAU-R: causalRelation, DIS: Disease • Learn and extract from a training set (i.e. ideal world) • Extract from the test set (i.e. unknown world) • Apply the learned rules on new documents to discover and extract new knowledge
Ontology Population via Supervised Machine Learning • Learn and extract from a training set (i.e. ideal world) • Recognize syntactic constructs such as NPs, VPs, PPs • Generate extraction rules • Rule for concept Disease • Disease:- <NP “smoking”><VP “causes”><NP DIS > • Rule for concept DiseaseCause • DiseaseCause:- <NP CAU><VP “causes”><NP “cancer” > • Rule for relation causalRelation • causalRelation:- <NP “smoking”><VP CAU-R><NP “cancer” > • Classify • Disease: cancer • DiseaseCause: smoking • causalRelation: causes
Ontology Population via Supervised Machine Learning • Possible problems • More than one value was extracted for a given relation • Entities from different classes were extracted (multiple concept assignment i.e. ambiguity) • Nothing was extracted • Possible solutions • Present the user all possible values, let the user decide • To assist user with the decision process by assigning confidence scores to possible values • i.e. how much does the system believe what it suggests is relevant/true • Provide context information via text highlighting to justify the system’s confidence • Provide empty data entry slots for users to enter their knowledge
Challenges • General challenges • It is difficult to eliminate the knowledge acquisition problem entirely • Due to the sensitivity of the domain (human health) the knowledge experts cannot be completely avoided • Computer scientists need to work together with domain experts to a certain extent • Systems should be able to be used by non-technicians • Multilinguality • Healthcare workers, patients, administrators should be able to have access to information in their own language
Challenges • Knowledge/ontology engineering specific challenges • Implicit information (typical for natural language) i.e. not machine-processable (not explicit) • Different levels of detail (granularity) is required to meet different expectations • i.e. provide sufficient detail but abstract away irrelevencies • Poly-hierachies to support multiple views • may lead to ambiguities, contradictions • Adaptability, extensibility for changing user demands and for standards • Expressibility vs. computational tractibility • Achieving consensus between practitioners
Questions? • Evaluation • How do we know if we have a good system? • Practitioners to evaluate the effficiency and reliability of the developedsystems?