Ontology Learning from Text

Ontology Learning from Text Methods & Tools Polyxeni Katsiouli Pervasive Computing Research Group Communication Networks Laboratory Department of Informatics and Telecommunications University of Athens – Greece 18/5/2007

Definition of Ontology ‘A formal, explicit specification of a shared conceptualization’ must be machine understandable not private to some individual, but accepted by a group typesofconceptsand constraintsmustbeclearly defined an abstract model of some phenomenon in the world formed by identifying the relevant concepts of that phenomenon

Main elements of an ontology wasWrittenBy domain range Object property (relation) hasTitle xsd:string domain range Hierarchy of concepts (is-a relations) datatype property (attribute)

Definition of Ontology Learning • The application of a set of methods and techniques used for building an ontology from scratch • Uses distributed and heterogeneous knowledge and information sources • Allows a reduction in the time and effort needed in the ontology development process

Ontology Learning methods from… • Unstructured sources • Involves NLP techniques, morphological and syntactic analysis, etc. • Semi-structured source • elicit an ontology from sources that have some predefined structure, such as XML Schema • Structured data • Extracting concepts and relations from knowledge contained in structured data, such as databases

Ontology Learning ‘Layer Cake’ x, y (sufferFrom(x, y)  ill(x)) cure (domain:Doctor, range:Disease) is_a (Doctor, Person) Disease:=<I, E, L> {disease, illness} disease, illness, hospital

Part 1 Terms Extraction disease, illness, hospital

Terms • Linguistic realizations of domain-specific concepts • Are the basis of the ontology learning process • Term extraction implies: • Linguistic processing part-of-speech tagging, morphological analysis, etc. • Statistical processing compares the distribution of terms between corpora

Terms Extraction: Process • Run a Part-Of-Speech (POS) tagger over the domain corpus • Identify possible terms by constructing patterns, such as: Adj-Noun, Noun-noun, Adj-Noun-Noun,… • Ignore Names • Identify only the relevant to the text terms by applying statistical metrics

Linguistic Analysis: an example [[He SUBJ] [booked PRED] [[this] [table HEAD]NP:DOBJ:X1]…]… [[It SUBJ:X1] [was PRED] stillavailable…] [[He SUBJ] [booked PRED] [[this] [table HEAD] NP:DOBJ]S] [[the SPEC] [large MOD] [table HEAD] NP] [[the] [large] [table] NP] [[in] [the] [corner] PP] [work~ing V] [table N:ARTIFACT] [table N:furniture] [table] [2005-06-01] [JohnSmith]

Statistical Analysis Statistical metrics used in terms extraction: Term weighting (TFIDF) Chi-square Mutual Information

TFIDF Most popular weighting schema The word is more popular when it appears several times in a document The word is more important if it appears in less documents tf(w)term frequency (number of words occurrences in a document) df(w)document frequency (number of documents containing the word Nnumber of all documents tfidf(w)relative importance of the word in the document

Part 2 Synonyms {disease, illness}

Synonyms • Identification of terms that share semantics, i.e., potentially refer to the same concept • Methods for extracting synonyms • Based on WordNet • Latent Semantic Indexing (LSI)

WordNet A lexical database for the English language Nouns, verbs, adjectives & adverbs are grouped into sets of synonyms (synsets) Synsets are interlinked by means of conceptual-semantic and lexical relations

Adapting WordNet to specific domain • Partition the set of synonymy relations defined in WordNet in three classes: • Relations irrelevant in the specific domain • Relations that are relevant but incorrect in the specific domain • Relations that are relevant and correct in the specific domain • Remove relations from the first two classes and include relations from the third class • Rank the rest sets according to their frequency in corpus

Latent Semantic Indexing (LSI) • LSI is a technique in NLP of analyzing relationships between a set of documents and the terms they contain • Uses a term-document matrix which describes the occurrences of terms in documents – Vector Space Model Example:

Part 3 Concepts Disease:=<I, E, L>

Concepts Intension,Extension,Lexicon A term may be indicate a concept if we can define its: (in)formal definition of the set of objects that this concept describes Intension: Example: a disease is an impairment of health or a condition of abnormal functioning a set of objects that the definition of this concept describes Extension: Example: influenza, cancer, heart disease the term itself and its multilingual synonyms Lexical realizations: Example: disease, illness, maladie

Part 4 Taxonomy Induction is_a (Doctor, Person)

Concept Hierarchy Extraction • With the use of WordNet • Lexico-syntactic patterns • Machine Readable Dictionaries • Co-occurrence Analysis • Linguistic-approaches Basic methods used for taxonomy extraction:

Taxonomy Extraction with WordNet • Given two terms t1 and t2, check if they stand in a hypernym relation with regard to WordNet • Normalize the number of hypernym paths by dividing by the number of senses of t1 path: a sequence of edges connecting the two synsets Example: - 4 different hypernym paths between synsets ‘country’ and ‘region’ - ‘country’ has 5 senses value of isa (country, region) = 0.8

Lexico-syntactic patterns -Hearst • Aim: the acquisition of hyponym lexical relations from text • Uses a set of predefined lexico-syntactic patterns which • occur frequently and in many text genres • indicate the relation of interest • can be recognized with little or no pre-encoded knowledge • Principle idea: match these patterns in texts to retrieve is_a relations • Precision with respect to WordNet: 55,45%

vehicle fruit is-a is-a is-a is-a is-a is-a orange truck car apple bike nectarine activity is-a running swimming Lexico-syntactic patterns - Hearst NPo such as {NP1, NP2,…, (and | or)} NPn ‘Vehicles such as cars, trucks and bikes….’ such NPas {NP,} * { (or | and) } NP ‘Such fruits as oranges, nectarines or apples…’ NP {, NP} * { , } { or | and } other NP ‘Swimming, running, or/and other activities…’ is-a

publication is-a is-a paper book Lexico-syntactic patterns - Hearst NP { , } including {NP, } * { or | and } NP ‘Injuries, including broken bones, wounds and bruises…’ injury is-a is-a is-a broken bone bruise wound NP { , } especially {NP, } * { or | and } NP ‘Publications, especially papers and books…’

Machine Readable Dictionaries • A method for extracting taxonomies which goes back to the 80’s • Main idea: exploit the regularity of dictionary entries to find a suitable hypernym for the defined word Example: spring “the season between winter and summer and in which leaves and flowers appear” is_a (spring, season)

hornbeam: “a type of tree with a hard wood, sometimes used in hedges” Example: is_a (hornbeam, tree) republican : “a member of a political party advocating republicanism” Example: is_a (republican, political party) part_of(republican, politicalparty) MRDs: Exceptions • The hypernym can be preceded by an expression such as ‘a kind of’, ‘a sort of’, or ‘a type of’ • The problem is solved by keeping an exception list with words such as ‘kind’, ‘sort’, ‘type‘ and taking the head of the NP following the preposition ‘of’ • The word can be defined in terms of a part-of or membership relation

Co-occurrence analysis • A certain term t1 is more special that a term t2, if t2 also appears in all the documents in which t1 appears. Document-based subsumption Term x subsumes term y iff P(x | y) 1, where n(x,y)  the number of documents in which x and y co-occur n(y)  the number of documents that contain y

Linguistic Approaches • Modifiers typically restrict or narrow down the meaning of the modified noun Example: is_a (international credit card, credit card)

Part 5 Relations (non-taxonomic) cure (domain:Doctor, range:Disease)

Extracting relations & attributes • Specific relations • Part-of • Qualia (Formal, Constitutive, Telic, Agentive) • Generalrelations • Exploiting linguistic structure • Attributes

Learning attributes: Introduction • Attributes relations with a datatype as range • Typically expressed in texts using preposition of, the verb have or genitive constructs, e.g. ‘the color of the car’, ‘the car’s color’, ‘every car has a color’ • Values of attributes are expressed using copulaconstructs, adjectives or expressionsspecific to the attribute in question, e.g., • ‘the car is red’ (copula + value) • ‘the red car’ (adjective) • ‘the baby weights 3 kgr’ (specific expressions)

Classification of attributes To systematize the learning process attributes are classified according to their range

An approach to learning attributes • Tokenize & part-of-speech tag the corpus • Apply the following patterns to extract adjective/noun pairs (\w+{DET})? (\w+{NN}) + is{VBZ} \w + {JJ} (\w+{DET})? \w + {JJ} (\w+{NN}) + • These pairs are weighted using conditional probability: f(n,a): joint frequency of adjective a and noun n f(n): the frequency of noun n • For each of the adjectives we look up the corresponding attributes in WordNet JJ: adjective DET: determiner NN: noun VBZ: verb, 3rd person singular present

“meronymy” / “part-of” relations Given a “seed” word find parts of that word in a large corpus of text whole NN[-PL] ‘s POS part NN[-PL] e.g. …building’s basement… part NN[-PL] of PREP {the|a} DET mods [JJ|NN]* whole NN e.g. …basement of a building… 55% accuracy Format  type_of_word TAG type_of_word TAG… NN = Noun NN-PL = Plural Noun PREP = Preposition POS = Possessive JJ = Adjective

Qualia structures The meaning of a lexical element is described in terms of four roles: Constitutive physical properties of a object (e.g., weight, material, parts) Agentive typically a verb denoting an action which brings the object in existence normally consists in typing information about the object (e.g., hypernym) Formal Telic the purpose or function of an object either by a verb or by a nominal Formal: artifact_tool Constitutive: blade, handle,… Telic: cut_act Agentive: make_act Example: Qualia structures for knife

Qualia Structures: LearningApproach • aim: to automatically learn qualia structures from the WWW • Based on the idea of matching certain lexico-syntactic patterns conveying a standard relation

Qualia Structures: LearningProcess • Clues: search engine queries indicating the relation of interest • Calculate the weight of a candidate qualia element e for the term t using Jaccard coefficient: Generate Clues Word Download Google Abstracts POS-tagging Matching regular expressions Statistical Weighting Weighted QS

Qualia Structure: Patterns (1/2) Formal Role Telic Role

Qualia Structure: Patterns (2/2) Constitutive Role

Relations by syntactic analysis SubjToClass_PredToSlot_DObjToRange OntoLT Maps a subject to the domain, the predicate or verb to a slot or relation and the object to its range. Example: ‘The player kicked the ball to the net’ relation: kick (domain: player, range: ball)

RelExtA tool for Relation Extraction • identifies relevant triples (pairs of concepts connected by a relation) over concepts from an existing ontology • is based on the fact that verbs express a relation between two classes that specify the domain and range • extracts relevant verbs & their grammatical arguments and computes corresponding relations through a statistical & linguistic processing • was developed in the context of SmartWeb project to provide intelligent information services in the FIFA World Cup 2006

RelExt: Linguisticprocessing • Linguistic annotation • the SCHUG system was used • provides a multi-layer XML format for a given text • dependency structure, lemmatization, POS • NER (Name Entity Recognition) • performed to map instances of football players to existing ontology classes • Concept tagging • maps synonyms for given terms to the corresponding ontology concepts Corpus Linguistic annotation NER & Concept Tagging Annotated corpus

RelExt: StatisticalProcessing • Relevance Measure • χ2test used to compute relevance ranking • Coocurence measure • Relation Extraction Frequencies In BNC, NZZ Relevance Measure Relevance Scores Heads, Preds Cooccurence Scores Heads <> Preds Cooccurence measure

Part 6 Axioms & Rules x, y (sufferFrom(x, y)  ill(x)

DIRTDiscovery of Inference Rules from Text • an unsupervised method for discovering inference rules from text, such as X is author of Y X wrote Y, X caused Y  Y is blamed on X X manufactures Y  X’s Y factory • Is based on the assumption that: Distributional Hypothesis Words that occurred in the same contexts tend to be similar

DIRT: Distributional Hypothesis • Distributional Hypothesis is applied to dependency tress • If two paths tend to link the same sets of words, their meanings are hypothesized to be similar

DIRT: Dependencytrees • The inference rules discovered by DIRT are between paths in dependency trees • Are generated by Minipar parser • Minipar represents its grammar as a network where nodes represent grammatical categories and links syntactic relationships A subset of the dependency relations in Minipar output

found subj obj John solution det mod a to pcomp problem det the DIRT: Dependencytrees “John found a solution to the problem” • Links represent dependency relationships • Direction: from the head to the modifier • Labels represent types of dependency relations • Each link between two words represents a direct semantic relationship Path between “John” and “problem” N:subj:V  find  V:obj:N  solution  N:to:N meaning “X finds solution to Y”

DIRT:PathsinDependencyTrees Connect the prepositional complement directly to the words modified by the preposition transformation rule Each link between two words represent a direct semantic relationship A path represents indirect semantic relationships between two content words

Ontology Learning from Text

Ontology Learning from Text

Presentation Transcript

From Ontology to History

Ontology Learning from Text: A Survey of Methods

Lecture 16: Unsupervised Learning from Text

Text Learning

Training-less Ontology-based Text Categorization.

Ontology learning and population from from text

Ontology Learning and Population from Text

Concept Ontology For Text Classification

Ontology Learning and Population from Text: Algorithms, Evaluation and Applications

Learning from Challenging Text

Towards Ontology Learning from Folksonomies

Ontology Learning

Learning from Text

From Formal Ontology to Biomedical Ontology

Learning from Text

Ontology Based Annotation of Text Segments

Lecture 16: Unsupervised Learning from Text

Learning Text Mining

From Ontology to History