660 likes | 999 Vues
Automated Ontology Elicitation and Usage Adrian Silvescu Computer Science Dept., Iowa State University Outline Introduction, Motivation and Philosophy Abstraction Super-Structuring Ontology Usage Ontology Elicitation Framework Conclusions The Problem: Ontology Elicitation
E N D
Automated Ontology Elicitation and Usage Adrian Silvescu Computer Science Dept., Iowa State University
Outline • Introduction, Motivation and Philosophy • Abstraction • Super-Structuring • Ontology Usage • Ontology Elicitation Framework • Conclusions
The Problem: Ontology Elicitation • How to get computers to figure out what an application domain (the world) is all about? • How to get computers to derive automatically the set of relevant concepts, entities and relations, pertaining to a certain application domain (or the world) • Example: Stock – News data is about: companies, units, CEOs, shares, earnings, cutting jobs, …
Time Warner Shares Idle Ahead of Third-Quarter Earnings Report • NEW YORK (AP) -- Shares of Media giant Time Warner Inc. were little changed Monday ahead of the company's third-quarter earnings report as investors wonder exactly what chairman Dick Parsons might say about its troubled America Online unit. There was heavy speculation this week the company -- which once had the "AOL" ticker symbol and called itself AOL Time Warner -- might announce job cuts at the online giant. A number of news organizations reported Tuesday that AOL is expected to cut about 700 jobs by the end of the year. There has also been continued speculation Time Warner might even spin off the unit
Main Paradigm for Ontology Elicitation • There are at least two essential “moves” that people make when attempting to make sense or "structure" a new application domain: Super-Structuring and Abstraction. • Super-Structuring is the process of grouping and subsequently naming a set of entities that occur within "proximity" of each other, into a more complex structural unit. • Abstraction, on the other hand, establishes that a set of entities belong to the same category, based on a certain notion of “similarity” among these entities, and subsequently names that category.
Structuralism vs. Functionalism • Structuralism: Meaning (and “similarity” in meaning thereof) is defined in terms of structural features. [e.g., in order to define a chair we are going to say that it has four legs, a rectangular surface on top of them and a back [OOP classes] • Functionalism: Functionalism on the other hand asserts that the meaning of an object resides with how it is used. [e.g., a chair is something that I can sit on, and if you intentionally kill somebody with that chair, then the chair is for all practical purposes a weapon. Briefly stated the functionalist claim is that “meaning” is about ends and not about the means by which these ends are met. [OOP interfaces] • These are two extremes … mix.
Inspirations for the main paradigm • Object Oriented Programming / Software Engineering / UML – these are the main principles used: • Composition = Super-Structuring • Inheritance = Abstraction • Philosophy of linguistics – Words are similar in meaning if they are used in similar contexts – “distributional semantics” hypothesis • [Zellig Harris, [British Empiricists, (Jeremy Bentham)]].
Abstractions and Super-Structures more concrete • SuperStructures – Proximal entities should be SuperStructured into a higher level unit if they occur together in the data significantly above chance levels. • Abstractions – Entities are similar and should be abstracted together if they occur within similar contexts
Example Step1 data: Mary loves John. Sue loves Curt. Mary hates Curt. Abstractions 1: A1 -> Mary | Sue because they have similar right contexts: loves. A2 -> John | Curt because they have similar left contexts: loves. Step 2 data: [Mary, A1] loves [John, A2]. [Sue, A1] loves [Curt, A2]. [Mary, A1] hates [Curt, A2]. Abstractions 2: A3 -> loves | hates because of high similarity between their left and right contexts: This illustrates how abstraction begets more abstraction (A3 not possible on the raw data). Step 3 data: [Mary, A1] [loves, A3] [John, A2]. [Sue, A1] [loves, A3] [Curt, A2]. [Mary, A1] [hates, A3][Curt, A2]. Structures 3: S1 -> A1 A3 because it occurs three times S2 -> A3 A2 because it occurs three times This illustrates how abstraction begets structuring (S1 and S2 not possible on the raw data) Structures 4: S3 -> S1 A2 S4 -> A1 S2
Algorithm [Silvescu and Honavar, 2003] until a limit criteria has been reached top_ka_abstractions = Abstract(sequence_data) top_ks_structures = SuperStructure(sequence_data) new_sequence_data = Annotate sequence_data with the new abstractions and structures repeat SuperStructure(S->AB) - returns the topmost ks structures made out of two components according to an (independence) measure of whether A and B occur together by chance (e.g., KL(P(AB)||P(A)P(B) ) Abstraction(S-> A | B) - returns the topmost ka abstractions (clusters) of two entities ranked according to the similarity between the probability distributions of their left and right contexts (e.g., JensenShannon(context(A),context(B)))
Validation: “Usefulness for Predictive tasks within the application domain” • Application domains examples: text data from news and protein sequences • Validation: use the structures and abstractions as additional features for learning predictive tasks within the domain: • Text categorization – for text data • Function prediction – for protein sequences • Experiments aimed at assessing the contribution of the new features vs. the basic ones (words for text and amino-acids for protein sequences respectively )
Summary • An algorithm that • puts together Super-Structuring and Abstraction • Re-annotate the data set after deriving a set Abstractions and Super-Structures and allowing further ontology elicitation that would not possible on the raw data.
Outline • Introduction, Motivation and Philosophy • Abstraction • Super-Structuring • Ontology Usage • Ontology Elicitation • Conclusions
Abstraction • Entities that occur in similar contexts should be abstracted together • We will explore next, one way to “operationalize” this intuition in a class conditional scenario • More exactly suppose we have set of mushrooms that are either edible or poisonous and each mushroom is represented as by a set of values associated with nominal attributes such as: • odor (9), gill attachment (4), gill spacing (3),… where the numbers in parenthesis are the number of values that each attribute can take • Then we can ask how “similar” are the odor values in the context of the associated mushrooms being poisonous or edible
Class Context Distributions y ~ P(Class|Odor=y)=(.7,.3) s ~ P(Class|Odor=s)=(.68,.32) Distance(y,s) = JS(P(Class|Odor=y), P(Class|Odor=s)
AVT-Learner (Abstractions)[Kang, Silvescu, Zhang and Honavar, 2004] Odor {m,s,y,f,c,p} {s,y,f,c,p} Most similar! {s,y,f} {a,l,n} {s,y} {c,p} {a,l} {m} {y} {s} {f} {c} {p} {a} {l} {n}
Using Abstractions • Annotate the data with the set of features given by abstractions and feed it to a classifier (Propositionalization). • For each abstraction add a binary feature that indicates whether a value from that abstraction is present in the current example • Use algorithms that are especially designed to handle abstractions
Using Ontologies • Propositionalization • Specially designed algorithms
Naïve Bayes classifiers Edibility Gill spacing Gill attachement Odor C={c1,…,ck} be the set of all possible classes x=<a1,…,an> a new example to be classified. Where the probabilities are estimated from the data
AVT-Based Learning Example [Zhang, Silvescu and Honavar, 2002][Zhang and Honavar 2003;2004]
Outline • Introduction, Motivation and Philosophy • Abstraction • Super-Structuring • Ontology Elicitation Framework • Ontology Usage • Conclusions
SuperStructuring • SuperStructures – Proximal entities should be SuperStructured into a higher level unit if they occur together in the data significantly above chance levels. • Proximity , Topology • Independence tests KL( P(AB)||P(A)P(B) ) • Compression
Experimental setup • Sequence data => natural topology: left,right • We have run Super-Structuring 10 times and selected each time the 100 best scoring double according to the KL score – a total of 1000 new features • Then we fed the resulting dataset annotated with these new 1000 features to a Naïve Bayes Classifier
Experimental Results • On protein sequences – function prediction: • Selected ~ 1000 functional classes from SWISSPROT that contain more that 10 proteins in each class • Super-Structuring improves prediction by 6.54 % (10 fold cross validation) vs. Naïve Bayes • On text data – text classification: • Reuters 21578 news documents • ~12000 of them are assigned to classes based on the type of news they report (e.g., earn, corn, grain, …) • Super-Structuring improves 1.43 % (10 fold cross validation) vs. Naïve Bayes
Generative Models for Sequence Classification MFDLSKFLPVITPLMIDTAKLCMSSAVSAY … -> Ligase MGLGWLLYQMGYVKKDFIANQTEGFEDWLA … -> Kinase Naïve Bayes Case:
Naïve Bayes-kgrams – Ontology usage Naïve Bayes S1 S2 S2 S3 S3 S4 S4 S5 S5 S6 S1 S2 S3 S4 S5 S6
Naïve Bayes ^ k (NBk)[Silvescu, Andorf, Dobbs and Honavar 2004] [Andorf, Silvescu, Dobbs and Honavar 2004] Naïve Bayes NBk S2 S3 S4 S5 S1 S2 S2 S3 S3 S4 S4 S5 S5 S6 S1 S2 S3 S4 S5 S6 JTT
S1 S1 S2 S2 S3 S3 S4 S4 S5 S5 S6 S6 Markov Models of size k
Markov Models of size k - General Si-k … … Si-1 Si
Experimental Results – Protein function classification (I) Results for the Kinase (GO0004672) dataset. There were a total of 396 proteins and five classes. Species refers to what species the sequences came from. K refers to the number of dependencies used in each algorithm.
Experimental Results – Protein function classification (II) Results for the Kinase (GO0004672) dataset. There were a total of 396 proteins and five classes. Species refers to what species the sequences came from. K refers to the number of dependencies used in each algorithm.
Experimental Results – Protein function classification (III) Results for the Kinase(GO0004672) / Ligase(GO001684) dataset. There was a total of 396 proteins and two classes. Species refers to what species the sequences came from. K refers to the number of dependencies used in each algorithm.
Outline • Introduction, Motivation and Philosophy • Abstraction • Super-Structuring • Ontology Usage • Ontology Elicitation Framework • Conclusions
AVT-MMk Stopping criterion for Cut Pushing: MDL Score Si-k … … Si-1 Si
The MDL Score • Where M is the model and D is the data • In the classification scenario we will use CMDL which replaces the Log Likelihood LL with the Conditional Log Likelihood and the penalty factor is also only on the part of the model that contains class dependencies
Taxonomies over k-grams Si-k … … Si Si+1
Multiple Taxonomies over 1 grams Si-k … … Si Si+1
S1 S2 S3 S4 AVT-Bayes Networks E D B C A
Outline • Introduction, Motivation and Philosophy • Abstraction • Super-Structuring • Ontology Usage • Ontology Elicitation Framework • Conclusions
Ontology elicitation framework • Repeat • 1. Propose New Concepts - Using the Data • e.g., by using Abstraction and SuperStructuring • 2. Evaluate the desirability of the proposed New Concepts according to a Scoring Function • 3. Select Posited Concepts according to the ranking of the New Concepts by the Scoring Function • 4. Annotate Data with Posited Concepts • until a termination criterion reached
Issues • Controlling the Ontology Elicitation process • 1. Mixing of Abstractions and Superstructuring at each step - base case • 2. Use Superstructures in order to improve modeling accuracy and abstractions in order to decrease model size and gain generalization accuracy [i.e., associated with better statistical significance due to larger sample size]
Types of Ontology elicitation • Model Free vs. Model Based • Model free Scores: such as KL and JS • Likelihood / MDL Score • Conditional vs. Unconditional • The KL and JS [Model Free] and MDL [Model Based] become class conditional
Other Topics under exploration • Probabilistic Model for Shallow parsing • Conditional Probabilistic Models & Regularization • Relational Learning
A Probabilistic Generative Model for Shallow Parsing . . . . . .
Current and Future work • Concept formation - Identifying the extensions of well separated concepts from the given structures and abstractions derived by the proposed method and further generalizing these concept extensions by learning corresponding intentional definitions and thus truly completing the ontology elicitation process. • Other operations - generalizing the approach in order to incorporate more operations (such as sub-structuring, maybe causality) • More tasks - validation on more predictive tasks (interaction site prediction and information extraction respectively) and on more than one task at a time (~ multitask learning) + incorporation of the task(s) bias into the feature / ontology derivation loop. • Arbitrary Topologies - exploring the method from sequences to arbitrary topologies (e.g., graphs in order to accommodate relational models) • More Domains - Exploring other application domains such as Metabolic Pathways, reinforcement learning, … • More ways to use ontologies
Summary • A framework for Ontology Elicitation • Learning Algorithms that use ontologies • Ontologies • improve classification accuracy • contribute to the comprehensibility of the model / domain
Timeline • 1. [Nov04-Dec04] Experiments with Model free ontology Elicitation • 2. [Nov04-Jan05] Experiments with AVT-NBk and AVT-BN • 3. [Nov04-Feb05] Experiments with the Probabilistic model for Shallow Parsing • 4. [Jan05-Feb05] Write paper on Model free ontology Elicitation • 5. [Jan05-Mar05] Write paper on AVT-NBk and AVT-BN • 6. [Feb05-Apr05] Write paper on the Probabilistic model for Shallow Parsing • 7. [Summer05, Fall05] Write thesis • 8. [Fall05] Defend
Publications [On Ontology Elicitation and usage] • [Silvescu and Honavar, 2004b] Silvescu, A. and Honavar, V. (2004) A Graphical Model for ShallowParsing Sequences. In Proceedings of the AAAI-04 Workshop on Text Extraction and Mining, San Jose, California. • [Silvescu et al., 2004a] Silvescu, A., Andorf, C., Dobbs, D., and Honavar, V. (2004) Learning NBk. To be submitted to SIGMOD/PODS, Baltimore, Maryland. • [Andorf et al., 2004] Andorf, C., Silvescu, A., Dobbs, D. and Honavar, V. (2004). Learning Classifiers for Assigning Protein Sequences to Gene Ontology Functional Families. In: Fifth International Conference on Knowledge Based Computer Systems (KBCS 2004). India. • [Kang et al., 2004] Kang, D-K., Silvescu, A., Zhang, J. and Honavar, V. (2004). Generation of Attribute Value Taxonomies from Data for Accurate and Compact Classifier Construction. In: IEEE International Conference on Data Mining. • Silvescu and Honavar, 2003] Silvescu, A. and Honavar, V. (2003) Ontology elicitation: Structural Abstraction = Structuring + Abstraction + Multiple Ontologies. Learning@Snowbird Workshop, Snowbird, Utah, 2003. • [Zhang et al., 2002] Zhang, J., Silvescu, A., and Honavar, V. (2002). Ontology-Driven Induction of Decision Trees at Multiple Levels of Abstraction. In: Proceedings of Symposium on Abstraction, Reformulation, and Approximation. Berlin: Springer-Verlag.