Learning the Semantic Meaning of a Concept from the Web

Learning the Semantic Meaning of a Concept from the Web Yang Yu Master’s Thesis Defense August 03, 2006

LIVING_THINGS ANIMAL PLANT HUMAN CAT TREE GRASS MAN WOMAN ARBOR FRUTEX The Problem • Manually preparing training data for text classification based ontology mapping is expensive.

http://www.google.com/ The Thesis • Automatically collecting training data for the concept defined in an ontology. • Benefits • Reduce the amount of human work • Fully automated ontology mapping

Overview • Background • The semantic Web and ontology • Ontology Mapping • Proposal • System • Experimental Results • WEAPONS ontology • LIVING_THINGS ontology • Discussions and Conclusion

Find all types of jets that are made in the USA Made-in WA partOf USA Semantic Web and Ontology • What is it? • “an extension of the current web” • An Example

Ontology Mapping • Interoperability problem • Independently developed ontologies for the same or overlapped domain • Mapping • r = f (Ci, Cj) where i=1, …, n and j=1, …, m; • r {equivalent, subClassOf, superClassOf, complement, overlapped, other}

Approaches to Ontology Mapping • Manual mapping • String Matching • Text classification • the semantic meaning of a concept is reflected in the training data that use the concept • Probabilistic feature model • Classification • Results highly depend on training data

Motivation • Preparing exemplars manually is costly • Billions of documents available on the web • Search engines

The Proposal • Using the concept defined in an ontology as a query and processing the search results to obtain exemplars • Verification • Build a prototype system • Check ontology mapping results

Ontology A Parser Queries Retriever Retriever WWW Links to Web Pages Processor HTML Docs Text Files System overview – Part I Search Engine

Concepts Queries FOOD FRUIT APPLE ORANGE living+things living+things animal living+things+animal plant living+things+plant cat living+things+animal+cat human living+things+animal+human man living+things+animal+human+man woman living+things+animal+human+woman tree living+things+plant+tree grass living+things+plant+grass frutex living+things+plant+tree+Frutex arbor living+things+plant+tree+arbor The parser (Query expansion) FOOD+FRUIT+APPLE

The retriever

The processor

Naïve Bayes text classifier • Bow toolkit • McCallum, Andrew Kachites, Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, http://www.cs.cmu.edu/~mccallum/bow 1996. • rainbow -d model --index dir/* • rainbow –d model –query • Bayes Rule • Naïve Bayes text classifier

Prior P (B | A) * P (A) P (B) posterior Normalizing constant P(A, B) Mitchell Tom, Machine Learning, McGraw Hill, 1997 B A P (B | A) = P (A, B) / P (A) P (A | B) = P (A, B) / P (B) Bayes Rule • P (A | B) =

Naïve Bayes classifier • A text classification problem • “What’s the most probable classification of the new instance given the training data?” • vj: category j. • (a1, a2, …, an): attributes of a new document • So Naïve (Mitchell Tom, Machine Learning, McGraw Hill) 1997

Ontology A Ontology B Feature Model Mapping Results Text Files (B) Text Files (A) Rainbow Rainbow Model Builder Calculator System overview– Part II

LIVING_THINGS ANIMAL PLANT LIVING_THINGS HUMAN CAT TREE GRASS ANIMAL PLANT HUMAN CAT TREE GRASS MAN WOMAN ARBOR FRUTEX MAN WOMAN ARBOR FRUTEX The model builder • Mutually exclusive and exhaustive • Leaf classes • C+and C-

The calculator • Naïve Bayes text classifier tends to give extreme values (1/0) • Tasks • Feed exemplars to the classifier one by one • Keep records of classification results • Take averages and generate report

Categories in WeaponsA.n3 Num. of exemplars TANK-VEHICLE 170 AIR-DEFENSE-GUN 20 SAUDI-NAVAL-MISSILE-CRAFT 10 An Example of the Calculator TANK-VEHICLE APC AIR-DEFENSE-GUN Classifier 200 SAUDI-NAVAL- MISSILE-CRAFT P(TANK-VEHICLE | APC) = 170 /200= 0.85 P(AIR-DEFENSE-GUN | APC) = 0.10 P(SAUDI-NAVAL-MISSILE-CRAFT| APC) = 0.05

Experiments with WEAPONS ontology • Information Interpretation and Integration Conference (http://www.atl.lmco.com/projects/ontology/i3con.html) • WeaponsA.n3 and WeaponsB.n3 • Both over 80 classes defined • More than 60 classes are leaf classes • Similar structure

WeaponsA.n3 Part of WeaponsA.n3 WEAPON CONVENTIONAL- WEAPON ARMORED- COMBAT-VEHICLE MODERN- NAVAL-SHIP WARPLANE SUPER-ETENDARD AIRCRAFT-CARRIER PATROL-CRAFT TANK-VEHICLE -

WEAPON CONVENTIONAL- WEAPON ARMORED- COMBAT-VEHICLE MODERN- NAVAL-SHIP WARPLANE FIGHTER-PLANE AIRCRAFT-CARRIER PATROL- WARTER-CRAFT TANK-VEHICLE - FIGHTER-ATTACK-PLANE LIGHT-TANK APC LIGHT-AIRCRAFT-CARRIER PATROL- BOAT- RIVER PATROL- BOAT SUPER-ETENDARD-FIGHTER WeaponsB.n3 Part of WeaponsB.n3

Part of WeaponsB.n3 Expected Results AIRCRAFT-CARRIER PATROL-CRAFT SUPER- ETENDARD TANK-VEHICLE FIGHTER-PLANE LIGHT-AIRCRAFT-CARRIER PATROL- WARTER-CRAFT APC FIGHTER-ATTACK-PLANE LIGHT-TANK SUPER-ETENDARD-FIGHTER PATROL- BOAT- RIVER PATROL- BOAT

A Typical Report P(APC | Ci) where i = 1 … 63 ...... ……

New Classes Whole file Prob Sentences with Keywords Prob LIGHT-AIRCRAFT-CARRIER AIRCRAFT-CARRIER 0.65 AIRCRAFT-CARRIER 0.57 P(TANK-VEHICLE | APC ) = 0.28 APC SILKWORM-MISSILE-MOD 0.46 SELF-PROPELLED-ARTILLERY 0.36 SUPER-ETENDARD-FIGHTER SILKWORM-MISSILE-MOD 0.66 MRBM 0.51 FIGHTER-ATTACK-PLANE SILKWORM-MISSILE-MOD 0.83 MRBM 0.38 P(SUPER-ETENDARD | SUPER-ETENDARD-FIGHTER ) = 0.21 PATROL-WATERCRAFT SILKWORM-MISSILE-MOD 0.28 PATROL-CRAFT 0.52 PATROL-BOAT-RIVER SILKWORM-MISSILE-MOD 0.65 PATROL-CRAFT 0.54 PATROL-BOAT SILKWORM-MISSILE-MOD 0.51 PATROL-CRAFT 0.66 LIGHT-TANK SILKWORM-MISSILE-MOD 0.56 TANK-VEHICLE 0.3 FIGHTER-PLANE AIRCRAFT-CARRIER 0.49 MRBM 0.38 classes with highest conditional probability

New Classes Group-whole-50 Prob Group-whole-100 Prob LIGHT-AIRCRAFT-CARRIER SILKWORM-MISSILE-MOD 0.60 AIRCRAFT-CARRIER 0.65 APC SILKWORM-MISSILE-MOD 0.65 SILKWORM-MISSILE-MOD 0.46 SUPER-ETENDARD-FIGHTER SILKWORM-MISSILE-MOD 0.74 SILKWORM-MISSILE-MOD 0.66 FIGHTER-ATTACK-PLANE SILKWORM-MISSILE-MOD 0.83 SILKWORM-MISSILE-MOD 0.83 PATROL-WATERCRAFT SILKWORM-MISSILE-MOD 0.64 SILKWORM-MISSILE-MOD 0.28 PATROL-BOAT-RIVER SILKWORM-MISSILE-MOD 0.89 SILKWORM-MISSILE-MOD 0.65 PATROL-BOAT SILKWORM-MISSILE-MOD 0.64 SILKWORM-MISSILE-MOD 0.51 LIGHT-TANK SILKWORM-MISSILE-MOD 0.62 SILKWORM-MISSILE-MOD 0.56 FIGHTER-PLANE SILKWORM-MISSILE-MOD 0.80 AIRCRAFT-CARRIER 0.49 different numbers of exemplars (whole)

New Classes Group-sentence-50 Prob Group-sentence-100 Prob LIGHT-AIRCRAFT-CARRIER AIRCRAFT-CARRIER 0.44 AIRCRAFT-CARRIER 0.57 APC TANK-VEHICLE 0.54 SELF-PROPELLED-ARTILLERY 0.36 SUPER-ETENDARD-FIGHTER HY-4-C-201-MISSILE 0.4 MRBM 0.51 FIGHTER-ATTACK-PLANE ICBM 0.19 MRBM 0.38 PATROL-WATERCRAFT PATROL-CRAFT 0.49 PATROL-CRAFT 0.52 PATROL-BOAT-RIVER PATROL-CRAFT 0.36 PATROL-CRAFT 0.54 PATROL-BOAT PATROL-CRAFT 0.37 PATROL-CRAFT 0.66 LIGHT-TANK TANK-VEHICLE 0.59 TANK-VEHICLE 0.3 FIGHTER-PLANE MRBM 0.38 MRBM 0.38 different numbers of exemplars (sentence)

Groups of experiments Mapping accuracy judged by desired class mapped Group-whole-50 0% Group-whole-100 11% Group-sentence-50 67% Group-sentence-100 56% Comparison of mapping accuracy of different groups of experiments Higher Conditional Probability

HUMAN MAN WOMAN Experiment with LIVING_THINGS ontology • P(MAN | HUMAN) • P (WOMAN | HUMAN) • Find a mapping for GIRL

Conditional Probability Using first 50 exemplars Using first 100 exemplars Using first 200 exemplars P(MAN | HUMAN) 0.75 0.58 0.62 P(WOMAN | HUMAN) 0.24 0.41 0.38 WOMAN HUMAN MAN Actual Experiment Results: L-1 Results of experiment (1)

Conditional Probability Using first 50 exemplars Using first 100 exemplars Using first 200 exemplars P(ANIMAL | GIRL) 0.66 0.53 0.77 P(PLANT | GIRL) 0.34 0.47 0.23 P(HUMAN | GIRL) 0.86 0.56 0.43 P(CAT | GIRL) 0.01 0.15 0.01 P(DOG | GIRL) 0.13 0.29 0.56 P(PYCNOGONID | GIRL) 0 0 0 P(MAN | GIRL) 0.02 0.03 0 P(WOMAN | GIRL) 0.98 0.97 1 Actual Experiment Results: L-3 Comparison between different numbers of exemplars (sentence)

Concepts Queries living+things Living+things animal Living+things+animal+Animalia plant Living+things+plant+Plantae cat Living+things+animal+Animalia+cat+Felidae human Living+things+animal+Animalia+human+intelligent man Living+things+animal+Animalia+human+intelligent+man+male woman Living+things+animal+Animalia+human+intelligent+woman+female tree Living+things+plant+Plantae+tree grass Living+things+plant+Plantae+grass frutex Living+things+plant+Plantae+tree+Frutex arbor Living+things+plant+Plantae+tree+arbor Actual Experiment Results: Different Queries Queries augmented with class properties

Conditional Probability Whole Keyword Sentences P(MAN | HUMAN) 0.91 0.93 P(WOMAN | HUMAN) 0.09 0.07 Conditional Probability Whole Keyword Sentences WOMAN HUMAN MAN P(ANIMAL | GIRL) 0.9 0.83 P(PLANT | GIRL) 0.1 0.17 P(HUMAN | GIRL) 0.78 0.83 P(CAT | GIRL) 0.22 0.17 P(MAN | GIRL) 0.14 0.16 P(WOMAN | GIRL) 0.86 0.84 Actual Experiment Results: L-4 Results of experiment (1) with new queries Results of experiment (2) with new queries

HUMAN MAN WOMAN Limitation 1: An exemplar is not a sample of a concept • An exemplar is a combination of strings that represent some usage of a concept. • An exemplar is not an instance of a concept. • The way we calculate conditional probability is an estimation.

Limitation 2: Popularity does not equal relevancy • Limited by a search engine’s algorithm • PageRank™ • Popularity does not equal relevancy • Weight cannot be specified for words in a search query

Limitation 3: Relevancy does not equal to similarity Search Results for concept A Text related to concept A Text against concept A Text for concept A i.e. desired exemplars Text for related concept B

Related Research • UMBC OntoMapper • Sushama Prasad, Peng Yun and Finin Tim, A Tool for Mapping between Two Ontologies Using Explicit Information, AAMAS 2002 Workshop on Ontologies and Agent Systems, 2002. • CAIMEN • Lacher S. Martin and Groh Georg ,Facilitating the Exchange of Explicit Knowledge through Ontology Mappings, Proc of the Fourteenth International FLAIRS conference, 2001. • GLUE • Doan Anhai, Madhavan Jayant, Dhamankar Robin, Domingos Pedro, and Halevy Alon, Learning to Match Ontologies on the Semantic Web, WWW2002, May, 2002. • Google Conditional Probability • P(HUMAN | MAN) = 1.77 billion / 2.29 billion = 0.77 • P(HUMAN | WOMAN) = 0.6 billion / 2.29 billion = 0.26 • Wyatt D., Philipose M., and Choudhury T., Unsupervised Activity Recognition Using Automatically Mined Common Sense. Proceedings of AAAI-05. pp. 21-27.

Conclusion and Future Work • Text retrieved from the web can be used as exemplars for text classification based ontology mapping • Many parameters affect the quality of the exemplars • There are noise contained in the processed documents • Future work • Clustering

Questions

Learning the Semantic Meaning of a Concept from the Web