Learning the Semantic Meaning of a Concept from the Web

Learning the Semantic Meaning of a Concept from the Web Yang Yu and Yun Peng May 30, 2007 yangyu1@umbc.edu, ypeng@umbc.edu

LIVING_THINGS ANIMAL PLANT HUMAN CAT TREE GRASS MAN WOMAN ARBOR FRUTEX The Problem • Manually preparing training data for each concept in text classification based ontology mapping is expensive. Exemplars

http://www.google.com/ Our Approach • Automatically collecting training data. • Benefits • Reduce the amount of human work

Overview • Background • The semantic Web and ontology • Ontology Mapping • Approach • Prototype System • Experimental Results • WEAPONS ontology • LIVING_THINGS ontology • Limitations and Conclusions

Semantic Web and Ontology Mapping • The Semantic Web • “an extension of the current web” • ontology files and programs that use them • Ontology Mapping • Interoperability problem • Mapping • r = f (Ci, Cj) where i=1, …, n and j=1, …, m; • r {equivalent, subClassOf, superClassOf, complement, overlapped, other}

Approaches to Ontology Mapping • Manual mapping • String Matching • Text classification • the semantic meaning of a concept can be reflected in the training data (exemplars) that use the concept • Probabilistic feature model • Classification • Results highly dependent on the quality of exemplars

Motivation and Proposal • Preparing exemplars manually is costly • Billions of documents available on the web • Search engines

The Proposal • Using the concept defined in an ontology and the semantic information to form a query and processing the search results to obtain exemplars • Verification • Build a prototype system • Check ontology mapping results

Ontology A Parser Queries Retriever Retriever WWW Links to Web Pages Processor HTML Docs Text Files System overview – Part I Search Engine 1. Whole file 2. Only sentences containing search keywords

Ontology A Ontology B Feature Model Mapping Results Text Files (B) Text Files (A) Rainbow Rainbow Model Builder Calculator System overview– Part II

LIVING_THINGS ANIMAL PLANT LIVING_THINGS HUMAN CAT TREE GRASS ANIMAL PLANT HUMAN CAT TREE GRASS MAN WOMAN ARBOR FRUTEX MAN WOMAN ARBOR FRUTEX The model builder • Mutually exclusive and exhaustive • Leaf classes • C+and C-

The calculator • Naïve Bayes text classifier tends to give extreme values (1/0) • Calculating conditional probabilities from raw classification data by taking average

Categories in WeaponsA.n3 Num. of exemplars TANK-VEHICLE 170 AIR-DEFENSE-GUN 20 SAUDI-NAVAL-MISSILE-CRAFT 10 An Example of the Calculator Ontology for Weapons TANK-VEHICLE APC AIR-DEFENSE-GUN Classifier 200 SAUDI-NAVAL- MISSILE-CRAFT P(TANK-VEHICLE | APC) = 170 /200= 0.85 P(AIR-DEFENSE-GUN | APC) = 0.10 P(SAUDI-NAVAL-MISSILE-CRAFT| APC) = 0.05

Experiments with WEAPONS ontology • WeaponsA.n3 and WeaponsB.n3 • Information Interoperation and Integration Conference (http://www.atl.lmco.com/projects/ontology/i3con.html) • Both have over 80 classes defined • More than 60 classes are leaf classes

WeaponsA.n3 Part of WeaponsA.n3 WEAPON CONVENTIONAL- WEAPON ARMORED- COMBAT-VEHICLE MODERN- NAVAL-SHIP WARPLANE AIRCRAFT-CARRIER PATROL-CRAFT SUPER-ETENDARD - TANK-VEHICLE

WeaponsB.n3 Part of WeaponsB.n3 WEAPON CONVENTIONAL- WEAPON ARMORED- COMBAT-VEHICLE MODERN- NAVAL-SHIP WARPLANE FIGHTER-PLANE AIRCRAFT-CARRIER PATROL- WARTER-CRAFT TANK-VEHICLE - FIGHTER-ATTACK-PLANE LIGHT-TANK APC LIGHT-AIRCRAFT-CARRIER PATROL- BOAT- RIVER PATROL- BOAT SUPER-ETENDARD-FIGHTER

Part of WeaponsB.n3 Expected Results WeaponsA.n3 AIRCRAFT-CARRIER SUPER- ETENDARD PATROL-CRAFT TANK-VEHICLE FIGHTER-PLANE LIGHT-AIRCRAFT-CARRIER PATROL- WARTER-CRAFT APC FIGHTER-ATTACK-PLANE LIGHT-TANK SUPER-ETENDARD-FIGHTER PATROL- BOAT- RIVER PATROL- BOAT WeaponsB.n3

A Typical Report P(APC | Ci) where i = 1 … 63 ...... ……

New Classes Whole file Prob Sentences with Keywords Prob LIGHT-AIRCRAFT-CARRIER AIRCRAFT-CARRIER 0.65 AIRCRAFT-CARRIER 0.57 APC SILKWORM-MISSILE-MOD 0.46 SELF-PROPELLED-ARTILLERY 0.36 SUPER-ETENDARD-FIGHTER SILKWORM-MISSILE-MOD 0.66 MRBM 0.51 FIGHTER-ATTACK-PLANE SILKWORM-MISSILE-MOD 0.83 MRBM 0.38 PATROL-WATERCRAFT SILKWORM-MISSILE-MOD 0.28 PATROL-CRAFT 0.52 PATROL-BOAT-RIVER SILKWORM-MISSILE-MOD 0.65 PATROL-CRAFT 0.54 PATROL-BOAT SILKWORM-MISSILE-MOD 0.51 PATROL-CRAFT 0.66 LIGHT-TANK SILKWORM-MISSILE-MOD 0.56 TANK-VEHICLE 0.3 FIGHTER-PLANE AIRCRAFT-CARRIER 0.49 MRBM 0.38 classes with highest conditional probability

HUMAN MAN WOMAN Experiment with LIVING_THINGS ontology • P(MAN | HUMAN) • P (WOMAN | HUMAN) • Find a mapping for GIRL

HUMAN MAN WOMAN Experiment Results (1) Results of experiment (1) P (MAN | HUMAN) = 0.62 P (WOMAN | HUMAN) = 0.38

Concepts Queries living+things Living+things animal Living+things+animal+Animalia plant Living+things+plant+Plantae cat Living+things+animal+Animalia+cat+Felidae human Living+things+animal+Animalia+human+intelligent man Living+things+animal+Animalia+human+intelligent+man+male woman Living+things+animal+Animalia+human+intelligent+woman+female tree Living+things+plant+Plantae+tree grass Living+things+plant+Plantae+grass frutex Living+things+plant+Plantae+tree+Frutex arbor Living+things+plant+Plantae+tree+arbor Additional Experiments: Different Queries Queries augmented with class properties

Conditional Probability Whole Keyword Sentences P(MAN | HUMAN) 0.91 0.93 P(WOMAN | HUMAN) 0.09 0.07 Conditional Probability Whole Keyword Sentences WOMAN HUMAN MAN P(ANIMAL | GIRL) 0.9 0.83 P(PLANT | GIRL) 0.1 0.17 P(HUMAN | GIRL) 0.78 0.83 P(CAT | GIRL) 0.22 0.17 P(MAN | GIRL) 0.14 0.16 P(WOMAN | GIRL) 0.86 0.84 Experiment Results (3) Results of experiment (1) with new queries Results of experiment (2) with new queries

Limitation 1: Relevancy !=similarity Search Results for concept A Text related to concept A Text against concept A Text for concept A i.e. desired exemplars Text for related concept B

HUMAN MAN WOMAN Limitation 2: “Conditional Probability” • An exemplar is a combination of strings that represent some usage of a concept. • An exemplar is not an instance of a concept. • The way we calculate conditional probability is an estimation.

Limitation 3: Popularity !=relevancy • Limited by a search engine’s algorithm • PageRank™ • Popularity does not equal relevancy • Weight cannot be specified for words in a search query

Related Research • UMBC OntoMapper • Sushama Prasad, Peng Yun and Finin Tim, A Tool for Mapping between Two Ontologies Using Explicit Information, AAMAS 2002 Workshop on Ontologies and Agent Systems, 2002. • CAIMEN • Lacher S. Martin and Groh Georg ,Facilitating the Exchange of Explicit Knowledge through Ontology Mappings, Proc of the Fourteenth International FLAIRS conference, 2001. • GLUE • Doan Anhai, Madhavan Jayant, Dhamankar Robin, Domingos Pedro, and Halevy Alon, Learning to Match Ontologies on the Semantic Web, WWW2002, May, 2002. • Google Conditional Probability • P(HUMAN | MAN) = 1.77 billion / 2.29 billion = 0.77 • P(HUMAN | WOMAN) = 0.6 billion / 2.29 billion = 0.26 • Wyatt D., Philipose M., and Choudhury T., Unsupervised Activity Recognition Using Automatically Mined Common Sense. Proceedings of AAAI-05. pp. 21-27.

Conclusion and Future Work • Text retrieved from the web can be used as exemplars for text classification based ontology mapping • Many parameters affect the quality of the exemplars • There are noise contained in the processed documents • Future work • Clustering • Restrict search to highly relevant sites and web resources

Questions • Thank you  • yangyu1@umbc.edu • ypeng@umbc.edu

Learning the Semantic Meaning of a Concept from the Web