Ontology Learning and Population using Heterogeneous Sources on the Web

Ontology Learning and Population using Heterogeneous Sources on the Web Victor de Boer OLP-AIO’s Workshop March, 16th, 2005

About me • Victor de Boer • Artificial Intelligence @ UvA • Graduated on Human Memory modelling • AiO since jan 1st 2004 • Supervisors: Bob Wielinga and Maarten van Someren • MultimediaN • (Mn-9c: VU, CWI, DEN)

Outline • Introduction and Research Questions • Ontology Learning and Population Task • My approach: Redundancy-based • Case Study • Results • Further Research • Questions / Discussion

Intro and Research Questions • Backbone of Semantic Web: • Ontologies • Content • Manual construction has its flaws and is also very time-consuming. • Web contains a lot of knowledge: let’s use it. • My research questions: • How can we automatically construct, enrich and populate ontologies using heterogeneous sources on the Web? • And how can these ontologies help us in extracting more information? (bootstrap)

OLP Task Description • Ontology Learning: • Concepts: • NERC, LSI, … C1 C3 C2 C4

OLP Task Description • Ontology Learning: • Concepts: • NERC, LSI, … • Hierarchical Structure • Hearst Patterns,… C1 C3 C2 C4

OLP Task Description • Ontology Learning: • Concepts: • NERC, LSI, … • Hierarchical Structure • Hearst Patterns,… • Other relations C1 C3 C2 C4

OLP Task Description • Ontology Learning: • Concepts: • NERC, LSI, … • Hierarchical Structure • Hearst Patterns,… • Other relations • Ontology Population • Instances C1 C3 C2 C4 I1 I3 I2

OLP Task Description • Ontology Learning: • Concepts: • NERC, LSI, … • Hierarchical Structure • Hearst Patterns,… • Other relations • Ontology Population • Instances • Relation Instances C1 C3 C2 C4 I1 I3 I2

OLP Task Description • Ontology Learning: • Concepts: • NERC, LSI, … • Hierarchical Structure • Hearst Patterns,… • Other relations • Ontology Population • Instances • Relation Instances • Ontology Enrichment C1 C3 C2 C4 I1 I3 I2

Relation Instantiation • We have: • two Concepts C1 and C2, • a relation R(C1,C2) • and instances I1 of C1 and I2 of C2. • Find for which instances the relation R holds. • Examples: • <Countries, has_city, City> • <Movie, has_director, Director> • <Artstyle, has_artist, Artist> • Information Extraction!

Approaches • Current approaches: • NLP based. Work well for Natural language documents • Wrapper-like. Work well with (semi-)structured documents • Not a generic approach • My approach: • Use generic methods, applicable to heterogeneous sources, combining information to collect evidence of this relation. Redundancy of information should compensate for the loss of subtlety.

Case Study: Domain • Art and Architecture Thesaurus (AAT) • Unified List of Artist Names (ULAN) • Relation: <aat:style, aua:has_artist, ulan:artist> • Find instances of this relation Has_artist

Case Study: Method Manual wrapper Person Name Extractor ULAN-check Seed list AAT Otto Dix Otto Dix Otto Dix S. Freud George Grosz George Grosz George Grosz Score: “George Grosz” + 0.5

Case Study: Results • Impressionism: 200 pages (+/-120 used) Seed Artists: Degas, Gauguin, Boudin, Morisot, Caillebotte, Seurat, Monet, Renoir, Manet sisley, alfred ; 0.08 ; ulan#19582 cassatt, mary ; 0.0780414 ; ulan#8671 cezanne, paul ; 0.0764626 ; ulan#9730 bazille, frederic ; 0.0394824 ; ulan#2147 signac, paul ; 0.0265291 ; ulan#19142 guillaumin, armand ; 0.0263668 ; ulan#11549 gustave courbet ; 0.0218521 ; ulan#12992 bonnard, pierre ; 0.0149454 ; ulan#4215 henri matisse ; 0.0134152 ; ulan#5698 camille corot ; 0.0128969 ; ulan#10536 d'orsay ; 0.0123066 ; ulan#28304 auguste rodin ; 0.0115357 ; ulan#17831 theodore rousseau ; 0.011157 ; ulan#18605 childe hassam ; 0.0107054 ; ulan#12300

Case Study: Results • Evaluation problems • 18 Impressionists (Gold Standard)

Assumptions, Limitations • Conclusions: • It seems to work • Evaluation a problem • Assumptions: • The redundancy of information we extract by using multiple, heterogeneous sources compensates what we lose by not using more ‘sophisticated’ methods • R must be one-to-many relation (no functional properties) • C1 must be ‘googlable’ • C2 must be ‘extractable’

Further Research • Collect more results (how robust is it?) • Different domains • More heterogeneous sources (dB’s), offline dictionaries… • Use page classification/trustability • Evaluation • Use Ontological information

Questions?

Ontology Learning and Population using Heterogeneous Sources on the Web

Ontology Learning and Population using Heterogeneous Sources on the Web

Presentation Transcript

Information Retrieval on the Semantic Web Using Ontology-based Visualization

Using The Gene Ontology:

Predictive Modeling with Heterogeneous Sources

The Population and Community Ontology (PCO)

Ontology learning and population from from text

Ontology Learning and Population from Text

Ontology and Semantic Web

Introduction to the Population and Community Ontology (PCO)

Ontology Learning and Population from Text: Algorithms, Evaluation and Applications

The Semantic Web and Ontology Course

The Web Ontology Language

Ontology materialization from relational database sources using D2RQ

Information Sources on the Web

Ontology Learning

Ontology Quality and the Semantic Web

Raster Data Sources on the Web

Knowledge Sifter : Agent-Based Search over Heterogeneous Sources using Semantic Web Services

Using the Gene Ontology

Ontology and Search on the Semantic WEB

The Semantic Web and Ontology

Semantic Web and The Web Ontology Language

The Semantic Web and Ontology