1 / 21

Maintaining Information Integration Ontologies

Maintaining Information Integration Ontologies. Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios Vouros. Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR “Demokritos” http://www.iit.demokritos.gr/skel.

rmerrifield
Télécharger la présentation

Maintaining Information Integration Ontologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maintaining Information Integration Ontologies Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios Vouros Software & Knowledge Engineering Lab Inst. of Informatics & TelecommunicationsNCSR “Demokritos” http://www.iit.demokritos.gr/skel DCAG, Ulm, December 6, 2003

  2. Structure of the talk • Information integration in CROSSMARC • Semi-automated ontology enrichment • Clustering “synonyms” • Conclusions Maintaining Information Integration Ontologies

  3. CROSSMARC Objectives Develop technology for Information Integration that can: • crawl the Web for interesting Web pages, • extract information from pages of different sites without a standardized format (structured, semi-structured, free text), • process Web pages written in several languages, • be customized semi-automatically to new domains and languages, • deliver integrated information according to personalized profiles. Maintaining Information Integration Ontologies

  4. CROSSMARC Architecture Ontology Maintaining Information Integration Ontologies

  5. CROSSMARC Ontology • Meta-conceptual layer • Embodies domain-independent semantics • Conceptual layer • Contains relevant concepts of each domain • Instance layer • Contains relevant individuals of each domain • The lexical layer • Language dependent realizations of domain information Maintaining Information Integration Ontologies

  6. CROSSMARC Ontology <node idref="OV-d0e1041"> <synonym>Intel Pentium III</synonym> <synonym>Pentium III</synonym> <synonym>P3</synonym> <synonym>PIII</synonym> </node> … <description>Laptops</description> <features> <feature id="OF-d0e5"> <description>Processor</description> <attribute type="basic" id="OA-d0e7"> <description>Processor Name</description> <discrete_set type="open"> <value id="OV-d0e1041"> <description>Intel Pentium 3</description> </value> … Lexicon <node idref="OA-d0e7"> <synonym>Όνομα Επεξεργαστή</synonym> </node> Ontology Greek Lexicon Maintaining Information Integration Ontologies

  7. Structure of the talk • Information integration in CROSSMARC • Semi-automated ontology enrichment • Clustering “synonyms” • Conclusions Maintaining Information Integration Ontologies

  8. Ontology Enrichment A-box T-box Conceptualization Instances An ontology captures knowledge in a static way, as it is a snapshot of knowledge from a particular point of view that governs a certain domain of interest in a specific time-period. Evolving nature of ontology OntologyMaintenance part of OntologyEnrichment Maintaining Information Integration Ontologies

  9. Ontology Enrichment • We concentrate on instances(knowledge of the domain of interest). • Highly evolving domain (e.g. laptop descriptions) • New Instances characterize new concepts. e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology. • New surface appearance of an instance. e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’ • The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain they cover. Maintaining Information Integration Ontologies

  10. Ontology Enrichment Annotating Corpus Using Domain Ontology machine learning Corpus Additional annotations Multi-Lingual Domain Ontology Information extraction Ontology Enrichment / Population Validation Domain Expert Maintaining Information Integration Ontologies

  11. Results: Annotation phase only Maintaining Information Integration Ontologies

  12. Results: Full enrichment cycle 25% of the initial ontology 50% of the initial ontology 75% of the initial ontology Maintaining Information Integration Ontologies

  13. Structure of the talk • Information integration in CROSSMARC • Semi-automated ontology enrichment • Clustering “synonyms” • Conclusions Maintaining Information Integration Ontologies

  14. Enrichment with synonyms • So far, only enrichment with instances that participate in the ‘instance of’ relationship has been supported. • The number of instances for validation increases with the size of the corpus and the ontology. • There is a need for supporting the enrichment of the ‘synonymy’ relationship (in different languages and domains). We approach this problem using … ONTOLOGY LEARNING Maintaining Information Integration Ontologies

  15. Enrichment with synonyms • Issues to be handled: • Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship). Synonym: ‘Intel pentium 3’ - ‘Intel pIII’ Orthographical: ‘Intel p3’ - ‘intell p3’ Lexicographical : ‘Hewlett Packard’ - ‘HP’ Combination :‘IntellPentium 3’ - ‘P III’ Maintaining Information Integration Ontologies

  16. Compression-based Clustering • COCL(COmpression-based CLustering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff. • CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters. • COCLiteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ). Maintaining Information Integration Ontologies

  17. Compression-based Clustering • Given CLUSTERS and candidate INSTANCES • while INSTANCES do • for each instance in INSTANCES • compute CCDiff for every cluster in CLUSTERS • end for each • select instance from INSTANCES that maximizes the • difference between its two smallest CCDiff’s • if min(CCDiff) of instance > threshold • create new cluster • assign instance to new cluster • remove instance from INSTANCES • calculate code model for the new cluster • add new cluster to CLUSTERS • else • assign instance to cluster of min(CCDiff) • remove instance from INSTANCES • recalculate code model for the cluster • end while Maintaining Information Integration Ontologies

  18. Results - Evaluation We hide incrementally one cluster at a time and measure the ability of the algorithm to discover the hidden clusters Recall: 100% Precision: 75% • Concept Generation Scenario • Instance Matching Scenario Dataset characteristics Maintaining Information Integration Ontologies

  19. Structure of the talk • Information integration in CROSSMARC • Semi-automated ontology enrichment • Clustering “synonyms” • Conclusions Maintaining Information Integration Ontologies

  20. Conclusions • CROSSMARC is a complete multi-lingual information integration system. • Ontology Maintenance is crucial in evolving domains. • Ontology Enrichment helps the adaptation of the system to new domains saving time and effort. • Machine-learning based information extraction can assist the discovery of new instances. • Compression-based clustering discovers string similarities that support the enrichment with different surface appearances of an instance (“synonyms”). Maintaining Information Integration Ontologies

  21. References • B. Hachey, C. Grover, V. Karkaletsis, A. Valarakos, M. T. Pazienza, M. Vindigni, E. Cartier, J. Coch, Use of Ontologies for Cross-lingual Information Management in the Web, In Proceedings of the Ontologies and Information Extraction International Workshop held as part of the EUROLAN 2003, Romania, July 28 - August 8, 2003 • M. T. Pazienza, A. Stellato, M. Vindigni, A. Valarakos, V. Karkaletsis, Ontology Integration in a Multilingual e-Retail System, In Proceedings of the HCI International Conference, Volume 4, pp. 785-789, Heraklion, Crete, Greece, June 22-27 2003. • A. Valarakos, G. Sigletos, V. Karkaletsis, G. Paliouras, A Methodology for Semantically Annotating a Corpus Using a Domain Ontology and Machine Learning, In RANLP, 2003 • A. Valarakos, G. Sigletos, V. Karkaletsis, G. Paliouras, G. Vouros, A Methodology for Enriching a Multi-Lingual Domain Ontology using Machine Learning, In Proceedings of the 6th ICGL workshop on Text Processing for Modern Greek: from Symbolic to Statistical Approaches, held as part of the 6th International Conference in Greek Linguistics, Rethymno, Crete, 20 September, 2003. • A. Valarakos, G. Paliouras, V. Karkaletsis, G. Vouros, A Name-Matching Algorithm for Ontology Enrichment, In Proceedings of the Hellenic Artificial Intelligence Conference (SETN’04), Samos, May, 2004. Maintaining Information Integration Ontologies

More Related