1 / 15

Ontology-based Annotation

Ontology-based Annotation. Sergey Sosnovsky @PAWS@SIS@PITT. Outline. O-based Annotation Conclusion Questions. Why Do We Need Annotation. What is Added by O-based Annotation. Ontology-driven processing (effective formal reasoning)

efuru
Télécharger la présentation

Ontology-based Annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ontology-basedAnnotation Sergey Sosnovsky@PAWS@SIS@PITT

  2. Outline O-based Annotation Conclusion Questions

  3. WhyDo We Need Annotation What is Added by O-based Annotation • Ontology-driven processing (effective formal reasoning) • Connecting other O-based Services (O-mapping, O-visualization…) • Unified vocabulary • Connecting to the rest of SW knowledge • Annotation-based Services • Integration of Disperse Information (knowledge-based linking) • Better Indexing and Retrieval (based on the document semantics) • Content-based Adaptation (modeling document content in terms of domain model) • Knowledge Management • Organization’s Repositories as mini Webs (Boeing, Rolls Royce, Fiat, GlaxoSmithKline, Merck, NPSA, …) • Collaboration Support • Knowledge sharing and communication

  4. Definition O-based Annotation is a process ofcreating a mark-up of Web-documents using a pre-existing ontologyand/orpopulating knowledge bases by marked up documents our: plays our: Athlete our: Sports rdf: type rdf: type our: plays MichaelJordan Basketball “Michael Jordan plays basketball”

  5. List of Tools AeroDAML / AeroSWARM Annotea / Annozilla Armadillo AktiveDoc COHSE GOA KIM Semantic Annotation Platform MagPie Melita MnM OntoAnnotate Ontobroker OntoGloss ONTO-H Ont-O-Mat / S-CREAM / CREAM Ontoseek Pankow SHOE Knowledge Annotator Seeker Semantik SemTag SMORE Yawas … • Information Extraction Tools: • Alembic • Amilcare / T-REX • Annie • Fastus • Lasie • Poteus • SIFT • …

  6. Important Characteristics Automation of Annotation(manual / semiautomatic / automatic / editable) Ontology-related issues: pluggable ontology (yes/no); ontology language (RDFS / DAML+OIL / OWL / …); local / anywhere access; ontology elements available for annotation (concept / instances / relations / triples); where annotations are stored (in the annotated document / on the dedicated server / where specified) annotation format (XML / RDF / OWL / …). Annotated Documents: document kinds (text / multimedia) document formats (plain text / html / pdf / …) documents access (local / web) Architecture / Interface / Interoperability Standalone tool / web interface / web component / API / … Annotation Scale (large – the WWW size / small - a hundred) Existing Documentation / Tutorial Availability

  7. SMORE our: plays our: Athlete our: Sports rdf: type rdf: type our: plays MichaelJordan Basketball “Michael Jordan plays basketball” Manual Annotation OWL-based Markup Simultaneous O modification (if necessary) ScreenScraper mines metadata from annotated pages and suggests as candidates for the mark-up Post-annotation O-based Inference

  8. Problems of Manual Annotation Solution: Dedicated Automatic Annotation Services (“SearchEngine”-like) • Expensive / Time-consuming • Difficult / Error prone • Subjective (two people annotating the same documents have in 15–30% annotate them differently) • Never ending • new documents • new versions of ontologies • Annotation storage problem • where? • Trust owner’s annotation • incompetence • Spam (Google does not use <META> info)

  9. Automatic O-based Annotation • Supervised • MnM • S-Cream • Melita & AktiveDoc • Unsupervised • SemTag - Seeker • Armadillo • AeroSWARM

  10. MnM • Ontology-based Annotation Interface: • Ontology browser (rich navigation capabilities) • Document browser (usually Web-browser) • The annotation is mainly based on select-drag-N-drop association of text fragments with ontology elements • Built-in or External ML component classifies the main corpus of documents • Activity Flow: • Markup (A human user manually annotate training set of documents by ontology elements) • Learn (A learning algorithm is run over the marked up corpus to learn the extraction rules) • Extract (An IE mechanism is selected and run over a set of documents) • Review (A human user observes the results and correct them if necessary)

  11. Amilcare and T-REX • Amilcare: • Automatic IE component • Is used in at least five O-based A tools (Melita, MnM, Ontoannotate, Ontomat, SemantiK) • Released to about 50 Industrial and Academic sites • Java API • Recently succeeded by T-REX

  12. Pankow • Input: A web page. • Step 1: Web page is scanned for phrases that might be categorized as instances of the ontology (partof-speech tagger to findcandidate proper nouns) • Result 1: set of candidate proper nouns • Step 2: The system iterates through allcandidate proper nouns and all candidate ontology conceptsto derive hypothesis phrasesusing preset linguistic patterns. • Result 2: Set of hypothesis phrases. • Step 3: Google is queried for the hypothesis phrases through • Result 3: the number of hits for each hypothesis phrase. • Step 4: The system sums up the query results to a total for each instance-concept pair. Then the system categorizes the candidate proper nouns into their highest ranked concepts • Result 4: an ontologically annotated web page.

  13. SemTag - Seeker • IBM-developed • ~264 million web pages • ~72 thousand of concepts (TAP taxonomy) • 434 million automatically disambiguated semantic tags • Spotting pass • Documents are retrieved from the Seeker store, and tokenized • Tokens are matched against the TAP concepts. • Each resulting label is saved with ten words to either side as a ``window'' of context around the particular candidate object. • Learning pass • A representative sample of the data is scanned to determine the corpus-wide distribution of terms at each internal node of the taxonomy. TBD (taxonomy-based disambiguation) algorithm is used. • Tagging pass • “Windows” are scanned once more to disambiguate each reference determine an TAP object • A record is entered into a database of final results containing the URL, the reference, and any other associated metadata.

  14. Conclusions • Web-document A is a necessary thing • O-based A benefits (O-based post-processing, unified vocabularies, etc.) • Manual Ais a bad thing • Automatic Ais a good thing: • Supervised O-based A: • Useful O-based interface for annotating training set • Traditional IE tools for textual classification • Unsupervised O-based A: • COHSE – matches concept names from the ontology and a thesaurus against tokens from the text • Pankow – uses ontology to build candidate queries, then uses community wisdom to choose the best candidate • SemTag – uses concept names to match tokens and hierarchical relations in the ontology to disambiguate between candidate concepts for a text fragment

  15. ? Questions ? ?

More Related