1 / 22

A Synergistic Semantic Annotation Model December 2007

A Synergistic Semantic Annotation Model December 2007. Yihong Ding, http://www.deg.byu.edu/ding/. Grand challenge: new generation World Wide Web. The current Web Enormous amount content Feasible for humans to read/write But … Content is simply too much to read The future Web

katy
Télécharger la présentation

A Synergistic Semantic Annotation Model December 2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Synergistic Semantic Annotation ModelDecember 2007 Yihong Ding, http://www.deg.byu.edu/ding/

  2. Grand challenge: new generation World Wide Web The current Web • Enormous amount content • Feasible for humans to read/write • But … • Content is simply too much to read The future Web • Even more content but machine-processable • Feasible for humans and machines to read/write • Key issue • Converting non-machine-processable content to machine-processable content, i.e., semantic annotation

  3. AptRental Ontology Semantic annotation, the general picture Data Extraction/Instance Recognition Engine

  4. AptRental Ontology Semantic annotation, the general picture

  5. Ontology • Definition: Explicit, formal specifications of conceptualizations • Unique identity of each concept • Unique identity of each relationship among concepts • Logic derivation rules underneath every declared relationship • Annotation: • 533-0293 is-a AptRental:ContactPhone • $1250 is-a AptRental:MonthlyRate • 533-0293 is-about AptRentalAd-instance-1 • $1250 is-about AptRentalAd-instance-1 • Ontology: • AptRentalAd has ContactPhone • AptRentalAd has MonthlyRate • Logic derivation: • To rent the apartment that costs $1250 monthly please call 533-0293. (machine understanding)

  6. Automated semantic annotation, methods • Layout-driven method (e.g. [Mukherjee et. al. 03]) • Machine-learning-based method (e.g. [Handschuh et. al. 02]) • Rule-based method (e.g. [Dill et. al. 03]) • NLP-based method (e.g. [Popov et. al. 03]) • Ontology-based method (e.g. [Ding et. al. 06])

  7. Ontology-based annotation

  8. Data extraction ontology Standard Ontology epistemological extension (instance recognizer) BedroomNr BedroomNr External representation Context Phrase Exception Phrase X CAPITOL HILL Luxury 2 bdrm 2 bath, 2 grg, w/d,views, 1700 sq ft. $1250 mo. Call 533-0293

  9. Ontology-based annotation BedroomNr BathNr External representation Context Phrase External representation Context Phrase CAPITOL HILL Luxury 2 bdrm 2 bath, 2 grg, w/d,views, 1700 sq ft. $1250 mo. Call 533-0293 External representation External representation Context Phrase Context Keyword External representation Feature MonthRate ContactPhone

  10. Ontology-based annotation: strength and weakness • Strengths • Ignore layout difference • Ignore layout change • Less maintenance once built • Weakness • Expensive to build instance recognizers

  11. Layout-driven annotation

  12. Layout-driven annotation

  13. Layout-driven annotation, strength and weakness • Strengths • Accurate • Simple and straightforward • Less domain knowledge requirement • Weakness • Expensive in layout-pattern maintenance

  14. Problem • How to • overcome the weaknesses • but • retaining the strengths • at the same time?

  15. resilient accurate Observation Extraction Domain ontology Annotated Document Conceptual Annotator (ontology-based annotation) A Document Domain ontology Layout Patterns Structural Annotator (layout-driven annotation) Annotated Document A Document

  16. Synergistic model Annotated Document Instance Recognizer Enrichment Extraction Domain ontology Annotated Document Layout Patterns Structural Annotator (layout-driven annotation) Conceptual Annotator (ontology-based annotation) Pattern Generation A Document

  17. Pattern Generation • Get the annotated outputs from ontology-based annotator • Apply HTML-structure analysis and produce a typical layout pattern for each extracted field • If applicable, produce a sequential dependency between the generated layouts • If applicable, produce simple heuristic rules such as “if A then B” between the generated layouts

  18. Instance recognizer enrichment • Get the annotated outputs from layout-driven annotator • Apply the results to the current corresponding instance recognizers • If recognized, continue; • Otherwise, • if dictionary-type recognizers, insert. • if regular-expression-type recognizers, try to generate a new regular expression and alert the user to check

  19. Preliminary results Apartment Rental domain • Ontology-based annotation • 90% accuracy in average on both precision and recall for nearly all fields • Except Location and Contact Name • Layout-driven annotation • Nearly 100% accuracy on both precision and recall on Location and Contact Name • Less recall on fields such as BedroomNr • Pattern generation • Great on well structured fields such as Location • Less successful on semi-structured fields such as BedroomNr • Instance recognizer enrichment • Good results even with poorly constructed initial instance recognizers

  20. Summary • Automatically produce layout patterns using outputs of ontology-based annotation • Automatically enrich domain-specific instance recognizers using outputs of layout-driven annotation • A new synergistic annotation model that retains original strengths and minimizes original weaknesses • An annotation system that self-improves its performance during its execution

  21. Future work • Dynamical tuning annotation based on user perspectives • Ensemble of various annotators • Collaborative annotation

  22. Thank you • Yihong Ding ding@cs.byu.edu • (801) 422-7604 • 2262 TMCB, Brigham Young University • Provo, UT 84601 • Data Extraction Research Lab at Brigham Young University • http://www.deg.byu.edu • Homepage, my virtual home on Web 1.0 • http://www.deg.byu.edu/ding/ • Thinking Space, my virtual home on Web 2.0 • http://yihongs-research.blogspot.com/

More Related