1 / 32

Semiautomatic Generation of Resilient Data-Extraction Ontologies

Semiautomatic Generation of Resilient Data-Extraction Ontologies. Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF. Wrapper-Driven Data Extraction. Web data extraction Obtain user-specified information from Web documents Wrapper

Télécharger la présentation

Semiautomatic Generation of Resilient Data-Extraction Ontologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

  2. Wrapper-Driven Data Extraction • Web data extraction • Obtain user-specified information from Web documents • Wrapper • Convert implicit HTML data into explicit formatted data • Data-source-specified, high performance • Examples: • SoftMealy, STALKER, WIEN, Omini, ROADRUNNER, …

  3. ? / next_token ? / ε _U s<U,U> / ε s<N,N> / ε ? / ε U etc. s<b,U> / “U=” + next_token s<U,N> / “N=” + next_token b _N s<b,N> / “N=” + next_token N ? / ε ? / next_token Common Problem of Wrappers SoftMealy <LI> <A HREF="…"> Mani Chandy </A>, <I>Professor of Computer Science</I> and <I>Executive Officer for Computer Science</I> • Resiliency • fixed domain • changeable layout • Scalability • unchanged existing wrapper • extendable domain and functions

  4. Structure Object sets Relationship sets Participation constraints Data frames Pros: resilient and scalable Cons: hard to create Knowledge requirements Tedious and error-prone work Car [-> object]; Car [0:1] has Make [1:*]; Make matches [10] constant { extract "\baudi\b"; }; end; Car [0:1] has Model [1:*]; Model matches [25] constant { extract "80"; context "\baudi\S*\s*80\b"; }; end; Car [0:1] has Mileage [1:*]; Mileage matches [8] constant {extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000";}; end; Car [0:1] has Price [1:*]; Price matches [8] constant { extract "[1-9]\d{3,6}"; context "\$[1-9]\d{3,6}";}; end; Data-Extraction Ontology

  5. Sample Documents Human Brain Concepts of Interest Data-Extraction Ontology Knowledge Base Concepts with Relations Motif of Ontology Generation

  6. Thesis Statement • Given: knowledge base • Input: sample Web pages of interest • Output: a data-extraction ontology for the domain of interest • Between input and output: this is the work of this thesis

  7. test documents training documents pre-processing clean records interact if necessary Data Extraction Ontology Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources Ontology-Generation Procedure

  8. Primary Knowledge Source • Requirements • Available • General in coverage • Rich in meaningful relationship • Encoded in or easily converted to XML • Mikrokosmos (K) Ontology • Developed by NMSU jointly with U.S. DoD • Contains over 5000 concepts • Connects to an average 14 links per concept • Represented in XML format

  9. Integrated Knowledge Base KNOWLEDGE BASE K Ontology Lexicons Data-Frame Library Synonym Dictionary (WordNet)

  10. test documents training documents pre-processing clean records interact if necessary Data Extraction Ontology Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources Ontology-Generation Procedure

  11. Domain Specification • Training documents • Data-rich • Narrow in topic breadth • Preprocessing

  12. Example – Car Advertisement Record 1: 00 GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446 Record 2: 02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 221-1250 Record 3: 02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755 Record 4: 00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah

  13. test documents training documents pre-processing clean records interact if necessary Data Extraction Ontology Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources Ontology-Generation Procedure

  14. Concept Selection • Selection strategies • Compare a string with the name of a concept • Compare a string with the values belonging to a concept • Apply data-frame recognizers to recognize a string KB <PHONE-NR> 00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah

  15. <PRICE> <MILEAGE> by keyword identification price Concept Selection • Reasons of conflict • Synonymy • Polysemy • Conflict resolution • Same-string only one meaning • Favor longer over shorter • Context decides meaning KB 02 Buick Century Custom, Pwr Seat, Nada Retail13,695 221-1250.

  16. test documents training documents pre-processing clean records interact if necessary Data Extraction Ontology Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources Ontology-Generation Procedure

  17. Relationship Retrieval KB <AUTOMOBILE> <MILEAGE> <YEAR> <PRICE> <PHONE-NR> <AUDIO-MEDIA-ARTIFACT> <CENTURY>

  18. test documents training documents pre-processing clean records interact if necessary Data Extraction Ontology Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources Ontology-Generation Procedure

  19. <AUTOMOBILE> <AUTOMOBILE> <PRICE> <PRICE> Constraint Discovery 02 Buick Century, lo mi, mint cond, green, pwr seat, $11,999. 373-4445 dlr# 2755 AUTOMOBILE [0:1] IsA.ARTIFACT.CostofProduction PRICE [1:1] 00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah

  20. test documents training documents pre-processing clean records interact if necessary Data Extraction Ontology Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources Ontology-Generation Procedure

  21. Ontology Generation concept nodes  object sets paths  relationship sets discovered constraints  participation constraints concept recognizers  data frames

  22. Automatically Generated Ontology -- Car Advertisement (01) {Automobile [-> object];} (02) {Automobile [0:1] has Mileage [1:1];} (03) {Automobile [0:1] IsA.ARTIFACT.CostOfProduction Price [1:1];} (12) {Price [1:1] IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year [0:*];} (20) {Automobile [0:1] relatesTo PhoneNr [1:*] relatesTo ArtifactPart [1:*] relatesTo Mileage [1:*] relatesTo Truck [1:*] relatesTo AudioMediaArtifact [1:*] relatesTo CommunicationDevice [1:*] relatesTo ControlEvent [1:*] relatesTo TravelEvent [1:*];}

  23. test documents training documents pre-processing clean records interact if necessary Data Extraction Ontology Concept Selection Relation Retrieval Constraint Discovery Extraction Processing Integrated Knowledge Base Results Storage Result Evaluation pre-processing Knowledge Sources Ontology-Generation Procedure

  24. Updating Strategies • Remove all bad relationship sets • Modify remaining incorrect relationship sets • Substitute incorrect object sets • Reduce long n-ary relationship sets • Fix participation constraints • Adjust names or re-arrange sequences • Add new relationship sets

  25. Final Ontology Car [-> object] Car [0:1] has Year [1:*] Car [0:1] has Mileage [1:*] Car [0:1] has Price [1:*] PhoneNr [1:*] is for Car [0:1] PhoneNr [0:1] has Extension [1:*] Car [0:*] has Feature [1:*] Car [0:1] has Make [1:*] Car [0:1] has Model [1:*]

  26. Evaluation Criteria • Basic measures • POG (Precision of Ontology Generation) • ROG (Recall of Ontology Generation) • Human constraints • PROG (Pseudo-ROG) • Comparing with an expert-created ontology • Knowledge base constraints • EPROG (Effective-PROG) • Correctness dependency • DEPROG (Dependent-EPROG) • For example: relationship sets depends on object sets

  27. Evaluation Results

  28. Discussion of Results • Bottleneck: cannot generate what not in the knowledge base • Object sets • Concept-selection procedure works well • Desired concept not shown in training records • Rarely occurring concept  not severe even if we don’t fix the error • Example: extension • Aggregation and union • USAddressCity, USAddressState, USAddressZipCode  Location • CropPlant, AnimalProduct, FruitFoodStuff  AgriculturalProduct • Close-meaning concepts: FurniturePart  Furnished

  29. Discussion of Results • Relationship sets • Binary relationship sets over 95% • Most errors due to incorrectly generated object sets • Semantically incorrect relationship sets • Price IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year • n-ary relationship sets (usually huge) • Participation constraints • Error due to lack of training examples • How much is enough?

  30. Knowledge Base Extensibility • Add SALT -- a new knowledge source • Successfully integrated into existing KB • Sample new relationship set (DOE abstract domain) • CrudeOil IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation

  31. Conclusion • Experimented with knowledge-base construction and extension • Standardized application domain specification • Generated data-extraction ontologies from a specified domain and an integrated knowledge base • Showed DEPROG results of more than 70% on average and over 90% for well-defined domains

  32. Future Work • Build a general-purpose knowledge source for data-extraction usage • Study more about data frames • Can a system correctly identify concepts with data frames? • Can a system update a data frame to fit a special situation? • Can a system generate a data frame from a collection of information of interest?

More Related