1 / 41

1 Kouznetsov A, 2 Shoebottom B, 1 Baker CJO

Algorithm to populate Telecom domain OWL-DL ontology with A-box object properties derived from Technical Support Documents. 1 Kouznetsov A, 2 Shoebottom B, 1 Baker CJO 1 Department of Computer Science and Applied Statistics, University of New Brunswick, Saint John, Canada

naoko
Télécharger la présentation

1 Kouznetsov A, 2 Shoebottom B, 1 Baker CJO

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithm to populate Telecom domain OWL-DL ontology with A-box object properties derived from Technical Support Documents 1Kouznetsov A, 2Shoebottom B, 1Baker CJO 1 Department of Computer Science and Applied Statistics, University of New Brunswick, Saint John, Canada 2Innovatia, Inc, Saint John, Canada

  2. Motivation: Why Ontology-Centric? • Problem: To respond information requests timely contact center workers need to search through many types of knowledge resources • Challenge: increasing quality of service and decreasing contact center costs • Solution: using the ontology‐centricplatform • less escalation to more experienced workers • less time spent in resolving cases • training time is also greatly reduced

  3. Motivation: Why Text Mining? • Problem : Significant time spent by highly educated experts in populating ontology. • Challenge: Reduce the workload • Solution: Apply text mining - semiautomatic method for extracting information, specifically named entities and their relations, from texts and populating a domain ontology.

  4. Focus • We are focused on the problem of accurately extracting and populating relations between the named entities and presenting them as object properties between A-box individuals in an OWL-DL ontology.

  5. Populate A-box Object Property. Single Property T-Box Domain Class Man Domain Instance Samuel Object Property hasSister ? Range Class Woman RangeInstance Mary A-Box

  6. Populate A-box Object Property. Multi-properties T-Box Domain Class Man Domain Instance Samuel Range Class Woman Range Instance Mary Object Property hasMother Object Property hasSister hasSister ? A-Box hasMother?

  7. More complicate case…. Domain Instance Samuel Range Instance Mary hasSister? hasMother? hasSameLastName?

  8. Methodology • Ontology-based information retrieval applies Natural Language processing (NLP) to link text segments, named entities and relations between named entities to existing ontologies. • Algorithm leverages a customized gazetteer list, including lists specific to object property synonyms • Score A-box property candidates by using functions of distance between co-occurred terms. • A-box Property prediction and population based on these scores (Thresholds, Fuzzy approach)

  9. Main Implementation tools • Java • GATE/JAPE • OWLAPI

  10. Semi-Automatic Ontology populating pipeline Pre processing Text Segments Processing Ontology Population Populated Ontology Term List (Excel) Connecting Recourses Sentences Named Entities Using Ontology Ontology unpopulated (OWL) Synonyms Lists Tables Single Relations Reasoning Visualizing Source Documents XML Text Segments Separation Other Text Segments Multi Relations Visual Queries

  11. Populating Ontology Ontology Scoring Framework Relation Framework for A-box candidates extraction Decision Framework Candidate Reasoning Focus Co-occurrence Based Scores generator Decision module Scores Labelled Data Tres

  12. Co-occurrence Based Scores generator Synonyms List All related content A-box Candidate Co-occurrence Based Scores generator (Light version) Relations Framework RelationObject Gazetteer Score calculator Scores Fragments Processor Integrator Tokenizer

  13. Generation of Scores • Relation Collection Framework to process Relation objects • Relation Object integrates object property with: • all types of related text fragments • ontology objects • and score processing intermediate and final results identified as : Domain Class: Domain Instance : Object Property : Range Class: Range Instance

  14. Scores Generator: Details Score Calculator: • Score calculation for text fragments associated with the Relation . • Current version based on distance between occurred entities and number of text fragments with co-occurrence • Includes by Text Fragments Processor and Integrator

  15. 2-terms and 3-terms scoring system Domain Synonyms list Score Processor Score Gazeteer sentence score Range Synonyms list Tokenized sentence Object Property Synonyms list Tokenizer Legend Legacy (2 terms) System Modified/Added on new (3 terms) system

  16. Multiple Formats Score Generation Technical documentation contains knowledge displayed in multiple formats, each requiring different processing subroutines: • Table Processing • Sentence Processing • Other segments

  17. Extensible Data Model Corpus Document Doc ID Document Segment Text Segment Table Segment Column Header Sentence Table Header Data Cell Row Header ID ID ID ID ID Content Content Content Content Content Options: Sections, Paragraphs, Bullet lists, Headings

  18. A-Box property candidates list A-Box Obj. Properties (399) A-Box Prop. Population Properties with co-occurrence of domain and range Individuals (143) Text Mining Ontology processing Gazetteer List T-Box Obj. Properties (102) corpus Properties with occurrence of domain or range Individuals (256)

  19. Evidences for A-box Obj. Property candidates Evidences for Current A-box (co-occurrence of Domain and Range) Evidences for Current A-box (occurrence of Domain or Range) A-Box scoring Text Segment Text Segment Text Segment Text Segment Table Segment Table Segment Table Segment Table Segment Table Segment Table Segment Table Segment Table Segment Text Segment Text Segment Text Segment Text Segment Current A-box Object Property Candidate Sentence Sentence Sentence Sentence Data Cell Data Cell Data Cell Data Cell Data Cell Data Cell Data Cell Data Cell Column Header Column Header Column Header Column Header Column Header Column Header Column Header Column Header Row Header Row Header Row Header Row Header Row Header Row Header Row Header Row Header Table Header Table Header Table Header Table Header Table Header Table Header Table Header Table Header Sentence Sentence Sentence Sentence ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content

  20. Table Segments: Primary Scoring Table Segment Data Cell Row Header Column Header Table Header A-Box scoring ID ID ID ID Current A-box Object Property Candidate Content Content Content Content Domain Property Range

  21. Table Segments: Secondary Scoring Table Segment Row Header Data Cell Column Header Table Header A-Box scoring ID ID ID ID Current A-box Object Property Candidate Content Content Content Content Domain Property Range

  22. Sentence Scoring • A-box Object property Score for sentence SentenceScore=1/(distance+1)+Bonus • Integrated Object property Score over all related sentences IntegratedScore= SUM(SentenceScore) • Summarize Integrated Score with Table Scores • Normalized Object property Score NormolizedScore= IntegratedScore/Norm

  23. Sentence scoring Score=1/(distance+1)+Bonus < > < > < > < > </ > </ > </ > </ > 2 4 3 1 Distance: 1000, Bonus =0, Score= 1/(1000+1)+0=0.00099 D R D D D 1 1 1 2 2 2 3 P 3 4 4 4 R R R 6 P Distance: 4, Bonus =0, Score= 1/(4+1)+0=0.2 Distance: 6, Bonus =3, Score= 1/(6+1)+3=3.14 Domain Synonym Range Synonym Object Property Synonym Distance: 4, Bonus =10, Score= 1/(4+1)+10=10.2 D R P

  24. Example Sentence Type 1 Distance: 1000, Bonus =0, Score= 1/(1000+1)+0=0.00099 < > </ > 1 D R Telecommunications_Chassis:8010co_Chassis:hasChassis_Shipping_Accessories:Telecommunications_Chassis_Screws:Screws • Domain Synonyms: • 8010co chassis • 8010co Chassis • 8010 CO chassis • 8010co • 8010CO chassis • Property Synonyms: • need • have • require • has • Range Synonyms: • Screws • screws sentence before cleaning: ["<Paragraph></Action> <Figure Numbered="Unnumbered" Position="Inline" TextSize="medium" Width="column" frame="all" id="DLM-11334063" xml:lang="en"> <image border-style="none" border-width="medium" xml:lang="en" href="ERGNN46205-301Loosening_screws_on_the_SDM_FW4_8010co_chassis33b.png"/></Figure> </Step> <Step xml:lang="en"><Action><Paragraph xml:lang="en">Rotate the insert/extract levers to eject the 8660 SDM from the chassis.] Final Score=9.99000999000999E-4 Best Bonus=0.0 Final Distance=1000.0

  25. Example Sentence Type 2 < > </ > 2 D 1 2 3 4 R Telecommunications_Chassis:Chassis:hasChassis_Components:Telecommunications_Chassis_Power_Supply:Power_Supply • Property Synonyms: • have • has • Domain Synonyms: • chassis • switch chassis • 8000 series • Chassis • CO chassis • Range Synonyms: • Power Supply • transformer • power supply • power module • Power supply sentence after cleaning: In a chassis that includes two power supplies in a non redundant power configuration, you must start both restrictions dual power supplies power supply units within 2 seconds of each other. Final Score=0.05 Best Bonus=0.0 Final Distance=19

  26. Example Sentence Type 4 < > </ > 4 D 1 2 P 4 R Telecommunications_Chassis_Power_Supply:Power_Supply:isPart_of_Chassis:Telecommunications_Chassis:Chassis • Property Synonyms: • used in • include • Domain Synonyms: • Power Supply • transformer • power supply • power module • Power supply • Range Synonyms: • chassis • switch chassis • 8000 series • Chassis • CO chassis sentence after cleaning: In a chassis that includes two power supplies in a non redundant power configuration, you must start both restrictions dual power supplies power supply units within 2 seconds of each other. Final Score=10.05 Best Bonus=10.0 Final Distance=19

  27. Bonus Calculation Bonus= Bonus Constant * Number of tokens in property < > < > </ > </ > D D 1 1 2 2 3 4 P P 6 6 R R P Distance: 6, Bonus Constant =10, Tokens in Property=2, Score= 1/(6+1)+2*10=20.14 3 Distance: 6, Bonus Constant=10, Tokens in Property=1, Score= 1/(6+1)+1*10=10.14 Sentence Example: Device X does not support Device Y Object Properly Tokens Number Obtained Score Support 1 1/(3+1)+1*10=10.25 Not Support 2 1/(3+1)+2*10=20.25 V

  28. Normalization • Norm coefficient for A-box object property Log(1.0+(NSD+1.0/Cd) *(NSR+1.0/Cr) ) NSD – Number Of Sentences Domain Occurred Cd – Domain Synonyms List Cardinality NSR – Number Of Sentences Range Occurred Cr – Range Synonyms List Cardinality

  29. Gold Standard and Evaluation Framework Pre processing Text Segments Processing Ontology Population Populated Ontology Term List (Excel) Knowledge Engineer T-Box Ontology Connecting Recourses Sentences Named Entities Using Ontology Populate Ontology Evaluation Report Ontology unpopulated (OWL) Labels Synonyms Lists Tables Single Relations Reasoning Visualizing Source Documents XML Prediction evaluation Framework Text Segments Separation Bullet Lists Multi Relations Evaluate predicted Properties / Update DB Golden Standard Database Import labels Visual Queries A-Box Ontology

  30. Thresholds: Decision Boundary • All scores for each A-box property candidate are summarized for based on eligible sources of evidence for the A-box in question • Threshold in use • Trade off - Recall vs. Precision

  31. Results for Tables: Baseline result Focus on Positive class Recall and Positive class Precision • Class of interest (Positive class) • Recall =0.80 • Precision=0.85

  32. Results for Tables: Continued Focus on Positive class Precision • Class of interest (Positive class) • Recall =0.25 • Precision=1.0

  33. Results for Tables: Continued Focus on Positive class Recall • Class of interest (Positive class) • Recall =1.0 • Precision=77.5

  34. Results for Sentences Focus on Positive class Precision • Class of interest (Positive class) • Recall =0.14 • Precision=1.0

  35. Results for Sentences and Tables Focus on Positive class Precision • Class of interest (Positive class) • Recall =0.4 • Precision=1.0 • Synergetic effect of using Sentences and Tables (wrt Precision=1.0): Recall (sentences)= 0.14 Recall (tables)= 0.25 Recall (sentences & tables)= 0.4

  36. Advantages • Improve Quality of Knowledge Base • Managing the argumentation process KB vs KE • Iterative improvement of accuracy • Tier1 doing Tier 2 task (improve service) • Tier1 (high precision) KB query • Tier 2 (high recall) – knowledge integration • Facilitate information processing without KE • Reduce workload (saving)

  37. Improve Quality of Knowledge Base • Offline task by Knowledge Engineer • Disambiguation • Expert can pay special attention to any significant inconsistency in human and machine outputs such as - Highly scored A-box candidates labeled as negatives • Human Expert & Machine Committee vs. single human expert

  38. Real Time Integration of New Evidence • Online, by call centre worker, at knowledge use stage • Extracting additional object properties from new documents for emergency case • High Positive Precision focused scenario • Offline, by Senior call centre worker, at knowledge use stage • Extracting additional object properties from new documents for questions not answered online • High Positive Recall focused scenario

  39. Reduce Workload • Online and Offline • Automatically Extracted Evidenced • Ranked Solutions with notified level of confidence

  40. Gold Standard Corpus and Evaluation Framework Pre processing Text Segments Processing Ontology Population Populated Ontology Term List (Excel) Knowledge Engineer T-Box Ontology Connecting Recourses Sentences Named Entities Using Ontology Populate Ontology Evaluation Report Ontology unpopulated (OWL) Labels Synonyms Lists Tables Single Relations Reasoning Visualizing Source Documents XML Prediction evaluation Framework Text Segments Separation Bullet Lists Multi Relations Evaluate predicted Properties / Update DB Golden Standard Database Import labels Visual Queries A-Box Ontology

  41. Future Work: Extend Literature Scheme • Sections • Paragraphs • Bullet Lists • Connect with Headings and Topics

More Related