1 Kouznetsov A, 2 Shoebottom B, 1 Baker CJO

Algorithm to populate Telecom domain OWL-DL ontology with A-box object properties derived from Technical Support Documents 1Kouznetsov A, 2Shoebottom B, 1Baker CJO 1 Department of Computer Science and Applied Statistics, University of New Brunswick, Saint John, Canada 2Innovatia, Inc, Saint John, Canada

Motivation: Why Ontology-Centric? • Problem: To respond information requests timely contact center workers need to search through many types of knowledge resources • Challenge: increasing quality of service and decreasing contact center costs • Solution: using the ontology‐centricplatform • less escalation to more experienced workers • less time spent in resolving cases • training time is also greatly reduced

Motivation: Why Text Mining? • Problem : Significant time spent by highly educated experts in populating ontology. • Challenge: Reduce the workload • Solution: Apply text mining - semiautomatic method for extracting information, specifically named entities and their relations, from texts and populating a domain ontology.

Focus • We are focused on the problem of accurately extracting and populating relations between the named entities and presenting them as object properties between A-box individuals in an OWL-DL ontology.

Populate A-box Object Property. Single Property T-Box Domain Class Man Domain Instance Samuel Object Property hasSister ? Range Class Woman RangeInstance Mary A-Box

Populate A-box Object Property. Multi-properties T-Box Domain Class Man Domain Instance Samuel Range Class Woman Range Instance Mary Object Property hasMother Object Property hasSister hasSister ? A-Box hasMother?

More complicate case…. Domain Instance Samuel Range Instance Mary hasSister? hasMother? hasSameLastName?

Methodology • Ontology-based information retrieval applies Natural Language processing (NLP) to link text segments, named entities and relations between named entities to existing ontologies. • Algorithm leverages a customized gazetteer list, including lists specific to object property synonyms • Score A-box property candidates by using functions of distance between co-occurred terms. • A-box Property prediction and population based on these scores (Thresholds, Fuzzy approach)

Main Implementation tools • Java • GATE/JAPE • OWLAPI

Semi-Automatic Ontology populating pipeline Pre processing Text Segments Processing Ontology Population Populated Ontology Term List (Excel) Connecting Recourses Sentences Named Entities Using Ontology Ontology unpopulated (OWL) Synonyms Lists Tables Single Relations Reasoning Visualizing Source Documents XML Text Segments Separation Other Text Segments Multi Relations Visual Queries

Populating Ontology Ontology Scoring Framework Relation Framework for A-box candidates extraction Decision Framework Candidate Reasoning Focus Co-occurrence Based Scores generator Decision module Scores Labelled Data Tres

Co-occurrence Based Scores generator Synonyms List All related content A-box Candidate Co-occurrence Based Scores generator (Light version) Relations Framework RelationObject Gazetteer Score calculator Scores Fragments Processor Integrator Tokenizer

Generation of Scores • Relation Collection Framework to process Relation objects • Relation Object integrates object property with: • all types of related text fragments • ontology objects • and score processing intermediate and final results identified as : Domain Class: Domain Instance : Object Property : Range Class: Range Instance

Scores Generator: Details Score Calculator: • Score calculation for text fragments associated with the Relation . • Current version based on distance between occurred entities and number of text fragments with co-occurrence • Includes by Text Fragments Processor and Integrator

2-terms and 3-terms scoring system Domain Synonyms list Score Processor Score Gazeteer sentence score Range Synonyms list Tokenized sentence Object Property Synonyms list Tokenizer Legend Legacy (2 terms) System Modified/Added on new (3 terms) system

Multiple Formats Score Generation Technical documentation contains knowledge displayed in multiple formats, each requiring different processing subroutines: • Table Processing • Sentence Processing • Other segments

Extensible Data Model Corpus Document Doc ID Document Segment Text Segment Table Segment Column Header Sentence Table Header Data Cell Row Header ID ID ID ID ID Content Content Content Content Content Options: Sections, Paragraphs, Bullet lists, Headings

A-Box property candidates list A-Box Obj. Properties (399) A-Box Prop. Population Properties with co-occurrence of domain and range Individuals (143) Text Mining Ontology processing Gazetteer List T-Box Obj. Properties (102) corpus Properties with occurrence of domain or range Individuals (256)

Evidences for A-box Obj. Property candidates Evidences for Current A-box (co-occurrence of Domain and Range) Evidences for Current A-box (occurrence of Domain or Range) A-Box scoring Text Segment Text Segment Text Segment Text Segment Table Segment Table Segment Table Segment Table Segment Table Segment Table Segment Table Segment Table Segment Text Segment Text Segment Text Segment Text Segment Current A-box Object Property Candidate Sentence Sentence Sentence Sentence Data Cell Data Cell Data Cell Data Cell Data Cell Data Cell Data Cell Data Cell Column Header Column Header Column Header Column Header Column Header Column Header Column Header Column Header Row Header Row Header Row Header Row Header Row Header Row Header Row Header Row Header Table Header Table Header Table Header Table Header Table Header Table Header Table Header Table Header Sentence Sentence Sentence Sentence ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID ID Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content

Table Segments: Primary Scoring Table Segment Data Cell Row Header Column Header Table Header A-Box scoring ID ID ID ID Current A-box Object Property Candidate Content Content Content Content Domain Property Range

Table Segments: Secondary Scoring Table Segment Row Header Data Cell Column Header Table Header A-Box scoring ID ID ID ID Current A-box Object Property Candidate Content Content Content Content Domain Property Range

Sentence Scoring • A-box Object property Score for sentence SentenceScore=1/(distance+1)+Bonus • Integrated Object property Score over all related sentences IntegratedScore= SUM(SentenceScore) • Summarize Integrated Score with Table Scores • Normalized Object property Score NormolizedScore= IntegratedScore/Norm

Sentence scoring Score=1/(distance+1)+Bonus < > < > < > < > </ > </ > </ > </ > 2 4 3 1 Distance: 1000, Bonus =0, Score= 1/(1000+1)+0=0.00099 D R D D D 1 1 1 2 2 2 3 P 3 4 4 4 R R R 6 P Distance: 4, Bonus =0, Score= 1/(4+1)+0=0.2 Distance: 6, Bonus =3, Score= 1/(6+1)+3=3.14 Domain Synonym Range Synonym Object Property Synonym Distance: 4, Bonus =10, Score= 1/(4+1)+10=10.2 D R P

Example Sentence Type 1 Distance: 1000, Bonus =0, Score= 1/(1000+1)+0=0.00099 < > </ > 1 D R Telecommunications_Chassis:8010co_Chassis:hasChassis_Shipping_Accessories:Telecommunications_Chassis_Screws:Screws • Domain Synonyms: • 8010co chassis • 8010co Chassis • 8010 CO chassis • 8010co • 8010CO chassis • Property Synonyms: • need • have • require • has • Range Synonyms: • Screws • screws sentence before cleaning: ["<Paragraph></Action> <Figure Numbered="Unnumbered" Position="Inline" TextSize="medium" Width="column" frame="all" id="DLM-11334063" xml:lang="en"> <image border-style="none" border-width="medium" xml:lang="en" href="ERGNN46205-301Loosening_screws_on_the_SDM_FW4_8010co_chassis33b.png"/></Figure> </Step> <Step xml:lang="en"><Action><Paragraph xml:lang="en">Rotate the insert/extract levers to eject the 8660 SDM from the chassis.] Final Score=9.99000999000999E-4 Best Bonus=0.0 Final Distance=1000.0

Example Sentence Type 2 < > </ > 2 D 1 2 3 4 R Telecommunications_Chassis:Chassis:hasChassis_Components:Telecommunications_Chassis_Power_Supply:Power_Supply • Property Synonyms: • have • has • Domain Synonyms: • chassis • switch chassis • 8000 series • Chassis • CO chassis • Range Synonyms: • Power Supply • transformer • power supply • power module • Power supply sentence after cleaning: In a chassis that includes two power supplies in a non redundant power configuration, you must start both restrictions dual power supplies power supply units within 2 seconds of each other. Final Score=0.05 Best Bonus=0.0 Final Distance=19

Example Sentence Type 4 < > </ > 4 D 1 2 P 4 R Telecommunications_Chassis_Power_Supply:Power_Supply:isPart_of_Chassis:Telecommunications_Chassis:Chassis • Property Synonyms: • used in • include • Domain Synonyms: • Power Supply • transformer • power supply • power module • Power supply • Range Synonyms: • chassis • switch chassis • 8000 series • Chassis • CO chassis sentence after cleaning: In a chassis that includes two power supplies in a non redundant power configuration, you must start both restrictions dual power supplies power supply units within 2 seconds of each other. Final Score=10.05 Best Bonus=10.0 Final Distance=19

Bonus Calculation Bonus= Bonus Constant * Number of tokens in property < > < > </ > </ > D D 1 1 2 2 3 4 P P 6 6 R R P Distance: 6, Bonus Constant =10, Tokens in Property=2, Score= 1/(6+1)+2*10=20.14 3 Distance: 6, Bonus Constant=10, Tokens in Property=1, Score= 1/(6+1)+1*10=10.14 Sentence Example: Device X does not support Device Y Object Properly Tokens Number Obtained Score Support 1 1/(3+1)+1*10=10.25 Not Support 2 1/(3+1)+2*10=20.25 V

Normalization • Norm coefficient for A-box object property Log(1.0+(NSD+1.0/Cd) *(NSR+1.0/Cr) ) NSD – Number Of Sentences Domain Occurred Cd – Domain Synonyms List Cardinality NSR – Number Of Sentences Range Occurred Cr – Range Synonyms List Cardinality

Gold Standard and Evaluation Framework Pre processing Text Segments Processing Ontology Population Populated Ontology Term List (Excel) Knowledge Engineer T-Box Ontology Connecting Recourses Sentences Named Entities Using Ontology Populate Ontology Evaluation Report Ontology unpopulated (OWL) Labels Synonyms Lists Tables Single Relations Reasoning Visualizing Source Documents XML Prediction evaluation Framework Text Segments Separation Bullet Lists Multi Relations Evaluate predicted Properties / Update DB Golden Standard Database Import labels Visual Queries A-Box Ontology

Thresholds: Decision Boundary • All scores for each A-box property candidate are summarized for based on eligible sources of evidence for the A-box in question • Threshold in use • Trade off - Recall vs. Precision

Results for Tables: Baseline result Focus on Positive class Recall and Positive class Precision • Class of interest (Positive class) • Recall =0.80 • Precision=0.85

Results for Tables: Continued Focus on Positive class Precision • Class of interest (Positive class) • Recall =0.25 • Precision=1.0

Results for Tables: Continued Focus on Positive class Recall • Class of interest (Positive class) • Recall =1.0 • Precision=77.5

Results for Sentences Focus on Positive class Precision • Class of interest (Positive class) • Recall =0.14 • Precision=1.0

Results for Sentences and Tables Focus on Positive class Precision • Class of interest (Positive class) • Recall =0.4 • Precision=1.0 • Synergetic effect of using Sentences and Tables (wrt Precision=1.0): Recall (sentences)= 0.14 Recall (tables)= 0.25 Recall (sentences & tables)= 0.4

Advantages • Improve Quality of Knowledge Base • Managing the argumentation process KB vs KE • Iterative improvement of accuracy • Tier1 doing Tier 2 task (improve service) • Tier1 (high precision) KB query • Tier 2 (high recall) – knowledge integration • Facilitate information processing without KE • Reduce workload (saving)

Improve Quality of Knowledge Base • Offline task by Knowledge Engineer • Disambiguation • Expert can pay special attention to any significant inconsistency in human and machine outputs such as - Highly scored A-box candidates labeled as negatives • Human Expert & Machine Committee vs. single human expert

Real Time Integration of New Evidence • Online, by call centre worker, at knowledge use stage • Extracting additional object properties from new documents for emergency case • High Positive Precision focused scenario • Offline, by Senior call centre worker, at knowledge use stage • Extracting additional object properties from new documents for questions not answered online • High Positive Recall focused scenario

Reduce Workload • Online and Offline • Automatically Extracted Evidenced • Ranked Solutions with notified level of confidence

Gold Standard Corpus and Evaluation Framework Pre processing Text Segments Processing Ontology Population Populated Ontology Term List (Excel) Knowledge Engineer T-Box Ontology Connecting Recourses Sentences Named Entities Using Ontology Populate Ontology Evaluation Report Ontology unpopulated (OWL) Labels Synonyms Lists Tables Single Relations Reasoning Visualizing Source Documents XML Prediction evaluation Framework Text Segments Separation Bullet Lists Multi Relations Evaluate predicted Properties / Update DB Golden Standard Database Import labels Visual Queries A-Box Ontology

Future Work: Extend Literature Scheme • Sections • Paragraphs • Bullet Lists • Connect with Headings and Topics

1 Kouznetsov A, 2 Shoebottom B, 1 Baker CJO