1 / 50

Peppering knowledge sources with SALT

This paper discusses the use of the SALT project to enhance ontology generation by integrating large-scale termbases with ontological structures and converting them into machine-readable XML format.

sburchfield
Télécharger la présentation

Peppering knowledge sources with SALT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Peppering knowledge sources with SALT (Boosting conceptual content for ontology generation) Deryle Lonsdale, Yihong Ding, David W. Embley, Alan Melby Brigham Young University lonz@byu.edu, {ding,embley}@cs.byu.edu, akm@byu.edu AAAI 2002 WS

  2. Acknowledgements • Co-authors (Embley, Ding) • EU Fifth Framework IST/HLT 3.4.1 • NSF Information and Intelligent Systems grant IIS-0083127 • Gerhard Budin (Eurodicautom data) • Sergei Nirenburg (Mikrokosmos ontology) AAAI 2002 WS

  3. Outline • Termbases and lexicons: (re)use(s) • The SALT and TIDIE projects • Data modeling and data resources • Termbase conversion • Ontology generation • Results and evaluation • Conclusions AAAI 2002 WS

  4. Termbases • Terminology databases for humans in multilingual documentation industry • Several models, formats; often concept-oriented in nature • Termium, Eurodicautom, etc. AAAI 2002 WS

  5. Lexicons • NLP applications: IR, MT, NLU, speech understanding • Widely varying data formats • Description at various levels of linguistic theory AAAI 2002 WS

  6. Sharing resources • Integration is the trend • Lexicons (OLIF for MT system lexicons) • Termbases (MARTIF for human termbases) • Lexicons and termbases • Needed: principled data-modeling approach • Wide variety of information to be treated • Wide range of formats currently in use AAAI 2002 WS

  7. The SALT project • SALT: Standards-based Access service to multilingual Lexicons and Terminologies (www.ttt.org/salt). • International cooperation, standards for coding and interchange of linguistic data, and the combining of technologies • Several partners (BYU TRG, KSU, etc.) • Data modeling approach to addresses the problem of interchange among diverse collections of such data, including their ontological substructure AAAI 2002 WS

  8. The SALT approach • Goal: provide • Modularity • differentiate core structure vs. data category specification • Coherence • use a meta-model • Flexibility • Support interoperable alternative representations • Modular meta-model approach • Implemented in various settings • Ongoing refinement: model’s coverage AAAI 2002 WS

  9. The TIDIE project • TIDIE: Target-based Independent-of-Document Information Extraction (www.deg.byu.edu) • Ontology-based data extraction • Conceptual modeling of real-world applications • Narrow, data-rich domains • Leverage (or build) custom ontologies for target-based extraction AAAI 2002 WS

  10. Leverage this … … to do this Information exchange Source Target Information Extraction Schema Matching AAAI 2002 WS

  11. Information Extraction • Examine/retrieve information from documents to fill information from user-supplied template • Requires some user-oriented specification of information • Our approach: finding, extracting, structuring, and synthesizing information is easier given a conceptual-model-based ontology AAAI 2002 WS

  12. Extracting pertinent information from documents AAAI 2002 WS

  13. Year Price 1..* 1..* 1..* has has Make 1..* Mileage 0..1 0..1 0..1 0..1 has has Car 0..1 0..1 0..* is for PhoneNr has has 1..* Model 0..1 1..* 1..* has Feature 1..* Extension A Conceptual Modeling Solution AAAI 2002 WS

  14. Car-Ads Ontology Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, … End; AAAI 2002 WS

  15. Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stero 0002 a/c 0003 Auto 0003 jade green 0003 gold Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (363)835-8597 0002 1998 Elandra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Recognition and Extraction AAAI 2002 WS

  16. Lexical resources for data modeling • Information extraction also requires knowledge representations with terminological and conceptual content. • Extraction ontology knowledge sources must: • be of a general nature • contain meaningful relationships • already exist in machine-readable form • have a straightforward conversion into XML. • This paper: create, leverage large-scale termbase • some ontological structure • reformatted according to the SALT standard • converted into μK-compliant XML for use by the ontology generator AAAI 2002 WS

  17. Eurodicautom • Well-known, widely-used termbase • > 1 million concept entries • Wide range of topics • Entries are multilingual • Entry information: sources cited, input/approval dates, … • Single-word terms (e.g. “generator”) or multi-word expressions (e.g. “black humus”) • Entries each have Lenoch subject-area code • Hierarchical representation for classifying terms (and by extension their related concepts) AAAI 2002 WS

  18. Partial Eurodicautom entry %%CM AG4 CH6 GO6 %%DA %%VE lavmosetørv %%RF A.Klougart %%EN %%VE black humus %%RF CILF,Dict.Agriculture,ACCT,1977 %%IT %%VE humus nero %%RF BTB %%ES %%VE humus negro %%RF CILF,Dict.Agriculture,ACCT,1977 %%SV %%VE sumpjord %%RF Mats Olsson,SLU(1997) AAAI 2002 WS

  19. Sample Lenoch codes AD Public Administration - Private Administration - Offices AD1 general aspects of the subject field AD2 public and private organisations AD3 publications & documentary search AD31 documentation and information systems AD4 administrative staff AD5 public procurement AD51 expropriation in the public interest TEH testing methods TEH1 general aspects of testing methods TEH2 non-destructive testing TEH21 chemical tests TEH22 photometrical testing TEH221 X-ray spectrometrical testing AAAI 2002 WS

  20. Converting the termbase • Use several thousand English terms and their subject codes • %%CM line lists three Lenoch codes: • AG4 (representing the subclass AGRONOMY), • CH6 (representing ANALYTICAL-CHEMISTRY) • GO6 (representing GEOMORPH-OLOGY). • Convert termbase entries via the SALT-developed TBX termbase exchange framework • XML-based refinement of MARTIF • Convert to μK XML format used by ontology engine • Result: TBX-mediated conversion from native Eurodicautom terms to the final XML-specified ontology (μK) • Lenoch codes re-interpreted as typical hierarchical relations (e.g. IS-A and SUBCLASS) AAAI 2002 WS

  21. Conversion process SALT Eurodicautom (native) Eurodicautom (TBX) Eurodicautom (μK) Lenoch AAAI 2002 WS

  22. Eurodicautom-TBX encoding <?xml version="1.0" encoding="UTF-8"?> <martif xmlns="x-schema:XLTcsV04.xml" lang="en" type="DXLT"> <martifHeader> <fileDesc> <sourceDesc> <p>sample Eurodicautom entry</p> </sourceDesc> </fileDesc> <encodingDesc> <p type="DCSName">DXLTdv04.xml</p> </encodingDesc> </martifHeader> <text> <body> <termEntry id="eid-EDIC-BTB-DAG77-63"> <admin type="originatingInstitution">BTB</admin> <admin type="projectSubset">DAG77</admin> <descrip type="reliabilityCode">4</descrip> <langSet lang="pt"> <ntig> <termGrp> <term>souto</term> <termNote type="termType">fullForm</termNote> </termGrp> <admin type="conceptID">BTB-DAG77-63</admin> <admin target="bib-Mock" type="sourceIdentifier">V.Correia,Engº Agrónomo,PDR Vale do Lima</admin> </ntig> <ntig> <termGrp> <term>minifúndio</term> <termNote type="termType">fullForm</termNote> </termGrp> <admin type="conceptID">BTB-DAG77-63</admin> <admin target="bib-Mock" type="sourceIdentifier">V.Correia,Engº Agrónomo,PDR Vale do Lima</admin> </ntig> </langSet> </termEntry> AAAI 2002 WS

  23. Derived XML ontology <RECORD> <CONCEPT>xenobiotic substances</CONCEPT> <SLOT>SUBCLASSES</SLOT> <FACET>VALUE/FACET> <FILLER>hazardous raw materials </FILLER> <UID>0</UID> </RECORD> <RECORD> <CONCEPT>physical nuisances</CONCEPT> <SLOT>SUBCLASSES</SLOT> <FACET>VALUE/FACET> <FILLER>ambient light</FILLER> <UID>0</UID> </RECORD> <RECORD> <CONCEPT>financial statistics</CONCEPT> <SLOT>IS-A</SLOT> <FACET>VALUE/FACET> <FILLER>economic statistics</FILLER> <UID>0</UID> </RECORD> …. AAAI 2002 WS

  24. Ontology generation • Goal: specify an ontology for information extraction purposes • Problem: complex, tedious, costly • Ideally: automatically generate schemas, ontologies • Source: natural-language text, tables, etc. AAAI 2002 WS

  25. Ontology generation overview AAAI 2002 WS

  26. Knowledge sources • Mikrokosmos (μK) ontology • About 5,000 hierarchically-arranged concepts • Fairly high connectivity ( about 14 inter-concept links per node) • Fairly general content, inheritance of properties • Data frame library • regular-expression templates for matching structured low-level lexical items (e.g. measurements, dates, currency expressions, and phone numbers) • provide information for conceptual matching via inheritance • Lexicons (e.g. onomastica, WordNet synsets) • Domain-specific training documents AAAI 2002 WS

  27. Knowledge integration AAAI 2002 WS

  28. Methodology • Preprocess input knowledge sources: • Integrate: map lexicon content and data frame templates to nodes in the merged ontology • Extract: match information from training documents collection • Parse, tokenize, regularize lexical content • Generate the ontology: four-stage generation process • concept selection • relationship retrieval • constraint discovery • refinement of the output ontology AAAI 2002 WS

  29. Processing input documents AAAI 2002 WS

  30. Concept selection • Finding which subset of the ontology’s concepts is of interest to a user • Concepts are selected via string matches between textual content and the ontological data. • Three different selection heuristics • concept-name matching • concept-value matching • data-frame pattern matching • String matching plus: • word synonym matching: WordNet synonym sets • multi-word term matching: bag-of-words(CAPITAL-CITY is considered a synonym of capital and city) AAAI 2002 WS

  31. Concept selection algorithm PROCEDURE ConceptSelection(Tdoc, Kbase) SourceDoc = Parse(Tdoc); PrimarySelectedConceptsList = MikroSelection(M-Ontology); SecondarySelectedConceptsList = DataFrameSelection(DF-Library); ConflictHandling(); SelectedSubgraphGeneration(); AAAI 2002 WS

  32. Basic Selection Strategy • Select from Mikrokosmos Ontology • Afghanistan • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. AAAI 2002 WS

  33. Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • Afghanistan • smaller than Texas. • Area<GeographicalArea>: 648,000 sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>:17.7 million. • Agriculture:Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. AAAI 2002 WS

  34. Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Afghanistan<Nation> • smaller than Texas<USState>. • Area<GeographicalArea>: 648,000 sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>:17.7 million. • Agriculture:Wheat<FoodStuff><AgriculturalProduct>, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. AAAI 2002 WS

  35. Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Select from Data Frame Libraries • Afghanistan • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. AAAI 2002 WS

  36. Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Select from Data Frame Libraries • extract result based on the data frames • Afghanistan • smaller than Texas. • Area: 648,000<Area><Mileage> sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7<Time> million<Population><Price>. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. AAAI 2002 WS

  37. Concept conflict resolution • Arrive at an internally consistent set of selected concepts. • Two levels of resolution • Document-level resolution • Knowledge-source resolution • Criteria: lexical occurrence, proximity, length and distribution of words and terms • Preferences from among knowledge sources specifying matches • Other default strategies AAAI 2002 WS

  38. Document-Level Conflict • Afghanistan • smaller than Texas. • Area: 648,000<Area><Mileage> sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7<Time> million<Population><Price>. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. AAAI 2002 WS

  39. Concept-Level Conflict • Afghanistan • smaller than Texas. • Area<GeographicalArea>: 648,000<Area> sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>: 17.7 million<Population>. • Agriculture: Wheat<FoodStuff><AgriculturalProduct>, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. AAAI 2002 WS

  40. Relationship retrieval • Ontology structure: directed graph, nodes are concepts • Conceptual relationship: all paths connecting concepts generated at given stage • Theoretical solution: find all the paths in the graph (NP-complete) • When multiple paths do exist, take the shortest path between 2 concepts (Cf. μK Onto-Search algorithm) • Dijkstra’s (polynomial) algorithm to compute the most salient relationships between concepts • Distance threshold on path length to prune weak relationships • Construct schemas, or linked conceptual configurations, from the relationships posited in the previous step. • Primary concept selected (or posited): highest connectivity • Cardinalities inferred from observed relationships AAAI 2002 WS

  41. Participation Constraints • Afghanistan<Nation> • smaller than Texas. • Area: 648,000 sq. km. • Capital—Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population: 17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. CapitalCity [1:1] IsA.CITY.PartOf Nation [1:1] AAAI 2002 WS

  42. Participation Constraints (2) • Afghanistan<Nation> • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul<City>, • Other cities<City>--Kandahar<City> Mazar-e-Sharif<City> Konduz<City> • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population: 17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. City [1:1] PartOf Nation [1:*] AAAI 2002 WS

  43. Refining results • Output ontology: may require hand-crafting • can be done in a text editor (flat ASCII ontology) • Considerable expertise required: • markup syntax • specification of conceptual relations. • familiarity with regular-expression writing • Possible solution: ontology editors for typical end-users • With rich enough knowledge sources and a good set of training documents, however, we believe that the generation of extraction ontologies can be fully automatic. AAAI 2002 WS

  44. Testing the system • Input: various of U.S. Department of Energy abstracts • Knowledge base: • μK ontology • Energy sub-hierarchy of Eurodicautom terms (300) AAAI 2002 WS

  45. Sample application document The trend in supply and demand of fuel and the fuels for electric power generation, iron manufacturing and transportation were reviewed from the literature published in Japan and abroad in 1986. FY 1986 was a turning point in the supply and demand of energy and also a serious year for them because the world crude oil price dropped drastically and the exchange rate of yen rose rapidly since the end of 1985 in Japan as well. The fuel consumption for steam power generation in FY 1986 shows the negative growth for two successive years as much as 98.1%, or 65,730,000 kl in heavy oil equivalent, to that in the previous year. The total energy consumption in the iron and steel industry in 1986 was 586 trillion kcal (626 trillion kcal in the previous year). The total sales amount of fuel in 1986 was 184,040,000 kl showing a 1.5% increase from that in the previous year. The concept Best Mix was proposed as the ideal way in the energy industry. (21 figs, 2 tabs, 29 refs) AAAI 2002 WS

  46. Sample output -- energy2 Information Ontology energy2 [-> object]; energy2 [0:*] has Alloy [1:*]; energy2 [0:*] has Consumption [1:*]; energy2 [0:*] has CrudeOil [1:*]; energy2 [0:*] has ForProfitCorporation [1:*]; energy2 [0:*] has FossilRawMaterials [1:*]; energy2 [0:*] has Gas [1:*]; energy2 [0:*] has Increase [1:*]; energy2 [0:*] has LinseedOil [1:*]; energy2 [0:*] has MetallicSolidElement [1:*]; energy2 [0:*] has Ores [1:*]; energy2 [0:*] has Produce [1:*]; energy2 [0:*] has RawMaterials [1:*]; energy2 [0:*] has RawMaterialsSupply [1:*]; Alloy [0:*] MadeOf.SOLIDELEMENT.Subclasses MetallicSolidElement [0:*]; Alloy [0:*] IsA.METAL.StateOfMatter.SOLID.Subclasses CrudeOil [0:*]; Alloy [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Produce [0:*]; AmountAttribute [0:*] IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT Consumption [0:*] IsA.FINANCIALEVENT.Agent Human [0:*]; ControlEvent [0:*] IsA.SOCIALEVENT.Agent Human [0:*]; ControlEvent [0:*] IsA.SOCIALEVENT.Location.PLACE.Subclasses Nation [0:*]; CountryName [0:*] NameOf Nation [0:*]; CountryName [0:*] IsA.REPRESENTATIONALOBJECT.OwnedBy Human [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.OwnedBy Human [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.GROW.Subclasses GrowAnimate [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Increase [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Combine [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Display [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Produce [0:*]; Custom [0:*] IsA.ABSTRACTOBJECT.ThemeOf.MENTALEVENT.Subclasses AddUp [0:*]; Display [0:*] IsA.PHYSICALEVENT.Theme.PHYSICALOBJECT.Subclasses Gas [0:*]; Display [0:*] IsA.PHYSICALEVENT.Theme.PHYSICALOBJECT.OwnedBy Human [0:*]; ForProfitCorporation [0:*] OwnedBy Human [0:*]; ForProfitCorporation [0:*] IsA.CORPORATION.HasNationality Nation [0:*]; Gas [0:*] IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation [0:*]; Gas [0:*] IsA.PHYSICALOBJECT.ThemeOf.GROW.Subclasses GrowAnimate [0:*]; LinseedOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Increase [0:*]; AAAI 2002 WS

  47. Evaluation • Several dozen relationships are generated • Correct: relationship is posited between the concept CRUDE-OIL and the action PRODUCE; the role is Theme, meaning that one can PRODUCE CRUDE-OIL • Incorrect: relationship between GAS and GROW • Precision: relatively low (around 75%) due to high number of matches • Recall: better (around 90%) • Note: it’s easier for a human to refine the system’s output by rejecting spurious relationships (i.e. deleting false positives) than to specify relationships that the system has missed. AAAI 2002 WS

  48. How to improve results • Less general, more focused ontologies • Richer ontological structure • More types of hierarchical relationships (beyond IS-A and its inverse, SUB-CLASSES) • Deeper hierarchies (maximum 4 in Lenoch) • Note: TBX supports several data types for conceptual encoding AAAI 2002 WS

  49. Related work • Lexical chaining in NLP • extracting and associating chains of word-based relationships from text • relating words and terms to resources like WordNet • Widely used in text categorization, automatic summarization, and topic detection and tracking • Our contributions: • integrating disparate knowledge sources for similar tasks • Discovering and generating a compatible set of ontological relationships AAAI 2002 WS

  50. Conclusions • The knowledge acquisition bottleneck impacts ontology construction for information extraction. • Terminographers and lexicographers codify information that can be advantageous for work in semantic-based processing. • Integrating these two disparate areas, it is possible to leverage large-scale terminological and conceptual information with relationship-rich semantic resources in order to reformulate, match, and merge retrieved information of interest to a user. • Possible future applications: • Knowledge-focused personal agents • Customized search, filtering, and extraction tools • Individually tailored views of data via integration, organization, and summarization • Lots of work still to be done… AAAI 2002 WS

More Related