1 / 18

Terminology Retrieval: towards a synergy between thesaurus and free text searching

VIII Iberoamerican Conference on Artificial Intelligence Sevilla, 2002. Terminology Retrieval: towards a synergy between thesaurus and free text searching. Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas Informáticos UNED. Overview. Motivation Objectives

wright
Télécharger la présentation

Terminology Retrieval: towards a synergy between thesaurus and free text searching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VIII Iberoamerican Conference on Artificial Intelligence Sevilla, 2002 Terminology Retrieval: towards a synergy between thesaurus and free text searching Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas Informáticos UNED

  2. Overview • Motivation • Objectives • Proposed approach: Terminology Retrieval • Website Term Browser • Evaluation • Conclusions

  3. MultilingualThesaurus 60. EDUCATIONAL SYSTEM Education NT1 adult education RT adult (10) RT lifelong learning NT1 basic education RT* transition from basic to secondary education RT didactic continuity (50) NT1 distance education UF distance learning UF distance study UF distance training UF ODL UF open and distance learning NT1 informal education NT1 lifelong learning UF continuing education UF lifelong education UF recurrent education RT adult education (…) Designed for • Indexing and searching in a specific subject area • Vocabulary control • Promoting consistency • Cross-language • Guiding users about which terms to use • Navigate the thesaurus

  4. Multilingual Thesaurus Problems • Construction & management (high cost) • Indexing • Manual keyword assessment • Errors in automatic keyword assessment • Domain specific • New domain needs a new thesaurus • Specialist oriented (know preferred descriptors) • Less specialized audience get poorer results

  5. Objectives • Develop a model • to help users to express and precise their information needs • to help users to overcome language barriers • Bringing to users the collection terminology • Morpho-syntactic, semantic & translingual variations • Without needs of thesauri construction • Establish an appropriate evaluation framework

  6. Free Text Searching Automatic Terminology Extraction Terminology Retrieval & Term browsing (Website Term Browser) Proposed approach NLP Techniques Information Retrieval Controlled Vocabulary Searching Controlled Vocabulary Searching Free Text Searching

  7. Terminology Retrieval From Automatic Terminology Extraction... Obtain lists of terms relevant for a specific domain • Term Extraction • Term Weighting • Term Selection ... to Terminology Retrieval Retrieve terms relevant for an information need • User query points the relevant terms • No terminology lists truncation • Favor recall relaxing term extraction patterns ... & Browsing • Navigate through relevant terminology • Access information from retrieved terms • Bridge the gap between query and collection vocabularies • Cross-Language

  8. Terminology Retrieval Requires • Phrase indexing and retrieval • Query expansion and translation • To retrieve terminology variations • Morpho-syntactic variations • Semantic variations • Translingual variations • Noise in retrieval • Ambiguity reduction • Co-ocurrence of expansion words in the same phrase

  9. Lemma Document Phrase Lemma Document Phrase Indexing Steps • Text pre-processing and listing of words • Word tagging (oriented to phrase detection) • Phrase detection & lemmatization of components • Document indexing & statistics (document frequency) Phrase selection (Subsumption & Lexicalization degree) Phrase indexing

  10. Ambiguity Reduction Nuclear taste proscription process? Nuclear test ban treaty? Query expansion and translation de Prohibición embargo entredicho interdicción interdicto proscripción ban interdiction prohibition proscription de Pruebas cata, catadura degustación ensayo escandallo experimento gustación muestreo, tanteo demonstrate establish, exhibit experiment experimentation fall, fitting indicate, point present, proof prove, run sample, sampling shew,show, taste test, trial, try Nucleares nuclear nuclear Expansion Tratados acuerdo capitulación concertación convenio cuidar, pacto manejar procesar accord discourse handle manage pact process treat treatise treaty Translation

  11. Tokenising Lexicon tok1 tok2 tok3 lem11 lem12 ... lem31 lem32 ... Lemmatising lem11 lem21 lem31 lem12 lem22 lem32 ··· ··· ··· EWN & Dic. Phrase index Document index Expansion / Translation exp31 exp32 ... tran31 tran32 ... Phrase retrieval Document retrieval exp21 exp22 ... tran21 tran22 ... exp11 exp12 ... tran11 tran12 ... Term ranking Document ranking terms documents Retrieval query

  12. Query in Spanish Hierarchy of terms Ranking of documents English Spanish Catalan

  13. - Translingual - Morpho-syntactic variations (permutation, insertion) - Semantic variations

  14. Evaluation of Terminology Retrieval Compare • Terminology Retrieval over 42,406 web pages (200 Mb) • Hand-crafted Multilingual Thesaurus (1051 descriptors)

  15. Evaluation of Terminology Retrieval Recall of mono-lexical terms (lemmas) • Monolingual: 85% - 95% • Translingual: 55% - 65% Recall of poly-lexical terms (phrases) • Monolingual: 40% - 65% • Translingual: 10% - 45% Loss of recall due to • Phrase extraction (mainly POS tagging): 3% - 17% • Phrase indexing (mainly lemmatization): 2% - 34% • Phrase selection: 12% - 37% • Lack of connections between different languages in EWN • Lack in EWN adjective hierarchies

  16. Conclusions A search model based on extraction, retrieval and browsing of terminology has been developed • User oriented • Interaction over terminological information • Intermediate way between free-searching and thesaurus-guided searching • Without needs of thesaurus construction • Bringing to users the collection terminology • Morpho-syntactic & semantic variations • Translinguality

  17. Conclusions An evaluation framework for Terminology Retrieval and Term Browsing has been established • Points the way to improve Terminology Retrieval • Users appreciate Term Browsing • WTB phrasal information can substantially complement the document ranking provided by the search engines

More Related