Hybrid Term Extraction Methods: A Comparative Study
150 likes | 260 Vues
This research paper explores automatic term and collocation extraction methods, evaluating precision and recall using two hybrid approaches. Comparing human-created lists with extracted ones, it highlights areas for improvement in automated extraction techniques. The study emphasizes the balance between lexical coverage and practicality in term extraction processes.
Hybrid Term Extraction Methods: A Comparative Study
E N D
Presentation Transcript
Comparative Analysis of Automatic Term and Collocation Extraction Sanja Seljan, Bojana DalbeloBašić, Jan Šnajder, Davor Delač, Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Overview • Introduction • Reasons for extraction • Research • Resources & tools • Extracted lists • Evaluation • Precision, recall, F-measure • Conclusion INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
I. Introduction • Monolingual and multilingual resources • Helpful • Integrated • Require human intervention • EU pre-accession activities • Speed up + consistency • Used in further research and practice INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
List: • Terms (Member State, European Union) • Collocations (adopt a/the resolution, decided as follows) • Multi-word units (depend on, well-being) • Term extraction process: • Term extraction (term acquisition)- identification • Term recognition - verification INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
II. Research • Resources • 10 documents – legislation, Cro-Eng • Tools • TermeX tool (FER) – list A • SDL Multi Term Extract + NooJ (FF) – list B • Reference list • Evaluation – reference list INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Reference list • 470 terms and collocations • Exclude unigrams • Balance between lexical coverage, adequacy, practicality • terms (NPs: 346/470) • collocations (VPs) INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Reference list • Contains: • Terms (acquiring company, applicant country) • Collocations (adopt a/the resolution, decided as follows, entry into force, having regard to) • Names and abbreviations (Economic and Monetary Union EMU, European Union EU) • Relevant embedded terms (crime prevention, crime prevention bodies, national crime prevention measures). INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
List B • Language-independent statistically-based SDL Multi Term Extract tool • Frequency treshold set to 4 • Filtered by the list of stop-words -> 369 cand. • Language dependant NooJ tool • 36 local grammars -> 512 cand. INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
List A • TermeX • Lexical association measures (AMs) • 14 AMs (PMI, Dice, Chi-square,…) • Lemmatization • POS filtering • Frequency treshold set to? INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
List A • Extracted terms ranked by AM value • 1816 candidates • AMs used: • 2-grams – PMI • 3-grams, 4-grams – heuristic extensions • Noun phrases only INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Results • Evaluation • F1-measure (precision, recall) • True positives calculated by taking into account inflection (suffix stripping) INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Results • List A unsatisfactory • Low recall – Verb phrases, terms consisting of more than 4 words • Low precision – ranked list, can be improved with cut-off (true positives are better ranked) • List B modest • can be improved with lemmatization, definition of upper/lower cases, more detailed local grammar INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Conclusion • Comparison of two hybrid approaches to term extraction • Human created lists differ from extracted lists • human knowledge, experience and intuition • Space for improvement – automatic extraction combined human intervention INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Thank you! INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009