170 likes | 298 Vues
This paper presents a novel approach to automatic indexing of medical documents, specifically MEDLINE, by introducing AMTEx, a method aimed at improving term extraction efficiency. Unlike MMTx, which relies on UMLS for mapping, AMTEx uses the MeSH thesaurus to enhance the process. Key features include multi-word term extraction, candidate evaluation, and term expansion through linguistic and statistical criteria. Experimental results demonstrate AMTEx outperforms existing methods in precision and recall, contributing to better indexing and retrieval of biomedical documents.
E N D
Automatic Document Indexing inLarge Medical Collections Advisor : Dr. Hsu Presenter : Shu-Ya Li Authors :Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis, Evangelos E. Milios 2006 . HIKM
Outline • Motivation • Objective • Current Approach : MMTx • Method : AMTEx • C/NC-value method • Use of MeSH Thesaurus as lexical resource • Experiments • Conclusion • Personal Opinions
Motivation • MMTx, the U.S. NLM approach • maps biomedical documents to UMLS term concepts • The limitations of MMTx in term extraction: • term over-generation • term concept diffusion • unrelated terms added to the final candidate list • MMTx focus on UMLS rather than MeSH • But MEDLINE indexing is based on MeSH • To improve the efficiency of automatic indexing of medical documents.
Objective • We propose a new method, AMTEX • Improving the efficiency of automatic term extraction by using C/NC-value method . • Indexing and retrieval of MEDLINE documents, based on the extraction and mapping of document terms to the MeSH Thesaurus.
Current Approach : MMTx • Maps arbitrary text to UMLS Metathesaurus concepts: • Parsing (syntactic analysis - linguistic filter) • Variant Generation (uses SPECIALIST Lexicon) • Candidate Retrieval (mapping process to Metathesaurus Concepts) • Candidate Evaluation (criteria: centrality, variation, coverage, cohesiveness)
MMTx Example • Parsing • Shallow syntactic analysis of the input text • Linguistic filtering: isolates noun phrases e.g. the term “ocular complications” is analysed as: • Variant Generation e.g. “obstructive sleep apnea” has variants: obstructive sleep apnea, sleep apnea, sleep, apnea, osa,… • Candidate Retrieval Candidate Metathesaurus concepts for the variant “osa” : osa [osa antigen], osa [osa gene product] osa [osa protein] osa [obstructive sleep apnea] • Candidate Evaluation Obstructive Sleep apnea 1000 Sleep Apnea 901 Apnea 827 … … Sleeping 793 Sleepy 755 • The limitations of MMTx in term extraction: • term over-generation • term concept diffusion • unrelated terms added to the final candidate list
Method - AMTEx Input Document d, MeSH Ontology Term Mapping C/NC-value Multi-word Term Extraction & Term Ranking C/NC-value Multi-word Term Extraction & Term Ranking Single-word Term Extraction Term Variant Generation MeSH Thesaurus Resource Output MeSH Term Lists Term Expansion
Step 1 & 2: C/NC value- Multi-word Term Extraction & Ranking • Part-of-Speech Tagging • Linguistic filtering: • Term Extraction - C-value • Term Ranking - NC-value • Keep terms up to threshold T1
Step 3 : Term Mapping • Candidate terms are mapped to terms of the MeSH Thesaurus (simple string matching). • Only candidate terms matching MeSH are retained. • Multi-word candidates not matching MeSH may contain (shorter) MeSH terms.
Step 4 : Single-word Term Extraction • For multi-word terms not matching MeSH • Multi-word are split into single-word terms • Single-word terms are validated against MeSH • Matched MeSH terms are added to term list
Step 5 : Term Variant Generation • Inflectional variants of the extracted terms are identified during term extraction • (C/NC-value) • Stemmed term-forms are also available in MeSH and are added to the list of terms
Step 6 : Term Expansion • Each term in the list is expanded with neighbor terms in MeSH • The expansion may include terms more than one level higher or lower than the original term, depending on T2
Experiments • Precision and Recall measures • Dataset • 61 full MEDLINE documents, from PMC database of NCBI Pubmed • MEDLINE documents are paired to respective MeSH index terms, manually assigned by experts • Ground Truth • the set of MeSH document index terms • Benchmark method • MMTx against AMTEx
Conclusion - AMTEx • designed for indexing and retrieval of MEDLINE documents • focuses on multi-word term extraction using valid linguistic & statistical criteria • based on MeSH - similarly to human indexing • selectively expands to term variants & synonyms • outperforms the current benchmark MMTx method, reaching better precision & recall
Personal Opinions • Advantage • Drawback • … • Application • …