Mu ltilingual C oncept H ierarchies for M edical Information O rganization and Re trieval

MUCHMORE Multilingual Concept Hierarchies for Medical Information Organization and Retrieval

Project Overview Application  Addressing a Real-Life Medical Scenario for Cross-Lingual Information Retrieval Research & Development  Developing Novel, Hybrid (Corpus-/Concept- Based) Methods for Handling this Scenario Evaluation  Evaluating the Technical Performance of (Combinations of) Existing and Novel Methods

User Perspective (ZInfo) Vision: BAIK Model • MuchMore •  Provide Relevant Medical Information • … for a Specific Patient Problem • … Automatically, from the Web • … Independent of Language

User Perspective (ZInfo) User Requirements • Automatic Query Generation (and Expansion), Identifying the Exact Problem of the Patient • Retrieval and Relevance Ranking of Evidence Based Medical Literature, Language Independent • Summarization and Filtering of Results According to a User Profile

User Perspective (ZInfo) User Evaluation Evaluate Usefulness  Query Generation  Relevance for Decisions in Diagnostics and Treatment Use for Medical Cases  Part of Postgraduate Course in Medical Informatics Problematic Issues  Different medical profiles, schools, experience, speciality  Relevant for one user may mean less or nothing to another  Evidence based medicine criteria exist only for a small fraction of medicine

MuchMore Prototype • Overview of Prototype Functionality • Relation between Functionality and User Requirements •  Issues Addressed by Research and Development within MuchMore

R&D in MuchMore Semantic Annotation Based CLIR Corpus Annotation (DFKI, ZInfo) •  PoS, Morphology, Phrases, Grammatical Functions •  Term and Relation Tagging • Term Extraction (XRCE, EIT, CMU, CSLI) •  Bilingual Lexicon Extraction, Extension of Semantic Resources • Sense Disambiguation (CSLI, DFKI) •  Tuning and Extension of Semantic Resources •  Combining Sense Disambiguation Methods • Relation Extraction (DFKI, CSLI) •  Grammatical Function Tagging •  Extracting Semantic Relation Indicators •  Extracting Novel Semantic Relations • Semantic Indexing/Retrieval (EIT,DFKI)

R&D in MuchMore Additional Approaches in CLIR • Corpus Based CLIR • Bilingual Lexicon Extraction (XRCE, EIT, CMU, CSLI) • Pseudo Relevance Feedback: PRF (CMU) • Generalized Vector Space Model: GVSM (CMU) Text Classification Based CLIR (CMU)  Hierarchical/Flat kNN with MeSH Summarization (CMU)  Query, Genre Specific

Corpus Annotation Annotation Evaluation Corpus ~ 9000 English and German Medical Abstracts from 41 Journals, Springer LINK WebSite, ~ 1 M Tokens for each Language PoS • Lexicon Update, Remaining Error Rate ~ 1.5% (EN) Histologically, we found a subepidermal blister formation and a predominantly neutrophilic infiltrate. pos=VB > pos_correct=NN Morphology Incorrect, e.g.:Chorionzottenbiopsie > Chor + Ion + Zotte + Biopsie • Term and Relation Tagging •  Evaluation of 8 DE/EN Parallel Abstracts, Relevant for a Query

Aim Bilingual Lexicon Extraction From Comparable Corpora at Word Level; From Parallel Corpora at Word, and Term (Multi-Word) Level Bilingual Extension of Semantic Resource (MeSH) Term Extraction XRCE (Aims and Resources) Resources • Optimal Combination of Existing Resources (Corpus, General Dictionary, Thesaurus: MeSH) • Corpus Specific German Decompounding (Improves Recall by 25% at Equal Precision)

Optimal Combination of Resources Retaining only 10 best Translations for each Candidate 1. word-to-word, comparable corpora: F1 = 0.84 2.a word-to-word, parallel corpora: F1 = 0.98 2.b term-to-term, parallel corpora: F1 = 0.85 Evaluating Separately with Individual Resources (F1) Corpus: 0.62; MeSH: 0.51; General Dictionary: 0.56 3. MeSH Extension: 1453 new multi-word terms added (synonyms or new term entries) extracted from the Springer corpus Term Extraction XRCE (Results of Best Method)

Term Extraction EIT (Similarity Thesauri) Method  Extract Most Frequent Terms (Single Word) by Comparison of Term Frequencies in a General Corpus (German: SDA, English: LA Times) vs. Medical Corpus Results  Single Word Terms (Springer Abstracts) German-English:104,904 / English-German: 49,454  Multiword Terms (Phrase Lexicon Generated from ICD10) German Phrases: 354 / English Phrases: 665 Bilingual Phrasal Entries Generated: German - English: 225 / English - German: 246

Term Extraction CMU (EBT Bilingual Lexicon) Method  For each word in one language, accumulate counts of the number of times the translations of the sentences containing that word include each word of the other language. These co-occurrence counts may be restricted using word-alignment techniques.  Apply a variable threshold to filter out uncommon co-occurrences which are unlikely to be translations. The result is a lexicon listing candidate translations and their relative frequencies. Results  ~99.000 Bilingual Term Pairs (PubMed Parallel Abstracts) (Estimated Error Rate: < 10%)

Term Extraction CSLI (Infomap System) Represent English and German Words as Vectors that are Produced by Recording the Number of Co-Occurrences of the Word in Question with each of a Set of Content-Bearing Words. Use (Cosine) Similarity Measure on these Rows to Find “Nearest Neighbours”. 1, 000 (English) content-bearing words ligament kneejoint . . . ligament English words English Kreuzband Kniegelenk German words German . . . . . .

WSD: Terms, Senses Semantic Resource Extension and Tuning • Extension (DFKI) • Morphological Analysis (Decomposition) • Entzündungsgewebe (infection tissue) HYPONYM Gewebe,Körpergewebe (body tissue) • Gewebe, Stoff,Textilstoff (textile) • Semantic Similarity (Co-Occurrence Patterns) • Karzinom (carcinoma), Metastase (metastasis) SYNONYM Geschwulst, Tumor, .... • Tuning (CSLI, DFKI) • Aligning Clusters with Senses C0043210|GER|P|L1254343|PF|S1496289|Frauen|3| C0043210|ENG|P|L1189496|PF|S1423265|Human adult females|0|

WSD: Algorithm Combination of Methods (Task, Domain, General) Bilingual Sense Selection (CSLI) • 1 Sense in L1 vs. >1 Sense in L2 • English blood vessel (C0005847)vs. vessel (polysaccharide) (C0148346) • German Blutgefaesse = blood vessel (C0005847) Collocations and Senses (CSLI) • For an ambiguous single word term that is part of several unambiguous multiword terms, choose the sense of the most frequent multiword term. single word term abortion 1) a natural process C0000786 (T047) 2) a medical procedure C0000811 (T061) multiword term recurrent abortion C0000809 (T047) => sense 1 induced abortion C0000811 (T061) => sense 2

WSD: Algorithm Combination of Methods (Task, Domain, General) Domain Specific Senses (DFKI) • Concept Relevance in Domain Corpus • Mineral 0.030774033: Mineralstoff, Eisen, Ferrum, Fluor, Kalzium, Magnesium 4.9409806E-5: Allanit, Alumogel, ..., Axionit, Beryll, ... Wurtzit, Zirkon Instance-Based Learning (DFKI) • Unsupervised Context Models (n-grams) • Training (Learn Class Models) He drank <milk LIQUID> He drank <coffee LIQUID> He drank <tea LIQUID> He drank <chocolate FOOD, LIQUID> • Application (Apply Class Models) He drank <chocolate FOOD, LIQUID> He drank <Java GEOGAPHICAL, LIQUID>

WSD: Evaluation Lexical Sample Evaluation Corpora (Medical) • Ambiguous: MeSH EN: 847 (2.5), DE: 780 (2.1); EWN EN: 6300 (2.8) DE: 4059 (1.5) • Evaluation (Nouns): GermaNet (40), English MeSH (59), German MeSH (28)

Relation Extraction Grammatical Function Tagging (DFKI) • Robust, Shallow Grammatical Function Tagger • EM Model (Trained on Frankfurter Rundschau: 35M Tokens,Adaptation on Medical Corpora Under Development) 1.5M ‘Types’: Verb, Voice, Function, Nom-Head-Argument abarbeiten ACT SUBJ Politiker  Use of PoS Information, Use of Chunk Information Planned  Tags for SUBJ, OBJ, IOBJ, ACT/PAS  German Available, English under Development • Untersucht <PRED1:PAS> wurden 30 Patienten <PRED1:SUBJ> <PRED2:SUBJ>, die sich <PRED2:SUBJ> einer elektiven aortokoronaren Bypassoperation <PRED2:IOBJ> unterziehen <PRED2:ACT> mussten.

Relation Extraction Semantic Relation Indicators (DFKI, CSLI) Novel Semantic Relations (DFKI, CSLI) differentiate conclude discriminate diagnose illustrate Cluster 1 T047/T060 (Diagnoses) T060/T101 (Affects) T060/T169 ... reduce treat follow diagnose cure Cluster 3 T047/T121 (Treats, Causes) T061/T121 (Uses) T121/T184 (Treats) ... Cluster 2 T101/T169 T101/T184 T101/T048 ... T047: Disease T048: Mental Dysfunction T060: Diagnostic Procedure T101: Patient T121: Pharm. Substance T169: Funct. Concept (Syndrom) T184: Sign or Symptom suffer demonstrate progress develop die

Maximal Marginal Relevance (MMR)  Find passages most relevant to query  Maximize information novelty (minimize passage redundancy) Assemble extracted passages for summary Argmaxkdiin C[λS(Q, di) - (1-λ)maxdjin R (S(di, dj))] Q = query, d = document, S = similarity function λ = tradeoff factor between relevance & novelty k = number of passages to include in summary Summarization (CMU) Extractive Summarization Applications  Re-ranking retrieved documents from IR Engine  Ranking passages from a document for inclusion in summaries  Ranking passages from topically-related document cluster for cluster summary

Summarization (CMU) MuchMore Application  INDICATIVE and QUERY-RELEVANT  MMR applies to English and German • Genre-based specialization (e.g. include conclusions for scientific articles) • Linguistic specialization possible  Summarization should apply when retrieving FULL articles  query-driven summaries instead of generic abstracts

Technical Evaluation Test Data  Test Collection: Springer Abstracts (German and English)  Query Set: 25 of 126 Selected by ZInfo  Relevance Assessments Assumption: Documents Retrieved by all Runs for one Query (Intersection) are Relevant Pool Size: 500 Documents Based on 18 Runs Done by CMU, CSLI and EIT German (ZInfo): 959 Relevant Documents English (CMU): 500 Relevant Documents (1 judge) 964 Relevant Documents (3 judges)

Technical Evaluation Methods Evaluated • Corpus Based Similarity Thesaurus (EIT) • Example-based Translation (CMU) • Pseudo Relevance Feedback (CMU) • Generalized Vector Space Model (CMU) • Hybrid Classification (CMU) • Hierarchical: kNN, Rocchio • Flat: kNN, Rocchio-style Classifier • Semantic Annotation + Extraction (DFKI, XRCE) • UMLS / XRCE Terms & Semantic Relations EuroWordNet Terms • Semantic Annotation + Similarity Thesaurus

Technical Evaluation TREC-Style Performance Measurements • Overall Performance •  11point-Average Precision (Interpolated) • Performance in the High-Precision Area • Assumption: User Wants to Get Most Relevant Documents Topranked within the Result List •  Average Interpolated Precision at Recall of 0.1 •  Exact Precision after 10 Retrieved Documents • Applied to Experiments Evaluating Semantic Annotations

Technical Evaluation Results: Corpus Based Methods Data Sets  EIT: The Springer Parallel Corpus, i.e. 9640 Documents for English, and 9640 documents for German CMU: Half of the Corpus, i.e. a Test Set with 4820 Documents in each.

Technical Evaluation Results: Hybrid Methods Categorization (Preliminary Results) Reuters-21578: 10,000+ documents, 90 categories Reuters Corpus Volume 1, TREC-10 version (RCV1): 783,484 documents, 84 categories Reuters Koller & Sahami subsets (ICML’98): 138 to 939 documents, 6-11 categories in a set OHSUMED: 233,445 documents, 14,321 categories

Technical Evaluation Results: Hybrid Methods Semantic Annotation + Extraction Data Set Full Springer Corpus Weighting Scheme Coordination Level Matching (CLM): 1. Pass: Documents Preferred Containing Matching Terms or Semantic Relations 2. Pass: All Features Using lnu.ltn Rel. Assessments German

Technical Evaluation Results: Hybrid Methods Semantic Annotation + Similarity Thesaurus Data Set Full Springer Corpus Weighting Scheme Coordination Level Matching (CLM) Rel. Assessments German

Technical Evaluation Summary of the Results • Assumption: CLIR achieves up to 75 % of Monolingual Baseline • (11pt Average Precision) • Corpus-based Methods (Compared to Monolingual PRF) • German – English PRF: 81 %, EBT: 77 %, EIT: 66% • English – German PRF: 113 %, EBT: 106 %, EIT: 60% • Hybrid Methods (Compared to Monolingual EIT) • German – English: 73 % (UMLS Terms & SemRels) • English – German: 50 % (UMLS Terms & SemRels) • English – German: 80 % (UMLS Terms & SemRels & XRCE Terms) • German – English: 74 % (SimThes & UMLS Terms & SemRels) • English – German: 80 % (SimThes & UMLS Terms & SemRels) • English – German: 92 % (SimThes & UMLS Terms & SemRels & XRCE Terms)

Management Deviations from the Work Plan Corpus Collection • Comparable Medical Document Corpora are Very Difficult to Obtain, Anonymization Must be Validated by Hospital CIO • Work with „Shuffled“ Parallel Corpus • Radiology Reports (~600.000) Available in German, to be Obtained for English Corpus Annotation • More Efforts on Improving PoS Tagging and Morphological Analysis (English and German Medical Specialist Lexicon) Relation Extraction • More Efforts on Grammatical Function Tagging as Preprocessing for Semantic Relation Tagging and Extraction

Management Future Prospects and Activities R&D Topics • Ontology DevelopmentCombining Axes in AGK-Thesaurus (ZInfo) with Cluster Methods (CSLI, DFKI) • Semantic WebSemantic Annotation of Medical Documents with Metadata (UMLS in Protégé) Related Projects and Workshops • Project Proposal IKAR/OS on KM & Visualization in Life Sciences • OntoWeb SIG on LT in Ontology Development and Use • MuchMore Workshop with Invited Experts in Medical Information Access, CLIR and Semantic Annotation (September 2002) • ZInfo/MuchMore Workshop on Electronic Patient Records (Spring 2003)

Mu ltilingual C oncept H ierarchies for M edical Information O rganization and Re trieval

Mu ltilingual C oncept H ierarchies for M edical Information O rganization and Re trieval

Presentation Transcript

C M O C

trieval

trieval

O rganization Profile

C oncept to Production

C o l o r S c h e m e

C h r o m e

Professional O rganization

Religion and M edical Ethics

CARBOHYDRATES: C n (H 2 O) m

C H R O M A CRACKDOWN

P atient C entered M edical H omes

F RANTZ M EDICAL

P akistan O rganization of M edical P hysicist

M O S H I A C H

H I S C O M

Religion and M edical Ethics

C oncept:

The O rganization of M atter