170 likes | 295 Vues
MCORES: a system for noun phrase coreference resolution for clinical records . 2012 SHARPn Summit “Secondary Use”. Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany, NY, USA.
E N D
MCORES: a system for noun phrase coreference resolution for clinical records 2012 SHARPn Summit “Secondary Use” Andreea Bodnari,1 Peter Szolovits,1 Ozlem Uzuner2 1MIT, CSAIL, Cambridge, MA, USA 2Department of Information Studies, University at Albany SUNY, Albany, NY, USA 10.16.2012- Rochester, MN
Outline • Medical coreference resolution system (MCORES) • Experimental results • Conclusion
Why coreference resolution? • Electronic Medical Records (EMRs) – large information repositories • Clinical information requires processing • Lower level: sentence parsing, tokenization • Higher level: coreference resolution, semantic disambiguation • Coreference resolution: a fundamental step in text processing
Data: i2b2/VA corpus • English medical corpus provided by i2b2 National Center for Biomedical Computing • De-identified medical discharge summaries • Source: PH & BIDMC • Content: 230(PH) + 196(BIDMC) discharge summaries • Annotated concepts and coreference chains • Concept types Persons Problems Treatments Tests Pronouns
Coreference resolution algorithm NP Instance Creation Feature Generation Classification Output Clustering
1. NP instance creation • Markables of same semantic category are paired together • MCORES creates positive instances only from neighboring markable pairs in a chain 1Instance creation akin to McCharty and Lehnert
1. NP instance creation Table 3: Distribution of coreferent and non-coreferent instances per semantic category over instances containing exact, partial, and no textual overlap.
2. Feature Generation • Multi-perspective features • Antecedent perspective • Anaphor perspective • Greedy perspective • Stingy perspective • Phrase-level lexical • Sentence-level lexical • Syntactic • Semantic • Miscellaneous
2. Feature Generation (lexical) Phrase-level lexical • Token overlap* • Normalized token overlap • Edit-distance • Normalized edit-distance Sentence-level lexical • Sentence-level token overlap* • Filtered sentence-level token overlap* • Left and right mention overlap • stingy and greedy perspectives only * multi-perspective feature
2. Feature Generation (syntactic & semantic) Syntactic • Number agreement • Noun overlap* • Surname match Semantic • UMLS CUI overlap* • UMLS CUI token overlap* • UMLS semantic type overlap* • Anaphor UMLS semantic type * multi-perspective feature
2. Feature Generation (miscellaneous) • Token distance • Mention distance • All-mention distance • Sentence distance • Section match • Section distance
3. Classification • C4.5 decision tree algorithm • Flexible • Readable prediction model • Classify pairs of markables based on values of the feature vectors
4. Output Clustering • Classifier makes pairwise predictions only • Pairwise predictions clustered into coference chains • Aggressive-merge1 clustering algorithm prediction [M1] - [M2] all preceding pairwise predictions linked to [M1]or [M2] 1Aggresive-merge algorithm proposed by McCarthy and Lehnert
Evaluation • Feature set evaluation • Perspectives evaluation • Performance evaluation against • In house baseline • Third party system (RECONCILEACL09& BART) • Evaluation metric: unweighted averages of Recall, Precision, and F-measures of • MUC • B3 • CEAF • BLANC
Discussion • MCORES’ advantage comes from linking markables with no token overlap • Phrase-level sub-MCORES performs similarly to MCORES • Greedy perspective system is the most favorable single-perspective system • Multi-perspective system performs as well or better than single-perspective systems • Error analysis • MCORES fails to classify misspelled person pairs • Medical problems false positives due to difference between newly and recurring events • Treatments false positives due to medications presenting different routes of administration • Tests false positive due to the large number of full overlap instances that did not corefer
Conclusion • Developed coreference resolution system for the medical domain (MCORES) • MCORES innovates through a multi-perspective and knowledge-based feature set • MCORES outperforms third party systems and an in-house baseline, improving coreference resolution on clinical records