1 / 48

An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain

Integrating Data for Analysis, Anonymization, and Sharing. An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain. Wendy W. Chapman, PhD. Division of Biomedical Informatics University of California, San Diego. Overview.

tausiq
Télécharger la présentation

An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating Data for Analysis, Anonymization, and Sharing An NLP Ecosystemfor Development and Use of Natural Language Processing in the Clinical Domain Wendy W. Chapman, PhD Division of Biomedical Informatics University of California, San Diego

  2. Overview • The promise of natural language processing (NLP) • Challenges of developing NLP in the clinical domain • Challenges in applying NLP in the clinical domain • iDASH • Opportunities for sharing and collaboration in NLP

  3. NLP Success “IBM's computer could very well herald a whole new era in medicine." ComputerWorld February 17, 2011 Dr. Watson?? Fresh off its butt-kicking performance on Jeopardy!, IBM’s supercomputer "Watson" has enrolled in medical school at Columbia University,”New York Daily News February 18th 2011

  4. Clinical NLP Since 1960’s Why has clinical NLP had little impact on clinical care?

  5. Barriers to Development • Sharing clinical data difficult • Have not had shared datasets for development and evaluation • Modules trained on general English not sufficient • Insufficient common conventions and standards for annotations • Data sets are unique to a lab • Not easily interchangeable

  6. Limited collaboration • Clinical NLP applications silos and black boxes • Have not had open source applications • Reproducibility is formidable • Open source release not always sufficient • Software engineering quality not always great • Mechanisms for reproducing results are sparse

  7. Overview • The promise of natural language processing (NLP) • Challenges of developing NLP in the clinical domain • Challenges in applying NLP in the clinical domain • Developing an NLP ecosystem on iDASH

  8. Security & Privacy Concerns Institutions are reluctant to share data • Clinical texts have many patient identifiers • 18 HIPAA identifiers • Names • Addresses • Items not regulated by HIPAA • tight end for the Steelers • Unique cases • 50s-year-old woman who is pregnant • Sensitive information • HIV status

  9. Lack of user-centered development and scalability • Perceived cost of applying NLP outweighs the perceived benefit (Len D’Avolio)

  10. Overview • The promise of natural language processing (NLP) • Challenges of developing NLP in the clinical domain • Challenges in applying NLP in the clinical domain • Developing an NLP ecosystem on iDASH

  11. iDASH • integrating Data • Analysis • Anonymization • Sharing Data Software/Tools Computational Resources

  12. Disincentives to Share iDASH aims to minimize these disincentives • ‘Scooping’ by faster analysts Exposure of potential errors in data • Resources for preparing data submissions • Maintaining data • Interacting with potential users takes time • Threat of privacy breach when human subjects are involved • Do not have policies in place • Fallible de-identification, anonymization algorithms

  13. nlp-ecosystem.ucsd.edu

  14. HIPAA &/or FISMA Compliant Cloud DigitalInformed consent • Access control • De-identification • Query counts • Artificial data generators Privacy preserving Informed Consent Registry Customizable DUAs Researcher access

  15. Schemas Bibliography Research Tutorials Guidelines Resources Education NLP Ecosystem UCSD Clinical Data Data Evaluation Workbench De-Identification MT Samples Tools & Services Collaborative Development Tools TxtVect Virtual Machines Annotation Admin & eHOST Registry 2011 summer internship program funded by NIH U54HL108460

  16. Collaborative Effort to Build Ecosystem Evaluation Workbench De-Identification Tools & Services Collaborative Knowledge Authoring TextVect Increase access to NLP Virtual Machines Annotation Environment Decrease Burden of Developing NLP Registry

  17. Increase ability to find NLP tools orbit

  18. Registry: orbit.nlm.nih.gov Len D’Avolio, Dina Demner-Fushman

  19. Increase access to clinical text De-identification service

  20. De-identification Brett South, Stephane Meystre, Oscar Fernandez, Danielle Mowery • Several available de-identification modules • Need to adapt to local text • Efficient • Secure • Customizable ensemble de-identification system • Build a de-identified corpus • Incorporate existing de-id modules • Launch as virtual machine • Iterative training, evaluation, and modification by user • Correct mistakes • Add regular expressions

  21. Increase access to textual features TextVect

  22. TextVect NLM: Abhishek Kumar

  23. Decrease the Burden of Customizing an NLP Application collaborative Knowledge Authoring Support Service (cKass)

  24. Customizing an IE App IE Output User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy Map

  25. Customizing an IE App IE Output Dry cough Productive cough Cough Hacking cough Bloody cough User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy Which concepts?

  26. Customizing an IE App IE Output Temp 38.0C Low-grade temperature User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy What is a fever?

  27. Customizing an IE App IE Output NECK: no adenopathy Disorder: adenopathy Negation: negated User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy Section mapping

  28. KOS-IEKnowledge Organization Systems for Information Extraction

  29. Compile information helpful for IE

  30. Collaborative Knowledge Base Development: cKASS Radiologist NLP Tools • Physician • Radiologist • Nurse • Clinical Researcher • Knowledge Engineer. Decision Support System User KB Shared KB External KB LQ Wang, M Conway, F Fana, M Tharp, D Hillert

  31. Knowledge Authoring Augment user KB with lexical variants, synonyms, and related concepts • User-driven authoring • Top-down: Provide access to external knowledge sources • UMLS, Specialist Lexicon, Bioportal • Bottom-up: Annotate to derive synonyms • Recommendation-based authoring • Generate lexical variants • Mine external knowledge sources • Mine patient records

  32. Decrease the Burden of Evaluation & Error Analysis Evaluation workbench

  33. Evaluation Workbench • Compare the output of two NLP annotators on clinical text • NLP system vs human annotation • View annotations • Calculate outcome measures • Drill down to all levels of annotation • Document-level • Perform error analysis • Future versions will support formal error analysis

  34. Levels of Annotation • Document • Report classified as Shigellosis • Group • Section classified as Past Medical History Section • Utterance • Group of text classified as Sentence • Snippet • “chest pain”classified as CUI 058273 • Word • “pain”classified as noun) • Token • “.”classified as EOS marker

  35. Select Classifications to View Document & annotations Outcome Measures for Selected Annotations Report List Attributes for Selected Annotation Relationships for Selected Annotation VA and ONC SHARP: Christensen, Murphy, Frabetti, Rodriguez, Savova

  36. Decrease the Burden of Annotation Annotation Environment

  37. Challenges to Annotating • Time consuming • Recruiting & training annotators for high agreement • Expensive • Domain experts especially expensive • Need for annotation by multiple people • Challenging to design annotation task • How many annotators? • How should I quantify quality of annotations? • Logistically challenging • Managing files and batches of reports • Setting up annotation tool • Reinventing the wheel • Hasn’t someone created a schema for this before?

  38. How can we reduce the burden of annotation?

  39. iDASH Annotation Environment Goal: provide an environment to decrease the Burden of annotation for research and application Annotator Registry eHOST Annotation Admin Web application iDASH cloud Client app on your computer VA, SHARP, and NIGMS : S Duvall, B South, G Savova, N Elhadad, H Hochheiser

  40. Annotator Registry • Enlist for annotation • Certify for annotation tasks • Personal health information • Part-of-speech tagging • UMLS mapping • Set pay rate • Searchable • Available for inclusion in new annotation task http://idash.ucsd.edu/nlp-annotator-registry

  41. Annotation Admin: Intended Users & Uses Users • NLP researchers • Annotation administrators Uses • Manage annotation projects – who annotates what • Currently done with hundreds of files on hard drive • Integrate with annotation tool (eHOST) • Download batches of raw reports to annotators • Upload and store annotated reports • Manage simple annotation projects • Facilitate distributed annotation

  42. Annotation Admin 1. Assign annotators to a task

  43. 2. Create a Schema

  44. 3. Assign users and set time expectations

  45. 3. Keep track of progress

  46. Collaborative Effort to Build Resources Evaluation Workbench De-Identification Tools & Services Collaborative Knowledge Authoring TextVect Increase access to NLP Virtual Machines Annotation Environment Decrease Burden of Developing NLP Registry

  47. Conclusion • More demand for EHR data • NLP has potential to extend value of narrative clinical reports • There have been many barriers • To development • To deployment • Recent developments facilitate collaboration & sharing • Common annotation conventions • Privacy algorithms • Shared datasets • Hosted environments • iDASH hopes to facilitate • Development of NLP • Application of NLP

  48. Integrating Data for Analysis, Anonymization, and Sharing Questions | Discussion iDASH/ShARe Workshop on Annotation September 29, 2012 La Jolla, CA wwchapman@ucsd.edu Division of Biomedical Informatics University of California, San Diego

More Related