De-identifying Pathology Reports for Pathology Informatics

De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel Saltz Center for Comprehensive Informatics

Introduction • The HIPAA Privacy Rule regulates the use and disclosure of Protected Health Information (PHI) • De-identification of pathology reports is of critical importance in order to facilitate secondary use of medical records for research • HIDE (Health Information DE-identification) is an open-source de-id tool based on advanced statistical based de-identification technologies

HIPAA Identifiers 1. Names; 2. All geographical subdivisions smaller than a state; 3. All elements of dates (except year); 4. Phone numbers; 5. Fax numbers; 6. Electronic mail addresses; 7. Social Security numbers; 8. Medical record numbers; 9. Health plan beneficiary numbers; 10. Account numbers; 11. Certificate/license numbers; 12. Vehicle identifiers and serial numbers; 13. Device identifiers and serial numbers; 14. Web Universal Resource Locators (URLs); 15. Internet Protocol (IP) address numbers; 16. Biometric identifiers, including finger and voice prints; 17. Full face photographic images or comparable images; and 18. Any other unique identifying number, characteristic, or code • These identifiers have to be removed or • Based on the opinion from an qualified statistical expert, the risk of identifying an individual is very small

HIDE Overview • Utilizes the state-of-the-art named entity recognition technique, Conditional Random Fields, for extracting PHI • Previous tools such as DE-ID and HMS scrubber use rule-based approaches which are labor intensive and not portable • Provides flexible de-identification options including full de-identification and state-of-the-art statistical de-identification • Previous tools allow simple removal or substitution of the PHI • Provides an easy-to-use web-based interface that utilizes the latest web-technologies • Integrated with caTIES, and caTissue (in progress)

PHI Extraction • Utilizes state-of-the-art NLP technique, Conditional Random Fields • High accuracy, easy to train, portable • Combines different feature sets and sampling techniques • Feature sets: dictionary, affix, regular expression and context • Can use default models or custom trained models • Web interface for annotating and training custom models • A set of reports are loaded and manually labeled • The labeled documents will generate a trained model for automatically de-identifying new reports

HIDE: De-identification Options • Full de-identification • safe-harbor, all 18 HIPAA identifiers removed or substituted • Partial de-identification • limited dataset, all direct HIPAA identifiers removed or substituted(not for dates, address other than street/P.O.Box) • Configurable de-identification • A configurable set of identifiers removed or substituted • Statistical de-identification • Advanced anonymization that guarantees rigorous statistically acceptable privacy while keeping the utility of the data

Statistical De-identification Example De-identification satisfying k-anonymity (k=2) (every record is indistinguishable in a group of records with size greater than or equal to k)

Study 1: PHI Extraction on Emory Pathology Reports (100 reports,10-fold cross validation) Precision: true positives over the sum of true positives and false positives Recall (sensitivity):true positives over total actual positives F1: combination: 2*precision*recall/(precision+ recall)

Study 2: PHI Extraction on i2b2 Reports • Based on 669 discharge summaries, 10-fold cross validation • Good precision and recall for most individual PHI identifiers • Good overall precision and recall for PHI extraction

Study 3: Impact of Different Feature Sets Dictionary (d), affix (a), regular expression (r) and context (c) features are in order of increasing importance for statistical CRF based PHI extraction

Integrating HIDE with caTIES • caTIES (cancer Text Information Extraction System) provides tools for de-identification and automated coding of free-text pathology reports • caTIES provides de-id extensibility through implementing its CaTIES_DeIdentifier interface • HIDEDeIdentifier, which calls HIDE client API • Added HIDE de-id option in caTIES installer • HIDE is bundled with caTIES since release v3.7 (May 2010)

Integrating HIDE with caTissue (in Progress) • caTissue uses caTIES V2.x and refactored it into caTissue’s workflow • HIDE integration with caTissue is similar to caTIES • Implementation and evaluation under going • Goal: Integration of pathology reports into caTissue installation at Winship Cancer Institute at Emory University

Ongoing Development • Continue development on HIDE/caTissue integration • Usability improvement: simplified installation progress • System improvements • Efficiency and scalability of the system • Multiple file formats support • Additional statistical de-identification options

HIDE Demo http://www.mathcs.emory.edu/hide/demos

Thank you http://www.mathcs.emory.edu/hide Li Xiong (lxiong@mathcs.emory.edu)

De-identifying Pathology Reports for Pathology Informatics

De-identifying Pathology Reports for Pathology Informatics

Presentation Transcript

Pathology

The Promise of Pathology Informatics

Pathology

Pathology Reports in Underwriting

Pathology

pathology

Pathology

pathology

PATHOLOGY

Data standards in pathology informatics and experimental pathology Experimental Biology 2004

PATHOLOGY

Implementing an RDF Schema for Pathology Images, From the Association for Pathology Informatics

PATHOLOGY

Pathology

Histology for Pathology Bone Pathology

Pathology