Securing Electronic Health Records without Impeding the Flow of Information

Securing Electronic Health Records without Impeding the Flow of Information Rakesh Agrawal* Microsoft Search Labs Mountain View, CA rakesha@microsoft.com Christopher Johnson IBM Almaden Research Center San Jose, CA johnsocm@us.ibm.com * Based on work done while author was at IBM Almaden

Based on joint work with • Roberto Bayardo • Alvin Cheung • Alexandre Evfimievski • Tyrone Grandison • Jerry Kiernan • Kristen Lefevre • Ramakrishnan Srikant • Yirong Xu

Thesis • Technology alone cannot solve the complex problem of securely managing the health information; at the same time, policy and law needs to be informed of what is technically feasible and in what timeframe. • By advancing technology, we can: • change the mix of legislation, societal norms, market forces, and technology comprising the solution; and • improve the overall quality of the solution.

Outline • Illustrate thesis with technology examples based on Hippocratic database work • Recommendations for • Policy designers and legislators • Solution developers • Scientists and researchers

Hippocratic Database Technologies GOAL Create a new generation of information systems that protect the privacy, security, and ownership of data while not impeding the flow of information. Active Enforcement Data item level enforcement of disclosure policies and patient preferences Compliance Auditing Determine whether data has been accessed in violation of specified policies Privacy-Preserving Data Mining Preserves privacy at individual level, allowing accurate data mining models at aggregate level Optimal k-anonymization De-identifies records in a way that maintains truthful data but is not prone to data linkage attacks Sovereign Information Integration Selective, minimal sharing across autonomous data sources, without trusted third party

# Name Age Phone 1 Adam 25 111-1111 3 Bob - 333-3333 4 Daniel 40 - Active Enforcement • Privacy Policy: Organizations define a set of policies describing who may access data (users or roles), for what purposes the data may be accessed (purposes) and to whom the data may be disclosed (recipients). • Consent: Data subjects are given control, through opt-in and opt-out choices, over who may see their data and under what circumstances • Disclosure Control: Database enforces privacy policies and data subject consent choices with respect to all data access. • Provides cell-level disclosure control. • Application modification not required. • Database agnostic; does not require changes to the database engine. Patient Preferences & Data Collection Application Data Retrieval Policy Creation Negotiation Patient Preferences & Policy Matching Installation Policy Parser Enforcement JDBC/ODBC Driver • Active Enforcement system intercepts and rewrites incoming queries to comply with policies, subject choices, and context. • Rewritten queries benefit from all of the optimizations and performance enhancements provided by the underlying engine (e.g. parallelism). DATABASE Installed Policy Patient Records VLDB 02, WWW 03, VLDB 04

Query Modification Example (Disclose Name only of Patients who have opted-in) SELECT Name FROM Patients WHERE Age < 20 SELECT CASE WHEN EXISTS (SELECT Name_Choice FROM Patient_Choices WHERE Patients.Patient# = Patient_Choices.Patient# AND Patient_Choices.Name_Choice = 1) THEN Name ELSE null END FROM Patients WHERE Age < 20 AND EXISTS (SELECT Patient#_Choice FROM Patient_Choices WHERE Patients.Patient# = Patient_Choices.Patient# AND Patient_Choices.Patient#_Choice = 1)

40 30 Elapsed Time (seconds) 20 10 Modified External Multiple Unmodified 0 0 20 40 60 80 100 Choice Selectivity (%) Modified Internal • Measured performance of a query selecting all records from a 5 million-record table • Compared performance of original and modified queries for varied choice selectivity • Not surprisingly, performance actually better for modified queries when we use privacy enforcement as an additional selection condition • Able to use indexes on choice values • Shows the importance of database-level privacy enforcement for performance

Audit Scenario The doctor must now review disclosures of Jane’s information in order to understand the circumstances of the disclosure, and take appropriate action Sometime later, Jane receives promotional literature from a pharmaceutical company, proposing over the counter diabetes tests The doctor uncovers that Jane’s blood sugar level is high and suspects diabetes Jane complains to the department of Health and Human Services saying that she had opted out of the doctor sharing her medical information with pharmaceutical companies for marketing purposes Jane has not been feeling well and decides to consult her doctor

Audit Expression Who has accessed Jane’s disease information? audit T.disease from Customer C, Treatment T where C.cid=T.pcid and C.name = ‘Jane’

Problem Statement • Given • A log of queries • An audit expression specifying sensitive data • NOT Given • Log of data accesses • Precisely and Efficiently identify • Those queries that accessed the data specified by the audit expression in the past

ID Timestamp Query User Purpose Recipient 1 2004-02… Select … B. Jones Marketing PharmaCo. 2 2004-02… Select … S. Roberts Treatment S. Roberts Compliance Auditing IDs of log queries having accessed data specified by the audit query Query with purpose, recipient Audit query Updates, inserts, delete • Audits whether particular data has been disclosed in violation of the specified policies. • Audit expression specifies what potential data disclosures need monitoring. • Identifies logged queries that accessed the specified data. • Auditors can analyze the circumstances of violations. • Make necessary corrections to procedures, policies, security. Database Layer Audit Database triggers track updates to base tables Database Layer Backlog Data Tables Generate audit record for each query Query Audit Log VLDB 04

Negligible by using Recovery Log to build Backlog tables Overhead on Updates 7x if all tuples are updates 3x if a single tuple is updated

Audit Query Execution Time

Reconstruct distribution of LDL Reconstruct distribution of weight Data Mining Algorithms Data Mining Model Privacy Preserving Data Mining Kevin’s LDL • Preserves privacy at the individual patient level, but allows accurate data mining models to be constructed at the aggregate level. • Adds random noise to individual values to protect patient privacy. • EM algorithm estimates original distribution of values given randomized values + randomization function. • Algorithms for building classification models and discovering association rules on top of privacy-preserved data with only small loss of accuracy. Kevin’s weight Julie’s LDL 126 | 210| ... 128 |130| ... Randomizer Randomizer 126+35 161| 165| ... 129|190| ... Sigmod00, KDD02, Sigmod05

Goal: De-identify patient data such that it retains its integrity, but is resistant to data linkage attacks. Motivation: Naïve de-identification methods are prone to data linkage attacks, which combine subject data with publicly available information to re-identify represented individuals. Samarati and Sweeney k-Anonymity* Method A k-anonymized data set has the property that each record is indistinguishable from at least k-1 other records within the data set. Optimal k-Anonymization We have developed a k-anonymization algorithm that finds optimal k-anonymizations under two representative cost measures and variations of k. Optimal k-Anonymization Process of k-Anonymization • Data Suppression - Involves deleting particular cell values or entire tuples. • Value Generalization - Entails replacing specific values, such as a telephone number, with more general ones, such as the area code alone. Advantages of Optimal k-anonymization • Truthful - Unlike other disclosure protection techniques that use data scrambling, swapping, or adding noise, all information within a k-anonymized dataset is truthful. • Secure - More secure than other de-identification methods, which may inadvertently reveal confidential information. Name City Address Age Diagnosis Age City Name Address Diagnosis Influenza Eric 7, rue du Mont Dore Paris 26 Influenza * 17th Arrond. Paris 20-29 Paul 13, rue des Canettes Paris 42 Paris Hypertens. * 6th Arrond. 40-49 Hypertens. (k=2, on name, address, age) Paris Marc 47 48, rue du Four Diabetes * 40-49 6th Arrond. Paris Diabetes Paris Asthma Henri 21, rue du Mont Dore 17th Arrond. 28 Asthma * Paris 20-29 * P. Samarati and L. Sweeney. “Generalizing Data to Provide Anonymity when Disclosing Information.” In Proc.of the 17th ACM SIGMOD-SIGACT-SIGART Symposium on the Principles of Database Systems, 188, 1998. ICDE05

Sovereign Information Integration • Separate databases due to statutory, competitive, or security reasons. • Selective, minimal sharing on a need-to-know basis. • Example: Among those patients who took a particular drug, how many with a specified DNA sequence had an adverse reaction? • Researchers must not learn anything beyond counts. • Algorithms for computing joins and join counts while revealing minimal additional information. Minimal Necessary Sharing R R  S • R must not know that S has b and y • S must not know that R has a and x R  S a u u v v x S b Count (R  S) • R and S do not learn anything except that the result is 2. DNA Sequences u v Medical Research Inst. y Drug Reactions Sigmod 03, DIVO 04

Recommendations • Policy Makers & legislators • Continuous technology monitoring and understanding to inform policies and laws (current and new) • Invest in research • Solution Developers (Technologists) • Design-in ethical considerations (e.g. respect for privacy, safeguard against misuse); they can’t be afterthoughts • Engage in dialog with policy makers and legislators to educate them on performance implications of the policies/laws

Recommendations for Researchers Asking questions is easy: it's answering them that's hard.

Policy Specification • How to determine if the policy specification accurately captures the intent of the policy maker? (The person specifying the policy is usually not a computer scientist.) • How to help the patient understand the policy and the implications of his or her choices? • How to design a policy language that reconciles the goals of understandability and efficient computation?

Sticky Policies • Healthcare organizations should be assured that original policy controls will be enforced over data after transfer to other entities. • Transferees of patient data should be capable of applying source disclosure policies to any information in its database. • Database should enforce source and enterprise policies and resolve any conflicts among policies. Data compliant with source and enterprise policies policies Patient data + policy annotations Patient Records DB patient data Hospital 1 Hospital 2

Data Pointillism • > 14B records with Choicepoint • Data from > 22,000 sources in RDC’s GRID • >550 companies compiling databases of pvt information Pointillist • Accuracy? Limits? • How to allow someone to verify data? • Identifying and correcting errors? • Usage control?

512MB SanDisk Cruzer $47.99 Transcend 40GB Portable Hard Disk USB 95mm x 71.5mm x 15mm, $189 Massively Distributed Data Management • What if patient data is stored on personal devices? • Pervasive monitoring devices will also collect patient data. • How to protect the security of these devices? • Enable selective sharing of information stored on devices? • Distributed backup in the network to prevent data loss?

Data Life Cycle Management • Healthcare organizations must define data retention policies based on legal requirements and patient specifications: • HIPAA: 6 years (21 years for pediatric care). • Medicare: 5 to 7 years • AHA & AHIMA: at least 10 years • Data compression vs. encryption • How to remove expired data and forget persistent data? • How to establish truthfulness of data?

Interoperability • Sovereign health information systems must be able to communicate among one another, using standard data formats and clinical vocabularies. • Examples of current efforts include: • HL7 messaging standards • SNOMED-CT vocabularies • CDA and CCR document standards • Much work remains to be done to make systems interoperable. • Mass collaboration might be useful in defining clinical vocabularies and taxonomies.

Concluding Remarks • Hippocratic Database technologies protect the security of electronic health records and patient privacy without impeding the flow of information. • We need not sacrifice security or privacy to gain value from EHRs for diagnosis, treatment, and research. • We must focus on: • Deriving value from bits we know how to manage. • Demonstrating what could not be done before.

Thank you! Papers: rakesh.agrawal-family.comCollaborations:rakesh.agrawal@microsoft.comjohnsocm@us.ibm.com

References Active Enforcement • R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. “Hippocratic Databases.” 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002. • K. Lefevre, R. Agrawal, V. Ercegovac, R. Ramakrishnan, Y. Xu, D. DeWitt. "Limiting Disclosure in Hippocratic Databases". Proc. of the 30th Int'l Conf. on Very Large Databases (VLDB 2004), Toronto, Canada, August 2004. Compliance Auditing • R. Agrawal, R. Bayardo, C. Faloutsos, J. Kiernan, R. Rantzau and R. Srikant. “Auditing Compliance with a Hippocratic Database.” Proc. of the 30th Int'l Conf. on Very Large Databases (VLDB 2004), Toronto, Canada, August 2004. Privacy-Preserving Data Mining • R. Agrawal and R. Srikant. "Privacy-Preserving Data Mining". Proc. of the ACM SIGMOD Conference on Management of Data, Dallas, May 2000. • A. Evfimievski, R. Srikant, R. Agrawal and J. Gehrke. "Privacy Preserving Mining of Association Rules". Proc. of the 8th ACM SIGKDD Int'l Conference on Knowledge Discovery in Databases and Data Mining, Edmonton, Canada, July 2002.

References Optimal k-Anonymization • R. J. Bayardo and R. Agrawal. "Data Privacy Through Optimal k-Anonymization". To appear in Proc. of the 21st Int'l Conf. on Data Engineering (ICDE 2005), Tokyo, Japan, April 2005. Sovereign Information Integration • R. Agrawal, A. Evfimievski, R. Srikant. “Information Sharing Across Private Databases.” ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003. • R. Agrawal, D. Asonov and R. Srikant. "Enabling Sovereign Information Sharing Using Web Services". Proc. of the ACM SIGMOD Conference on Management of Data, Paris, France, June 2004.

Securing Electronic Health Records without Impeding the Flow of Information