A semantic based methodology to classify and protect sensitive data in medical records

A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento di Informatica e Sistemistica Universita’ degli Studi di Napoli, Federico II Naples, Italy

Rationale • Introduction to challenges in e-healt; • Motivation and Open challenges; • Proposal of access control policies; • Methodology to extract relevant information to protect and apply the proper security policy; • A Case study; • Conclusion and future works

The Electronic Health • E-Health challenges: • To provide value-added services to the healthcare actors (patients, doctors, etc...); • To enhance the efficiency and reducing the costs of complex informative systems. • E-Health term encloses many meanings; we are focused on those aspects of telemedicine that involve not only technological aspects but, also, procedural ones; • In particular, we are assisting to a gradual adoption of innovative IT solutions for e-health but, at the state, the major open issue is the cohesistence of two different domains:

The cohesistence of old and new systems from a security point of view….. • Modern eHealth systems are designed to enforce fine-grain access control policies and the medical records are a-priori well structured to properly manage the different fields, but….. • eHealth is also applied in those contexts where new information systems have not been developed yet but “documental systems” are, in some way, introduced. This means that today documental systems give users the possibility to access a digitalized version of a medical record without having previously classified the critical parts.

Unstructured Medical record data and actors • Actors are not aware that structuring data is important for data elaboration and protection. • Security Problem • private data (critical part) can be accessed by not authorized actors. • It is not possible to enforce a fine-grained acess control on digitalized unstructured documents • Solution • extract relevant informaton from the records, • enforce access control policies

Motivation and our proposal • The problem: “Documental systems” allow access to medical record digitalized version (unstructured data) without having previously classified the critical parts. • We propose a semantic-based method to locate the resource being accessed and associate the proper security rule to apply. • The Access control models is still based on fine-grain data classification.

Semantic method for resource classification • Knowledge extraction by means of several text analysis methodologies. 1 2 STEPS 3 4 • Running example:

Step 1 - Text Preprocessing: Tokenization and Normalization • Goal: • extraction of relevant units of lexical elements • Text tokenization: • segmentation of a sentence into minimal units of analysis (token). • disambiguation of punctuation marks, aiming at token separation;; separation of continuous strings (i.e. strings that are not separated by blank spaces) to be considered as independent tokens: for example, in the Italian string “c’era” there are two independent tokens (c’ + era). This segmentation can be performed by means of special tools, defined tokenizers, including glossaries with wellknown expressions to be regarded as medical domain tokens and mini-grammars containing heuristic rules regulating token combinations. • Text normalization: • variations of the same lexical expression should be reported in a unique way: • (i) words that assume different meaning if are written in small or capital letter • (ii) acronyms and abbreviations (“USA” or “U.S.A.”)

Step 2 - Morpho-syntactic analysis: POS tagging and Lemmatization • Goal: • extraction of word categories. • Part-of-speech (POS) tagging: • assignment of a grammatical category (noun, verb, etc.) to each lexical unit. • word-category disambiguation: the vocabulary of the documents of interest is compared with an external lexical resource • Key-Word In Context (KWIC) Analysis. • Lemmatization: • Reducing the inflected forms to the respective lemma

Step 3 - Relevant Terms Recognition • Goal: • identification of terms useful to characterize the sections of interest. • TF-IDF (Term Frequency - Inverse Document Frequency): relevant lexical items are frequent and concentrated on few documents. Wt,d = ft,d * log(N/Dt) • term frequency (tf ), corresponds to the number of times a given term occurs in the resource; • inverse document frequency (idf), concerning the term distribution within all the sections of the medical records: it relies on the principle that term importance is inversely proportional to the number of documents from the corpus where the given term occurs.

Step 4 - Identification of Concepts of Interest • Goal: • Clusterize relevant terms in synset (semantically equivalent terms) in order to associate the semantic concept

Security Policies • At the end of the semantic analysis process, a medical record can be seen as composed by several sections (resources) that can be properly protected; • A Security policy is set of rules structured as ACL: sj ; ai; rk where: • sj  S = s1 … sm the set of actors; • ai  A = a1 … ah the set of actions; • rk  R = r1 … rh the set of resources;

Medical Record Policy (Use Case) actions resources actors

Action-actors identification Giving the policy and given a resource r*  R, it is easy to locate the set of all allowed rules: Lr* = sj, ai, r* r*R, ai  A*A, sj S*S

System behavior: an example

Conclusions and Future works • We have proposed a semantic approach for document parts (resource) classification from a security point of view; • It is useful to associate a set of security rules on the resources; • It is a promising method that can strongly help in facing security issues that arise once data are made available for new potential applications. • Future works: • To prove the methodology in other e-government fields, • To implement a system to on-line extract/classify and enforce fine-grained policies with acceptable performances.

A semantic based methodology to classify and protect sensitive data in medical records