1 / 38

Privacy Policy Evaluation

Privacy Policy Evaluation. A tool for automatic natural language privacy policy analysis. The Aim and the Look and Feel. Purposes. A tool to help the user in performing privacy level assessment towards a website The tool should be able to:

jela
Télécharger la présentation

Privacy Policy Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Privacy Policy Evaluation A tool for automatic natural language privacy policy analysis

  2. The Aim and the Look and Feel

  3. Purposes • A tool to help the user in performing privacy level assessment towards a website • The tool should be able to: • analyze the privacy policies of a web site (written in Natural Language • verify its compliance to the Privacy Principles (PP) • e.g. the ones specified by the EU • verify its compliance to the user preferences

  4. User Side Server Side SPARCLE Policy Genarator User’s Privacy Preferences DB Natural Language Privacy Policy define Web Site end-user publish DB save data Set Of Goals Text Analysis Privacy Best Practise EU Privacy Principle DB GOALS Extraction <XACML> Privacy Enforcement Privacy Evaluation

  5. User Side Server Side User’s Preferences read Natural Language Privacy Policy end-user define DB Web Server DB Text Mining publish EU Privacy Principle save data DB Principles compliance Checking Privacy Enforcement Compliance Results

  6. User’s Properties Settings

  7. User's Preferences View

  8. Privacy Policy Results View

  9. Example of real Privacy Policy for an hospital • Example 1: • O'Connor (hospital) is the sole owner and user of all information collected on the Web site. • O'Connor does not sell, give, or disclose the information (statistical or personal) gathered on the Web site to any third party • Example 2: • We may use and disclose medical information about you without your consent if necessary for reviews preparatory to research, but none of your medical information would be removed from the Hospital in the course of such reviews.

  10. (Key) Action list

  11. Example of goal • Approaches • Looking for key words (i.e. look for the verbs we are interested in) • Full text analysis ( i.e. analyze each sentence and verify to which goals (if any) it corresponds )

  12. The Architecture

  13. The Architecture Privacy Policy document User’s Settings Privacy Evaluation GUI Sentence Splitter Machine Learning Module Text Mining Sentence Classification Sentence Layout Organization Results Shown to the user

  14. Text Mining

  15. Text Mining versus Data Mining • Text mining is different from Data Mining • Why? • According to Witten ‘text mining’ denotes any system that analyze large quantities of NL text and detects lexical or linguistic usage patterns • Data Mining is looking for patterns in data while text mining is looking for patterns in text • In data mining the information is hidden and automatic tools are needed to discover it whole in text mining the information it’s present and clear, not hidden at all • In text mining it’s the human resource the problem: for humans it’s impossible to read the whole text by themselves • A good output from text mining is a comprehensible summaries of salient features in a large body of text [1] I. H. Witten, “Text mining,” in Practical handbook of internet computing, Boca Raton, Florida: Chapman & Hall/CRC Press, 2005, pp. 1-22.

  16. Bag of Words • A document can be represented as a bag of words: • i.e. a list of all the words of the documents with a counter of how often each word appears in the document • Some words (the stop words) are discarded • A query is seen as a set of words. • The documents are consulted to verify which one satisfies the query • A technique of relevance ranking allows to assess the importance of each term according to a collection of documents or to the single document. • this allows to rank each document according to the query • In the web search engines the query is usually composed by few words while in expert information retrieval systems query may be way more complex

  17. Text categorization (1) • Is the assignment of certain documents (or part of the document) to predefined categories according to their contents • Automatic text categorization helps to detect the topics that a document covers • the dominant approach is to use machine learning to infer categories automatically from a training set of pre-classified documents. • the categories are simple label with no associated semantic • in our case the categories will be the single privacy principle we are interested in • Machine learning techniques for text categorization are based on rules and decision trees

  18. Text categorization (2) • Example for the principle “Transfer to Third Parties ” • if (share&data) • or (sell&data) • or (third party &data) • then “THIRD PARTY” • Rules like this can be produces using standard machine learning techniques • The training data contains a substantial sample of documents (paragraph) for each category • Each document (or paragraph) is a positive instance for the category it represents and a negative one for all the other categories • Typical approaches extract features from each document and use the feature vectors as input to a schema that learns how to classify documents • The key words can be the features and the word occurrence can be the feature value. Using the vector of features it’s possible to build a model for each category • the model predicts whether or not a category can be associated to a given document based on the words in it and on their occurrence • Using the bag of words the semantic is lost (only words are accounted) but experiments shows that adding semantic does not significantly improves the results

  19. Approach to solve our problems • Category Definition • Define the categories we are interested in • e.g. the privacy principles • Annotation • Select a certain number of privacy policy (verify in literature the correct number) and annotate the sentences (in the policy) related to each category • Bag of words • for each category define the bag of words (this can be an automated process) • e.g. security has : {https, login, password, security, protocol….} • Training set • The annotated documents are our training set • Machine Learning • Apply machine learning algorithms to learn the classification rules • Usage • use the classification rules to classify the sentences of new privacy policies in the right category

  20. 1. Category Definition (a) • The categories are chosen as “privacy principle” adapted from • Directive 95/46/EC • Directive 2002/58/EC 2002 • and the OECD Guidelines

  21. 1. Category Definition (b)

  22. 2. Annotation (a) • One problem for the annotation part is verify the quality of the annotator. • Usually, the authors of the experiment are also the ones annotating the documents for the training set. • Their quality has to be assessed comparing their annotations with the ones of other users • A way to do this is presented in the following slides

  23. 2. Annotation (b) • Ask 4, 5 persons to annotate the same privacy policy • For each sentences, annotate whether they are relevant for one of the ten principle (categories) • Make a matrix of similarity to verify how close to each other are the annotators • The similarity can be measured with an F_value ( or the cosine-similarity measures) • Bad if <0.5 • more o less ok if between 0.5 and 0.75 • Very good if >0.75

  24. 2. Annotation (c) • if the annotations of A0 and A1 (i.e. the annotation of the authors of the experiment) are in average good then the authors can be good annotators • Let’s call Fxythe value in the cell AxAy (agreement coefficient)

  25. 2. Annotation (d) • How can we compute Fxy, the agreement value between the annotators Ax and Ay (regarding a given category C)? • If • a= number of paragraphs both Ax and Ay says belong to the category C • b= number of paragraph were Ax says belong to C and Ay says No • c= number of paragraphs were Ax says do not belong to C and Ay says yes • d= number of paragraphs were both Ax and Ay say do not belong to C • then the Fxy can be computed as: (these formulas need to be verified!!! or better computation need to be found) • Recall = a / (a+b) ? • Precision = a/ (a+c) ? • F-Value = ? • Kappa-Value = ? • Maybe the recall is the formula more useful in our case

  26. 3. Bag of Words • A set of words relevant for the document • Bag of words selection (a.k.a. feature selection) can be made selecting all the words of a document (in our case a document is a sentence) • To each word of the bag of word a weight should be given • Such process can be made automatically

  27. 4. Training Set • The training set is the set of privacy policies we annotate and to which we apply Machine Learning algorithms • The policies have to be chosen according to two criteria: • belonging to the same domain (e.g. e-health) • belonging to as different domain as possible • The size of the training set needs to be defined

  28. 5. Machine Learning • You can use WEKA for machine Learning and use WORDNET for the relationship between words • SYC is a tool for the ontology for the all the knowledge of the world • not recommended • Use first of all the cosine similarity to make a sort of pre-test • Look at the Normal Decision Tree that should give rules as result (??)

  29. Tools • Weka • TagHelper Tools • download the program + the documentation • Repeat the example showed in TagHelperManual.ppt to understand how it’s possible to create a model starting from annotated documents • The file AccessMyrecordsPrivacyPolicy.xls is the annotation result of the privacy policy available at http://www.accessmyrecords.com/privacy_policy.htm • The model has been build on this annotations • You can test the model using the file EMC2.xls

  30. Master Thesis Requirements

  31. Task 1 - Objectives • Literature Study on Text Mining • I. H. Witten, “Text mining,” in Practical handbook of internet computing, Boca Raton, Florida: Chapman & Hall/CRC Press, 2005, pp. 1-22. • Elisa will provide Yuanhao with a defined the set of categories (privacy principle) • and guidelines to follow to associate each sentence of a privacy policy to a category • Yuanhao will select 2 privacy policies (e-health domain) and will proceed to the annotation of such policies according to the categories and the guidelines indicated above • Elisa will also annotate such privacy policies • Other 4 persons need to be chosen to annotate the 2 above mentioned privacy policies (these persons will be provided with the same annotation guidelines mentioned above)

  32. Task 1 - OUPUT • At the end of the TASK 1 the 6 annotators (Elisa, Yuanhao + 4), will provide annotated privacy policies in the following format • a xsl file with two column. The first column indicate the category (e.g. choice, use, security) and the second column indicates the sentence of the privacy policies associated to that category • Note that every sentence of the privacy policy needs to be associated to a category

  33. TASK 2 – Objectives • Validation of Elisa and Yuanhao as annotators. • The method described in the slide Annotation (c) and (d) needs to be applied. • The annotators needs to be compared • The method to do that needs to be formalized • i.e. a measure to verify the agreement level between two annotators needs to be found (e.g. precision, recall, f-measure)

  34. Task 2 - OUTPUT • Elisa and Yuanhao are demonstrated to be valid annotators • They can start to annotate the training set of privacy policy

  35. HIGH PRIORITY Task 3 – Objectives • Annotate the privacy policy for the training set • 10 privacy policies (in the domain of the e-health) may be sufficient (find support in the Literature) • Build up the model using machine learning techniques OUTOUT The model for the privacy policy categorization

  36. HIGH PRIORITY Task 3: Suggestion • Read the paper (very useful) • F. Sebastiani, “Machine learning in automated text categorization”. • The tagHelperTool can be useful (it’s an open source tool) • Read the paper for technical details about it (not everything is interesting for us): • C. Rosé et al., “Analyzing collaborative learning processes automatically: Exploiting the advances of computational linguistics in computer-supported collaborative learning” • Look at the classes FeatureSelector and EnglishTokenizer to see how the tool treats stop words and how it creates its own bag of words • etc/English.stp contains the file with the English stop words • This system uses WEKAclassifiers • Rainbow • a program that performs statistical text classification. It is based on the Bow library. For more information about obtaining the source and citing its use, see the Bow home page. • Verify whether other tools are good to execute the following steps: • Stop words elimination • Stemming • Bag-of-words (features) extraction • Words reduction (if necessary) • Machine learning model (classifier construction: verify which algorithm is better to use) • Also, start writing the report about that so I can read it once back

  37. HIGH PRIORITY TASK 4 – Objectives • Apply the model (output of task 3) to new privacy policies • Measure and document the quality of the model • Precision • Recall • F-Value

  38. TASK 5 - Objectives • Build Up the GUI able to analyze privacy policies and show the different categories (principle) accounted by each policy • The GUI may be web-based • The user should have the possibility to set the categories (privacy principle) he is interested in verify • The GUI should be based on the look&Feel shown in the very first slides of this presentation

More Related