slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
– with special attention to location ( ) privacy PowerPoint Presentation
Download Presentation
– with special attention to location ( ) privacy

– with special attention to location ( ) privacy

131 Views Download Presentation
Download Presentation

– with special attention to location ( ) privacy

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Bettina Berendt Dept. Computer Science K.U. Leuven – with special attention to location ( ) privacy SPACE WEB MINING and PRIVACY : foes or friends?




  5. What is Web Mining? And who am I? Knowledge discovery (aka Data mining): "the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data." Web Mining: the application of data mining techniques on the content, (hyperlink) structure, and usage of Web resources. Web structure mining Web usage mining Web mining areas: Web content mining Navigation, queries, content access & creation

  6. Why Web / data mining? “the database of Intentions“ (J. Battelle)


  8. Location-based services and augmented reality

  9. Semiotically augmented reality: semapedia and related ideas

  10. Mobile Social Web


  12. What's special about spatial information? 1. Interpreting Rich inferences from spatial position to personal properties and/or identity possible Pos(A,9-17) = P1 → workplace(A,P1) Pos(A,20-6) = P2 → home(A,P2) An even richer „database of intentions“?! Pos(A,now) = P3 & temp(P3,now,hot) → wants(A,ice-cream) (location-based services) Pos(A, t in 13-18) = Pos(Demonstration,13-18) → suspicious(A) (ex. Dresden phone surveillance case 2011)

  13. What's special about spatial information? 2. Sending, or: Opt-out impossible?! Physically: You cannot be nowhere Corollary: You cannot be in two places at once → limits on identity-building Contractually: Rental car with tracking, ... Culturally I: Opt-out may preclude basics of identity construction No mobile phone/internet communication Culturally II: Opt-out considered suspicious in itself (ex. A. Holm surveillance case 2007)

  14. FOES ?


  16. Behaviour on the Web (and elsewhere) Data

  17. (Web) data analysis and mining Data Privacy problems!

  18. Technical background of the problem: • The dataset allows for Web mining (e.g., which search queries lead to which site choices), • it violates k-anonymity (e.g. "Lilburn"  a likely k = #inhabitants of Lilburn)


  20. Inferences Data mining / machine learning: inductive learning of models („knowledge“) from data Privacy-relevant (Re-)identification: inferences towards identity Profiling: inferences towards properties Application of the inferred knowledge

  21. What is identity merging?Or: Is this the same person?

  22. Data integration: an example Paper published by the MovieLens team (collaborative-filtering movie ratings) who were considering publishing a ratings dataset, see Public dataset: users mention films in forum posts Private dataset (may be released e.g. for research purposes): users‘ ratings Film IDs can easily be extracted from the posts Observation: Every user will talk about items from a sparse relation space (those – generally few – films s/he has seen) [Frankowski, D., Cosley, D., Sen, S., Terveen, L., & Riedl, J. (2006). You are what you say: Privacy risks of public mentions. In Proc. SIGIR‘06] Generalisation with more robust de-anonymization attacks and different data: [Narayanan A, Shmatikov V (2009) De-anonymizing social networks. In: Proc. 30th IEEE Symposium on Security and Privacy 2009]

  23. Given a target user t from the forum users, find similar users (in terms of which items they related to) in the ratings dataset Rank these users u by their likelihood of being t Evalute: If t is in the top k of this list, then t is k-identified Count percentage of users who are k-identified E.g. measure likelihood by TF.IDF (m: item) Merging identities – the computational problem

  24. Results

  25. What do you think helps?

  26. What is classification (and prediction)?

  27. Predicting political affiliation from Facebook profile and link data (1): Most Conservative Traits Lindamood et al. 09 & Heatherly et al. 09

  28. Predicting political affiliation from Facebook profile and link data (2): Most Liberal Traits per Trait Name Lindamood et al. 09 & Heatherly et al. 09

  29. What is collaborative filtering? "People like what people like them like"

  30. User-based Collaborative Filtering • Idea: People who agreed in the past are likely to agree again • To predict a user’s opinion for an item, use the opinion of similar users • Similarity between users is decided by looking at their overlap in opinions for other items

  31. Example: User-based Collaborative Filtering

  32. Similarity between users • How similar are users 1 and 2? • How similar are users 1 and 4? • How do you calculate similarity?

  33. Popular similarity measures Cosine basedsimilarity Adjusted cosine basedsimilarity Correlation based similarity

  34. Algorithm 1: using entire matrix 5 7 7 Aggregation function: often weighted sum Weight depends on similarity 8 4

  35. Algorithm 2: K-Nearest-Neighbour Neighbours are people who have historically had the same taste as our user 5 7 7 Aggregation function: often weighted sum Weight depends on similarity 8 4


  37. Summary: Lots of data → lots of privacy threats (and opportunities) The Web incites one of the semiotically richest (and often machine-processable) types of interaction Space incites data-rich types of interaction → two rich sources of „the database of intentions“


  39. How many people see an ad? Television: sample viewers, extrapolate to population Web: count viewers/clickers through clickstream City streets: count pedestrians / motorists? Too many streets! → Solution intuition: sample streets, predict

  40. Fraunhofer IAIS (2007): predict frequencies based on similar streets Street segments modelled as vectors Spatial / geometric information Type of street, direction, speed class, … Demographic, socio-economic data about vicinity Nearby points of interest (buffer around segment, count #POI) KNN algorithm Frequency of a street segment = weighted sum of frequencies from most similar k segments in sample Dynamic + selective calculation of distance to counter the huge numbers of segments and measurements


  42. IP filtering:a deterministic classification modelIP → country

  43. Where do people live who will buy the Koran soon? Technical background of the problem: • A mashup of different data sources • Amazon wishlists • Yahoo! People (addresses) • Google Maps each with insufficient k-anonymity, allows for attribute matching and thereby inferences

  44. Multiple views on traffic Weather Major events Incident reports Operator ID: Nick Heading: INCIDENT Message: INCIDENT INFORMATION Cleared 1637: I-405 SB JS I-90 ACC BLK RL CCTV 1623 – WSP, FIR ON SCENE • Event store • Learning • Reasoning Traffic Prediction: space data + Web data + ... E.g. LARKC project: I. Celino, D. Dell'Aglio, E. Della Valle, R. Grothmann, F. Steinke and V. Tresp: Integrating Machine Learning in a Semantic Web Platform for Traffic Forecasting and Routing. IRMLeS 2011 Workshop at ESWC 2011.


  46. Recall (a simple view): Cryptographic privacy solutions Data not all !

  47. "Privacy-preserving data mining" Data not all !

  48. Privacy-preserving data mining (PPDM) Database inference problem: "The problem that arises when confidential information can be derived from released data by unauthorized users” Objective of PPDM : "develop algorithms for modifying the original data in some way, so that the private data and private knowledge remain private even after the mining process.” Approaches: Data distribution Decentralized holding of data Data modification Aggregation/merging into coarser categories Perturbation, blocking of attribute values Swapping values of individual records sampling Data or rule hiding Push the support of sensitive patterns below a threshold

  49. Example 1: Collaborative filtering

  50. Collaborative filtering: ideaand architecture Basic idea of collaborative filtering: "Users who liked this also liked ..."  generalize from "similar profiles" Standard solution: At the community site / centralized: Compute, from all users and their ratings/purchases, etc., a global model To derive a recommendation for a given user: find "similar profiles" in this model and derive a prediction Mathematically: depends on simple vector computations in the user-item space