1 / 63

New Directions in Data Mining

New Directions in Data Mining. Ramakrishnan Srikant. Directions. Electronic Commerce & Web Catalog Integration (WWW 2001, with R. Agrawal) Searching with Numbers (WWW 2002, with R. Agrawal) Security Privacy. Catalog Integration. B2B electronics portal: 2000 categories, 200K datasheets.

janice
Télécharger la présentation

New Directions in Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New Directions in Data Mining Ramakrishnan Srikant

  2. Directions • Electronic Commerce & Web • Catalog Integration (WWW 2001, with R. Agrawal) • Searching with Numbers (WWW 2002, with R. Agrawal) • Security • Privacy

  3. Catalog Integration • B2B electronics portal: 2000 categories, 200K datasheets New Catalog Master Catalog

  4. Catalog Integration (cont.) • After integration:

  5. Goal • Use affinity information in new catalog. • Products in same category are similar. • Accuracy boost depends on match between two categorizations.

  6. Intuition behind Algorithm Standard Algorithm Enhanced Algorithm

  7. Problem Statement • Given • master categorization M: categories C1, C2, …, Cn • set of documents in each category • new categorization N:set of categories S1, S2, …, Sn • set of documents in each category • Standard Alg: Compute Pr(Ci | d) • Enhanced Alg: Compute Pr(Ci | d, S)

  8. Enhanced Naïve Bayes classifier • Use tuning set to determine w. • Defaults to standard Naïve Bayes if w = 0. • Only affects classification of borderline documents.

  9. Yahoo & Google • 5 slices of the hierarchy: Autos, Movies, Outdoors, Photography, Software • Typical match: 69%, 15%, 3%, 3%, 1%, …. • Merging Yahoo into Google • 30% fewer errors (14.1% absolute difference in accuracy) • Merging Google into Yahoo • 26% fewer errors (14.3% absolute difference) • Open Problems: SVM, Decision Tree, ...

  10. Directions • Electronic Commerce & Web • Catalog Integration (WWW 2001, with R. Agrawal) • Searching with Numbers (WWW 2002, with R. Agrawal) • Security • Privacy

  11. Data Extraction is hard • Synonyms for attribute names and units. • "lb" and "pounds", but no "lbs" or "pound". • Attribute names are often missing. • No "Speed", just "MHz Pentium III" • No "Memory", just "MB SDRAM" • 850 MHz Intel Pentium III • 192 MB RAM • 15 GB Hard Disk • DVD Recorder: Included; • Windows Me • 14.1 inch diplay • 8.0 pounds

  12. Searching with Numbers

  13. Why does it work? • Conjecture: If we get a close match on numbers, it is likely that we have correctly matched attribute names. • Non-overlapping attributes: • Memory: 64 - 512 Mb, Disk: 10 - 40 Gb • Correlations: • Memory: 64 - 512 Mb, Disk: 10 - 100 Gb still fine. • Screen Shot

  14. Empirical Results

  15. Incorporating Hints • Use simple data extraction techniques to get hints, • Names/Units in query matched against Hints. • Open Problem: Rethink data extraction in this context.

  16. Other Problems • P. Domingos & M. Richardson, Mining the Network Value of Customers, KDD 2001 • Polling over the Web

  17. Directions • Electronic Commerce & Web • Security • Privacy

  18. Security

  19. Some Hard Problems • Past may be a poor predictor of future • Abrupt changes • Reliability and quality of data • Wrong training examples • Simultaneous mining over multiple data types • Richer patterns

  20. Directions • Electronic Commerce & Web • Security • Privacy • Motivation • Randomization Approach • Crytographic Approach • Challenges

  21. Growing Privacy Concerns • Popular Press: • Economist: The End of Privacy (May 99) • Time: The Death of Privacy (Aug 97) • Govt. directives/commissions: • European directive on privacy protection (Oct 98) • Canadian Personal Information Protection Act (Jan 2001) • Special issue on internet privacy, CACM, Feb 99 • S. Garfinkel, "Database Nation: The Death of Privacy in 21st Century", O' Reilly, Jan 2000

  22. Privacy Concerns (2) • Surveys of web users • 17% privacy fundamentalists, 56% pragmatic majority, 27% marginally concerned (Understanding net users' attitude about online privacy, April 99) • 82% said having privacy policy would matter (Freebies & Privacy: What net users think, July 99)

  23. Technical Question • Fear: • "Join" (record overlay) was the original sin. • Data mining: new, powerful adversary? • The primary task in data mining: development of models about aggregated data. • Can we develop accurate models without access to precise information in individual data records?

  24. Directions • Electronic Commerce & Web • Security • Privacy • Motivation • Randomization Approach • R. Agrawal and R. Srikant, “Privacy Preserving Data Mining”, SIGMOD 2000. • Crytographic Approach • Challenges

  25. Web Demographics • Volvo S40 website targets people in 20s • Are visitors in their 20s or 40s? • Which demographic groups like/dislike the website?

  26. Randomization Approach Overview 30 | 70K | ... 50 | 40K | ... ... Randomizer Randomizer 65 | 20K | ... 25 | 60K | ... ... Reconstruct distribution of Age Reconstruct distribution of Salary ... Data Mining Algorithms Model

  27. Reconstruction Problem • Original values x1, x2, ..., xn • from probability distribution X (unknown) • To hide these values, we use y1, y2, ..., yn • from probability distribution Y • Given • x1+y1, x2+y2, ..., xn+yn • the probability distribution of Y • Estimate the probability distribution of X.

  28. Intuition (Reconstruct single point) • Use Bayes' rule for density functions

  29. Intuition (Reconstruct single point) • Use Bayes' rule for density functions

  30. Reconstructing the Distribution • Combine estimates of where point came from for all the points: • Gives estimate of original distribution.

  31. Reconstruction: Bootstrapping • fX0 := Uniform distribution • j := 0 // Iteration number • repeat • fXj+1(a) := (Bayes' rule) • j := j+1 • until (stopping criterion met) • Converges to maximum likelihood estimate. • D. Agrawal & C.C. Aggarwal, PODS 2001.

  32. Seems to work well!

  33. Recap: Why is privacy preserved? • Cannot reconstruct individual values accurately. • Can only reconstruct distributions.

  34. Classification • Naïve Bayes • Assumes independence between attributes. • Decision Tree • Correlations are weakened by randomization, not destroyed.

  35. Algorithms • “Global” Algorithm • Reconstruct for each attribute once at the beginning • “By Class” Algorithm • For each attribute, first split by class, then reconstruct separately for each class. • Associate each randomized value with a reconstructed interval & run decision-tree classifier. • See SIGMOD 2000 paper for details.

  36. Experimental Methodology • Compare accuracy against • Original: unperturbed data without randomization. • Randomized: perturbed data but without making any corrections for randomization. • Test data not randomized. • Synthetic data benchmark from [AGI+92]. • Training set of 100,000 records, split equally between the two classes.

  37. Synthetic Data Functions • F3 • ((age < 40) and • (((elevel in [0..1]) and (25K <= salary <= 75K)) or • ((elevel in [2..3]) and (50K <= salary <= 100K))) or • ((40 <= age < 60) and ... • F4 • (0.67 x (salary+commission) - 0.2 x loan - 10K) > 0

  38. Quantifying Privacy • Add a random value between -30 and +30 to age. • If randomized value is 60 • know with 90% confidence that age is between 33 and 87. • Interval width  amount of privacy. • Example: (Interval Width : 54) / (Range of Age: 100)  54% randomization level @ 90% confidence

  39. Acceptable loss in accuracy

  40. Accuracy vs. Randomization Level

  41. Directions • Electronic Commerce & Web • Security • Privacy • Motivation • Randomization Approach • Crytographic Approach • Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining”, Crypto 2000. • Challenges

  42. Inter-Enterprise Data Mining • Problem: Two parties owning confidential databases wish to build a decision-tree classifier on the union of their databases, without revealing any unnecessary information. • Horizontally partitioned. • Records (users) split across companies. • Example: Credit card fraud detection model. • Vertically partitioned. • Attributes split across companies. • Example: Associations across websites.

  43. Cryptographic Adversaries • Malicious adversary: can alter its input, e.g., define input to be the empty database. • Semi-honest (or passive) adversary: Correctly follows the protocol specification, yet attempts to learn additional information by analyzing the messages.

  44. Yao's two-party protocol • Party 1 with input x • Party 2 with input y • Wish to compute f(x,y) without revealing x,y. • Yao, “How to generate and exchange secrets”, FOCS 1986.

  45. Private Distributed ID3 • Key problem: find attribute with highest information gain. • We can then split on this attribute and recurse. • Assumption: Numeric values are discretized, with n-way split.

  46. Information Gain • Let • T = set of records (dataset), • T(ci) = set of records in class i, • T(ci,aj) = set of records in class i with value(A) = aj. • Entropy(T) = • Gain(T,A) = Entropy(T) - • Need to compute • SjSi |T(aj, ci)| log |T(aj, ci)| • Sj |T(aj)| log |T(aj)|.

  47. Selecting the Split Attribute • Given v1 known to party 1 and v2 known to party 2, compute (v1 + v2) log (v1 + v2) and output random shares. • Party 1 gets Answer - d • Party 2 gets d, where d is a random number • Given random shares for each attribute, use Yao's protocol to compute information gain.

  48. Summary (Cryptographic Approach) • Solves different problem (vs. randomization) • Efficient with semi-honest adversary and small number of parties. • Gives the same solution as the non-privacy-preserving computation (unlike randomization). • Will not scale to individual user data. • Can we extend the approach to other data mining problems? • J. Vaidya and C.W. Clifton, “Privacy Preserving Association Rule Mining in Vertically Partitioned Data”, KDD 2002

  49. Directions • Electronic Commerce • Security • Privacy • Motivation • Randomization Approach • Crytographic Approach • Challenges

  50. Challenges • Application: Privacy-Sensitive Security Profiling • Privacy Breaches • Clustering & Associations

More Related