200 likes | 339 Vues
Detecting Data Leakage. Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu. Leakage Problem. Name: Sarah. Sex: Female. …. Name: Mark. Sex: Male. …. Jeremy. Sarah. Mark. App. U 1. App. U 2. Other Sources e.g. Sarah’s Network. Kathryn.
E N D
Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu
Leakage Problem Name: Sarah Sex: Female …. Name: Mark Sex: Male …. Jeremy Sarah Mark App. U1 App. U2 Other Sources e.g. Sarah’s Network Kathryn Stanford Infolab
Outline • Problem Description • Guilt Models • Pr{U1 leaked data} = 0.7 • Pr{U2 leaked data} = 0.2 • Distribution Strategies Stanford Infolab
Problem Description • Guilt Models • Distribution Strategies Stanford Infolab
Problem Entities Stanford Infolab
Agents’ Data Requests • Sample • 100 profiles of Stanford people • Explicit • All people who added application (example we used so far) • All Stanford profiles Stanford Infolab
Problem Description • Guilt Models • Distribution Strategies Stanford Infolab
Guilt Models (1/3) p: posterior probability that a leaked profile comes from other sources p p Guilty Agent: Agent who leaks at least one profile Other Sources e.g. Sarah’s Network Pr{Gi|S}: probability that agent Ui is guilty, given the leaked set of profiles S Stanford Infolab 8
Guilt Models (2/3) Agents leak all their data items OR nothing Agents leak each of their data items independently p2 p(1-p) (1-p)p or or (1-p)2 or Stanford Infolab 9
Guilt Models (3/3) Independently NOT Independently Pr{G2} Pr{G2} Pr{G1} Pr{G1} Stanford Infolab
Problem Description • Guilt Models • Distribution Strategies Stanford Infolab
The Distributor’s Objective (1/2) U1 R1 S (leaked) Request R2 U2 R1 Request R3 Request U3 R3 Request Pr{G1|S}>>Pr{G2|S}Pr{G1|S}>> Pr{G4|S} U4 R4 Stanford Infolab
The Distributor’s Objective (2/2) • To achieve his objective the distributor has to distribute sets Ri, …, Rn that minimize • Intuition: Minimized data sharing among agents makes leaked data reveal the guilty agents Stanford Infolab
Distribution Strategies – Sample (1/4) • Set T has four profiles: • Kathryn, Jeremy, Sarah and Mark • There are 4 agents: • U1, U2, U3 and U4 • Each agent requests a sample of any 2 profiles of T for a market survey Stanford Infolab
Distribution Strategies – Sample (2/4) Poor Minimize U1 U1 U2 U2 U3 U3 U4 U4 Stanford Infolab
Distribution Strategies – Sample (3/4) • Optimal Distribution • Avoid full overlaps and minimize U1 U2 U3 U4 Stanford Infolab
Distribution Strategies – Sample (4/4) Stanford Infolab
Distribution Strategies Sample Data Requests Explicit Data Requests The distributor must provide agents with the data they request General Idea: Add fake data to the distributed ones to minimize overlap of distributed data Problem: Agents can collude and identify fake data NOT COVERED in this talk • The distributor has the freedom to select the data items to provide the agents with • General Idea: • Provide agents with as much disjoint sets of data as possible • Problem: There are cases where the distributed data must overlap E.g., |Ri|+…+|Rn|>|T| Stanford Infolab
Conclusions • Data Leakage • Modeled as maximum likelihood problem • Data distribution strategies that help identify the guilty agents Stanford Infolab