Create Presentation
Download Presentation

Download Presentation
## Privacy Issues in Scientific Workflow Provenance

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Privacy Issues in Scientific Workflow Provenance**Sudeepa Roy1 Joint work with Susan B. Davidson 1 Sanjeev Khanna1 Sarah Cohen Boulakia2 1University of Pennsylvania 2Laboratoire de Recherche en Informatique USA France**Workflow**TGCCGTGTGGCTAAATG… CTGTGC … CTAAATGTCTGTGC… GGCTAAATGTCTG TGCCGTGTGGCGTC… ATCCGTGTGGCTA.. Split Entries • Graphical representation of a sequence of actions to perform a task (e.g., a biological experiment) • Vertex ≡ Module (program) • Edge ≡ Dataflow • Run:An execution of the workflow • Actual data appears on the edges Align Sequences Format-1 Functional Data Curate Annotations Format-3 Format-2 Construct Trees**…**Need for provenance TGCCGTGTGGCTAAATGTCTGTGC … CCCTTTCCGTGTGGCTAAATGTCTGTGC … s TGCCGTGTGGCTAAATGTCTGTGC GTCTGTGC… TGCCGTGTGGCTAAATGTCTGTGC GTCTGTGC… TGCCGTGTGGCTAAATGTCTGTGC… Typical provenance queries: • Whether d1 depends on d2 • How d1 depends on d2 ATGGCCGTGTGGTCTGTGCCTAACTAACTAA… Split Entries Align Sequences How has this tree been generated? Format Curate Annotations Functional Data ? Which sequenceshave been used to produce this tree? Format Format Construct Trees t Biologist’s workspace**Workflow privacy for provenance queries**• Analysts want access to the sequence of module executions and intermediate data • But workflows may capture medical diagnosis, proprietary biological experiments, etc • Many “private” components in such workflows! • proprietary modules (functionality) • personal information, medical records (data) • even the entire process! (provenance) Privacy Issues in Workflow Provenance**In this talk….**• Identify important privacy concerns in scientific workflows w.r.t. provenance queries • Module Privacy • Data Privacy • Provenance Privacy • Propose a model for Module Privacy • From Davidson-Khanna-Panigrahi-Roy ’10 • Discuss future directions Privacy Issues in Workflow Provenance**Privacy Concerns in Scientific Workflows**Privacy Issues in Workflow Provenance**Example 1: Module Privacy**Split entries Patient record: Gender, smoking habits, Familial environment, blood pressure, blood test report, … Module functionalityshouldbekept secret (From patient’s standpoint): output should not be guessed given input data values (From module owner’s standpoint): no one should be able to simulate the module and use it elsewhere. Check for Cancer Check for Infectious disease P: (X1, X2, X3, X4, X5) (X1, X2, X3) (X1, X2, X4, X5) Create Report P has cancer? P has an infectious disease? If X1 > 60 OR (X2 < 800 AND X5=1) AND …. report**Example 2: Data Privacy**Microarray data obtainedfrom the experiment Robots are used to performmicroarrayanalysis Data must benormalized to beinterpretedcorrectly Normalization datashouldbekept secret Microarraycompaniesprovidenormalizationmethods Data fromother groups isused in normalization Normalized data Privacy Issues in Workflow Provenance**Example 3: Provenance Privacy**Protein + Functional annotation M2 M1 Protein M2 compares domains of proteins (more precise but more time consuming) M1 compares the entire protein against already annotated genomes The provenanceshouldbekept secret Privacy Issues in Workflow Provenance**Privacy concerns at a glance**smoking habits, blood pressure, blood test report, …… • Module Privacy • Functionality is private (x, f(x)) • Data Privacy • Data items are private • Provenance Privacy • How data items are generated is private Split entries P: (X1, X2, X3, X4) (X1, X2, X3) (X1, X2, X4) Check for infectious disease Check for cancer Check for cancer DB P has infectious disease? P has cancer? Create Report report Privacy Issues in Workflow Provenance**Formal Study of Privacy in Workflows**Privacy Issues in Workflow Provenance**The questions we want to answer...**• How do we • measure privacy? • Can we preserveprivacyof private componentsin a workflow and maximize utilityw.r.t. provenance queries with provable guaranteeson both privacy and utility of the solution? • We identified • them! • What information • can we hide? • How do we • measure utility? • How do we find • a good solution? Privacy Issues in Workflow Provenance**Module Privacy – A formal study…**… from our recent work Privacy Issues in Workflow Provenance**Our workflow model**Initial Input Data • A directed acyclic graph • D = {d1, d2, …. d6} : data items • Each edge carries a data item • Data Sharing: each data item is produced by a unique module but can be on multiple edges d1 d2 v1 d3 d4 d4 v2 v3 d5 d6 Final Output Data Privacy Issues in Workflow Provenance**Run – An execution of the workflow**d1, 0 d2, 1 v1 d3, 0 d4, 1 d4, 1 v2 v3 d5, 1 d6, 0 Privacy Issues in Workflow Provenance**Owner vs. User of the workflow**• Owner owns the workflow • User executes the workflow on different inputs and sees the output and may want to see some intermediate data • Each module in the workflow is “private” • User has noapriori knowledge of the functions • Owner cannot show all intermediate data! d1,0 d2,1 v1 d3,0 d4,1 d4, 1 v2 v3 d5,1 d6,0 Privacy Issues in Workflow Provenance**Owner vs. User of the workflow**• Privacy to the owner vs. loss of utility to the user • There is cost to the user of hiding each data item • Owner chooses which subset of data to hide • These data values are not shown across all runs of the workflow • Connections are always shown • Hide a subset of data with minimum costthat ensures privacy d1,0 d2,1 v1 d3,0 d4,? d4,? v2 v3 d5,1 d6,0 Privacy Issues in Workflow Provenance**Module Privacy**• A module f = a function • For every input x to f, f(x) value should not be revealed • Enoughequivalent possible f(x) values w.r.t. visible information According to required privacy guarantee • There is a fork, a knife and a spoon in this figure Privacy Issues in Workflow Provenance**Module Privacy**• Standalone module privacy: • Module is not part of a workflow • In-network module privacy: • Module belongs to a workflow Let us take a look at “standalone module privacy” first Privacy Issues in Workflow Provenance**Standalone Privacy: An example…**• A module f decides which diseases a person may have based on the regions he lives and visited recently • Four regions A, B, C, D • Three diseases D1, D2, D3 • R denotes where a person lives (A or B), V denotes where he visited recently (C or D) • f(R, V) = (D1, D2, D3) C D B R V A Privacy Issues in Workflow Provenance**Standalone Privacy: An example…**• R = 1 if a person lives in region A, 0 if he lives in B • V = 1 if a person recently visited region C, 0 if he visited D • D1 = 1 if a person is susceptible to disease D1 and 0 otherwise • Similarly D2, D3. • D1 = R V (may have D1 if either he lives in A or visited C) • D2 = (R V) (may not have D2 only if he lives in A and visited C) • D3 = (R V)(may have D3 if and only he lives in B and visited D) • D1 = R V • D2 = (R V) • D3 = (R V) Privacy Issues in Workflow Provenance**Γ-Standalone Privacy of functions**• hide a subset of input and output data values • Γ-standalone-privacy: for all inputs x, there are at least Γ possible values of f(x) • Similar to the notion of L-diversity (MKGV’07) • hide D2, D3 • 4 possible outputs for each input • eg. can map f(0, 0) to • (0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1) Privacy Issues in Workflow Provenance**Many options give the same privacy…**• hide V, D3 • can map f(0, 0) to • (0, 1, 0), (0, 1, 1), (1, 1, 0), (1, 1, 1) • gives 4-privacy • But not all…. • hide R, V • over all possible executions only 3 outputs • can map f(0, 0) to • (0, 1, 1), (1, 1, 0), (1, 0, 0) • does not give 4-privacy Privacy Issues in Workflow Provenance**Consistent Functions**• Two functions are consistentw.r.t. some visible attributes, if their “tables” are the same w.r.t. the visible values • Γ-standalone-privacy: for all inputs x, consistent functions map to at least Γ possible values of f(x) Privacy Issues in Workflow Provenance**Standalone to In-network Privacy**• Carefully choosing a subset gives desired standalone privacy - through consistent functions • How is in-network privacy (enough possible f(x) values in a network) different? • Claim: If consistent functions give Γ possible values of f(x) when f is standalone, then they also give Γ possible values of f(x) when f is in a network. • Why? Pick a consistent function for each module, they are consistent with the network as a whole Privacy Issues in Workflow Provenance**A toy workflow**• f1 is the same module as before • Module f2 decides which vaccine (V1 or V2) a person needs to take based on the diseases • V1 = 1, V2 = 0 if D1 = 1 • Otherwise, V1 = 0 and V2 = D1 D2 R V f1 D3 D1 D2 f2 V2 V1 Privacy Issues in Workflow Provenance**Consistent Sequence of Functions**<g1, …, gn> is consistent with <f1, …, fn> if all individual fi-s are consistent with gi-s • f1: • D1 = R V, D2 = (R V), D3 = (R V) • f2: • V1 = 1, V2 = 0 if D1 = 1 • Otherwise, V1 = 0 and V2 = D2 D3 R V f1 D3 • g1: • D1 = ( R V), D2 = (R V), D3 = (R V) • g2: • V1 = 1, V2 = 0 if D1 = 0 • Otherwise, V1 = 0 and V2 = D2 D3 D1 D2 f2 V2 V1 f1 is consistent with both f1, g1; f2 with both f2, g2; <g1, g2> is consistent with <f1, f2>, <f1, g2> is not Privacy Issues in Workflow Provenance**Our prior results (Davidson-Khanna-Panigrahi-Roy ’10)**Γ-InNetwork Privacy of Functions Informal definition: Each input x of each function can be mapped to Γ different outputs by consistent functions from consistent sequences • We give a (correct!) proof of the claim • Use the above connection to find a minimum-cost data subset to hide that ensures privacy for every module • NP-complete Privacy Issues in Workflow Provenance**Related Work**Privacy Issues in Workflow Provenance**Related Work**• Access control in Scientific Workflows: • Chebotko et. al. (2008), Gil et. al. (2007, 2010) • But, no formal notion of privacy and quality of the solution. • Privacy Preserving Data Mining Techniques: • Formal analysis of privacy and utility in social networks, statistical databases, …. • K-anonymity, L-diversity, Differential privacy • Not exactly suited for workflow related applications • Different query format, • Adding noise may not be useful • Secure Provenance of Workflows: • Braun et. al, Hasan et. al, ... Privacy Issues in Workflow Provenance**Future Work and Open Problems**Privacy Issues in Workflow Provenance**Future Work**• Module Privacy (ongoing work) • How do we handle a combination of private and public modules? • Data Privacy • Hiding a data value may not be enough – how much is revealed from the displayed data values? • Provenance Privacy (ongoing work) • Reachability between pairs of modules is private • Connect theory with practice Privacy Issues in Workflow Provenance**Thank you**Questions? Privacy Issues in Workflow Provenance