The Challenge of Assuring Data Trustworthiness

The Challenge of Assuring Data Trustworthiness Elisa Bertino CERIAS and CS Department Purdue University bertino@cs.purdue.edu 1

Motivations • Data trustworthiness is critical for making “good” decisions • Few efforts have been devoted to investigate approaches for assessing how trusted the data are • No techniques exist able to protect against data deception 4

Approaches Integrity models and techniques From the security area: Biba Model Clark-Wilson Model Signature techniques Physical integrity Semantic integrity Data quality Reputation techniques 5

Challenges Data trustworthiness is a multi-faceted concepts • It means different things to different people or applications • The prevention of unauthorized and improper data modification • The quality of data • The consistency and correctness of data • Different definitions require different approaches. • Access control, workflows, information-flow, constraints, etc. • We need a unified perspective and approaches to manage and coordinate a variety of mechanisms 6

Challenges The trustworthiness of data is versatile • It is hard to quantify • It may change, independent from direct modifications • Time, real-world facts • Its implication may vary, depending on applications • High trustworthiness is always preferred • However, high trustworthiness often has high costs • We need flexible systems in which application-dependent policies can be specified and enforced 7

The Trust Fabric

Data Trustworthiness Assessment Based on Provenance in Data Streams An Example

DataStreams Everywhere where the data comes from = Data Provenance • New computing environments • Ubiquitous/mobile computing, embedded systems, and sensor networks • New applications • Traffic control systems monitoring data from mobile sensors • Location based services (LBSs) based on user's continuously changing location • e-healthcare systems monitoring patient medical conditions • Real-time financial analysis • What are we interested in? • Data is originated by multiple distributed sources • Data is processed by multiple intermediate agents Assessing data trustworthiness is crucial for mission critical applications • Knowing where the data comes from is crucial for assessing data trustworthiness 11

What is Provenance? • In general, the origin, or history of something is known as its provenance. • In the context of computer science, data provenancerefers to information documenting how data came to be in its current state - where it originated, how it was generated, and the manipulations it underwent since its creation. 12

Focus of Our Work assessed based on Data Trustworthiness Data Provenance in Data Stream Environments 13

An Example Application: Battlefield Monitoring Sensor Network Region A Region B Region C

What Makes It Difficult to Solve? • Data stream nature • Data arrives rapidly  real-time processing requirement  high performance processing • Unbounded in size  not possible to store the entire set of data items • Dynamic/adaptive processing • Sometimes, only approximate (or summary) data are available • Provenance nature • Annotation  increased as it is transmitted from the source to the server (i.e., snowballing effect) • Interpretation semantics differ from usual data • Network nature • Provenance processing in the intermediate node (e.g., provenance information can be merged/separated/manipulated) • Hierarchical structure for network and provenance 15

Our Solution : A Cyclic Framework for Assessing Data Trustworthiness

Modeling Sensor Networks and Data Provenance server node intermediate nodes terminalnodes • A sensor network be a graph, G(N,E) • N = { ni|ni is a network node of which identifier is i} : a set of sensor nodes • a terminal node generates a data item and sends it to one or more intermediate or server nodes • an intermediate node receives data items from terminal or intermediate nodes, and it passes them to intermediate or server nodes • a server node receives data items and evaluates continuous queries based on those items • E = { ei,j | e i,j is an edge connecting nodes ni and nj.} : a set of edges connecting sensor nodes • A data provenance, pd • pd is a subgraph of G (a) a physical network (b) a simple path (c) an aggregate path (d) an arbitrary graph

Assessing Trustworthiness  Computing Trust Scores data arrives incrementally in data stream environments trust score of the node affects the trust score of the data created by the node Node Trust Scores Data Trust Scores trust score of the data affects the trust score of the sensor nodes that created the data • Trust scores: quantitative measures of trustworthiness • Data trust scores: indicate about how much we can trust the data items • Node trust scores: indicate about how much we can trust the sensor nodes collect correct data Scores provide an indication about the trustworthiness of data items/sensor nodes and can be used for comparison or ranking purpose • Interdependency between data and node trust scores 18

A Cyclic Framework for Computing Trust Scores A set of data items of the same event in a current window 6 Current trust scores of nodes ( ) 1 Current trust scores of data items ( ) 2 Next trust scores of nodes ( ) 5 Intermediate trust scores of data items ( ) + + 3 Intermediate trust scores of nodes ( ) 4 Next trust scores of data items ( ) • Trust score of a data item d • The current trust score of d is the score computed from the current trust scores of its related nodes. • The intermediate trust score of d is the score computed from a set (d ) D of data items of the same event. • The next trust score of d is the score computed from its current and intermediate scores. • Trust score of a sensor node n • The intermediate trust score of n is the score computed from the (next) trust scores of data items. • The next trust score of n is the score computed from its current and intermediate scores. • The current trust score of n, is the score assigned to that node at the last stage. 19

Intermediate Trust Scores of Data (in more detail) A set of data items of the same event in a current window Current trust scores of nodes ( ) Current trust scores of data items ( ) 6 1 2 5 Next trust scores of nodes ( ) Intermediate trust scores of data items ( ) + + 3 Intermediate trust scores of nodes ( ) Next trust scores of data items ( ) 4 Data trust scores are adjusted according to the data value similarities and the provenance similarities of a set of recent data items (i.e., history) • The more data items have similar values, the higher the trust scores of these items are • Different provenances of similar data values may increase the trustworthiness of data items 20

Using Data Value and Provenance Similarities a probability density function , where x is the value of a data item d ) • Setting based on data value similarities • with the mean μand variance σ2 of the history data set D, we assume the current input data follow a normal distribution N (μ, σ2) • because the mean μ is determined by the majority values in D, • if x is close to the mean, it is more similar to the other values; • if x is far from the mean, it is less similar to the other values. • with this observation, we obtain the initial intermediate score of d (whose value is vd) as the integral area of f(x) 21

Using Data Value and ProvenanceSimilarities (cont’d) • Adjusting with provenance similarities • we define the similarity function between two provenances pi, pj as sim(pi, pj) • sim(pi, pj) returns a similarity value in [0, 1] • it can be computed from the tree or graph similarity measuring algorithms • from the observation of value and provenance similarities, given two data items d, tD, their values vd, vt , and their provenances pd, pt (here, notation ‘’ means “is similar to”, and notation ‘’ means “is not similar to”) • if pdpt and vdvt, the provenance similarity makes a small positive effect on ; • if pdpt and vdvt, the provenance similarity makes a large negative effect on ; • if pdpt and vd vt, the provenance similarity makes a large positive effect on ; • if pdpt and vdvt, the provenance similarity makes a small positive effect on ; • then, we first calculate the adjustable similarity between d and t, where dist(vd, vt) is a distance between two values, δ1 is a threshold that vd and vt are treated to be similar; δ2 is a threshold to be not similar • with the (normalized) sum of adjustable similarity of d, we adjust vd to

Computing Next Trust Scores The next trust core is computed as cdsd+ (1- cd) Where is constant ranging in [0,1] • If is small trust scores evolve fast • If it large trust scores evolve slowly • In the experiments we set it to 1/2

Incremental Evolution of Trust Scores • Two evolution schemes • Immediate mode • evolves trust scores whenever a new data item arrives • pros: provides high accurate trust scores • cons: incurs a heavy computation overhead, thus not feasible when the arrival rate of data items is very high • Batch mode • accumulates a certain amount of input data items, and then evolves trust scores only once for the accumulated data items • pros: reduces the computation overhead so as to make the cyclic framework scalable over the input rate of data items and the size of sensor networks • cons: the accuracy of trust scores can be low compared with the immediate mode • Batch mode in detail • Two stages: Stall Stage(data accumulation)/Evolution Stage(score evaluation) • The evolution stage is triggered when a threshold is reached • Use confidence interval concept to trigger the evolution only when the current status significantly changed from the last evaluation • we use a confidence level  as the threshold • trigger only when the mean of accumulated data falls out of the confidence interval of  in the normal distribution of the last evaluation stage • an example  = 95%

Experimental Evaluation • Simulation • Sensor network as an f-ary complete tree whose fanout and depth are f and h, respectively • Synthetic data that has a single attribute whose values follow a normal distribution with mean μi and variance σi2 for each event i (1 ≤ i ≤ Nevent) • Data items for an event are generated at Nassign leaf nodes and the interval between the assigned nodes is Ninterleave • The number of data items in windows (for evaluating intermediate trust scores) is ω < notation and default values > • Goal of the experiments • Showing efficiency and effectiveness of our cyclic framework • Showing efficiency and effectiveness of batch mode compared to immediate mode 25

Experiment 1Computation Efficiency of the Cyclic Framework • Measure the elapsed time for processing a data item with our cyclic framework • For showing scalability, we varies 1) the size of sensor networks (i.e., h) and 2) the number of data items for evaluating data trust scores (i.e., ω) • Shows affordable computation overhead and scalability both with the size of sensor network and the number of data items in windows 26

Experiment 2Effectiveness of the Cyclic Framework • Inject incorrect data items into the sensor network, and then observed the change of trust scores of data items • For observing the effect of provenance similarities, we vary the interleaving factor (i.e., Ninterleave)  if Ninterleave increases, the provenance similarity decreases • Graph (a) shows the changes in the trust scores when incorrect data items are injected, and Graph (b) shows when the correct data items are generated again • In both cases, we can see that our cyclic frame evolves trust scores correctly • The results also show that our principles • different values with similar provenance result in a large negative effect • similar values with different provenance result in a large positive effect are correct 27

Experiment 3Immediate vs. Batch • Measure the average elapsed time for processing a data item (for efficiency) and measure the average difference of trust scores (for effectiveness) • For showing the sensitivity on frequency of the evolution stage, we varies the batch threshold (i.e., confidence level )  the smaller  means a more frequent invocation of the evolution stage • From the results, we can see that • the performance advantage of the batch mode is high when  is large, and • the batch mode does not significantly reduce the accuracy compared with the immediate mode 28

Discussion • How do we use trust scores • Notion of confidence policy • Situation awareness • How do we improve data assessment • Use of semantic knowledge • Dynamic integration of new data sources, also heterogeneous • How do we deal with rapidly changing values • User awareness • Triggering additional actions, for example collecting more evidence • Sensor node sleep/awake times based on data trust scores (required and observed) • How do we securely convey provenance • Data watermarking techniques • How do we deal with privacy/confidentiality • Privacy-preserving data matching techniques

Conclusions • We address the problem of assessing data trustworthiness based on its provenance in sensor networks • We propose a solution providing a systematic approach for computing and evolving the trustworthiness levels of data items and network nodes • A similar approach is being used for assessing information about locations of individuals • Future work • more accurate computation of trust scores • secure delivery of provenance information • trust scores for aggregation and join in sensor networks • extend a streaming data management system with our techniques 30

Thank You! • Questions? • Elisa Bertino bertino@cs.purdue.edu • Funding sources: • NSF – Project “A Comprehensive Approach for Data Quality and Provenance in Sensor Networks” • NSF – Project “IPS: Security Services for Healthcare Applications” • Northrop Grumman NGIT Cybersecurity Research Consortium • Air Force Office of Scientific Research (AFOSR) – Project “Systematic Control and Management of Data Integrity, Quality and Provenance for Command and Control Applications” • The sponsors of CERIAS

The Challenge of Assuring Data Trustworthiness

The Challenge of Assuring Data Trustworthiness

Presentation Transcript

The Data Center Challenge

The risk data challenge

assuring

Authentication Trustworthiness

Assessing the Trustworthiness of Location Data Based on Provenance

Trustworthiness

Trustworthiness

Trustworthiness

Trustworthiness

Trustworthiness

Assuring Data Quality

Assuring the Quality of your COSF Data

Trustworthiness

Trustworthiness

The Challenge of the New Data

Analysis of trustworthiness of LEGO

Assuring the Quality of Data from the Child Outcomes Summary Form

The Challenge of Community LiDAR data

The Challenge of Teaching Data Visualization

Data Analysis for Assuring the Quality of your COSF Data

Assuring statistical quality of Administrative data

Trustworthiness Vocabulary