1 / 57

The Role of Agents in Distributed Data Mining: Issues and Benefits

The Role of Agents in Distributed Data Mining: Issues and Benefits . Josenildo Costa da Silva 1 , Matthias Klusch 1 , Stefano Lodi 2 , Gianluca Moro 2 , Claudio Sartori 2 1 Deduction and Multiagent Systems, German Research Center for Artificial Intelligence , Saarbruecken , Germany

lyris
Télécharger la présentation

The Role of Agents in Distributed Data Mining: Issues and Benefits

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Role of Agents in Distributed Data Mining: Issues and Benefits Josenildo Costa da Silva 1, Matthias Klusch 1, Stefano Lodi2, Gianluca Moro 2, Claudio Sartori 2 1Deduction and Multiagent Systems, German Research Center for Artificial Intelligence, Saarbruecken, Germany 2Department of Electronics, Computer Science and Systems, Universityof Bologna, Bologna, Italy

  2. Distributed Data Mining (DDM) • Data sets • Massive • Inherently distributed • Networks • Limited bandwidth • Limited computing resources at nodes • Privacy and security • Sensitive data • Share goals, not data AgentLink III: TFG1 IIA4WE, Roma

  3. Centralized solution • Apply traditional DM algorithms to data retrieved from different sources and stored in a data warehouse • May be impractical or even impossible for some business settings • Autonomy of data sources • Data privacy • Scalability (~TB/d) AgentLink III: TFG1 IIA4WE, Roma

  4. Agents and DDM • DDM exploits distributed processing and problem decomposability • Is there any real added valueof using concepts from agent technology in DDM? • Few DDM algorithms use agents • Evidence that cooperation amongdistributed DM processes may allow effective mining even without centralizedcontrol • Autonomy, adaptivity, deliberative reasoning naturally fit into the DDM framework AgentLink III: TFG1 IIA4WE, Roma

  5. State of the Art • BODHI • Mobile agent platform/Framework for collective DM on heterogeneous sites • PADMA • Clustering homogeneous sites • Agent based text classification/visualization • JAM • Metalearning, classifiers • Papyrus • Wide area DDM over clusters • Move data/models/results to minimize network load AgentLink III: TFG1 IIA4WE, Roma

  6. Agents for DDM (pros) • Autonomy of data sources • Scalability of DM to massive distributed data • Multi-strategy DDM • Collaborative DM AgentLink III: TFG1 IIA4WE, Roma

  7. Agents for DDM (against) • Need to enforce minimal privileges at a data source • Unsolicited access to sensitive data • Eavesdropping • Data tampering • Denial of serviceattacks AgentLink III: TFG1 IIA4WE, Roma

  8. The Inference Problem • Work in statistical DB (mid 70’s) • Integration/aggregation at the summary level is inherent in DDM • Infer sensitive data even from partial integration to a certain extent and with some probability (inference problem) • Existing DDM systems are not capable of coping with the inference problem AgentLink III: TFG1 IIA4WE, Roma

  9. Data Clustering • Popular problem • Statistics (cluster analysis) • Pattern Recognition • Data Mining • Decompose multivariate data set into groups of objects • Homogeneity within groups • Separation between groups AgentLink III: TFG1 IIA4WE, Roma

  10. DE-clustering • Clustering based on non-parametricdensity estimation • Construct an estimate of the probability density function from the data set • Objects “attracted” by a local maximum of the estimate belong to the same cluster AgentLink III: TFG1 IIA4WE, Roma

  11. Kernel Density Estimation • The higher the number of data objects in the neighbourhood of x, the higher density at x • A data object exerts more influence on the value of the estimate at x than any data object farther from x than xi • The influence of data objects is radial AgentLink III: TFG1 IIA4WE, Roma

  12. Formalizing Density Estimators • The density estimate at a space object x is proportional to a sum of weights • The sum consists of one weight for every data object • Weight is a monotonically decreasing function (kernel ) of the distance between x and xiscaled by a factor h (window width ) AgentLink III: TFG1 IIA4WE, Roma

  13. Kernel Functions • Uniform kernel AgentLink III: TFG1 IIA4WE, Roma

  14. Kernel Functions • Triangular kernel AgentLink III: TFG1 IIA4WE, Roma

  15. Kernel Functions • Epanechnikov’s kernel AgentLink III: TFG1 IIA4WE, Roma

  16. Kernel Functions • Gaussian kernel AgentLink III: TFG1 IIA4WE, Roma

  17. Example (1/2) • Uniform kernel, h=250 AgentLink III: TFG1 IIA4WE, Roma

  18. Example (2/2) • Gaussian kernel, h=250 AgentLink III: TFG1 IIA4WE, Roma

  19. Distributed Data Clustering (1/2) • Clustering algorithm A( ) • Homogeneous distributed data clustering problem for A: • Data set S • Sites Lj • Ljstores data set Dj with AgentLink III: TFG1 IIA4WE, Roma

  20. Distributed Data Clustering (2/2) • Problem: find clustering Cj in the data space of Lj such that: • Cj agree with A(S) (correctness requirement): • Time/communication costs are minimized (efficiency requirement) • The size of data transferred out of the data space of any Lj is minimized (privacy requirement) AgentLink III: TFG1 IIA4WE, Roma

  21. Traditional (centralized) solution • Gather all local data sets into one centralized repository (e.g., a data warehouse) • Run A( ) on the centralized data set • Unsatisfied privacy requirement • Unsatisfied efficiency requirement for some A( ) AgentLink III: TFG1 IIA4WE, Roma

  22. Sampling • Goal: given some class of functions of type represent every member as a sampling series where: • is a collection of points of • is some set of suitable expansion functions AgentLink III: TFG1 IIA4WE, Roma

  23. Example • The class of polynomials of degree 1 • Sampling points • Expansion functions • Finite sum AgentLink III: TFG1 IIA4WE, Roma

  24. Band-limited Functions • Function f of one real variable • Range of frequencies of a function f support of the Fourier transform of f • Any function whose range of frequencies is confined to a bounded set B is called band-limited to B(the band-region) AgentLink III: TFG1 IIA4WE, Roma

  25. Example: sinc function AgentLink III: TFG1 IIA4WE, Roma

  26. Sampling Theorem • If f is band-limited with band-region then AgentLink III: TFG1 IIA4WE, Roma

  27. Sampling Theorem (scaled multidimensional version) • Let where is the -th component of a vector • If f is band-limited to Bthen AgentLink III: TFG1 IIA4WE, Roma

  28. Sampling Density Estimates (1/4) • Additivity of density estimates of a distributed data set AgentLink III: TFG1 IIA4WE, Roma

  29. Sampling Density Estimates (2/4) • The sampling series of the density estimate is also additive where AgentLink III: TFG1 IIA4WE, Roma

  30. Sampling Density Estimates (3/4) • Truncation errors • The support of a kernel function is not bounded in general • Aliasing errors • The support of the Fourier transform of a kernel function is not bounded in general kernel functions are not band-limited AgentLink III: TFG1 IIA4WE, Roma

  31. Sampling Density Estimates (4/4) • The sampling series of a density estimate can only be approximated • Trade-off between the number of samples and accuracy • Define a minimal multidimensional rectangle outside which samples are negligible • Define a vector of sampling intervals such that the aliasing error is negligible AgentLink III: TFG1 IIA4WE, Roma

  32. The KDEC scheme • Every site Lj: • Helper H: • Waits for the samples of local density estimates • Computes a local density estimate of its data Dj • Samples • Orderly sums the samples • Sends the samples to H • Sends the summation back to each Lj • Waits for the samples of the global density estimate • Reconstructsfrom its samples • Applies DE-clustering to Dj and AgentLink III: TFG1 IIA4WE, Roma

  33. The KDEC scheme Helper Site1 Site2 AgentLink III: TFG1 IIA4WE, Roma

  34. The KDEC scheme Helper AgentLink III: TFG1 IIA4WE, Roma

  35. The KDEC scheme Helper AgentLink III: TFG1 IIA4WE, Roma

  36. The KDEC scheme Helper AgentLink III: TFG1 IIA4WE, Roma

  37. The KDEC scheme Helper AgentLink III: TFG1 IIA4WE, Roma

  38. Properties of the approach • Communication complexity depends only on the number of samples • Data objects are never transmitted over the network • Local clusters are close to global clusters which can be obtained using DE-cluster • Time complexity does not exceed the time complexity of centralized DE-clustering AgentLink III: TFG1 IIA4WE, Roma

  39. Window width and sampling frequency • Good estimates when h is not less than a small multiple of the smallest distance between objects • As , the number of samples rarely exceeds the number of data points AgentLink III: TFG1 IIA4WE, Roma

  40. Complexity • Site j • Sampling: O(q(N) Sam) • DE-cluster: O(|Dj|q(Dj)) • Helper • Summation of samples: O(Sam) • Communication • Time: O(Sam) • Volume: O(M Sam) AgentLink III: TFG1 IIA4WE, Roma

  41. Complexity (centralized approach) • Site j • Transmission/Reception of data objects: O(|Dj|) • Helper • Global DE-clustering: O(N q(N)) • Communication: • Time: O(N) • Volume: O(N) AgentLink III: TFG1 IIA4WE, Roma

  42. Stationary agent-based KDEC • The helper engages site agents to agree on: • Kernel function • Window width • Sampling frequencies • Sampling region • The global sampled form of the estimate is computed in a single stage AgentLink III: TFG1 IIA4WE, Roma

  43. Mobile agent-based KDEC • At site Ln the visiting agent: • Negotiates kernel function, window width, sampling frequencies, sampling region • Carries the sum of samples collected at Lm, m<n, in its data space • The global sampled form of the estimate is returned to the interested agents AgentLink III: TFG1 IIA4WE, Roma

  44. A Hierarchical Scheme • Additivity allows to extend the scheme to trees of arbitrary arity • Local sampled density estimates are propagated upwards in partial sums, until the global sampled DE is computed at the root and returned to the leaves • May provide more protection against disclosure of DEs AgentLink III: TFG1 IIA4WE, Roma

  45. Inference and Trustworthiness • Inference problemforkernel density estimates • Goal of inference attacks: exploit information contained in a density estimate to infer the data objects • Trustworthiness of helpers • Trustworthy helper no bit of information written to memory by a process for the Helper procedure is sent to a system peripheral by a different process AgentLink III: TFG1 IIA4WE, Roma

  46. Inference Attacks on Kernel Density Estimates • Let be extensionally equal to a density estimate: • For example, gis the reconstructed density estimate (sampling series) AgentLink III: TFG1 IIA4WE, Roma

  47. Inference Attacks on Kernel Density Estimates • Simple strategy: Search the density estimate or its derivatives for discontinuities • Example: The kernel is the square pulse • For each pair of projections of objects on an axis there is a pair of projections of discontinuities on that axis having the same distance as the objects’ projections • If h is known then the objects can be inferred easily • If the kernel has discontinuous derivatives, then the same technique applies to the derivatives AgentLink III: TFG1 IIA4WE, Roma

  48. Inference Attacks on Kernel Density Estimates • If g is not continuous at x an object lies at h=250 AgentLink III: TFG1 IIA4WE, Roma

  49. Inference Attacks on Kernel Density Estimates • If the kernel is infinitely differentiable the problem is more difficult • Select space objects and attempt to solve a nonlinear system ofequations AgentLink III: TFG1 IIA4WE, Roma

  50. Attack Scenarios • Single-site attack • One of the sites attempts to infer the data objects from the global density estimate • Unable to associate a specific data object to a specific site • Site coalition attack • A coalition computes the sum of the density estimates of all the other sites as difference • Special case: the coalition includes all sites but onethe attack potentiallyreveals the data objects at the site AgentLink III: TFG1 IIA4WE, Roma

More Related