1 / 30

Data Mining Algorithms for Large-Scale Distributed Systems

Data Mining Algorithms for Large-Scale Distributed Systems. Presenter: Ran Wolff Joint work with Assaf Schuster 2003. What is Data Mining?. The automatic analysis of large database The discovery of previously unknown patterns The generation of a model of the data. Main Data Mining Problems.

Télécharger la présentation

Data Mining Algorithms for Large-Scale Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003

  2. What is Data Mining? • The automatic analysis of large database • The discovery of previously unknown patterns • The generation of a model of the data

  3. Main Data Mining Problems • Association rules • Description • Classification • Fraud, Churn • Clustering • Analysis He who does this and that will usually do some other thing too These attributes indicate a good behavior - those indicate bad behavior. There are three types of entities

  4. Examples – Classification • Customers purchase artifacts in a store • Each transaction is described in terms of a vector of features • The owner of the store tries to predict which transactions are fraudulent • Example: young men who buy small electronics during rash-hours • Solution: do not respect checks

  5. Examples – Associations • Amazon tracks user queries • Suggests to each user additional books he would usually be interested in • Supermarket finds out “people who buy diapers also buy beer” • Place diapers and beer at opposite sides of the supermarket

  6. Examples – Clustering • Resource location • Find the best location for k distribution centers • Feature selection • Find 1000 concepts which summarize a whole dictionary • Extract the meaning out of a document by replacing each work with the appropriate concept • Car for auto, etc.

  7. Why Mine Data of LSD Systems? • Data mining is good • It is otherwise difficult to monitor an LSD system: lots of data, spread across the system, impossible to collect • Many interesting phenomena are inherently distributed (e.g., DDoS), it is not enough to just monitor a few nodes

  8. An Example • Peers in the Kazza network reveal to the system which files they have on their disks in exchange to access to the files of their peers • The result is a 2M peers database of people recreational preferences • Mining it, you could discover that Matrix fans are also keen of Radio-Head songs • Promote RH performances in Matrix-Reloaded • Ask RH to write the music for Matrix-IV

  9. What is so special about this problem? • Huge systems – Huge amounts of data • Dynamic setting • System – join / depart • Data – constant update • Ad-hoc solution • Fast convergence

  10. Our Work • We developed an association rule mining algorithm that works well in LSD Systems • Local and therefore scalable • Asynchronous and therefore fast • Dynamic and therefore robust • Accurate– not approximated • Anytime– you get early results fast

  11. In a Teaspoon • A distributed data mining algorithm can be described as a series of distributed decisions • Those decisions are reduced to a majority vote • We developed a majority voting protocol which has all those good qualities • The outcome is an LSD association rule mining (still to come: classification)

  12. Problem Definition – Association Rule Mining (ARM)

  13. Solution to Traditional ARM

  14. Large-Scale Distributed ARM

  15. Solution of LSD-ARM • No termination • Anytime solution • Recall • Precision

  16. Majority Vote in LSD Systems • Unknown number of nodes vote 0 or 1 • Nodes may dynamically change their vote • Edges are dynamically added / removed • An infra-structure • detects failure • ensures message integrity • maintains a communication forest • Each node should decide if the global majority is of 0 or 1

  17. Majority Vote in LSD Systems – cont. • Because of the dynamic settings, the algorithm never terminates • Instead we measure the percent of correct outputs • In static periods that percent ought to converge to 100% • In stationary periods we will show it converges to a different percentage • Assume the overall percentage of ones remains the same, but they are constantly switched

  18. LSD-Majority Algorithm • Nodes communicates by exchanging messages <s, c> • Node u maintains: • su– its vote, cu– one (for now) • <suv, cuv>– the last <s,c> it had sent to v • <svu, cvu>– the last <s,c> it had received from v

  19. LSD-Majority – cont. • Node u calculates: • Captures the current knowledge of u • Captures the current agreement between u and v

  20. LSD-Majority – Rational • It is OK if the current knowledge of u is more extreme than what it had agreed with v • The opposite is not OK • v might assume u supports its decision more strongly than u actually does • Tie breaking prefers a negative decision

  21. LSD-Majority – The Protocol

  22. LSD-Majority – The Protocol • The same decision is applied whenever • a message is received • su changes • an edge fails or recovers

  23. LSD-Majority – Example

  24. LSD-Majority Results

  25. Proof of Correctness Will be given in class

  26. Back from Majority to ARM • To decide whether an itemset is frequent or not

  27. Back from Majority to ARM • To decide whether a rule is confident or not

  28. Additionally • Create candidates based on the ad-hoc solution • Create rules on-the-fly rather than upon termination • Our algorithm outputs the correct rules without specifying their global frequency and confidence

  29. Eventual Results By the time the database is scanned once, in parallel, the average node has discovered 95% of the rules, and has less than 10% false rules.

More Related