1 / 26

Data Mining: Extracting Knowledge from Past Data

Data Mining: Extracting Knowledge from Past Data. Ming-Syan Chen Network Database Laboratory Electrical Engineering Department National Taiwan University. Outline. An introduction to data mining Challenging issues on data mining. Data Mining. Data mining: Knowledge discovery in databases

bree
Télécharger la présentation

Data Mining: Extracting Knowledge from Past Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining: Extracting Knowledge from Past Data Ming-Syan Chen Network Database Laboratory Electrical Engineering Department National Taiwan University

  2. Outline • An introduction to data mining • Challenging issues on data mining NTU

  3. Data Mining • Data mining: Knowledge discovery in databases • extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases • Relevant fields: AI, database, statistics • We are buried in data, but looking for knowledge NTU

  4. Interpretation/ Evaluation Knowledge Data Mining Transformation Patterns Preprocessing … … … … … … Transformed … … … … … … Data Selection … … … … … … Preprocessed Data Target Data Data Knowledge Discovering Process NTU

  5. Mining Capabilities • Association • Classification • Clustering • Traversal patterns • Sequential patterns • and many others NTU

  6. E.g., Mining Association Rules • Transaction data analysis: Mining association rules • Given: (1) a database of transactions(2) each tx has a list of items purchased • Find all asso. rules: the presence of one set of items implies the presence of another set of items in the same tx • Two primary approaches (1) Apriori-Based (2) FP-Tree-Based NTU

  7. Two Parameters • Confidence (how true) • the rule X&Y => Z has 90% conf. means 90% of customers who bought X and Y also bought Z • Support (how useful the rule is) • useful rules should have some minimum tx support NTU

  8. Applications • 依據不同產業需求提出產業別應用 NTU

  9. Remarks • Data mining is very application dependent • Small team with good skill and domain knowledge • Lots of work has been done in other areas • Emerging issues: • Journals, ACM TODS, ACM TKDD (from 2007), IEEE TKDE, DMKD, KAIS, IS, Pattern Recognition • SIGKDD, ICDM (from 2001), SIAM-SDM (from 2001), SIGMOD, ICDE, VLDB, CIKM, ICML, SIGIR, WI, PAKDD, etc. NTU

  10. What is the Next for Data Mining • Privacy-preserving mining • Data stream mining • Mining for bioinformatics • Mining to assist content-based data management NTU

  11. Synopsis in Memory Data Streams Stream Processing Engine (Approximate) Answer Data Streams: Computation Model • Stream processing requirements • Single pass: Each record is examined at most once • Bounded storage: Limited Memory for storing synopsis • Real-time: Per record processing time must be low NTU

  12. Outline • An introduction to data mining • ChallengingIssues on data mining NTU

  13. Challenging Issues for Data Mining • Identifying data source for desired knowledge • Mining purposes: knowledge or auxiliary meta data • Data collection methods (in Web, wireless, tx) • Different types of data from different environment • Usefulness and certainty of mining results • Support and confidence • Interactive mining with different data granularities • e.g., generalized association rules NTU

  14. Issues (cont’d) • Mining in data streaming environments • Look at data only once; the amount of data is huge • incremental mining (temporal and spatial) • Efficiency and scalability of mining algorithms • Sampling methods (frequency tuned wrt data or wrt result accuracy) • Hardware-enhanced mining • E.g., PDA, STB, devices for LBS NTU

  15. Issues (cont’d) • Interestingness of mining results • Have to know the original likelihood • Evaluation of mining results • How to measure the advantage gained • Expression of various kinds of mining results • Protection of privacy and data security • Data hiding NTU

  16. Ongoing Works in NetDB Lab • Web usage mining • Web content mining • Mining in mobile environments • Scalable clustering techniques tuned with domain knowledge • Incremental mining (temporal and spatial) • Hardware-enhanced mining NTU

  17. Summary • Data mining is an area of growing importance • Increasing demand for intelligence • Fast advance in IT techniques • Mining will be of increasing impact to Web and wireless applications. • Huge amount of digital data • Nature of applications and their users NTU

  18. Graphical user interface Pattern evaluation Knowledge base Data mining engine Database or data warehouse server Data cleaning Filtering Data integration Data Database warehouse NTU

  19. Incremental Mining • Due to the increasing use of the record-based databases, recent important applications have called for the need of incremental mining • Such applications include Web log records, stock market data, grocery sales data, transactions in electronic commerce, and daily weather/traffic records, to name a few NTU

  20. Incremental Mining • To mine the transaction database for a fixed amount of most recent data (say, data in the last 12 months) • One has to not only include new data (i.e., data in the new month) into, but also remove the old data (i.e., data in the most obsolete month) from the mining process. NTU

  21. E.g., Redundant Rules • For the same support and confidence, if we have a rule {a,d}=>{c,e,f,g}, what do we have • {a,d}=>{c,e,f} • {a}=>{c,e,f,g} • {a,d,c}=>{e,f,g} • {a}=>{d,c,e,f,g} NTU

  22. E.g., Generalized Asso. Rules • Which data granularities should be used for data mining • To mine meaningful rules (proper data units) and be as specific as possible • similar dilemma for other mining capabilities NTU

  23. Freg. Itemset Itemset support Jacket 2 Outerwear 3 Clothes 4 Shoes 2 Hiking Boots 2 Footwear 4 Outerwear, Hiking Boots 2 Clothes, Hiking Boots 2 Outerwear, Footwear 2 Clothes, Footwear 2 Clothes Footwear Outerwear Shirts Hiking Boots Shoes Jackets Ski Pants Database Tx Items bought 100 Shirt 200 Jacket, Hiking Boots 300 Ski Pants, Hiking Boots 400 Shoes 500 Shoes 600 Jacket sup(30%) conf(60%) Outerwear → Hiking Boots 33% 66% Outerwear → Footwear 33% 66% Hiking Boots → Outwear 33% 100% Hiking Boots → Clothes 33% 100% However, Jacket → Hiking Boots 16% 50% Ski Pants → Hiking Boots 16% 100% NTU

  24. E.g., Interestingness of Rules • In a school of 5000 students • 60% (3000) play basketball and 75% (3750) eat cereal; and 40% (2000) do both • Say, minimal sup is 2000 and min conf is 60%, one gets the rule • “play basketball => eat cereal” so ... does that mean promoting the basketball activities will help the sales of cereal? NTU

  25. Interestingness (Cont’d) • In fact, P(A and B)/P(A) should be greater than P(B) to make the rule “A=>B” be interesting • how about for the rule {A,K,}=>{B,L,V} to be interesting NTU

  26. Related Training • Database • AI: machine learning • Statistics NTU

More Related