Data Mining: Extracting Knowledge from Past Data

Data Mining: Extracting Knowledge from Past Data Ming-Syan Chen Network Database Laboratory Electrical Engineering Department National Taiwan University

Outline • An introduction to data mining • Challenging issues on data mining NTU

Data Mining • Data mining: Knowledge discovery in databases • extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases • Relevant fields: AI, database, statistics • We are buried in data, but looking for knowledge NTU

Interpretation/ Evaluation Knowledge Data Mining Transformation Patterns Preprocessing … … … … … … Transformed … … … … … … Data Selection … … … … … … Preprocessed Data Target Data Data Knowledge Discovering Process NTU

Mining Capabilities • Association • Classification • Clustering • Traversal patterns • Sequential patterns • and many others NTU

E.g., Mining Association Rules • Transaction data analysis: Mining association rules • Given: (1) a database of transactions(2) each tx has a list of items purchased • Find all asso. rules: the presence of one set of items implies the presence of another set of items in the same tx • Two primary approaches (1) Apriori-Based (2) FP-Tree-Based NTU

Two Parameters • Confidence (how true) • the rule X&Y => Z has 90% conf. means 90% of customers who bought X and Y also bought Z • Support (how useful the rule is) • useful rules should have some minimum tx support NTU

Applications • 依據不同產業需求提出產業別應用 NTU

Remarks • Data mining is very application dependent • Small team with good skill and domain knowledge • Lots of work has been done in other areas • Emerging issues: • Journals, ACM TODS, ACM TKDD (from 2007), IEEE TKDE, DMKD, KAIS, IS, Pattern Recognition • SIGKDD, ICDM (from 2001), SIAM-SDM (from 2001), SIGMOD, ICDE, VLDB, CIKM, ICML, SIGIR, WI, PAKDD, etc. NTU

What is the Next for Data Mining • Privacy-preserving mining • Data stream mining • Mining for bioinformatics • Mining to assist content-based data management NTU

Synopsis in Memory Data Streams Stream Processing Engine (Approximate) Answer Data Streams: Computation Model • Stream processing requirements • Single pass: Each record is examined at most once • Bounded storage: Limited Memory for storing synopsis • Real-time: Per record processing time must be low NTU

Outline • An introduction to data mining • ChallengingIssues on data mining NTU

Challenging Issues for Data Mining • Identifying data source for desired knowledge • Mining purposes: knowledge or auxiliary meta data • Data collection methods (in Web, wireless, tx) • Different types of data from different environment • Usefulness and certainty of mining results • Support and confidence • Interactive mining with different data granularities • e.g., generalized association rules NTU

Issues (cont’d) • Mining in data streaming environments • Look at data only once; the amount of data is huge • incremental mining (temporal and spatial) • Efficiency and scalability of mining algorithms • Sampling methods (frequency tuned wrt data or wrt result accuracy) • Hardware-enhanced mining • E.g., PDA, STB, devices for LBS NTU

Issues (cont’d) • Interestingness of mining results • Have to know the original likelihood • Evaluation of mining results • How to measure the advantage gained • Expression of various kinds of mining results • Protection of privacy and data security • Data hiding NTU

Ongoing Works in NetDB Lab • Web usage mining • Web content mining • Mining in mobile environments • Scalable clustering techniques tuned with domain knowledge • Incremental mining (temporal and spatial) • Hardware-enhanced mining NTU

Summary • Data mining is an area of growing importance • Increasing demand for intelligence • Fast advance in IT techniques • Mining will be of increasing impact to Web and wireless applications. • Huge amount of digital data • Nature of applications and their users NTU

Graphical user interface Pattern evaluation Knowledge base Data mining engine Database or data warehouse server Data cleaning Filtering Data integration Data Database warehouse NTU

Incremental Mining • Due to the increasing use of the record-based databases, recent important applications have called for the need of incremental mining • Such applications include Web log records, stock market data, grocery sales data, transactions in electronic commerce, and daily weather/traffic records, to name a few NTU

Incremental Mining • To mine the transaction database for a fixed amount of most recent data (say, data in the last 12 months) • One has to not only include new data (i.e., data in the new month) into, but also remove the old data (i.e., data in the most obsolete month) from the mining process. NTU

E.g., Redundant Rules • For the same support and confidence, if we have a rule {a,d}=>{c,e,f,g}, what do we have • {a,d}=>{c,e,f} • {a}=>{c,e,f,g} • {a,d,c}=>{e,f,g} • {a}=>{d,c,e,f,g} NTU

E.g., Generalized Asso. Rules • Which data granularities should be used for data mining • To mine meaningful rules (proper data units) and be as specific as possible • similar dilemma for other mining capabilities NTU

Freg. Itemset Itemset support Jacket 2 Outerwear 3 Clothes 4 Shoes 2 Hiking Boots 2 Footwear 4 Outerwear, Hiking Boots 2 Clothes, Hiking Boots 2 Outerwear, Footwear 2 Clothes, Footwear 2 Clothes Footwear Outerwear Shirts Hiking Boots Shoes Jackets Ski Pants Database Tx Items bought 100 Shirt 200 Jacket, Hiking Boots 300 Ski Pants, Hiking Boots 400 Shoes 500 Shoes 600 Jacket sup(30%) conf(60%) Outerwear → Hiking Boots 33% 66% Outerwear → Footwear 33% 66% Hiking Boots → Outwear 33% 100% Hiking Boots → Clothes 33% 100% However, Jacket → Hiking Boots 16% 50% Ski Pants → Hiking Boots 16% 100% NTU

E.g., Interestingness of Rules • In a school of 5000 students • 60% (3000) play basketball and 75% (3750) eat cereal; and 40% (2000) do both • Say, minimal sup is 2000 and min conf is 60%, one gets the rule • “play basketball => eat cereal” so ... does that mean promoting the basketball activities will help the sales of cereal? NTU

Interestingness (Cont’d) • In fact, P(A and B)/P(A) should be greater than P(B) to make the rule “A=>B” be interesting • how about for the rule {A,K,}=>{B,L,V} to be interesting NTU

Related Training • Database • AI: machine learning • Statistics NTU

Data Mining: Extracting Knowledge from Past Data