300 likes | 471 Vues
ROUGH SETS and FCA Foundations and Case Studies of Feature Subset Selection and Knowledge Structure Formation. DOMINIK ŚLĘZAK www.infobright.com www.infobright.org. Contents. Rough Sets & Feature Selection Association Reducts Conceptual Reducts Building Ensembles Towards Clustering
E N D
ROUGH SETSand FCAFoundations and Case Studies of Feature Subset Selection andKnowledge Structure Formation DOMINIK ŚLĘZAK www.infobright.com www.infobright.org
Contents • Rough Sets & Feature Selection • Association Reducts • Conceptual Reducts • Building Ensembles • Towards Clustering • Rough Sets & Infobright Story • Rough & Granular Computation • Knowledge Structure Formation
Rough Sets • Rough set theory proposed by Z. Pawlak in 82 is an approximate reasoning model • In applications, it focuses on derivation of approximate knowledge from databases • It provides good results in such domains as, e.g., Web analysis, finance, industry, multimedia, medicine, and bioinformatics
Decision Systems IF (H=Normal) AND (T=Mild) THEN (S=Yes) It corresponds to a data block included in the positive region of the decision class “Yes”
Rulesand Approximations Lower & Upper Approximations POS(Sport?|B)
Feature Reduction (Selection) • Reducts: optimal attribute subsets, which approximate well enough the pre-defined target concepts or the whole data source • Feature reduction is one of the steps in the knowledge discovery in databases process • In real-world situations, we may agree to slightly decrease the quality, if it leads to asignificantly simpler knowledge model
{a,b,c} {d,e} {a,b,d,f} {c,e} {a,b,f} {e} {a,c,e} {b,d} {a,c,f} {d} {a,d,e} {b,c} {a,d,f} {c} {a,e,f} {b} {b,c,d} {a,e} {b,d,e} {a,c} {b,e,f} {a} {c,d,f} {a} {c,e,f} {a,b,d} Association Reducts
Association Reducts as Association Rules in InDiscernibility Tables
Most Interesting Reducts • Given association reduct (C,D), we evaluate it with the value F(|C|,|D|) • Function F: N N R should hold: IF n1 < n2 THEN F(n1,m) > F(n2,m) IF m1 < m2 THEN F(n,m1) < F(n,m2) • F(|C|,|D|) is maximized subject to # from the space of approximation parameters • Such maximization problem is NP-hard
What # can actually mean? 1) |POS(d|B)| 2) Disc(d|B) = Disc(B{d}) – Disc(B) where Disc(X) = |{(u1,u2): X(u1)≠X(u2)}| 3) Relative Gain R(d|B) = 4) Entropy H(d|B) = H(B{d}) – H(B)
(empty,empty) ({3,7,12,13},{O}) ({1-3,7,9,12,13},{O,T}) ({1-3,7-9,11-13},{O,H}) ({3-7,10,12-14},{O,W}) ({10,11,13},{T,H}) ({2,5,9},{T,W}) ({5,9,10,13},{H,W}) ({1-3,7-13},{O,T,H}) (1-14,{O,T,W}) (1-14,{O,H,W}) ({2,5,9-11,13},{H,T,W}) Conceptual Reducts Reduct as a pair (X,B), where XU, POS(B)=X, POS(C)X for any CB
Reduct „Lattice” empty empty 3,7,12,13 O 1-3,7,9,12,13 O,T 1-3,7-9,11-13 O,H 3-7,10,12-14 O,W 10,11,13 T,H 2,5,9 T,W 5,9,10,13 H,W 1-3,7-13 O,T,H 1-14 O,T,W 1-14 O,H,W 2,5,9-11,13 H,T,W
Most Interesting Reducts • Given conceptual reduct (X,B), we evaluate it with the value F(|X|,|B|) • Function F: N N R should hold: IF n1 < n2 THEN F(n1,m) < F(n2,m) IF m1 < m2 THEN F(n,m1) > F(n,m2) • So we should maximize F(|X|,|B|) or... • ... shall we rather search for ensembles?
“Good” Ensembles of Reducts • Reducts with minimal cardinalities (or rules) • Reducts with minimal pairwise intersections ATTRIBUTES Challenge: How to modify the existing attribute reduction methods to search for such „good” ensembles R1 R2 R3
Hybrid Genetic Algorithm (1) • Genetic part, where each chromosome encodes a permutation of the attributes • Heuristic part, where permutationsare put into the following algorithm: • LETLEFT = A • FOR i = 1 TO |A| REPEAT • LETLEFT LEFT \ {a(i)} • IF NOT LEFT #dUNDO(a) • EVALUATE REDUCTLEFT
Hybrid Genetic Algorithm (2) • LET(LEFT,RIGHT)= (,A) • FOR i = 1 TO|U|+|A| REPEAT IF (i){1,...,|U|} THEN IF u(i)POS(RIGHT) THEN LETLEFT LEFT {u(i)} IF (i){|U|+1,...,|U|+|A|} THEN IF POS(RIGHT \ {a(i)}) LEFTTHEN LETRIGHT RIGHT \ {a(i)} • EVALUATE REDUCT(LEFT,RIGHT)
Reduct „Lattice” once more empty empty 3,7,12,13 O 1-3,7,9,12,13 O,T 1-3,7-9,11-13 O,H 3-7,10,12-14 O,W 10,11,13 T,H 2,5,9 T,W 5,9,10,13 H,W 1-3,7-13 O,T,H 1-14 O,T,W 1-14 O,H,W 2,5,9-11,13 H,T,W
Feature Clustering / Selection • Frequent occurrence of representatives in reducts yields splitting clusters • Rare occurrence of pairs of close representatives yields merging clusters Grużdź, Ihnatowicz, Ślęzak:Interactive gene clustering – a casestudy of breast cancer microarray data. Inf. Systems Frontiers 8 (2006). REDUCTS WITH CLUSTER REP-RESENTATIVES CLUSTERS OF ATTRIBUTES FEEDBACK
How about groups of rows (1) Data-based knowledge models, classifiers... Database indices, data partitioning, data sorting... Difficulty with fast updates of structures...
Two-Level Computing Large Data (10TB) & Mixed Workloads
SELECT MAX(A) FROM T WHERE B>15; DATA STEP 1 STEP 2 STEP 3
Knowledge Structures (Nodes) Order Detail Table – assume many more rows Supplier/Part Table – assume many more rows
DATA – Best Inspiration • New Objectives • New Schemas • New Volumes • New Queries • New Types • New KNs • ...........
References (Unfinished List) • D. Ślęzak, J. Wróblewski, V. Eastwood, P. Synak: Brighthouse - An Analytic Data Warehouse for Ad-hoc Queries. VLDB 2008: 1337-1345. • D. Ślęzak: Rough Sets and Few-Objects-Many-At-tributes Problem - The Case Study of Analysis of Gene Expression Data Sets. FBIT 2007: 437-440. • D. Ślęzak: Rough Sets and Functional Dependen-cies inData - Foundations of Association Reducts. To appear. • ......
THANK YOU!! www.infobright.com www.infobright.org slezak@infobright.com