390 likes | 681 Vues
Lattice Representation of Data. Dr. Alex Pogel Physical Science Laboratory New Mexico State University. Basic Idea. Replace tabular representation by lattice representation in order to reveal hierarchical structure Basic definitions Information in the lattice
E N D
Lattice Representation of Data Dr. Alex Pogel Physical Science Laboratory New Mexico State University
Basic Idea Replace tabular representation by lattice representation in order to reveal hierarchical structure • Basic definitions • Information in the lattice • Carving up epidemiological data Ganter & Wille: Formal Concept Analysis (FCA) Barwise & Seligman: Information Flow
Input data Base data structure is a {0,1}-table • A set G of objects (represented by rows) and • A set M of attributes (represented by columns) • an entry of 1 indicates object g has attribute m M { G
Input data, mathematically Mathematically speaking: a binary relation I from G to M, a subset of G x M interpreted as an indication of which objects g have which attributes m Via (g,m) e I
Key Definitions The notion of “formal concept” is based on natural mappings that arise from the binary relation I [interpret G and M as before]: • to each subset H of G, we associate the set a(A) of all attributes the objects in H satisfy in common a: P(G)P(M) • to each subset N of M, we associate the set o(N) of all objects satisfying every attribute in N o: P(M)P(G)
Key Definitions The attribute subsets N of M such that a(o(N)) = N are called formal concepts in FCA And are called closed sets in mathematics, as a(o(–)) is a closure operator on M A formal concept can be identified geometrically within a data table by reshuffling rows and columns such that • object-attribute relations are maintained and • a maximal rectangle of 1s appears.
Closure System Arises Taking all closed sets together we obtain a closure system [aka a topped intersection structure, in Davey-Priestley] which is always a complete lattice [an ordered set for which every subset has both a supremum and infimum in the set] Examples: • Rwith <=, • P(S) with inclusion, • any topology with inclusion,…
Full list: difficult, redundant all implications that hold for the data, with up to three attributes in their premise; 125 with positive support
Duquenne-Guigues Basis 20 implications generate the full list, and serve as a basis (analogy with linear algebra); ordered by support value
Implication Reads Upwards at top right: warm-blooded implies airbreather 1st in basis: high support indicated in lime green
A Subinterval of the lattice fourlegged implies airbreather pet implies warm-blooded (iguana?) and fur implies fourlegged and warm-blooded (platypus?)
Original data preserved animals 26 and 27 share the attributes “lives in water”, “is warm-blooded” and “is an airbreather”
Original data preserved animals 26 and 27 share the attributes “lives in water”, “is warm-blooded” and “is an airbreather”
Color-coded support the similarity in color between “livestock” and the concept node below it yields the association rule livestock implies fur with 79% confidence And 11% support (bottom)
Visual Vocabulary Small subdiagrams (Specifically meet-subsemilattices) can be recognized as complex sentences
3 unordered attribute concepts b c a Note: the top element is really irrelevant, but adding it makes everything we’ll look at a lattice instead of just a meet semilattice (definition: an ordered structure closed under finite meet (glb))
Here’s the best known outcome No non-trivial implications b c a
W over V: a & c b b c a
Diamond in diamond Under condition c, a and b are equivalent b a c
Convergence any two imply the third b c a
Two Complex Sentences So, we can read that For nocturnal animals and pets, the attributes fourlegged and warm-blooded are equivalent, and the only implication between the attributes “nocturnal,” “fur” and “pet” is pet and nocturnal implies fur.
Lung Cancer and Smoking nearly half of these 30+ year smokers have lung cancer
Bird-keeping and Smoking Association rules involving bird-keeping and smoking
Limitations as KDD Process • Needs attention given to data preparation • Need more built-in verification of discovered rules • No domain-specific constructions (advantage ?) • Does not scale without clustering (universal ?)
Lung Cancer No Lung Cancer BirdKeep Yes 33 34 BirdKeep No 16 64 Epidemiological functions Plan to add odds ratio calculation, via click OR = 3.9
Support for improvement Traditional diagram improvement algorithms are based solely upon the order structure We are now moving towards the inclusion of support values in these algorithms I will talk about this topic in detail in July, here at DIMACS, as part of the Applications of Lattice Theory workshop END