An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data Advisor : Dr. Hsu Presenter : Jing-Wei Lin

Outline • Motivation • Objective • Introduction • SOM and AOI • GSOM and EAOI • Exploratory clustering and pattern extraction • Experimental results • Conclusions

Motivation • A successful integration relies on appropriate individual • techniques. However, the traditional self-organizing map(SOM) and attribute-oriented induction(AOI) have some drawbacks. • The traditional self-organizing map(SOM) is incapable of directly handling the categorical data. • The attribute-oriented induction(AOI) may fail to preserve major values of an attribute, leading to over generalization. 比如說：台北的薪資資料有30筆、桃園和新竹的資料各一筆而使用AOI處理後代表北台灣的所得的值可能會使人發生誤會

SOM • The SOM is an unsupervised neural network which projects high-dimensional data onto a low-dimensional grid, usually two-dimensional, and preserves the topological relationships of the original data.

AOI • Attribute-oriented induction extracts data patterns in a large amount of data and produces a set of concise rules, which represent the general patterns hidden in the data. 註：AOI是一個可以對關聯式資料庫進行資料特微擷取的技術

Objective • A generalized self-organizing map (GSOM) and an extended attribute-oriented induction (EAOI), which not only overcome the drawbacks of their original algorithms but also provide additional analysis capabilities.

Introduction • Among unsupervised clustering techniques, a lot of attention has been paid to self-organizing map (SOM), which projects high-dimensional data to low-dimensional grids, without losing their topological order. • Regarding pattern extraction techniques, attribute-oriented induction (AOI) is a popular and effective approach.

Introduction (cont.) • The integrated analysis framework works as follows: train the GSOM using preprocessed data, perform data clustering visually and exploratory on the trained map, and then extract the characteristics of individual clusters using the EAOI.

Introduction (cont.) • The GSOM is able to directly handle categorical data：因為在二元轉換的過程中會造成資料損失或不完整的情況發生，故利用概念階層樹給予每一個link一個權重來計算出種類型資料間確切的距離

Introduction (cont.) • The EAOI offers the additional capability of preserving major • values in the data： • 即在傳統的AOI法中，另外考慮了『重複次數』來獲得特徵值的分佈程度，並針對種類 • 型資料也提出『主要特徵』的指標來解決太過一般化的問題。 • EAOI：

SOM • Training on SOM essentially involves two steps： • The identifying：each training pattern compares with all the units of the map and identifies the best matching unit (BMU) that is most similar to the training pattern. • The adjusting：the BMU and its neighbors are updated to resemble the training pattern.

Problem with the SOM • The conventional SOM can not directly handle categorical attributes. • The binary transformation approach has at least four disadvantages. • (1) Similarity information among categorical values is not conveyed • (2) When the domain of a categorical attribute is large, the transformation increases the dimensionality of the transformed relation • (3) Maintenance is difficult • (4) The names of binary attributes fail to preserve the semantics of the original categorical attribute

AOI • The induction method mainly includes two steps, attribute removal and attribute generalization • Attribute removal：相異資料過大的欄位、意義重覆的欄位將被移除 • Attribute generalization：for each remaining attribute, the original attribute values, which are more specific, are replaced by the values closer to the root of its concept hierarchy, which are more general.

Problems with handling major value and numeric attributes • The traditional AOI is incapable of revealing major values and suffers from discretizing numeric attributes. • Regarding the construction of concept hierarchies for numeric attributes, there are two problems: • (1) subjectivity of the construction：因概念階層建立的標準，造成相似的資料被區分到不同的類別去，因為標準是由人主觀給定的 • (2) The generalization of boundary values：如：當標準設為50—100為中階時而49.9和50僅只有小小差異卻被分到低階去

GSOM-Distance hierarchy • To alleviate the drawbacks resulting from binary transformation, we propose distance hierarchy. • A concept hierarchy extended with weights, as the mechanism to facilitate the representation and measurement of the distance between categorical values.

GSOM-Distance hierarchy (cont.) • The least common ancestor of two points X and Y, denoted as LCA(X, Y) i.e., LCA(X, Z)=Drink.

GSOM-Distance hierarchy (cont.) • The least common point of two points X and Y, denoted as LCP(X, Y), is defined as one of the three cases: (1) either X or Y if they are at the same position (i.e., equivalent); (2) Y if Y is an ancestor of X; otherwise (3) LCP(X, Y)=LCA(X, Y)

GSOM-Distancehierarchy(cont.) • The distance between two points in a distance hierarchy is the total weight between them. Let X=(NX, dX) and Y=(NY, dY) be the two points, the distance between X and Y is defined as 註：d=offset represents the distance from the root of the hierarchy to X.

GSOM-Distancebetween a pattern and a map unit (cont.) • For example, assume that a two-dimensional pattern x=(x1,x2)=(Coke, 9), Dom(x2)=[5, 20], and distance hierarchies dh1 and dh2 are given as shown in Fig. x1=Coke is mapped to X=(Coke, 2) in dh1. x2=9 is mapped to X=(MAX, 4) in dh2. 種類型數值型註：dhi=Xi-Leaf distance

GSOM-Distancebetween a pattern and a map unit (cont.) • Assume a unit m consists of n components, m=[m1, m2, …, mn] Each mi, which can becategorical or numeric, is composed of two parts: (N, d). For the categorical • That is, mi =(N, d) is mapped to a point M with the value (N, d), denoted as dhi(mi)=M=(N, d), indicating the anchor of the mapping point M is N and the offset from the root is d.

GSOM-Distancebetween a pattern and a map unit (cont.) • Suppose x, m, and dh represent a training pattern, a map unit, and a set of distance hierarchies, respectively. Then the distance between x and m is defined as • For example, the differences between the paired mapping points of x and m are |(Coke,2)-(Coke, 0.3)|=1.7 and |(MAX, 4)-(MAX, 6)|=2, respectively, making the distance between x and m (1.7**2+2**2)**1/2=2.62. (註：解決了種類型的資料不需要二元轉換即可處理)

GSOM-Adaptation of a unit component • Let X=(P, dX), M=(Q, dM), (德耳塔)be the adjusting amount, and NLCA be the least common ancestor of the anchors P and Q • Case 1: new M is (Q, dM+) • Case 2: new M is (P, dM+) • Case 3: new M is (Q, dM- ) • Case 4: new M is (P, 2dNLCA-dM+)

GSOM-Adaptation of a unit component • For a numeric component, the adjusting process is simpler due to its degenerated hierarchy. Let X=(MAX, dX), M=(MAX, dM), and be the adjusting amount. If dM > dX , the new M is (MAX, dM- ), otherwise (MAX, dM+ ).

EAOI • For the exploration of major values, we introduced a parameter, majority threshold β. If some values (i.e., major values) take up a major portion (exceeding β) of an attribute, the EAOI preserves those major values and generalizes other non-major values, β is set to 1, the EAOI degenerates to the AOI.註：0<β<=1 • EAOI除了分群和類別兩種特徵維度外，在數值型資料裡還加入了平均數和標準差來解決傳統的AOI會造成資料特徵有偏誤的現象

EAOI (cont.)

EAOI (cont.) • Algorithm: An EAOI algorithm for major values and alternative processing of numeric attributes • Input: A relation W with an attribute set A; a set of concept hierarchies; generalization threshold θ and majority threshold β. • Output: A generalized relation P.

EAOI (cont.) • Method： 1. Determine whether to generalize numeric attributes. 2. For each attribute Ai to be generalized in W, 2.1 Determine whether Ai should be removed, and if not, determine its minimum desired generalizationlevel Li in its concept hierarchy. 2.2 Construct its major-value set Mi according to θand β. 2.3 For vDom(Ai), if vMi, construct the mapping pair as (v, vLi-MLi); otherwise, as (v, v). 3. Derive the generalized relation P by replacing each value v by its mapping value and computing other aggregatevalues.

Exploratory clustering and pattern extraction • The GSOM alone is incapable of extracting clusters’ characteristics, whereas the EAOI alone will result in over generalization if the data are diversified and not clustered before generalization. • Three kinds of patterns can be analyzed: cluster characteristics, discriminant rules, and characteristic rules.

Exploratory clustering and pattern extraction(cont.) • Cluster Characteristics： Extracted by EAOI from each cluster Ci, cluster characteristics can be expressed as: • For example, C1: {[(City=Taipei, Salary=(51000, 0));0.97], [(City=North_Taiwan-{Taipei}, Salary=(51000, 1000));0.03]} represents two patterns, which take up 97% and 3% supports, extractedfrom C1.

Exploratory clustering and pattern extraction(cont.) • Discriminate rules： For instance, If C1:{[(City=Taipei, Salary=(51000,0));0.97], [(City=North_Taiwan-{Taipei}, Salary=(51000,1000));0.03]}{A(0.7), B(0.3)} indicates that C1 has two patterns taking up 97% and 3%, respectively, and these patterns imply Class A with 70% confidence or Class B with 30% confidence.

Exploratory clustering and pattern extraction(cont.) • Characteristic Rules： • IF 飲料 ((birthPlace=台中, company=企管, amt=(200,3.4), (C2, 0.8)) or (birthPlace=臺北, company=管理學院-企管, amt=(150,2.1), (C1, 0.2)))，表「飲料」類別中，包含兩個規則，一為80%屬於第二群，其特徵是台中、企管、平均購買金額與標準差分別為200與3.4；二為20%屬於第一群，特徵為臺北、管理學院-企管、平均購買金額與標準差分別為150與2.1，主要特徵為「企管」

Experimental results- Synthetic data • This experiment aims to compare the results by using the conventional SOM and AOI with those of the GSOM and EAOI on a synthetic, mixed dataset. • We designed a dataset of 400 tuples, which has four attributes plus one class attribute, as shown in Table 1.

Experimental results- Synthetic data

Experimental results- Synthetic data The hierarchies for attributes are shownin Fig the hierarchies of the Age and the amount are for the traditional AOI.

Experimental results- Synthetic data • The map size is 64 units, the learning rate is a linear function with the initial value  and a neighborhood radius function set to the side length of the map, training time T is at least 10 times of the map size.

Experimental results- Synthetic data GSOM SOM • Shows the training results of 12,000 training time：

Experimental results- Synthetic data • We further use EAOI and AOI to extract discriminate rules for the four groups formed onthe GSOM. The parameters are set as follows: the attribute generalization threshold θ=3 and themajority threshold β=0.75

Experimental results- Synthetic data • GSOM

Experimental results- UCI adult dataset • The dataset has 15 attributes including eight categorical, six numerical, and one class attributes Salary indicating whether the salary is over 50K (>50K) or less than 50K (<=50K).

Experimental results- UCI adult dataset

Experimental results- UCI adult dataset • We use three criteria to cluster the training results.

Experimental results- UCI adult dataset • For instance, the second criterion (d <=2.828) merges Cluster 4 and 7 of the GSOM in Fig. 10(a) and merges Cluster 1, 2, 5, 6, 10, 12 and 13 of the SOM in Fig. 10(b).

Experimental results- UCI adult dataset • The average categorical utility of a set of clusters is calculated as follows. •      • where P(Ai=Vij|Ck) is the conditional probability that the • attribute Ai has the values Vij given the cluster Ck, and • P(Ai=Vij) is the overall probability of Ai having Vij in the entire • data set.

Experimental results- UCI adult dataset • We compute the ACU of categorical values of clusters formed by the three clustering criteria at the leaf level and Level 1 of the distance hierarchies, and the increased rate, as shown in

Experimental results- UCI adult dataset • The expected entropy of an attribute C in a set of clusters can be used to measure how the class values are distributed in the clusters, formula is as follows • where Vj denotes one of the possible values that C can take, • |Ck| is the size of Cluster k, and |D| is the dataset size. • The chaining effect results in a reduced cluster number and the increased expected entropy

Experimental results- UCI adult dataset • The Salary class distributions in the clusters are shown in Table 6, where Cluster 4 and 1 have the largest ratios of >50K. Cluster 5, 3, and 7 have much lower ratios of >50K compared to the dataset. • We use EAOI and AOI to extract cluster patterns. The parameters were set as follows: the attribute generalization threshold θ=4 and the majority threshold β=0.75.

Experimental results- UCI adult dataset • Table 7 and 8 are referred to for a portion of the patterns from Cluster 4, 2, and 7 by both methods.

Experimental results- UCI adult dataset

Experimental results- Sales data • In another experiment, we used a subset of sales records of a store at a university during 4/12/1999 to 7/17/2000.

Experimental results- Sales data

An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data