220 likes | 389 Vues
Mapping Nominal Values to Numbers for Effective Visualization. Presented by Matthew O. Ward Geraldine Rosario, Elke Rundensteiner, David Brown, Matthew Ward Computer Science Department, Worcester Polytechnic Institute Supported by NSF grant IIS-0119276.
E N D
Mapping Nominal Values to Numbers for Effective Visualization Presented by Matthew O. Ward Geraldine Rosario, Elke Rundensteiner, David Brown, Matthew Ward Computer Science Department, Worcester Polytechnic Institute Supported by NSF grant IIS-0119276. Presented at InfoVis2003, October 20, 2003.
Visualizing Nominal Variables • Most data visualization • tools are designed for • numeric variables. • What if variable is • nominal? • Most tools which are • designed for nominal • variables cannot handle • large # of values.
Goals Main goal: To display data sets containing nominal variables in visual exploration tools Sub-goals: For each nominal variable • To provide order and spacing to the values • To group similar values together Desired Features of the Solution: • Data-driven • Multivariate • Scalable • Distance-preserving • Association-preserving
Distance – transform the data so that the distance between 2 nominal values can be calculated (based on the variable’s relationship with other variables) Quantification– assign order and spacing to the nominal values Classing – determine which values are similar to each other and can be grouped together Proposed Approach Pre-process nominal variables using a Distance-Quantification-Classing (DQC) approach • Multiple Correspondence Analysis • Focused Correspondence Analysis • Modified Optimal Scaling • Hierarchical Cluster Analysis Each step can be accomplished using more than one technique.
Distance-Quantification-Classing Approach Target variable & data set with nominal variables DISTANCE STEP Transformed data for distance calculation QUANTIFICATION STEP CLASSING STEP Nominal-to-numeric mapping Classing tree
Example Input to Output Observed Counts COLOR by QUALITY Good Ok Bad Total Blue 187 727 546 1460 Green 267 538 356 1161 Orange 276 411 191 878 Purple 155 436 361 952 Red 283 307 357 947 White 459 366 327 1152 Total 1627 2785 2138 6550 DQC Nominal Numeric Blue -0.02 Green -0.54 Orange 0.55 Purple 0 Red -0.50 White 0.57 blue purple green red orange white Task: Pre-process color based on its patterns across quality and size. Data: Quality (3): good,ok,bad Color (6) : blue,green,orange, purple,red,white Size (10) : a to j
Distance Step: Correspondence Analysis Observed Counts COLOR by QUALITY Good Ok Bad Total Blue 187 727 546 1460 Green 267 538 356 1161 Orange 276 411 191 878 Purple 155 436 361 952 Red 283 307 357 947 White 459 366 327 1152 Total 1627 2785 2138 6550 How strong is the association between COLOR and QUALITY? Can we find similar COLORs based on its association with QUALITY? Row Percentages Good Ok Bad Blue 13 50 37 100 Green 23 46 31 100 Orange 31 47 22 100 Purple 16 46 38 100 Red 30 32 38 100 White 40 32 28 100 Similar profiles: (blue,purple)
Focused Corresp Analysis (FCA) color quality size Multiple Corresp Analysis (MCA) size quality quality color color size Similar row profiles: (blue,purple), … Similar column profiles: (ok,bad), … Coordinates for Independent Dimensions Dim1 Dim2 Blue - 0.02 - 0.28 Green - 0.54 0.14 Orange 0.55 0.10 Purple 0 - 0.25 Red - 0.50 0.20 White 0.57 0.19 Similar column profiles are combined to produce fewer independent dimensions. [Singular Value Decomposition, etc.]
Quantification Step: Modified Optimal Scaling Nominal Numeric Blue -0.02 Green -0.54 Orange 0.55 Purple 0 Red -0.50 White 0.57 Coordinates for Independent Dimensions Dim1 Dim2 Blue - 0.02 - 0.28 Green - 0.54 0.14 Orange 0.55 0.10 Purple 0 - 0.25 Red - 0.50 0.20 White 0.57 0.19 Nominal-to-numeric mapping
Classing Step: Hierarchical Cluster Analysis Coordinates for Independent Dimensions Dim1 Dim2 Counts Blue - 0.02 - 0.28 1460 Green - 0.54 0.14 1161 Orange 0.55 0.10 878 Purple 0 - 0.25 952 Red - 0.50 0.20 947 White 0.57 0.19 1152 Cluster Analysis weighted by counts 100 [from FCA] 50 0 Info loss blue purple green red orange white
Experimental Evaluation • Wrong quantification and classing can introduce artificial patterns and cause errors in interpretation • Evaluation measures: • Believability • Quality of Visual Display • Quality of classing • Quality of quantification • Space – FCA less space • Run time – MCA faster perception statistical computational
Test Data Sets * UCI Repository of Machine Learning Databases
Believability and Quality of Visual Display • Given two displays resulting from different nominal-to-numeric mappings: • Which mapping gives a more believable ordering and spacing? • Based on your domain knowledge, are the values that are positioned close together similar to each other? • Are the values that are positioned far from the rest of the values really outliers? • Which display has less clutter?
Believability and Quality of Visual Display Automobile Data: Alphabetical Order, equal spacing Are these patterns believable?
Believability and Quality of Visual Display Automobile Data: FCA Are these patterns believable?
Quality of Classing • Classing A is better than classing B if, given a classing tree, the rate of information loss with each merging is slower • Depends on data set Information loss due to classing for one variable [The lower the line, the slower the info loss, the better the classing.] Calculate difference between the lines, then summarize.
Quality of Quantification • A quantification is good if … • If data points that are close together in nominal space are also close together in numeric space • If two variables are highly associated with each other, then their quantified versions should also have high correlation. MCA gives better quantification for most data sets based on average squared correlation measure.
Summary • DQC is a general-purpose approach for pre-processing nominal variables for data analysis techniques requiring numeric variables (linear regression) or low cardinality nominal variables (association rules) • DQC – multivariate, data-driven, scalable, distance-preserving, association-preserving • FCA is a viable alternative to MCA when memory space is limited • Quality of classing and quantification • depends on strength of associations within the data set. • is in the eye of the user
Next Steps • Stress test the technique with more experiments • Perform user study that measures the quality of the visual display resulting from MCA vs. FCA • Further investigate tuning parameters and sensitivity to characteristics of the data set • Mixed or numeric variables as analysis variables • Cascaded Focused Correspondence Analysis
Related Work • Visualizing nominal data: CA plots [Fri99], sieve diagrams, mosaic displays, fourfold displays, Dimensional Stacking, TreeMaps • Quantification: optimal scaling, homogeneity analysis [Gre93] • Classing nominal variables: loss of inertia [Gre93], decision trees, concept hierarchy • Clustering nominal variables: k-prototypes [Hua97b]
For further information XmdvTool Homepage: http://davis.wpi.edu/~xmdv xmdv@cs.wpi.edu Code is free for research and education. Contact author: ger@wpi.edu