1 / 33

Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map. Daniel X. Pape Community Architectures for Network Information Systems dpape@canis.uiuc.edu www.canis.uiuc.edu CSNA’98 6/18/98. Overview. Self-Organizing Map (SOM) Algorithm

rico
Télécharger la présentation

Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map Daniel X. Pape Community Architectures for Network Information Systems dpape@canis.uiuc.edu www.canis.uiuc.edu CSNA’98 6/18/98

  2. Overview • Self-Organizing Map (SOM) Algorithm • U-Matrix Algorithm for SOM Visualization • SOM Navigation Application • Document Representation and Collection Examples • Problems and Optimizations • Future Work

  3. Basic SOM Algorithm • Input • Number (n) of Feature Vectors (x) • format: vector name: a, b, c, d • examples: 1: 0.1, 0.2, 0.3, 0.4 2: 0.2, 0.3, 0.3, 0.2

  4. Basic SOM Algorithm • Output • Neural network Map of (M) Nodes • Each node has an associated Weight Vector (m) of the same dimensionality as the input feature vectors • Examples: m1: 0.1, 0.2, 0.3, 0.4 m2: 0.2, 0.3, 0.3, 0.2

  5. Basic SOM Algorithm • Output (cont.) • Nodes laid out in a grid:

  6. Basic SOM Algorithm • Other Parameters • Number of timesteps (T) • Learning Rate (eta)

  7. Basic SOM Algorithm SOM() { foreach timestep t { foreach feature vector fv { wnode = find_winning_node(fv) update_local_neighborhood(wnode) } } } find_winning_node() { foreach node n { compute distance of m to feature vector } return node with the smallest distance } update_local_neighborhood(wnode) { foreach node n { m = m + eta [x - m] } }

  8. U-Matrix Visualization • Provides a simple way to visualize cluster boundaries on the map • Simple algorithm: • for each node in the map, compute the average of the distances between its weight vector and those of its immediate neighbors • Average distance is a measure of a node’s similarity between it and its neighbors

  9. U-Matrix Visualization • Interpretation • one can encode the U-Matrix measurements as greyscale values in an image, or as altitudes on a terrain • landscape that represents the document space: the valleys, or dark areas are the clusters of data, and the mountains, or light areas are the boundaries between the clusters

  10. U-Matrix Visualization • Example: • dataset of random three dimensional points, arranged in four obvious clusters

  11. U-Matrix Visualization Four (color-coded) clusters of three-dimensional points

  12. U-Matrix Visualization Oblique projection of a terrain derived from the U-Matrix

  13. U-Matrix Visualization Terrain for a real document collection

  14. Current Labeling Procedure • Feature vectors are encoded as 0’s and 1’s • Weight vectors have real values from 0 to 1 • Sort weight vector dimensions by element value • dimension with greatest value is “best” noun phrase for that node • Aggregate nodes with the same “best” noun phrase into groups

  15. Umatrix Navigation • 3D Space-Flight • Hierarchical Navigation

  16. Document Data • Noun phrases extracted • Set of unique noun phrases computed • each noun phrase becomes a dimension of the data set • Each document represented by a binary vector with a 1 or a 0 denoting the existence or absence of each noun phrase

  17. Document Data • Example: • 10 total noun phrases: alexander, king, macedonians, darius, philip, horse, soldiers, battle, army, death • each element of the feature vector will be a 1 or a 0: • 1: 1, 1, 0, 0, 1, 1, 0, 0, 0, 0 • 2: 0, 1, 0, 1, 0, 0, 1, 1, 1, 1

  18. Document Collection Examples

  19. Problems • As document sets get larger, the feature vectors get longer, use more memory, etc. • Execution time grows to unrealistic lengths

  20. Solutions? • Need algorithm refinements for sparse feature vectors • Need a faster way to do the find_winning_node() computation • Need a better way to do the update_local_neighborhood() computation

  21. Sparse Vector Optimization • Intelligent support for sparse feature vectors • saves on memory usage • greatly improves speed of the weight vector update computation

  22. Faster find_winning_node() • SOM weight vectors become partially ordered very quickly

  23. Faster find_winning_node() U-Matrix Visualization of an Initial, Unordered SOM

  24. Faster find_winning_node() Partially Ordered SOM after 5 timesteps

  25. Faster find_winning_node() • Don’t do a global search for the winner • Start search from last known winner position • Pro: • usually finds a new winner very quickly • Con: • this new search for a winner can sometimes get stuck in a local minima

  26. Better Neighborhood Update • Nodes get told to “update” quite often • Weight vector is made public only during a find_winner() search • With local find_winning_node() search, a lazy neighborhood weight vector update can be performed

  27. Better Neighborhood Update • Cache update requests • each node will store the winning node and feature vector for each update request • The node performs the update computations called for by the stored update requests only when asked for its weight vector • Possible reduction of number of requests by averaging the feature vectors in the cache

  28. New Execution Times

  29. Future Work • Parallelization • Label Problem

  30. Label Problem • Current Procedure not very good • Cluster boundaries • Term selection

  31. Cluster Boundaries • Image processing • Geometric

  32. Cluster Boundaries • Image processing example:

  33. Term Selection • Too many unique noun phrases • Too many dimensions in the feature vector data • “Knee” of frequency curve

More Related