320 likes | 415 Vues
Document Maps. Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy of Sciences Warsaw.
E N D
Document Maps Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy of Sciences Warsaw Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using Bayesian networks and artificial immune systems"
Agenda • Motivation • What is a document map • Map creation • Clustering • Experimental results • Future directions
Motivation • The Web as well as intranets become increasingly content-rich: simple ranked lists or even hierarchies of results seem not to be adequate anymore • A good way of presenting massive document sets in an understandable form will be crucial in the near future
Document map • Many attempts have been made to visualize sets of dicuments not just like a list, but rather in two dimensions • A document map is a mapping of a set of documents to 2-D representing their inter-relationships
A relationship • A link between hypertext documents • Citation in the bibliography • Content similarity
A tree of relations with central subject (Inxight – Tree Studio)
Selforganizing map (WebSOM)dissimilarity of grouops of documents
Future research – hypergeometric representation (Fish-Eye eEffect)
The preparation of documents is done by an indexer, which turns a document into a vector-space model representation • Indexer also identifies frequent phrases in document set for clustering and labelling purposes • Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded • The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation • ‘The best’ (wrt some similarity measure) map is used bythe query processor in response to the user’s query
How are the maps created • A modified WebSOM method is used: • compact reference vectors representation • broad-topic initialization method • joint winner search method • multi-level (hierarchical) maps • multi-phase document clustering: • initial grouping to identify major topics • Initial document grouping • WEBSOM on document groups • fuzzy cell clusters extraction and labelling
Document model in search engines My dog likes this food dog • In the so-called vector model a document is considered as a vector in space spanned by the words it contains. food When walking, I take some food walk
Document model in search engines dog • The relevance of a document to a query or to another document is measured as cosine of angle between the query and the document. food Query: walk walk
Reference vector representation • Vectors are sparse by nature • During learning process they become even sparser • Represented as a balanced red-black trees • Tolerance threshold imposed • Terms (dimensions) below threshold are removed • Significant complexity reduction without negative quality impact
Topic-sensitive initialization • Inter-topic similarities important both for map learning and visualization/cluster extraction • Simple approach: • Use LSI to select K main broad topics • Select K map cells (evenly spread over the map) as the fixpoints for individual topics • Initialize selected fixpoints with broad topics • Initialize remaining cells with „in-between values”
Clustering document vectors r x m Mocna zmiana położenia (gruba strzałka) Document space 2D map Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar
Joint winner search • Global winner search: accurate but slow • Local winner search: faster but can be inaccurate during rapid changes • Start with single phase of global search • Document movements become more smooth during learning process: usually local search is enough • Use global search when occassional sudden moves occur (eg. outliers, neighbourhood width decrease)
Hierarchical maps • Bottom-up approach • Feasible (with joint winner search method) • Start with most detailed map • Compute weighted centroids of map areas • Use them as seeds for coarser map • Top-down approach is possible but requires fixpoints
Clustering document groups • Numerous methods exists but none of them directly applicable: • Extremely fuzzy structure of topical groups in SOM cells • Neccesity of taking into account similiarity measures both in original document space and in the map space • Outlier-handling problem during cluster formation • No a priori estimation of the number of topical groups • Fuzzy C-MEANS on lattice of map cells applied • Graph theoretical approach (density- and distance- based MST) combined with fuzzy clustering • Clustered documents are labeled by weighted centroids of cell reference vectors scaled with between-group entropy
Experiments with map convergence • We examined the convergence of the maps to a stable state depending on: • type of alpha function (search radius reduction) • type of winner search method • type of initialization method
Experiments with execution time • The impact of the following factors on the speed of map creation was investigated: • Map size (total number of cells) • Optimization methods: • dictionary optimization • reference vector representation • Map quality assessment: • Compare with ‘ideal’ map (e.g. without optimizations) • Identical initialization and learning parameters • Compute sum of squared distances of location of each document on both maps
Future research • Maps for joint term-citation model, taking into account between-group link flow direction • Fully distributed map creation • Adaptive document retrieval and clustering: • Bayesian network based relevance measure • Survival models for document update rate estimation • Dead link propagation methods for page freshness estimation • We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects
Future research • Bayesian networks will be applied in particular to: • measure relevance and classify documents • accelerate document clustering processes • construct a thesaurus supporting query enrichment • keyword extraction • between-topic dependencies estimation
Thank you! Any questions?