Exploring Document Space: Visualization Techniques for Text Documents

Visualization of Document Corpus Blaž Fortuna Marko Grobelnik Dunja Mladenič

Motivation We have a larger collection of text documents, • what are the main topics in the set? • which documents are related and how? • which topics are related and how? • how to enable the user to explore document space?

Document representation • Bag-of-words • Documents are encoded as vectors • Each element of vector corresponds to frequency of one word • These vectors live in a very high dimensional space (dimensionality == number of distinct words in collection) Computers are used in increasingly diverse ways in Mathematics and the Physical and Life Sciences. This workshop aims to bring together researchers in Mathematics, Computer Science, and Sciences to explore the links between their disciplines and to encourage new collaborations.

Problem • Documents in bag-of-words representation live in a very high dimensional space – usually >10,000 dims! • For visualisation the number of dimensions must be reduced to just 2!

Latent Semantic Indexing (LSI) The Big Picture >100 >10,000

Latent Semantics Indexing What is LSI? • A linear technique for finding words with similar meaning based on concurrences in the documents • Similar words are grouped into latent variables (concepts), one word can appear in more concepts • Documents are described by these concepts instead of words (== much lower dimension). Background • Uses Singular Value Decomposition (SVD) to find the best low-dimensional approximation of the documents. • Latent variables are the basis vectors of this low-dimensional subspace

Latent Semantic Indexing (LSI) Multidimensional scaling (MS) The Big Picture >100 >10,000 2

Multidimensional scaling • Non-linear technique for dimensionality reduction • Finds a position of points in lower dimension space so that the Euclidian distances best match original distances • Iterative gradient descent algorithm • We use it to position documents into two dimensional plane

Latent Semantic Indexing Multidimensional scaling The Big Picture >100 >10,000 2

Density of points is used to generate a landscape. Landscape is used as a background – lighter is higher. Clusters of high density can be emphasized by drawing contour lines. Landscape generation Document Document Documents

Keywords Each point from the plane can be assigned a set of keywords by averaging TFIDF vectors of documents close to the point. Keyword

Keywords User can also zoom in and check keywords for a specific area. Area Keywords

Demo on two document collections • Documents == Scientific papers from PASCAL network, only abstract text is used • Documents == Researchers from PASCAL network, each researcher is described by abstracts of papers he/she co-authored.

Trip into the third dimension

Thank you for listening! • Questions?

Exploring Document Space: Visualization Techniques for Text Documents

Exploring Document Space: Visualization Techniques for Text Documents

Presentation Transcript

Principles of corpus construction

Corpus

Document (Text) Visualization

A Corpus for Cross-Document Co-Reference

Interactive visualization for opportunistic exploration of large document collections

Uses of a Corpus

Definition of Visualization

Sources of the Corpus

Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Corpus Linguistics

Document Visualization at UMBC

Use of corpus analysis tools in medical corpus processing

Corpus 3

CORPUS ANNOTATION

Efficient Visualization of Document Streams

corpus pendants

Use of corpus analysis tools in medical corpus processing

Semantic Medline: Multi-Document Summarization and Visualization

Definition of a corpus

Document Visualization at UMBC