Data Visualisation

HCI 0283 Lecture 8 Document Visualisation Data Visualisation

Visualising the Non-Visual • How do we gain insight from text documents without having to read them? • How do we transform a text file into images? • How can the user use these images to answer questions? • These questions have become more and more important as more information is stored in electronic formats • Huge numbers of documents are available in organised digital libraries and far more are available in unorganised collections – web sites • These documents are available in many different forms and formats

Visualising the Non-Visual • Some examples • A week’s worth of news stories on news.bbc.co.uk • A collection of scientific papers such as the ACM library at portal.acm.org • A collection of witness statements and forensic reports taken by police as part of a murder investigation • ... So much has already been written about everything that you can’t find out anything about it James Thurbur

Queries • More usefully we can identify the sort of question which the owner of a collection is likely to ask • To do this we need either • A set of keywords representing their interests • A document which is a representative sample of their interests • Likely questions include • Which documents are likely to be of interest to me? • What other documents might be sufficiently close to my interest to be worth consideration? • Are there any other documents whose title might trigger thoughts that are useful to my search? • How are my keywords actually distributed in this document?

Queries • Many ingenious diagrams have been developed to answer these and similar questions • Unfortunately they will never provide the whole answer sought by the user • In many cases the answer to a question causes the user to modify the question • The output of an algorithm must allow the user to interpret the results easily and interact with them – preferably with pointers to new information

TileBar • The TileBars system is a good illustration of the issues that can arise • This accpts a set of topics and a collection of documents • Suppose we have a collection of medical journals and wish to look for articles on researchinto theprevention of osteoporosis • Our topics would then be osteoporosis, prevention and research

User Query Osteoporosis Prevention Research TileBars

TileBars • For each document in the collection TileBars displays a reminder of the colours and the TileBar itself • Each vertical segment represents a paragraph of the document • The horizontal arrays show the relative appearance of the selected terms – darker sections contain more instances of the term than lighter sections • The user can then select to view that segment by clicking upon it – the corresponding paragraph is displayed with the topic words highlighted in the appropriate colour The effects of...

Issues • Document visualisation is not information retrieval • Keywords may be ambiguous • Petrol (UK) = Gas (US) • Needs a dictionary of synonyms • Context – Any tool intended to support document visualisation must display enough of the document text to allow the user to assess the document’s content

Dimensionality • Documents are often characterised by very large sets of numbers • For each document we store the frequency of occurrence of each word • These numbers may be weighted depending upon whether they are keywords • It is impossible for humans to visualise spaces of hundreds of dimensions...

General Scheme • Most document visualisation schemes use the following scheme: Analysis E.g. Generation of first-order statistics, higher-order statistics, semantic data Algorithms E.g. Clustering, placement of centroids in multi-dimensional space Visualisation E.g. Display of encoded data

Analysis • The essential descriptors of a collection of text are extracted • This is usually the relative occurrences of words in the text • This allows the system to generate a ‘shorthand’ description of the document • Any images are removed, as are common words such as a, the, of etc...

Algorithms • Now algorithms generate and efficient and flexible representation of the documents • The high-dimensional data produced by the analysis cannot be shown directly and must be processed in some manner • Algorithms are use to transform it by clustering and projection onto a 2-D or 3-D space

Visualisation • The data prodiced by the algorithm is then • Encoded • Presented • Made sensitive to interation • In TileBars this is the stage that resents the TileBars and document titles and allows the user to select a document and paragraph for examination • This is the bit we’re really interested in 

Clouds • A recent innovation in document visualisation has been the appearance of “clouds” • Tag clouds – originally developed for Flickr, these show how many times a particular tag has been applied to an item • Word clouds – showing the frequencies of words in a given text. This is becoming a popular way of analysing political speeches but can also be used to get a general feeling of the contents of a document • Data clouds – used to highlight numerical data such as share prices using colour for direction and size for percentage change • Collocate cloud – used for collections of texts to show how often words are found in close proximity to each other • Users scan clouds rather than reading them, so it can be difficult to find the smaller words, particularly if they’re on the edge of the display

“God Bless Iceland” Word Cloud Created using tagcrowd.com

Wikipedia “Geothermal Power” Word Cloud Created using tagcrowd.com

“God Bless Iceland” Wordle Created using www.wordle.net

Wikipedia “Geothermal Power” Wordle Created using www.wordle.net

Data Cloud – Share Trades Data cloud graph showing the closing percentage increase or decrease of the New York Stock Exchange. Only the top 500 stocks by volume traded are displayed. Data is from Oct 2001, source: NYSE/Wikipedia

Collocate Cloud Created using Scottish Corpus of Texts and Speeches

Coffee Time!

Galaxies • Suppose we have two documents • The analysis phase will produce two high-dimensional vector, one for each document, of words and assign a frequency to each word • One document may be designated as the reference document to define the type of content the user is searching for instead of using topic keywords • The analysis stage then computes a single measure of similarity between the two documents • It might perform a clustering process to identify the centroid of each cluster • When projected in 2-D this produces a galaxy of clusters, allowing the user to interactively select and investigate the individual clusters

Themescapes • Instead of a galaxy, the clusters may be used to generate a 3D landscape called a themescape • This is a ‘thematic terrain’ that shows the primary themes and their relative prevalences • Elevation in the landscape represents the strength of the theme • Documents are represented by small points • Documents with similar content are placed closer together

Themescapes • Peaks appear where there is a collection of closely-related documents • Valleys are more interesting because they contain fewer documents and more unique content • The user can drill down into a landscape to obtain more detail • Moving a focus circle to an area of interest displays a brief list of the topics in the circle • Next a list of the title in the circle are shown • Pausing on a title opens a short summary, while clicking on it opens the document

Focus Circle & Topics

Short Summaries

Flagging Documents

Themescapes • This approach has the advantage of using familiar landscape metaphores • It can be used equally for single paragraphs and entire documents • Once the themescape has been generated users can query it either by entering keywords or by selecting keywords from an automatically-generated list

Galaxy of News • Inpreviousexamplesithasbeenclearhownavigation is relatedto a fixeddatabasestructure • Galaxy of News removes the requirement for a data structure instead using a network of symbols (e.g. keywords, events, times, locations) • Inter-symbol weights are proportional to the number of documents in which the symbols occur • Symbols are sorted spacially – those with high inter-symbol weights are placed close to each other in two dimensions • The third dimension – depth – provided progressively more detail • The user’s movement around space determines what appears in his view

Galaxy of News • This also highlights good visual design • Keyword fonts are adjusted as the user moves • Keyword transparency is used to show depth • Colour is adjusted as the user moves between keyword groups • Transparency of text lines between keywords and articles is ajusted • BUT it may be difficult to navigate through rather than to browse

Kohonen Maps • Document visualisation can also be based on neural network principles • Kohonen’s self-organising feature map algorithm takes a set of input objects (documents) each represented by an N-dimensional vector of ‘features’ (keywords) and maps them onto the node of a 2-D grid

Kohonen Maps • Regions occur where the response patterns of nodes are similar • The result of this algorithm reveals the main contents of the document set • Red labels on the map highlight the main areas related to the chosen topic • For instance, an SOM based on 1287 documents on Kohonee maps…

Kohonen Maps • This provides a global view of the document collection • There is sufficient content to permit • Browsing • Creation of an internal model • Navigation to focus on small areas of interest • The user can click on a dot to see the corresponding title ot drag an area to see a list of titles in that area

Summary • Huge amounts of data are stored in non-graphical formats: text documents • We can search through the contents using keywords, but how do we display the results? • Analysis → Algorithms → Visualisation • Visualisation is often through clouds, galaxies, themescapes and Kohonen (self-organising) maps

Coming Soon… • Next lecture: Presentation • Homework: Read chapter 10 of Information Visualisation (Spence)

Data Visualisation