240 likes | 377 Vues
Kirrkirr: Software for the Flexible and Interactive Visualization of a Structured Warlpiri Dictionary. Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz Linguistics, University of Sydney Nitin Indurkhya Applied Science, Nanyang Technological University
E N D
Kirrkirr: Software for the Flexible and Interactive Visualization of a Structured Warlpiri Dictionary Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz Linguistics, University of Sydney Nitin Indurkhya Applied Science, Nanyang Technological University http://www.sultry.arts.usyd.edu.au/kirrkirr/
Research Program: Lexicon • A language is more than individual words with a definition • it is a vast network of associations between words and within and across the concepts represented by words • The aim of this work is to provide a wide variety of users – not just linguists – with a better understanding of this conceptual map. • Traditional paper dictionaries offer very limited ways for making such networks visible • On a computer, there are no such limitations to the way information can be displayed.
Research: Computational Lexicography • Dictionaries on computers are now commonplace • But there has been little attempt to utilise the potential of the new medium • Most present a plain, search-oriented representation of the paper version • Goal: fun dictionary tools that are effective for browsing and language learning (cf. Kegl 1995) • Like flicking through a paper dictionary, but better • Innovative ways for representing and linking dictionary information, through creative use of computer software • Should improve user supports and incidental learning • Focus: exploration/dissemination, not creation
Initial focus: Warlpiri • Warlpiri is an Australian Aboriginal language spoken in the Tanami desert (NW of Alice Springs) • There are a number of factors influencing this choice: • Rich lexical materials have been collected by linguists over decades (Ken Hale, MIT, from 1950s, Simpson, Nash, Laughren, Hoogenraad) resulting in the most comprehensive lexical databases for any Australian Language • Warlpiri is the first language of a relatively large community of people. There is reasonable vernacular literacy • Until now, results haven’t been produced in a format usable by the community (only raw printouts) – which is not really acceptable. Fixing this is also good science: for subtle linguistic judgments, one needs speaker involvement.
Educational goals • Dictionary structure and usability are often dictated by professional linguists, while the needs of others (speakers, semi-speakers, young users, second language learners) are not met. Focus: school kids. • The low level of literacy in the region makes an e-dictionary potentially more useful than a paper edition • less dependent on good knowledge of spelling and alphabetical order. • builds on captivating qualities of computers • multimedia content and the pronunciations of words is a considerable help as well.
Kirrkirr: A Warlpiri dictionary browser (Jansz 1998; Jansz, Manning and Indurkhya 1999) • An environment for the interactive exploration of dictionaries. • Although our current work has just been with Warlpiri, the design is general – any XML dictionary • Attempts to more fully utilise graphical interfaces, hypertext, multimedia, and different ways of indexing and accessing information • Written in Java, it can either be run over the web (needs bandwidth) or locally (here Java’s main advantage is cross-platform support: Win/Mac/Unix) • originally JDK1.1.6+Swing, now Java 2
Overview Kirrkirr provides various modules • Animated network layout of word relationships • Formatted dictionary entries • Semantic domains display • A notes facility for ‘jotting in the margin’ annotations • Multimedia: audio, pictures • Advanced searching interfaces • others in planning: formatting (XSL) editing, figuration patterns, semantic domain browsing, terminology sets • These attempt to cater to users with different interests and competence levels
The lexical database • Original text materials are stored in an ad hoc format of markup using backslash codes with some (rather odd) nesting of structural tags [origin: runoff] • These are converted to XML using an error-correcting stack-based parser (written in PERL) • The inconsistency and flexibility of dictionary entries actually made this a surprisingly difficult task. • Innumerable structural errors/inconsistencies/typos from years of hand maintenance in text editors and via regexps • Heuristic content-sensitive parser imposes data integrity • XML gives data an explicit, manipulable structure • Result remains a portable text file
Kirrkirr’s XML Index Process Kirrkirr Dictionary Browser XML Parser XML Document Object HTML document + XSL file XSL Processor Index in Memory XML Formatted Warlpiri dictionary file headword file position headword file position headword file position <DICTIONARY> <ENTRY> ... </ENTRY> <ENTRY> ... </ENTRY> <ENTRY> ... </ENTRY> </DICTIONARY> Across file system or web
XML Indexing • We are currently using ad hoc indexing of one large XML file • This gives adequate speed/memory use, but requires a modified XML parser to extract and parse 1 entry • We have also experimented with an XQL version using a PDOM (GMD-IPSI): more flexible, but slower • Parsed entries are cached
Performance - Startup time • Impact on Startup time [200 MHz Pentium]:
Visualization of dictionary information • For dictionaries with simple textual content behind them, there is little that can be done but an on-line reflection of a printed page • But we would like to be able to do more • we want to know a word’s relationships to other words, and the patterning in these relationships • In a computational approach, the program can mediate between lexical data and the user • The interface can select from and choose how to present information (according to the user’s preferences and abilities) – in many different ways
Graph-based visualisation (Jansz 1998; Jansz, Manning and Indurkhya 1999) • Classic graph layout problem • Adapts work by Eades et al. (1998) and Huang et al. (1998) on visualisation and navigation of WWW document linkages • Uses the spring algorithm. Big advantage is that it is an iterative updating algorithm, and so gives an easy interactivity: • it wiggles and people can play with it, clicking to sprout nodes • A major goal was clarity and simplicity of the graph: the software maintains a set of focus nodes to prevent overcrowding
Formatted dictionary entries • Are produced automatically and online from the XML by using XSLT – a tree transformation language • XSL allows easy modelling of some user preferences • One can leave out information such as part of speech, or detailed definitions, or rearrange it • We provide several stylesheets to choose from • This issue is surprisingly important: many users find information overload confusing and demotivating • Can produce a bilingual or monolingual dictionary • Can also use this for print dictionaries (via RTF or TeX). We have produced a couple of samples.
Rich typology of link types • The semantic links present in a dictionary (synonym, antonym, hyponym, subentry, variant, coverbs, …) solve a major problem of the web: we have many link types each with a clear semantic interpretation • We use consistent colour-coding of text and network edges to show these link types • Gives a richer browsing experience • You can tell where you are going before clicking • Dictionary-given links are supplemented by links derived from collocational analysis of Warlpiri texts • uses loglikelihood ratios (Dunning 1993) • works reasonably successfully from 1/4 million words
Semantic domain browsing • A common request of teachers and users is to view words via semantic domains
Educational advantages/usability • Work (at PARC and elsewhere: Pirolli et al. 1996) has stressed the role for browsing as well as searching in information access • It provides a context for learning • A student can opportunistically explore words that are related in various ways • Important semantic relationships can be understood • People continually see alphabetical order and word spellings, but don’t need to know them to use Kirrkirr • Use of “fuzzy spelling” in searches supports users with poor spelling. It usually finds what you wanted.
Multimedia (currently pictures and audio) Can hear pronunciations – gives a much better understanding of pronunciation than phonetic symbols pictures of plants and animals are more intelligible than descriptions (future: videos of Warlpiri sign language …) Advanced search page search various fields, regular expressions, fuzzy spelling, etc. Notes: one can annotate dictionary entries (to correct or personalise) Other components
Interim Conclusions • Kirrkirr is a prototype of what one can do to develop new ways to organize and visualize lexicons • We have addressed the challenge of making dictionary information accessible and usable in the creation of an application which mediates between well-structured data and users’ needs and insights in searching/browsing and presentation • The interface has this year started being regularly used in Warlpiri schools – one school at the moment, hopefully more to follow soon: • “Look it up on that thing!”
Kirrkirr: Software for the Flexible and Interactive Visualization of a Structured Warlpiri Dictionary Christopher Manning Computer Science and Linguistics, Stanford University Kevin Jansz Linguistics, University of Sydney Nitin Indurkhya Applied Science, Nanyang Technological University http://www.sultry.arts.usyd.edu.au/kirrkirr/