The Challenge of Finding Information in Long Documents

The Challenge of Finding Information in Long Documents David J Harper The Robert Gordon University Smart Web Technologies Centre School of Computing Aberdeen, Scotland. LIDA 2003 Invited Paper

Preamble • Information retrieval research has focussed largely on document retrieval, and rather less on within-document retrieval • Within-document retrieval is just part of a range of tools and techniques that address “retrieval-with-reading” activities • Explore language modelling as a principled basis for “retrieval-with-reading” techniques or tools LIDA 2003 Invited Paper

Outline of Talk • Categorisation of retrieval-with-reading activities • Review of retrieval-with-reading techniques and tools • Language Modelling 101 • ProfileSkim: Relevance Profiling Tool • Applying Language Modelling to retrieval-with-reading activities • Concluding Remarks LIDA 2003 Invited Paper

Categorization of Reading Activities Reading to … • … to select a document • Buying a book • Opening a webpage retrieved by search engine • Deciding to read document • … to extract/locate specific information • Finding a quotation in a book • Locating contact details on a webpage • … to reference information (more generally) • Finding supporting information for a legal case • Finding related work LIDA 2003 Invited Paper

Categorization of Reading Activities (cont) Reading to … • … to write a document • Usually involves a complex mix of other reading activities • … to explore the information space from a given “pivot” document • Follow-up bibliographic references in a paper • Follow hypertext links in web pages • Find similar documents • … to understand a document in depth • Reading a book/paper cover-to-cover • Skimming a book/paper LIDA 2003 Invited Paper

Reading to Select a Document • Enabled by various forms of document summarisation or overview • Summarisation of documents, e.g. automatic abstracting or extracting • Snippet summarisation of web pages retrieved by search engines: • Generic summarisation • Query-biased summarisation • Overviews of document structure/content LIDA 2003 Invited Paper

Reading to Select a Document (Example 1) Query-biased web page summarisation • Generating summaries for use in ranked retrieval display • Summaries based on distribution of words in document (title, headings, body) biased towards query words • Top-scoring sentences used in summary • User experiments confirm that query-biased summaries are better than general summaries • Tombros and Sanderson 1998 LIDA 2003 Invited Paper

Reading to Select a Document (Example 2) • Tilebars: Compact visualisation of retrieved documents with respect to query (topic) showing: • relative length of each document, • the frequency of the topic words in the document, and • the distribution of the topic words with respect to the document and to each other • Hearst 1995 LIDA 2003 Invited Paper

Reading to Extract Specific Information • Information extraction techniques that extract factoids (and usually populate a database) based on templates, e.g. extracting contact details from web pages, Ask Jeeves • Passage (or snippet) retrieval, where the passage contains the desired specific information • Browsing tools and techniques: • Query term highlighting within retrieved documents • Find function in web browser/ word processing package (woeful) LIDA 2003 Invited Paper

Reading to Reference Information in a Document • Reading tools that integrate document overviews (e.g. table of contents) and document view • Passage retrieval, providing that passages rather than documents are retrieved • Within-document retrieval tools • ProfileSkim: passage retrieval in context LIDA 2003 Invited Paper

Reading to Write a Document • Interleaving of writing and reading sub-tasks • Mix of different kinds of reading activities • Example: Remembrance Agent • Augments user while writing (unobstrusive) • Displays documents (emails, notes, online documents) relevant to user’s current context • Monitors writing/browsing activity and displays one-line summaries in document editor (Emacs) • Rhodes and Starner 1996 LIDA 2003 Invited Paper

Reading to Explore from Pivot Document • Follow-up references, papers by same author, same group, etc. CiteSeer is obvious tool on the Web • Find nearest neighbour documents by essentially using pivot document as a query, e.g. “More Like This” function • Explore category in which document is located, e.g. documents in NLM MESH category, web pages in Yahoo! Category • Follow hard-wired hypertext links • Within and between document cross references • Follow “soft” hypertext links • Use chunk of document text as a query [Plagarism Story] LIDA 2003 Invited Paper

Reading to Understand or Study a Document • In general, will involve a mix of other kinds of reading activity • Annotation (including ability to add dynamic cross references) and “clipping” are arguably as important as reading LIDA 2003 Invited Paper

“Reading” of Multi-media Documents • Kinds of reading activity equally applicable to multimedia documents • Reading to select: video or soundtrack • Reading to extract: quotation in audio speech • Reading to reference: scene/shot retrieval in a video LIDA 2003 Invited Paper

Language Modelling 101 • (Simple) statistical representation of a “chunk” of text, e.g. of a document, paragraph, etc • Simpliest model is “bag of words” model, which essentially: • Counts frequencies of words (tokens) in text • Interprets counts as a probability distribution • Use distributions to compare different text chunks!! LIDA 2003 Invited Paper

Consider relevance of this document with respect to queries: { TREC, experiment } { precision, recall } “Bag of Words” Example Document Words Frequency prob evaluation 0.05 retrieval 0.15 information 0.15 system 0.15 TREC 0.25 experiment 0.15 precision 0.05 recall 0.05 LIDA 2003 Invited Paper

Language Modelling 101 (cont) • Language models can built over any chunks of text: • Collection or (arbitrary) set of documents • Entire document • Parts of document • Given Text1 and Text2, and corresponding language models ModelT1 and ModelT2, we can use them to: • Compare similarity of texts by comparing models ModelT1 <->ModelT2 e.g. document <-> document • Deciding if a text could be “generated” from another text Probability of (ModelT1 -> Text2) e.g. document -> query, often expressed as Prob( Query ¦ ModelDocument ) LIDA 2003 Invited Paper

Similarity of text chunks, e.g. document with document Matching based on probability of generating one text chunk from another, e.g. query from document Using Language Models for Retrieval Processes Document 1 Document 2 Document D Model of 1 Model of 2 Model of D Query Pr (ModelT1 -> Text2) ModelT1 <-> ModelT2 LIDA 2003 Invited Paper

ProfileSkim • Developed to support retrieval within long documents • Within document retrieval tool: supports reading to extract and reading to reference • Main concept: relevance profiling based on language modelling • Harper et al 2002, 2003 LIDA 2003 Invited Paper

Overview of ProfileSkim Tool File to skim Skim query Tile being visited Highlighted query term variants LIDA 2003 Invited Paper

Retrieval Status Value Word position Relevance Profile Meter Click and visit ... Document Tile Relevance Profile Meter (1) LIDA 2003 Invited Paper

P(query | window) Tile max -> tile RSV Tile Tile Sliding window Relevance Profiling Process LIDA 2003 Invited Paper

Profile Generation using Language Modelling • sliding window of N words of fixed size • compute “retrieval status value” RSVwindow at each word position in the document • RSVwindow = P( generate query | window ) LIDA 2003 Invited Paper

Select representative paragraph for a retrieved document based on query: Choose paragraph (para) where: Mpara <-> Mdocis largest AND Pr (Mpara -> Query) is largest Query-biased summarisation: Using LM Document Lang. Models Mdoc Mpara1 Mpara2 etc Paragraph Query LIDA 2003 Invited Paper

Given selected text within document, generate soft-links to other (relevant) documents Assume text model of web (say) Mweb Compare Mweb and Mselect to choose set of terms that contribute to MOST to divergence Use chosen terms to query the Web, and generate soft links Note: Can mix Mselect and Mdoc to obtain better model of selected text! Soft hyperlinks: Using LM Document (Mdoc) Soft-linked Documents Selected Text (Mselect) LIDA 2003 Invited Paper

Reading to write: Using LM (exercise for reader) • As you are writing a document, a tool suggests parts of other documents that may be relevant. c.f. Remembrance Agent writing this LIDA 2003 Invited Paper

Reading in Context • Reading documents is generally done in the context of a larger task, and the pattern of reading activities will depend on the task. • Task Writing a research proposal for EU Framework 6: • Reading FP6 Programme Call (and many related documents): reading to extract and reference • Reading to reference documents supporting proposal • Reading to extract ancillary information, e.g. contact details from web pages (say) • Can you think of any searching/reading environment that supports such a complex set of interactions? LIDA 2003 Invited Paper

Concluding Remarks • Reading of (long) documents to find information is raising interesting challenges in the field of information retrieval • A variety of reading activities should be supported, and preferably within an information seeking (with reading) environment • Language Models enable us to model text chunks at various levels of granularity, and thus provide a principled foundation for “retrieval-with-reading” techniques and tools LIDA 2003 Invited Paper

Reading List • Hearst, M. A.: TileBars: visualization of term distribution information in full text information access. Proc. CHI'95, (1995), 56-66. • Whittaker, S., Hirschberg, J., Choi, J., Hindle, D., Pereira, F. and Singhal, A.: SCAN: Designing and evaluating user interfaces to support retrieval from speech archives. In Proceedings ACM SIGIR '99. ACM Press (1999) 26-33. • Kaszkiel, M. and Zobel, J.: Passage Retrieval Revisited. In: Proceedings of the Twentieth International ACM-SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, July 1997. ACM Press (1997) 178-185. • Kaszkiel, M.: Indexing and Retrieval of Passages in Full-Text Databases, PhD thesis. RMIT Computer Science Technical Report (RT-17), May 2000 (2000). LIDA 2003 Invited Paper

cont… • Kaszkiel, M., Zobel, J. and Sacks-Davis, R.: Efficient Passage Ranking for Document Databases. ACM Transactions on Information Systems, Vol 17, No. 4 (1999) 406-439. • Landauer, T., Egan, D., Remde, J., Lesk, M., Lochbaum, C., and Ketchum, D.: Enhancing the usability of text through computer delivery and formative evaluation: The SuperBook project. In: McKnight, C., Dillon, A., and Richardson, J. (eds): Hypertext: A Psychological Perspective. Ellis Horwood (1993) 71-136. • Marchionini. G.: Information Seeking in Electronic Environments. Cambridge University Press, Cambridge (1995). • Byrd, D.: A Scrollbar-based Visualization for Document Navigation. In Proceedings of ACM Digital Libraries 99. ACM Press (1999). • de Kretser, O. and Moffat, A.: Effective Document Presentation with a Locality-Based Similarity Heuristic. In: Proceedings of the Twenty Second International ACM-SIGIR Conference on Research and Development in Information Retrieval, Berkeley, August 1999. ACM Press (1999) 113-120. LIDA 2003 Invited Paper

cont… • Tombros, A. and Sanderson, M.: Advantages of Query Biased Summaries in Information Retrieval. In: Proceedings of 1998 ACM SIGIR Conference on Research and Development in Information Retrieval (1998) 2-10. • Ponte, J. and Croft, W. B.: A language modeling approach to information retrieval. In: Proceedings of the 1998 ACM SIGIR Conference on Research and Development in Information Retrieval (1998) 275-281. • Song, F. and Croft, W.B.: A general language model for information retrieval in Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval (1999) 279-280. • Schilit, B. N., Golovchinsky, G. and Price, M. N.: Beyond paper: Supporting Active Reading with free-form digital ink annotations. In: Proceedings of CHI98, ACM Press (1998) 149-156. LIDA 2003 Invited Paper

cont… • Harper, D. J., Coulthard, S. and Sun, Y.: A Language Modelling Approach to Relevance Profiling for Document Browsing. In: Procs JCDL 2002, Oregon, USA (2002) 76-83. • Harper, D. J., Koychev, I. and Sun, Y. : Query-Based Document Skimming: A User-Centred Evaluation. In: Procs 25th European Conference on IR Research, LNCS 2622, Springer (2003) 377-392. • Rhodes, B. J. and Starner, T.: Remembrance Agent: A continuously running automated retrieval system. In: Proceedings of The First International Conference on The Practical Application Of Intelligent Agents and Multi Agent Technology(PAAM '96), (1996) 487-495. LIDA 2003 Invited Paper

The Challenge of Finding Information in Long Documents

The Challenge of Finding Information in Long Documents

Presentation Transcript

Finding Information in the Atmospheric Sciences

The Challenge of Finding Information in Long Documents

The Challenge of Long-term Implementation

Information The Artificiality of Documents: or, Resurrecting the Human in Information Science .

Managing Long Documents

Finding Government Documents

The information challenge

Finding Information

The Challenge of Managing Digital Information in the Workplace

Finding Information

Finding Information

Finding Government Documents

Finding Information

Finding Information

Finding Primary Documents

Documents (information) (recap)

The Information Challenge

The Challenge of Reuse of Information

Formatting Long Documents

The Information Challenge