Thanks to Bill Arms, Marti Hearst

Documents Thanks to Bill Arms, Marti Hearst

Last time • Big O – Growth of work • Size of information • Continues to grow • IR an old field, goes back to the ‘40s • Search engine most popular information retrieval model • Still new ones being built

Focus on documents • Document will be what we: • Crawl (harvest) • Index • Retrieve with query • Evaluate • Rank • IR iterative process

Repositories Goals Workspace IR is an Iterative Process Assume Search Engine has been built and working

Query Parse User’s Information Need text input

Index Pre-process Collections

Index Query Parse Rank or Match Pre-process User’s Information Need Collections text input

Index Query Parse Query Reformulation Rank or Match Pre-process User’s Information Need Collections text input Evaluation

Definitions Collections consist of Documents • Document • The basic unit which we will automatically index • usually a body of text which is a sequence of terms • has to be digital • Tokens or terms • Basic units of a document, usually consisting of text • semantic word or phrase, numbers, dates, etc • Collections or repositories or corpus • particular collections of documents • sometimes called a database • Query • request for documents on a topic

Document Collectons Many on the web • From the Text Search Engines: IR in Practive • Document collections • Collections • Corpus collections at UW • Some searchable but cost to download

Collection vs documents vs terms Collection Terms or tokens Document

What is a Document? • A document is a digital object with an operational definition • Indexable (usually digital) • Can be queried and retrieved. • Many types of documents • Text or part of text • Web page • Image • Audio • Video • Data • Email • Etc.

Text Documents A text digital document consists of a sequence of words and other symbols, e.g., punctuation. The individual words and other symbols are known as tokens or terms. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. • Mixed text, combination of the above Examples?

Why the focus on text? • Language is the most powerful query model • Language can be treated as text • Text has many interesting properties • Others?

What we covered • Documents are the atoms of IR • Index terms or tokens in documents • Terms or tokes will be text • Interested in collections of documents • Repository • Corpus • Document collection

Thanks to Bill Arms, Marti Hearst

Thanks to Bill Arms, Marti Hearst

Presentation Transcript

SIMS 247: Information Visualization and Presentation Marti Hearst

i247: Information Visualization and Presentation Marti Hearst

i247: Information Visualization and Presentation Marti Hearst

i247: Information Visualization and Presentation Marti Hearst

i247: Information Visualization and Presentation Marti Hearst

i247: Information Visualization and Presentation Marti Hearst

Thanks to Bill Arms, Marti Hearst

i247: Information Visualization and Presentation Marti Hearst

SIMS 247: Information Visualization and Presentation Marti Hearst

i247: Information Visualization and Presentation Marti Hearst

i247: Information Visualization and Presentation Marti Hearst

Foundations of Software Design Fall 2002 Marti Hearst

SIMS 247: Information Visualization and Presentation Marti Hearst

SIMS 247: Information Visualization and Presentation Marti Hearst

i247: Information Visualization and Presentation Marti Hearst

i247: Information Visualization and Presentation Marti Hearst

i247: Information Visualization and Presentation Marti Hearst

Thanks to Wolfgang Glänzel , Ray Mooney, Scott White, Bill Arms, Michael Nelson

i247: Information Visualization and Presentation Marti Hearst

SIMS 247: Information Visualization and Presentation Marti Hearst