1 / 17

What is a document?

What is a document?. Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “ Managing senior programmers is like herding cats.” Dave Platt or… paper/article? video?. Basic IR: Documents. Assume:

ania
Télécharger la présentation

What is a document?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? • quotation? “Managing senior programmers is like herding cats.” Dave Platt or… • paper/article? • video?

  2. Basic IR: Documents Assume: • free text from a quotation through a book (unstructured or semi-structured data) • English • available electronically (on-line repositories) • generally, too many documents to store locally in an index. • generally, infer semantics through low level units (e.g., terms) and metadata

  3. Accents spacing Noun groups Automatic or Manual indexing Docs stopwords stemming structure structure Full text Index terms Logical View of Documents (Figure taken from on-line course resources for Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto)

  4. Structure • Metadata is information on the organization of the data. external to meaning: length, author, date… subject matter: subject codes, keywords, taxonomic indicators • Organizational Conventions: • articles have a title, author list, abstract, sections, etc. • web pages have headings, title, keywords, etc. Accents spacing Noun groups Automatic or Manual indexing Docs stopwords stemming structure

  5. Markup Languages Markup is extra syntax that describes formatting, attributes, semantics, etc. Tags provide direction and delineate beginning and end of marks. Examples: TeX, Standard Generalized Markup Language (SGML), eXtensible Markup Language (XML) , HyperText Markup Language (HTML).

  6. Term Separators: Accents, Spacing, etc… • Lexical analysis divides text into distinct terms. • usually disregard punctuation, numbers, spaces • Decisions: • how to treat case and hyphens? • disregard comments? • how to use or not formatting directives? Accents spacing Noun groups Automatic or Manual indexing Docs stopwords stemming structure

  7. Information in Terms Information entropy quantifies information content: where there are a set s of terms and p is the relative frequency (%) of a term.

  8. Freq Terms Term Distribution Zipf’s Law approximates the distribution of term frequencies in a text. Frequency of ith most frequent term is times that of most frequent term where 1.5 < Q < 2.0

  9. Stop Words • words that either • appear so frequently that they do not distinguish documents (e.g., “www”) or • have more syntactic than semantic role (e.g., “the”). Advantage: Filtering out stop words reduces document description and focuses attention on terms that convey more information. Disadvantage: May reduce recall Accents spacing Noun groups Automatic or Manual indexing Docs stopwords stemming structure

  10. Vocabulary Size Heap’s Law models the size of vocabulary as a function of: • the size of the text (n), • a baseline (10<K<100), • a growth factor (b< 1). Voc Text Size

  11. Noun Groups • Further focus term set by filtering for particular subsets selected manually (e.g., classifications or index terms). • Discard terms that are not nouns*. • Fix spelling errors. • Use a thesaurus to combine similar words. *From Google web site, Top 20 gaining queries 2002 contain only nouns. Accents spacing Noun groups Automatic or Manual indexing Docs stopwords stemming structure

  12. Stemming • Grammars permit minor modifications of terms that change their type rather than meaning, e.g., plurals, gerunds, some prefixes and suffixes… • Stemming reduces term to just the core (stem). Advantages: reduces set of terms, combines same meaning Disadvantage: may reduce recall by incorrectly combining meanings (e.g., “skies” and “ski”) Accents spacing Noun groups Automatic or Manual indexing Docs stopwords stemming structure

  13. Putting it together: Document The purpose of the course is to teach theory and practice underlying the construction of Web based information systems. As such, the course will devote equal time to information retrieval and software engineering topics. The theory will be put into practice through a semester long team programming project. 48 words, 307 characters

  14. Putting it together: Stop Word Removal purpose course teach theory practice underlying construction Web based information course devote equal time information retrieval software engineering topics theory practice semester long team programming project 26 words, 213 chars

  15. Putting it together: Only Nouns purpose course theory practice construction Web information course equal time information retrieval software engineering topics theory practice semester team programming project 21 words, 179 chars

  16. Putting it together: Stemming & Alphabetizing construct course course engineer equal informat informat practice practice program project purpose retrieve semester software team theory theory time topic web 21 words, 161 chars

  17. Indexing • Terms remaining after document processing must be stored to facilitate retrieval. • Typically, they are stored in an inverted index. More on that later… Accents spacing Noun groups Automatic or Manual indexing Docs stopwords stemming structure

More Related