1 / 25

| 1

| 1. Zoekmachines. Gertjan van Noord 2014. Lecture 2: vocabulary, posting lists. Agenda for today. Questions Chapter 1 Chapter 2: Term vocabulary & posting lists Chapter 2: Posting lists with positions Homework/lab assignment. Chapter 2 Overview. Preprocessing of documents

levi-mendez
Télécharger la présentation

| 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. | 1 Zoekmachines Gertjan van Noord 2014 Lecture 2: vocabulary, posting lists

  2. Agenda for today • Questions Chapter 1 • Chapter 2: Term vocabulary & posting lists • Chapter 2: Posting lists with positions • Homework/lab assignment

  3. Chapter 2 Overview Preprocessing of documents • choose the unit of indexing (granularity) • tokenization (removing punctuation, splitting in words) • stop list? • normalization: case folding, stemming versus lemmatizing, ... • extensions to postings lists

  4. Tokens, types and terms token each separate word in the text type same words belong to one type (index) term finallyincluded in the index index term is an equivalence classof tokens and/or types

  5. 26-01-12 Tokens, types and terms The Lord of the Rings • Number of tokens? • 5 • Number of types? • 4 • Number of terms? • 4? 2? 1?

  6. 26-01-12 Equivalence classes • Casefolding • Diacritics • Stemming/lemmatisation • Decompounding • Synonym lists • Variant spellings

  7. 26-01-12 Equivalence classes • Implicit: mapping rules • Relational: query expansion • Relational: double indexing • Mapping should be done: • Indexing • Querying

  8. Words and word forms • Inflection (D: verbuiging/vervoeging) • changing a word to express person, case, aspect, ... • for determiners, nouns, pronouns, adjectives: declination (D: verbuiging) • for verbs: conjugation (D: vervoeging) • Derivation (D: afleiding) • formation of a new word from another word (e.g. by adding an affix (prefix or suffix) or changing the grammatical category)

  9. Inflection examples Determiners E: the D: de, het G: der, des, dem, den, die, das Adjectives E: young D: jonge, jonge G: junger, junge, junges, jungen Nouns E: man, men D: man, mannen G: mann, mannes, Verbs E write / writes / wrote / written D schrijf/ schrijft /schrijven / schreef/ schreven / geschreven G schreibe/ schreibst / schreibt / schreiben / schrieben / geschrieben

  10. Derivation examples to browse -> a browser red -> to redden, reddish Google -> to google arm(s) -> to arm, to disarm -> disarmament, disarming

  11. Stemming and lemmatizing verb forms inform, informs, informed, informing derivations information, informative, informal?? stem inform lemma inform, information, informative, informal verb forms sing: sings, sang, sung, singing derivations singer, singers, song, songs stem sing, sang, sung, song, lemma sing, singer, song

  12. Discussion Why is stemming used when lemmatizing is much more precise? Lemmatizing is a more complex process it needs - a vocabulary (problem: new words) • morphologic analysis (knowledge of inflection rules) - syntactic analysis, parsing (noun or verb?)

  13. 26-01-12 Compound splitting Marketingjargon -> marketing AND jargon • Increased retrieval • Decreased precision • Must be applied to both query and index! • But what to do with the query marketing jargon ? • And with spreekwoord appel boom ?

  14. Chapter 2 Overview Preprocessing of documents • choose the unit of indexing (granularity) • tokenization (removing punctuation, splitting in words) • stop list? • normalization: case folding, stemming versus lemmatizing, ... • extensions to postings lists

  15. Efficient merging of postings For X AND Y, we have to intersect 2 lists Most documents will contain only one of the two terms

  16. Recall basic intersection algorithm

  17. Skip pointers

  18. Skip pointers • Makes intersection of 2 lists more efficient • think of millions of list items • How many skip pointers and where? • Trade-off: • More pointers, often useful but small skips. • Less pointers … • Heuristic: distance √n, evenly distributed

  19. Skip pointers: useful? Yes, certainly in the past With very fast CPUs less important Especially in a rather static index If a list keeps changing less effective

  20. Extensions of the simple term index To support phrase queries • “information retrieval” • “retrieval of information” Different approaches • biword indexes • phrase indexes • positional indexes • combinations

  21. Biword and phrase indexes • Holding terms together in the index • Simple biword index: • retrieval of, of information • Sophisticated: POS tagger selects nouns • N x* N retrieval of this information • Phrase index: includes variable lengths of word sequences • terms of 1 and 2 words both included

  22. Positional index Add in the postings lists for each doc the list of positions of the term for phrase queries for proximity search Example [information, 4] : [1:<4,22, 35>, 2:<5,17, 30>, …] [retrieval, 2] : [1:<5,20>, 2:<18,31>]

  23. Combination schemes Often queried combinations: phrase index names of persons and organization esp. combinations of common terms (!) find out from query log For other phrases a positional index Williams e.a.: next word index added

  24. H.E. Williams, J.Zobel, and D.Bahle (2004) Fast Phrase Querying With Combined Indexes (ACM Dig Library): Phrase querying with a combination of three approaches (next word index, phrase index and inverted file) ... is more than 60% faster on average than using an inverted index alone ... requires structures that total only 20% of the size of the collection. We conclude that our approaches make stopping unnecessary and allow fast query evaluation for all phrase queries.

  25. Doc ID No of matching docs No of occurrences in doc A nextword index (Williams e.a.) position docfreq,(<doc,freq,[pos, pos,..]>,<doc, freq, [..]

More Related