1 / 27

Information Retrieval (2)

Information Retrieval (2). Prof. Dragomir R. Radev radev@umich.edu. IR WINTER 2010. … 3. Document preprocessing. Tokenization. Stemming. The Porter algorithm. Storing, indexing and searching text. Inverted indexes . …. Document preprocessing.

yael
Télécharger la présentation

Information Retrieval (2)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval(2) Prof. Dragomir R. Radev radev@umich.edu

  2. IR WINTER 2010 … 3. Document preprocessing. Tokenization. Stemming. The Porter algorithm. Storing, indexing and searching text. Inverted indexes. …

  3. Document preprocessing • Dealing with formatting and encoding issues • Hyphenation, accents, stemming, capitalization • Tokenization: • Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc, can’t • Example: “The New York-Los Angeles flight”

  4. Non-English languages • Arabic: • Japanese: • German: Lebensversicherungsgesellschaftsangesteller كتاب この本は重い。

  5. Document preprocessing • Normalization: • Casing (cat vs. CAT) • Stemming (computer, computation) • Soundex • Labeled/labelled, extraterrestrial/extra-terrestrial/extra terrestrial, Qaddafi/Kadhafi/Ghadaffi • Index reduction • Dropping stop words (“and”, “of”, “to”) • Problematic for “to be or not to be”

  6. Porter’s algorithm Example: the word “duplicatable” duplicat rule 4duplicate rule 1b1duplic rule 3 The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied.

  7. Porter’s algorithm

  8. Links • http://maya.cs.depaul.edu/~classes/ds575/porter.html • http://www.tartarus.org/~martin/PorterStemmer/def.txt

  9. Approximate string matching • The Soundex algorithm (Odell and Russell) • Uses: • spelling correction • hash function • non-recoverable

  10. The Soundex algorithm 1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions 2. Assign the following numbers to the remaining letters after the first: b,f,p,v : 1 c,g,j,k,q,s,x,z : 2 d,t : 3 l : 4 m n : 5 r : 6

  11. The Soundex algorithm 3. if two or more letters with the same code were adjacent in the original name, omit all but the first 4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits Examples: Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300 same as Ellery, Ghosh, Heilbronn, Kant, and Ladd Some problems: Rogers and Rodgers, Sinclair and StClair

  12. IR WINTER 2010 … 4. Word distributions.The Zipf distribution.The Benford distribution.Heap‘s law. TF*IDF …

  13. Word distributions • Words are not distributed evenly! • Same goes for letters of the alphabet (ETAOIN SHRDLU), city sizes, wealth, etc. • Usually, the 80/20 rule applies (80% of the wealth goes to 20% of the people or it takes 80% of the effort to build the easier 20% of the system), more examples coming up…

  14. Shakespeare • Romeo and Juliet: • And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me, 262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60; • … • A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1; http://www.mta75.org/curriculum/english/Shakes/indexx.html(visited in Dec. 2006)

  15. The BNC (Adam Kilgarriff) • 1 6187267 the det • 2 4239632 be v • 3 3093444 of prep • 4 2687863 and conj • 5 2186369 a det • 6 1924315 in prep • 7 1620850 to infinitive-marker • 8 1375636 have v • 9 1090186 it pron • 10 1039323 to prep • 11 887877 for prep • 12 884599 i pron • 13 760399 that conj • 14 695498 you pron • 15 681255 he pron • 16 680739 on prep • 17 675027 with prep • 18 559596 do v • 19 534162 at prep • 20 517171 by prep Kilgarriff, A. Putting Frequencies in the Dictionary.International Journal of Lexicography10 (2) 1997. Pp 135--155

  16. Stop words • 250-300 most common words in English account for 50% or more of a given text. • Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. • Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). • Token/type ratio: 2256/859 = 2.63

  17. Zipf’s law Rank x Frequency  Constant

  18. Zipf's law is fairly general! • Frequency of accesses to web pages • in particular the access counts on the Wikipedia page, • with s approximately equal to 0.3 • page access counts on Polish Wikipedia (data for late July 2003) • approximately obey Zipf's law with s about 0.5 • Words in the English language • for instance, in Shakespeare’s play Hamlet with s approximately 0.5 • Sizes of settlements • Income distributions amongst individuals • Size of earthquakes • Notes in musical performances http://en.wikipedia.org/wiki/Zipf's_law http://www.nslij-genetics.org/wli/zipf/ http://www.cut-the-knot.org/do_you_know/zipfLaw.shtml

  19. Zipf’s law (cont’d) • Limitations: • Low and high frequencies • Lack of convergence • Power law with coefficient c = -1 • Y=kxc • Li (1992) – typing words one letter at a time, including spaces

  20. Heap’s law • Size of vocabulary: V(n) = Knb • In English, K is between 10 and 100, β is between 0.4 and 0.6. V(n) http://en.wikipedia.org/wiki/Heaps%27_law n

  21. Heaps law (cont’d) • Related to Zipf’s law: generative models • Zipf’s and Heap’s law coefficients change with language Alexander Gelbukh, Grigori Sidorov. Zipf and Heaps Laws’ Coefficients Depend on Language. Proc.CICLing-2001, Conference on Intelligent Text Processing and Computational Linguistics, February 18–24, 2001, Mexico City. Lecture Notes in Computer Science N 2004, ISSN 0302-9743, ISBN 3-540-41687-0, Springer-Verlag, pp. 332–335.

  22. The Benford law • the first digit of a random number is d with the probability log10 (1+1/d). • Number ones are much more frequent that number nines. • Useful in forensic accounting, political science: • Mebane, Walter R., Jr. 2006. Detecting Attempted Election Theft: Vote Counts, Voting Machines and Benford's Law

  23. IDF: Inverse document frequency TF * IDF is used for automated indexing and for topicdiscrimination: N: number of documentsdk: number of documents containing term kfik: absolute frequency of term k in document iwik: weight of term k in document i idfk = log2(N/dk) + 1 = log2N - log2dk + 1

  24. Vector-based matching • The cosine measure S (dk . ck .idf(k)) sim (D,C) = k S S (dk)2 . (ck)2 k k

  25. Asian and European news 622.941 deng 306.835 china 196.725 beijing 153.608 chinese 152.113 xiaoping 124.591 jiang 108.777 communist 102.894 body 85.173 party 71.898 died 68.820 leader 43.402 state 38.166 people 97.487 nato 92.151 albright 74.652 belgrade 46.657 enlargement 34.778 alliance 34.778 french 33.803 opposition 32.571 russia 14.095 government 9.389 told 9.154 would 8.459 their 6.059 which

  26. Other topics 120.385 shuttle 99.487 space 90.128 telescope 70.224 hubble 59.992 rocket 50.160 astronauts 49.722 discovery 47.782 canaveral 47.782 cape 40.889 mission 35.778 florida 27.063 center 74.652 compuserve 65.321 massey 55.989 salizzoni 29.996 bob 27.994 online 27.198 executive 15.890 interim 15.271 chief 11.647 service 11.174 second 6.781 world 6.315 president

  27. Readings • 2: MRS9 • 3: MRS13, MRS14

More Related