1 / 41

Information Retrieval

Information Retrieval. January 14, 2005. Handout #2. Course Information. Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M 11-12 & Th 12-1 or via email Course page: http://tangra.si.umich.edu/~radev/650/

lorene
Télécharger la présentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval January 14, 2005 Handout #2

  2. Course Information • Instructor: Dragomir R. Radev (radev@si.umich.edu) • Office: 3080, West Hall Connector • Phone: (734) 615-5225 • Office hours: M 11-12 & Th 12-1 or via email • Course page: http://tangra.si.umich.edu/~radev/650/ • Class meets on Fridays, 2:10-4:55 PM in 409 West Hall

  3. Evaluation

  4. Relevance • Difficult to change: fuzzy, inconsistent • Methods: exhaustive, sampling, pooling, search-based

  5. Contingency table retrieved not retrieved relevant w x n1 = w + x not relevant y z N n2 = w + y

  6. Precision and Recall w Recall: w+x w Precision: w+y

  7. Exercise Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista. Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs. Later, try different queries.

  8. [From Salton’s book]

  9. Interpolated average precision (e.g., 11pt) Interpolation – what is precision at recall=0.5?

  10. Issues • Why not use accuracy A=(w+z)/N? • Average precision • Average P at given “document cutoff values” • Report when P=R • F measure: F=(b2+1)PR/(b2P+R) • F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of P and R • When do F and F1 report the wrong results?

  11. Kappa • N: number of items (index i) • n: number of categories (index j) • k: number of annotators

  12. Kappa example (from Manning, Schuetze, Raghavan)

  13. Kappa (cont’d) • P(A) = 370/400 • P (-) = (10+20+20+70)/800 = 0.2125 • P (+) = (10+20+300+300)/800 = 0.7878 • P (E) = 0.2125 * 0.2125 + 0.7878 * 0.7878 = 0.665 • K = (0.925-0.665)/(1-0.665) = 0.776 • Kappa higher than 0.67 is tentatively acceptable; higher than 0.8 is good

  14. Relevance collections • TREC ad hoc collections, 2-6 GB • TREC Web collections, 2-100GB

  15. Sample TREC query <top> <num> Number: 305 <title> Most Dangerous Vehicles <desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example. </top> LA031689-0177 FT922-1008 LA090190-0126 LA101190-0218 LA082690-0158 LA112590-0109 FT944-136 LA020590-0119 FT944-5300 LA052190-0048 LA051689-0139 FT944-9371 LA032390-0172 LA042790-0172LA021790-0136LA092289-0167LA111189-0013LA120189-0179LA020490-0021LA122989-0063LA091389-0119LA072189-0048FT944-15615LA091589-0101LA021289-0208

  16. <DOCNO> LA031689-0177 </DOCNO> <DOCID> 31701 </DOCID> <DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE> <SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION> <LENGTH><P>586 words </P></LENGTH> <HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE> <BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE> <TEXT> <P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P> <P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after Consumer Reports magazine charged that the vehicle had basic design flaws. </P> <P>Several Fatalities </P> <P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs, particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation conducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P> <P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the accident report, NHTSA declined to investigate the Samurai. </P> ... </TEXT> <GRAPHIC><P> Photo, The Ford Bronco II "appears to have a higher number of single-vehicle, first event roll-overs," a federal official said. </P></GRAPHIC> <SUBJECT> <P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS; RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P> </SUBJECT> </DOC>

  17. TREC (cont’d) • http://trec.nist.gov/tracks.html • http://trec.nist.gov/presentations/presentations.html

  18. Word distribution models

  19. Shakespeare • Romeo and Juliet: • And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me, 262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60; • … • A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1; http://www.mta75.org/curriculum/english/Shakes/indexx.html

  20. The BNC (Adam Kilgarriff) • 1 6187267 the det • 2 4239632 be v • 3 3093444 of prep • 4 2687863 and conj • 5 2186369 a det • 6 1924315 in prep • 7 1620850 to infinitive-marker • 8 1375636 have v • 9 1090186 it pron • 10 1039323 to prep • 11 887877 for prep • 12 884599 i pron • 13 760399 that conj • 14 695498 you pron • 15 681255 he pron • 16 680739 on prep • 17 675027 with prep • 18 559596 do v • 19 534162 at prep • 20 517171 by prep Kilgarriff, A. Putting Frequencies in the Dictionary.International Journal of Lexicography10 (2) 1997. Pp 135--155

  21. Stop lists • 250-300 most common words in English account for 50% or more of a given text. • Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. • Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). • Token/type ratio: 2256/859 = 2.63

  22. Zipf’s law Rank x Frequency  Constant

  23. Zipf's law is fairly general! • Frequency of accesses to web pages • in particular the access counts on the Wikipedia page, • with s approximately equal to 0.3 • page access counts on Polish Wikipedia (data for late July 2003) • approximately obey Zipf's law with s about 0.5 • Words in the English language • for instance, in Shakespeare’s play Hamlet with s approximately 0.5 • Sizes of settlements • Income distributions amongst individuals • Size of earthquakes • Notes in musical performances http://en.wikipedia.org/wiki/Zipf's_law

  24. Zipf’s law (cont’d) • Limitations: • Low and high frequencies • Lack of convergence • Power law with coefficient c = -1 • Y=kxc • Li (1992) – typing words one letter at a time, including spaces

  25. Heap’s law • Size of vocabulary: V(n) = Knb • In English, K is between 10 and 100, β is between 0.4 and 0.6. V(n) http://en.wikipedia.org/wiki/Heaps%27_law n

  26. Heap’s law (cont’d) • Related to Zipf’s law: generative models • Zipf’s and Heap’s law coefficients change with language Alexander Gelbukh, Grigori Sidorov. Zipf and Heaps Laws’ Coefficients Depend on Language. Proc.CICLing-2001, Conference on Intelligent Text Processing and Computational Linguistics, February 18–24, 2001, Mexico City. Lecture Notes in Computer Science N 2004, ISSN 0302-9743, ISBN 3-540-41687-0, Springer-Verlag, pp. 332–335.

  27. Indexing

  28. Methods • Manual: e.g., Library of Congress subject headings, MeSH • Automatic

  29. LOC subject headings A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL) http://www.loc.gov/catdir/cpso/lcco/lcco.html

  30. Medicine CLASS R - MEDICINE Subclass R R5-920 Medicine (General) R5-130.5 General works R131-687 History of medicine. Medical expeditions R690-697 Medicine as a profession. Physicians R702-703 Medicine and the humanities. Medicine and disease in relation to history, literature, etc. R711-713.97 Directories R722-722.32 Missionary medicine. Medical missionaries R723-726 Medical philosophy. Medical ethics R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying R727-727.5 Medical personnel and the public. Physician and the public R728-733 Practice of medicine. Medical practice economics R735-854 Medical education. Medical schools. Research R855-855.5 Medical technology R856-857 Biomedical engineering. Electronics. Instrumentation R858-859.7 Computer applications to medicine. Medical informatics R864 Medical records R895-920 Medical physics. Medical radiology. Nuclear medicine

  31. Finding the most frequent terms in a document • Typically stop words: the, and, in • Not content-bearing • Terms vs. words • Luhn’s method

  32. Luhn’s method E FREQUENCY WORDS

  33. Computing term salience • Term frequency (IDF) • Document frequency (DF) • Inverse document frequency (IDF)

  34. Scripts to compute tf and idf cd /clair4/class/ir-w03/tf-idf ./tf.pl 053.txt | sort -nr +1 | more ./tfs.pl 053.txt | sort -nr +1 | more ./stem.pl reasonableness ./build-df.pl ./idf.pl | sort -n +2 | more

  35. Applications of TFIDF • Cosine similarity • Indexing • Clustering

  36. Variants of TF*IDF • E.g., Okapi (Robertson) • TF/(k+TF) • k is from 1 to 2

  37. Vector-based matching • The cosine measure S (dk . ck .idf(k)) sim (D,C) = k S S (dk)2 . (ck)2 k k

  38. IDF: Inverse document frequency TF * IDF is used for automated indexing and for topicdiscrimination: N: number of documentsdk: number of documents containing term kfik: absolute frequency of term k in document iwik: weight of term k in document i idfk = log2(N/dk) + 1 = log2N - log2dk + 1

  39. Asian and European news 622.941 deng 306.835 china 196.725 beijing 153.608 chinese 152.113 xiaoping 124.591 jiang 108.777 communist 102.894 body 85.173 party 71.898 died 68.820 leader 43.402 state 38.166 people 97.487 nato 92.151 albright 74.652 belgrade 46.657 enlargement 34.778 alliance 34.778 french 33.803 opposition 32.571 russia 14.095 government 9.389 told 9.154 would 8.459 their 6.059 which

  40. Other topics 120.385 shuttle 99.487 space 90.128 telescope 70.224 hubble 59.992 rocket 50.160 astronauts 49.722 discovery 47.782 canaveral 47.782 cape 40.889 mission 35.778 florida 27.063 center 74.652 compuserve 65.321 massey 55.989 salizzoni 29.996 bob 27.994 online 27.198 executive 15.890 interim 15.271 chief 11.647 service 11.174 second 6.781 world 6.315 president

  41. Software • KEA: http://www.nzdl.org/Kea/ • Example: • Paper: “Protocols for secure, atomic transaction execution in electronic commerce” • Author: anonymity, atomicity, auction, electronic commerce, privacy, real-time, security, transaction • Kea: atomicity, auction, customer, electronic commerce, intruder, merchant, protocol, security, third party, transaction

More Related