1 / 90

Introduction to Information Retrieval and Web-based Searching Methods

Introduction to Information Retrieval and Web-based Searching Methods Mark Sanderson, University of Sheffield m.sanderson@shef.ac.uk, dis.shef.ac.uk/mark/ Contents Introduction Ranked retrieval Models Evaluation Advanced ranking Future Sources Aims

oshin
Télécharger la présentation

Introduction to Information Retrieval and Web-based Searching Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Information Retrieval and Web-based Searching Methods Mark Sanderson, University of Sheffield m.sanderson@shef.ac.uk, dis.shef.ac.uk/mark/ ©Mark Sanderson, Sheffield University

  2. Contents • Introduction • Ranked retrieval • Models • Evaluation • Advanced ranking • Future • Sources ©Mark Sanderson, Sheffield University

  3. Aims • To introduce you to basic notions in the field of Information Retrieval with a focus on Web based retrieval issues. • To squeeze it all into 4 hours, including coffee breaks • If it’s not covered in here, hopefully there will at least be a reference ©Mark Sanderson, Sheffield University

  4. Objectives • At the end of this you will be able to… • Demonstrate the workings of document ranking • Remove suffixes from words. • Explain how recall and precision are calculated. • Exploit Web specific information when searching. • Outline the means of automatically expanding users’ queries. • List IR publications. ©Mark Sanderson, Sheffield University

  5. Introduction • What is IR? • General definition • Retrieval of unstructured data • Most often it is • Retrieval of text documents • Searching newspaper articles • Searching on the Web • Other types • Image retrieval ©Mark Sanderson, Sheffield University

  6. Typical interaction • User has information need. • Expresses it as a query • in their natural language? • IR system find documents relevant to the query. ©Mark Sanderson, Sheffield University

  7. Text • No computer understanding of document or query text • Use “bag of words” approach • Pay no heed to inter-word relations: • syntax, semantics • Bag does characterise document • Not perfect, words are • ambiguous • used in different forms or synonymously ©Mark Sanderson, Sheffield University

  8. To recap Documents Documents User Query Process IR System Process Retrieved relevant(?)documents Store ©Mark Sanderson, Sheffield University Retrieval Part

  9. Processing • “The destruction of the amazon rain forests” • Case normalisation • Stop word removal. • From fixed list • “destruction amazon rain forests” • Suffix removal, also know as stemming. • “destruct amazon rain forest” • Documents processed as well ©Mark Sanderson, Sheffield University

  10. Different forms - stemming • Matching the query term “forests” • to “forest” and “forested” • Stemmers remove affixes • removal of suffixes - worker • prefixes? - megavolt • infixes? - un-bloody-likely • Stick with suffixes ©Mark Sanderson, Sheffield University

  11. Plural stemmer • Plurals in English • If word ends in “ies” but not “eies”, “aies” • “ies” -> “y” • if word ends in “es” but not “aes, “ees”, “oes” • “es” -> “e” • if word ends in “s” but not “us” or “ss” • “s” -> “” • First applicable rule is the one used ©Mark Sanderson, Sheffield University

  12. Plural stemmer reference • Good review of stemming • Frakes, W. (1992): Stemming algorithms, in Frakes, W. & Baeza-Yates, B. (eds.), Information Retrieval: Data Structures & Algorithms: 131-160 ©Mark Sanderson, Sheffield University

  13. Plural stemmer • Examples • Forests - ? • Statistics - ? • Queries - ? • Foes - ? • Does - ? • Is - ? • Plus - ? • Plusses - ? ©Mark Sanderson, Sheffield University

  14. Take more off? • What about • “ed”, “ing”, “ational”, “ation”, “able”, “ism”, etc, etc. • Porter, M.F. (1980): An algorithm for suffix stripping, in Program - automated library and information systems, 14(3): 130-137 • Three pages of rules • What about • “bring”, “table”, “prism”, “bed”, “thing”? • When to strip, when to stop ©Mark Sanderson, Sheffield University

  15. CVCs • Porter used pattern of letters • [C*](VC)m[V*] • Tree - m=? • Trouble - m=? • Troubles - m=? • m = 0 or sometimes 1 • stop • Syllables? • Pinker, S. (1994): The Language Instinct ©Mark Sanderson, Sheffield University

  16. Problems • Porter doesn’t always return words • “query”, “queries”, “querying”, etc • -> “queri” • Krovetz, R. (1993): Viewing morphology as an inference process, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval: 191-202 • Xu, J., Croft, W.B. (1998): Corpus-Based Stemming using Co-occurrence of Word Variants, in ACM Transactions on Information Systems, 16(1): 61-81 ©Mark Sanderson, Sheffield University

  17. Is it used? • Research says it is useful • Hull, D.A. (1996): Stemming algorithms: A case study for detailed evaluation, in Journal of the American Society for Information Science, 47(1): 70-84 • Web search engines hardly use it • Why? • Unexpected results • computer, computation, computing, computational, etc. • User expectation? • Foreign languages? ©Mark Sanderson, Sheffield University

  18. Ranked retrieval • Everything processed into a bag… • …calculate relevance score between query and every document • Sort documents by their score • Present top scoring documents to user. ©Mark Sanderson, Sheffield University

  19. The scoring • For each document • Term frequency (tf) • t: Number of times term occurs in document • dl: Length of document (number of terms) • Inverse document frequency (idf) • n: Number of documents term occurs in • N: Number of documents in collection ©Mark Sanderson, Sheffield University

  20. TF • More often a term is used in a document • More likely document is about that term • Depends on document length? • Harman, D. (1992): Ranking algorithms, in Frakes, W. & Baeza-Yates, B. (eds.), Information Retrieval: Data Structures & Algorithms: 363-392 • Watch out for mistake: not unique terms. • Problems with spamming ©Mark Sanderson, Sheffield University

  21. Spamming the tf weight • Searching for Jennifer Anniston? SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK ©Mark Sanderson, Sheffield University

  22. IDF • Some query terms better than others? • In general, fair to say that… • “amazon” > “forest”  “destruction” > “rain” ©Mark Sanderson, Sheffield University

  23. To illustrate All documents Relevant documents ©Mark Sanderson, Sheffield University

  24. To illustrate All documents amazon ©Mark Sanderson, Sheffield University

  25. To illustrate All documents rain ©Mark Sanderson, Sheffield University

  26. IDF and collection context • IDF sensitive to the document collection content • General newspapers • “amazon” > “forest”  “destruction” > “rain” • Amazon book store press releases • “forest”  “destruction” > “rain” > “amazon” ©Mark Sanderson, Sheffield University

  27. Very successful • Simple, but effective • Core of most weighting functions • tf (term frequency) • idf (inverse document frequency) • dl (document length) ©Mark Sanderson, Sheffield University

  28. Robertson’s BM25 • Q is a query containing terms T • w is a form of IDF • k1, b, k2, k3 are parameters. • tf is the document term frequency. • qtf is the query term frequency. • dl is the document length (arbitrary units). • avdl is the average document length. ©Mark Sanderson, Sheffield University

  29. Reference for BM25 • Popular weighting scheme • Robertson, S.E., Walker, S., Beaulieu, M.M., Gatford, M., Payne, A. (1995): Okapi at TREC-4, in NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4): 73-96 ©Mark Sanderson, Sheffield University

  30. Getting the balance • Documents with all the query terms? • Just those with high tf•idf terms? • What sorts of documents are these? • Search for a picture of Arbour Low • Stone circle near Sheffield • Try Google and AltaVista ©Mark Sanderson, Sheffield University

  31. Very short “arbour” only Longer, lots of “arbour”, no “low” ©Mark Sanderson, Sheffield University

  32. “arbour low” Arbour Low documents do exist ©Mark Sanderson, Sheffield University

  33. Lots of Arbour Low documents ©Mark Sanderson, Sheffield University Disambiguation?

  34. Result • From Google • “The Stonehenge of the north” ©Mark Sanderson, Sheffield University

  35. Caveat • Search engines don’t say much • Hard to know how they work ©Mark Sanderson, Sheffield University

  36. Boolean searching? • Start with query • “amazon” & “rain forest*” & (“destroy” | “destruction”) • Break collection into two unordered sets • Documents that match the query • Documents that don’t • User has complete control but… • …not easy to use. ©Mark Sanderson, Sheffield University

  37. Boolean • Two forms of query/retrieval system • Ranked retrieval • Long championed by academics • Boolean • Rooted in commercial systems from 1970s • Koenig, M.E. (1992): How close we came, in Information Processing and Management, 28(3): 433-436 • Modern systems • Hybrid of both ©Mark Sanderson, Sheffield University

  38. Don’t need Boolean? • Ranking found to be better than Boolean • But lack of specificity in ranking • destruction AND (amazon OR south american) AND rain forest • destruction, amazon, south american, rain forest • Jansen, B.J., Spink, A., Bateman, J., and Saracevic, T. (1998): Real Life Information Retrieval: A Study Of User Queries On The Web, in SIGIR Forum: A Publication of the Special Interest Group on Information Retrieval, 32(1): 5-17 ©Mark Sanderson, Sheffield University

  39. Models • Mathematically modelling the retrieval process • So as to better understand it • Draw on work of others • Vector space • Probabilistic ©Mark Sanderson, Sheffield University

  40. Vector Space • Document/query is a vector in N space • N = number of unique terms in collection • If term in doc/qry, set that element of its vector • Angle between vectors = similarity measure • Cosine of angle (cos(0) = 1) • Doesn’t model term dependence D Q  ©Mark Sanderson, Sheffield University

  41. Model references • wx,y - weight of vector element • Vector space • Salton, G. & Lesk, M.E. (1968): Computer evaluation of indexing and text processing. Journal of the ACM, 15(1): 8-36 • Any of the Salton SMART books ©Mark Sanderson, Sheffield University

  42. Modelling dependence • Latent Semantic Indexing (LSI) • Reduce dimensionality of N space • Bring related terms together. • Furnas, G.W., Deerwester, S., Dumais, S.T., Landauer, T.K., Harshman, R.A., Streeter, L.A., Lochbaum, K.E. (1988): Information retrieval using a singular value decomposition model of latent semantic structure, in Proceeding of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 465-480 • Manning, C.D., Schütze, H. (1999): Foundations of Statistical Natural Language Processing: 554-566 ©Mark Sanderson, Sheffield University

  43. Probabilistic • Assume independence ©Mark Sanderson, Sheffield University

  44. Model references • Probabilistic • Original papers • Robertson, S.E. & Sparck Jones, K. (1976): Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3): 129-146. • Van Rijsbergen, C.J. (1979): Information Retrieval • Chapter 6 • Survey • Crestani, F., Lalmas, M., van Rijsbergen, C.J., Campbell, I. (1998): “Is This Document Relevant? ...Probably”: A Survey of Probabilistic Models in Information Retrieval, in ACM Computing Surveys, 30(4): 528-552 ©Mark Sanderson, Sheffield University

  45. Recent developments • Probabilistic language models • Ponte, J., Croft, W.B. (1998): A Language Modelling Approach to Information Retrieval, in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval: 275-281 ©Mark Sanderson, Sheffield University

  46. Evaluation • Measure how well an IR system is doing • Effectiveness • Number of relevant documents retrieved • Also • Speed • Storage requirements • Usability ©Mark Sanderson, Sheffield University

  47. Effectiveness • Two main measures • Precision is easy • P at rank 10. • Recall is hard • Total number of relevant documents? ©Mark Sanderson, Sheffield University

  48. Test collections • Test collection • Set of documents (few thousand-few million) • Set of queries (50-400) • Set of relevance judgements • Humans check all documents! • Use pooling • Take top 100 from every submission • Remove duplicates • Manually assess these only. ©Mark Sanderson, Sheffield University

  49. Test collections • Small collections (~3Mb) • Cranfield, NPL, CACM - title (& abstract) • Medium (~4 Gb) • TREC - full text • Large (~100Gb) • VLC track of TREC • Compare with reality (~10Tb) • CIA, GCHQ, Large search services ©Mark Sanderson, Sheffield University

  50. Where to get them • Cranfield, NPL, CACM • www.dcs.gla.ac.uk/idom/ • TREC, VLC • trec.nist.gov ©Mark Sanderson, Sheffield University

More Related