1 / 44

Quality of a search engine

Quality of a search engine. Paolo Ferragina Dipartimento di Informatica Università di Pisa. Reading 8. Is it good ?. How fast does it index Number of documents/hour (Average document size) How fast does it search Latency as a function of index size Expressiveness of the query language.

Télécharger la présentation

Quality of a search engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8

  2. Is it good ? • How fast does it index • Number of documents/hour • (Average document size) • How fast does it search • Latency as a function of index size • Expressiveness of the query language

  3. Measures for a search engine • All of the preceding criteria are measurable • The key measure: user happiness …useless answers won’t make a user happy

  4. Happiness: elusive to measure • Commonest approach is given by the relevance of search results • How do we measure it ? • Requires 3 elements: • A benchmark document collection • A benchmark suite of queries • A binary assessment of either Relevant or Irrelevant for each query-doc pair

  5. Evaluating an IR system • Standard benchmarks • TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years • Other doc collections: marked by human experts, for each query and for each doc, Relevant or Irrelevant • On the Web everything is more complicated since we cannot mark the entire corpus !!

  6. collection Retrieved Relevant General scenario

  7. Precision vs. Recall • Precision: % docs retrieved that are relevant [issue “junk” found] • Recall: % docs relevant that are retrieved [issue “info” found] collection Retrieved Relevant

  8. How to compute them • Precision: fraction of retrieved docs that are relevant • Recall: fraction of relevant docs that are retrieved • Precision P = tp/(tp + fp) • Recall R = tp/(tp + fn)

  9. Some considerations • Can get high recall (but low precision) by retrieving all docs for all queries! • Recall is a non-decreasing function of the number of docs retrieved • Precision usually decreases

  10. We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries Precision-Recall curve precision x x x x recall

  11. A common picture x precision x x x recall

  12. F measure • Combined measure (weighted harmonic mean): • People usually use balanced F1 measure • i.e., with  = ½ thus 1/F = ½ (1/P + 1/R) • Use this if you need to optimize a single measure that balances precision and recall.

  13. Recommendation systems Paolo Ferragina Dipartimento di Informatica Università di Pisa

  14. Recommendations • We have a list of restaurants • with  and  ratings for some Which restaurant(s) should I recommend to Dave?

  15. Basic Algorithm • Recommend the most popular restaurants • say # positive votes minus # negative votes • What if Dave does not like Spaghetti?

  16. Smart Algorithm • Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes. • Perhaps recommend Straits Cafe to Dave •  Do you want to rely on oneperson’s opinions?

  17. Main idea d1 U What do we suggest to U ? d2 d3 V d4 d5 W d6 d7 Y

  18. A glimpse on XML retrieval(eXtensible Markup Language) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 10

  19. XML vs HTML • HTML is a markup language for a specific purpose (display in browsers) • XML is a framework for defining markup languages • HTML has fixed markup tags, XML no • HTML can be formalized as an XML language (XHTML)

  20. XML Example (visual)

  21. XML Example (textual) <chapter id="cmds"> <chaptitle> FileCab </chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application. </para> </chapter>

  22. Basic Structure • An XML doc is an ordered, labeled tree • character data: leaf nodes contain the actual data (text strings) • elementnodes: each labeled with • a name (often called the element type), and • a set of attributes, each consisting of a name and a value, • can have child nodes

  23. XML: Design Goals • Separate syntax from semantics to provide a framework for structuring information • Allow tailor-made markup for any imaginable application domain • Support internationalization (Unicode) and platform independence • Be the standard of (semi)structured information (do some of the work now done by databases)

  24. Why Use XML? • Represent semi-structured • XML is more flexible than DBs • XML is more structured than simple IR • You get a massive infrastructure for free

  25. Data vs. Text-centric XML • Data-centric XML: used for messaging between enterprise applications • Mainly a recasting of relational data • Text-centric XML: used for annotating content • Rich in text • Demands good integration of text retrieval functionality • E.g., find me the ISBN #s of Books with at least three Chapters discussing cocoa production, ranked by Price

  26. IR Challenges in XML • There is no document unit in XML • How do we compute tf and idf? • Indexing granularity • Need to go to document for retrieving or displaying a fragment • E.g., give me the Abstracts of Papers on existentialism • Need to identify similar elements in different schemas • Example: employee

  27. Xquery: SQL for XML ? • Simple attribute/value • /play/title contains “hamlet” • Path queries • title contains “hamlet” • /play//title contains “hamlet” • Complex graphs • Employees with two managers • What about relevance ranking?

  28. Data structures for XML retrieval • Inverted index: give me all elements matching text query Q • We know how to do this – treat each element as a document • Give me all elements below any instance of the Book element (Parent/child relationship is not enough)

  29. droppeth under Verse under Play. Positional containment Doc:1 27 1122 2033 5790 Play 431 867 Verse Containment can be viewed as merging postings. 720 Term:droppeth

  30. Summary of data structures • Path containment etc. can essentially be solved by positional inverted indexes • Retrieval consists of “merging” postings • All the compression tricks are still applicable • Complications arise from insertion/deletion of elements, text within elements • Beyond the scope of this course

  31. Search Engines Advertising

  32. Classic approach… Socio-demo Geographic Contextual

  33. Search Engines vsAdvertisement • First generation-- use only on-page, web-text data • Word frequency and language • Second generation-- use off-page, web-graph data • Link (or connectivity) analysis • Anchor-text (How people refer to a page) • Third generation-- answer “the need behind the query” • Focus on “user need”, rather than on query • Integrate multiple data-sources • Click-through data Pure search vs Paid search Ads show on search (who pays more), Goto/Overture 2003 Google/Yahoo New model All players now have: SE, Adv platform + network

  34. The new scenario • SEs make possible • aggregation of interests • unlimited selection (Amazon, Netflix,...) • Incentives for specialized niche players The biggest money is in the smallest sales !!

  35. Two new approaches • Sponsored search: Ads driven by search keywords (and user-profile issuing them) AdWords

  36. +$ -$

  37. Two new approaches • Sponsored search: Ads driven by search keywords (and user-profile issuing them) • Context match: Ads driven by the content of a web page (and user-profile reaching that page) AdWords AdSense

  38. Econ IR How does it work ? • Match Ads to query or pg content • Order the Ads • Pricing on a click-through

  39. Visited Pages Clicked Banner Web usage data !!! Web Searches Clicks on Search Results

  40. Dictionary problem

  41. Similar to web searching, but: Ad-DB is smaller, Ad-items are small pages, ranking depends on clicks A new game • For advertisers: • What words to buy, how much to pay • SPAM is an economic activity • For search engines owners: • How to price the words • Find the right Ad • Keyword suggestion, geo-coding, business control, language restriction, proper Ad display

More Related