440 likes | 549 Vues
Quality of a search engine. Paolo Ferragina Dipartimento di Informatica Università di Pisa. Reading 8. Is it good ?. How fast does it index Number of documents/hour (Average document size) How fast does it search Latency as a function of index size Expressiveness of the query language.
E N D
Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8
Is it good ? • How fast does it index • Number of documents/hour • (Average document size) • How fast does it search • Latency as a function of index size • Expressiveness of the query language
Measures for a search engine • All of the preceding criteria are measurable • The key measure: user happiness …useless answers won’t make a user happy
Happiness: elusive to measure • Commonest approach is given by the relevance of search results • How do we measure it ? • Requires 3 elements: • A benchmark document collection • A benchmark suite of queries • A binary assessment of either Relevant or Irrelevant for each query-doc pair
Evaluating an IR system • Standard benchmarks • TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years • Other doc collections: marked by human experts, for each query and for each doc, Relevant or Irrelevant • On the Web everything is more complicated since we cannot mark the entire corpus !!
collection Retrieved Relevant General scenario
Precision vs. Recall • Precision: % docs retrieved that are relevant [issue “junk” found] • Recall: % docs relevant that are retrieved [issue “info” found] collection Retrieved Relevant
How to compute them • Precision: fraction of retrieved docs that are relevant • Recall: fraction of relevant docs that are retrieved • Precision P = tp/(tp + fp) • Recall R = tp/(tp + fn)
Some considerations • Can get high recall (but low precision) by retrieving all docs for all queries! • Recall is a non-decreasing function of the number of docs retrieved • Precision usually decreases
We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries Precision-Recall curve precision x x x x recall
A common picture x precision x x x recall
F measure • Combined measure (weighted harmonic mean): • People usually use balanced F1 measure • i.e., with = ½ thus 1/F = ½ (1/P + 1/R) • Use this if you need to optimize a single measure that balances precision and recall.
Recommendation systems Paolo Ferragina Dipartimento di Informatica Università di Pisa
Recommendations • We have a list of restaurants • with and ratings for some Which restaurant(s) should I recommend to Dave?
Basic Algorithm • Recommend the most popular restaurants • say # positive votes minus # negative votes • What if Dave does not like Spaghetti?
Smart Algorithm • Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes. • Perhaps recommend Straits Cafe to Dave • Do you want to rely on oneperson’s opinions?
Main idea d1 U What do we suggest to U ? d2 d3 V d4 d5 W d6 d7 Y
A glimpse on XML retrieval(eXtensible Markup Language) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 10
XML vs HTML • HTML is a markup language for a specific purpose (display in browsers) • XML is a framework for defining markup languages • HTML has fixed markup tags, XML no • HTML can be formalized as an XML language (XHTML)
XML Example (textual) <chapter id="cmds"> <chaptitle> FileCab </chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application. </para> </chapter>
Basic Structure • An XML doc is an ordered, labeled tree • character data: leaf nodes contain the actual data (text strings) • elementnodes: each labeled with • a name (often called the element type), and • a set of attributes, each consisting of a name and a value, • can have child nodes
XML: Design Goals • Separate syntax from semantics to provide a framework for structuring information • Allow tailor-made markup for any imaginable application domain • Support internationalization (Unicode) and platform independence • Be the standard of (semi)structured information (do some of the work now done by databases)
Why Use XML? • Represent semi-structured • XML is more flexible than DBs • XML is more structured than simple IR • You get a massive infrastructure for free
Data vs. Text-centric XML • Data-centric XML: used for messaging between enterprise applications • Mainly a recasting of relational data • Text-centric XML: used for annotating content • Rich in text • Demands good integration of text retrieval functionality • E.g., find me the ISBN #s of Books with at least three Chapters discussing cocoa production, ranked by Price
IR Challenges in XML • There is no document unit in XML • How do we compute tf and idf? • Indexing granularity • Need to go to document for retrieving or displaying a fragment • E.g., give me the Abstracts of Papers on existentialism • Need to identify similar elements in different schemas • Example: employee
Xquery: SQL for XML ? • Simple attribute/value • /play/title contains “hamlet” • Path queries • title contains “hamlet” • /play//title contains “hamlet” • Complex graphs • Employees with two managers • What about relevance ranking?
Data structures for XML retrieval • Inverted index: give me all elements matching text query Q • We know how to do this – treat each element as a document • Give me all elements below any instance of the Book element (Parent/child relationship is not enough)
droppeth under Verse under Play. Positional containment Doc:1 27 1122 2033 5790 Play 431 867 Verse Containment can be viewed as merging postings. 720 Term:droppeth
Summary of data structures • Path containment etc. can essentially be solved by positional inverted indexes • Retrieval consists of “merging” postings • All the compression tricks are still applicable • Complications arise from insertion/deletion of elements, text within elements • Beyond the scope of this course
Search Engines Advertising
Classic approach… Socio-demo Geographic Contextual
Search Engines vsAdvertisement • First generation-- use only on-page, web-text data • Word frequency and language • Second generation-- use off-page, web-graph data • Link (or connectivity) analysis • Anchor-text (How people refer to a page) • Third generation-- answer “the need behind the query” • Focus on “user need”, rather than on query • Integrate multiple data-sources • Click-through data Pure search vs Paid search Ads show on search (who pays more), Goto/Overture 2003 Google/Yahoo New model All players now have: SE, Adv platform + network
The new scenario • SEs make possible • aggregation of interests • unlimited selection (Amazon, Netflix,...) • Incentives for specialized niche players The biggest money is in the smallest sales !!
Two new approaches • Sponsored search: Ads driven by search keywords (and user-profile issuing them) AdWords
+$ -$
Two new approaches • Sponsored search: Ads driven by search keywords (and user-profile issuing them) • Context match: Ads driven by the content of a web page (and user-profile reaching that page) AdWords AdSense
Econ IR How does it work ? • Match Ads to query or pg content • Order the Ads • Pricing on a click-through
Visited Pages Clicked Banner Web usage data !!! Web Searches Clicks on Search Results
Similar to web searching, but: Ad-DB is smaller, Ad-items are small pages, ranking depends on clicks A new game • For advertisers: • What words to buy, how much to pay • SPAM is an economic activity • For search engines owners: • How to price the words • Find the right Ad • Keyword suggestion, geo-coding, business control, language restriction, proper Ad display