440 likes | 554 Vues
This article by Paolo Ferragina from the University of Pisa explores the critical aspects of search engine evaluation, focusing on performance metrics such as indexing speed, search latency, and the effectiveness of query languages. It highlights the importance of user happiness, emphasizing the need for relevance in search results. Additionally, it discusses standard benchmarks like TREC and challenges inherent in information retrieval systems, including precision-recall measures. The discussion extends to recommendation systems and the advantages of XML in representing semi-structured information.
E N D
Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8
Is it good ? • How fast does it index • Number of documents/hour • (Average document size) • How fast does it search • Latency as a function of index size • Expressiveness of the query language
Measures for a search engine • All of the preceding criteria are measurable • The key measure: user happiness …useless answers won’t make a user happy
Happiness: elusive to measure • Commonest approach is given by the relevance of search results • How do we measure it ? • Requires 3 elements: • A benchmark document collection • A benchmark suite of queries • A binary assessment of either Relevant or Irrelevant for each query-doc pair
Evaluating an IR system • Standard benchmarks • TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years • Other doc collections: marked by human experts, for each query and for each doc, Relevant or Irrelevant • On the Web everything is more complicated since we cannot mark the entire corpus !!
collection Retrieved Relevant General scenario
Precision vs. Recall • Precision: % docs retrieved that are relevant [issue “junk” found] • Recall: % docs relevant that are retrieved [issue “info” found] collection Retrieved Relevant
How to compute them • Precision: fraction of retrieved docs that are relevant • Recall: fraction of relevant docs that are retrieved • Precision P = tp/(tp + fp) • Recall R = tp/(tp + fn)
Some considerations • Can get high recall (but low precision) by retrieving all docs for all queries! • Recall is a non-decreasing function of the number of docs retrieved • Precision usually decreases
We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries Precision-Recall curve precision x x x x recall
A common picture x precision x x x recall
F measure • Combined measure (weighted harmonic mean): • People usually use balanced F1 measure • i.e., with = ½ thus 1/F = ½ (1/P + 1/R) • Use this if you need to optimize a single measure that balances precision and recall.
Recommendation systems Paolo Ferragina Dipartimento di Informatica Università di Pisa
Recommendations • We have a list of restaurants • with and ratings for some Which restaurant(s) should I recommend to Dave?
Basic Algorithm • Recommend the most popular restaurants • say # positive votes minus # negative votes • What if Dave does not like Spaghetti?
Smart Algorithm • Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes. • Perhaps recommend Straits Cafe to Dave • Do you want to rely on oneperson’s opinions?
Main idea d1 U What do we suggest to U ? d2 d3 V d4 d5 W d6 d7 Y
A glimpse on XML retrieval(eXtensible Markup Language) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 10
XML vs HTML • HTML is a markup language for a specific purpose (display in browsers) • XML is a framework for defining markup languages • HTML has fixed markup tags, XML no • HTML can be formalized as an XML language (XHTML)
XML Example (textual) <chapter id="cmds"> <chaptitle> FileCab </chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application. </para> </chapter>
Basic Structure • An XML doc is an ordered, labeled tree • character data: leaf nodes contain the actual data (text strings) • elementnodes: each labeled with • a name (often called the element type), and • a set of attributes, each consisting of a name and a value, • can have child nodes
XML: Design Goals • Separate syntax from semantics to provide a framework for structuring information • Allow tailor-made markup for any imaginable application domain • Support internationalization (Unicode) and platform independence • Be the standard of (semi)structured information (do some of the work now done by databases)
Why Use XML? • Represent semi-structured • XML is more flexible than DBs • XML is more structured than simple IR • You get a massive infrastructure for free
Data vs. Text-centric XML • Data-centric XML: used for messaging between enterprise applications • Mainly a recasting of relational data • Text-centric XML: used for annotating content • Rich in text • Demands good integration of text retrieval functionality • E.g., find me the ISBN #s of Books with at least three Chapters discussing cocoa production, ranked by Price
IR Challenges in XML • There is no document unit in XML • How do we compute tf and idf? • Indexing granularity • Need to go to document for retrieving or displaying a fragment • E.g., give me the Abstracts of Papers on existentialism • Need to identify similar elements in different schemas • Example: employee
Xquery: SQL for XML ? • Simple attribute/value • /play/title contains “hamlet” • Path queries • title contains “hamlet” • /play//title contains “hamlet” • Complex graphs • Employees with two managers • What about relevance ranking?
Data structures for XML retrieval • Inverted index: give me all elements matching text query Q • We know how to do this – treat each element as a document • Give me all elements below any instance of the Book element (Parent/child relationship is not enough)
droppeth under Verse under Play. Positional containment Doc:1 27 1122 2033 5790 Play 431 867 Verse Containment can be viewed as merging postings. 720 Term:droppeth
Summary of data structures • Path containment etc. can essentially be solved by positional inverted indexes • Retrieval consists of “merging” postings • All the compression tricks are still applicable • Complications arise from insertion/deletion of elements, text within elements • Beyond the scope of this course
Search Engines Advertising
Classic approach… Socio-demo Geographic Contextual
Search Engines vsAdvertisement • First generation-- use only on-page, web-text data • Word frequency and language • Second generation-- use off-page, web-graph data • Link (or connectivity) analysis • Anchor-text (How people refer to a page) • Third generation-- answer “the need behind the query” • Focus on “user need”, rather than on query • Integrate multiple data-sources • Click-through data Pure search vs Paid search Ads show on search (who pays more), Goto/Overture 2003 Google/Yahoo New model All players now have: SE, Adv platform + network
The new scenario • SEs make possible • aggregation of interests • unlimited selection (Amazon, Netflix,...) • Incentives for specialized niche players The biggest money is in the smallest sales !!
Two new approaches • Sponsored search: Ads driven by search keywords (and user-profile issuing them) AdWords
+$ -$
Two new approaches • Sponsored search: Ads driven by search keywords (and user-profile issuing them) • Context match: Ads driven by the content of a web page (and user-profile reaching that page) AdWords AdSense
Econ IR How does it work ? • Match Ads to query or pg content • Order the Ads • Pricing on a click-through
Visited Pages Clicked Banner Web usage data !!! Web Searches Clicks on Search Results
Similar to web searching, but: Ad-DB is smaller, Ad-items are small pages, ranking depends on clicks A new game • For advertisers: • What words to buy, how much to pay • SPAM is an economic activity • For search engines owners: • How to price the words • Find the right Ad • Keyword suggestion, geo-coding, business control, language restriction, proper Ad display