Quality of a search engine

Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8

Is it good ? • How fast does it index • Number of documents/hour • (Average document size) • How fast does it search • Latency as a function of index size • Expressiveness of the query language

Measures for a search engine • All of the preceding criteria are measurable • The key measure: user happiness …useless answers won’t make a user happy

Happiness: elusive to measure • Commonest approach is given by the relevance of search results • How do we measure it ? • Requires 3 elements: • A benchmark document collection • A benchmark suite of queries • A binary assessment of either Relevant or Irrelevant for each query-doc pair

Evaluating an IR system • Standard benchmarks • TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years • Other doc collections: marked by human experts, for each query and for each doc, Relevant or Irrelevant • On the Web everything is more complicated since we cannot mark the entire corpus !!

collection Retrieved Relevant General scenario

Precision vs. Recall • Precision: % docs retrieved that are relevant [issue “junk” found] • Recall: % docs relevant that are retrieved [issue “info” found] collection Retrieved Relevant

How to compute them • Precision: fraction of retrieved docs that are relevant • Recall: fraction of relevant docs that are retrieved • Precision P = tp/(tp + fp) • Recall R = tp/(tp + fn)

Some considerations • Can get high recall (but low precision) by retrieving all docs for all queries! • Recall is a non-decreasing function of the number of docs retrieved • Precision usually decreases

We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries Precision-Recall curve precision x x x x recall

A common picture x precision x x x recall

F measure • Combined measure (weighted harmonic mean): • People usually use balanced F1 measure • i.e., with  = ½ thus 1/F = ½ (1/P + 1/R) • Use this if you need to optimize a single measure that balances precision and recall.

Recommendation systems Paolo Ferragina Dipartimento di Informatica Università di Pisa

Recommendations • We have a list of restaurants • with  and  ratings for some Which restaurant(s) should I recommend to Dave?

Basic Algorithm • Recommend the most popular restaurants • say # positive votes minus # negative votes • What if Dave does not like Spaghetti?

Smart Algorithm • Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes. • Perhaps recommend Straits Cafe to Dave •  Do you want to rely on oneperson’s opinions?

Main idea d1 U What do we suggest to U ? d2 d3 V d4 d5 W d6 d7 Y

A glimpse on XML retrieval(eXtensible Markup Language) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 10

XML vs HTML • HTML is a markup language for a specific purpose (display in browsers) • XML is a framework for defining markup languages • HTML has fixed markup tags, XML no • HTML can be formalized as an XML language (XHTML)

XML Example (visual)

XML Example (textual) <chapter id="cmds"> <chaptitle> FileCab </chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application. </para> </chapter>

Basic Structure • An XML doc is an ordered, labeled tree • character data: leaf nodes contain the actual data (text strings) • elementnodes: each labeled with • a name (often called the element type), and • a set of attributes, each consisting of a name and a value, • can have child nodes

XML: Design Goals • Separate syntax from semantics to provide a framework for structuring information • Allow tailor-made markup for any imaginable application domain • Support internationalization (Unicode) and platform independence • Be the standard of (semi)structured information (do some of the work now done by databases)

Why Use XML? • Represent semi-structured • XML is more flexible than DBs • XML is more structured than simple IR • You get a massive infrastructure for free

Data vs. Text-centric XML • Data-centric XML: used for messaging between enterprise applications • Mainly a recasting of relational data • Text-centric XML: used for annotating content • Rich in text • Demands good integration of text retrieval functionality • E.g., find me the ISBN #s of Books with at least three Chapters discussing cocoa production, ranked by Price

IR Challenges in XML • There is no document unit in XML • How do we compute tf and idf? • Indexing granularity • Need to go to document for retrieving or displaying a fragment • E.g., give me the Abstracts of Papers on existentialism • Need to identify similar elements in different schemas • Example: employee

Xquery: SQL for XML ? • Simple attribute/value • /play/title contains “hamlet” • Path queries • title contains “hamlet” • /play//title contains “hamlet” • Complex graphs • Employees with two managers • What about relevance ranking?

Data structures for XML retrieval • Inverted index: give me all elements matching text query Q • We know how to do this – treat each element as a document • Give me all elements below any instance of the Book element (Parent/child relationship is not enough)

droppeth under Verse under Play. Positional containment Doc:1 27 1122 2033 5790 Play 431 867 Verse Containment can be viewed as merging postings. 720 Term:droppeth

Summary of data structures • Path containment etc. can essentially be solved by positional inverted indexes • Retrieval consists of “merging” postings • All the compression tricks are still applicable • Complications arise from insertion/deletion of elements, text within elements • Beyond the scope of this course

Search Engines Advertising

Classic approach… Socio-demo Geographic Contextual

Search Engines vsAdvertisement • First generation-- use only on-page, web-text data • Word frequency and language • Second generation-- use off-page, web-graph data • Link (or connectivity) analysis • Anchor-text (How people refer to a page) • Third generation-- answer “the need behind the query” • Focus on “user need”, rather than on query • Integrate multiple data-sources • Click-through data Pure search vs Paid search Ads show on search (who pays more), Goto/Overture 2003 Google/Yahoo New model All players now have: SE, Adv platform + network

The new scenario • SEs make possible • aggregation of interests • unlimited selection (Amazon, Netflix,...) • Incentives for specialized niche players The biggest money is in the smallest sales !!

Two new approaches • Sponsored search: Ads driven by search keywords (and user-profile issuing them) AdWords

Two new approaches • Sponsored search: Ads driven by search keywords (and user-profile issuing them) • Context match: Ads driven by the content of a web page (and user-profile reaching that page) AdWords AdSense

Econ IR How does it work ? • Match Ads to query or pg content • Order the Ads • Pricing on a click-through

Visited Pages Clicked Banner Web usage data !!! Web Searches Clicks on Search Results

Dictionary problem

Similar to web searching, but: Ad-DB is smaller, Ad-items are small pages, ranking depends on clicks A new game • For advertisers: • What words to buy, how much to pay • SPAM is an economic activity • For search engines owners: • How to price the words • Find the right Ad • Keyword suggestion, geo-coding, business control, language restriction, proper Ad display

Quality of a search engine

Quality of a search engine

Presentation Transcript

Choosing a Search Engine

Choosing a Search Engine

Search Engine

Frompo a Search Engine

Search Engine

Search Engine

Search Engine

Search Engine

Anatomy of a search engine

SEARCH ENGINE

Search Engine

Search Engine

Search Engine

Search engine

Search Engine

Search Engine Optimization - Importance Of Search Engine Optimization

search engine

The Anatomy Of A Search Engine

Search Engine

SEARCH ENGINE