CSM06 Information Retrieval

CSM06 Information Retrieval Lecture 5: Web IR part 2 Dr Andrew Salway a.salway@surrey.ac.uk

Recap of Lecture 4 • Various techniques that search engines can use to index and rank web pages • In particular, techniques that exploit hypertext structure: • Use of anchor text • Link analysis  PageRank • Plus, techniques that analyse the words in webpages

Recap of Lecture 4 (*ADDED) • The ‘Random Surfer’ explanation of PageRank: • A web surfer follows links at random: at a page with no outlinks they ‘teleport’ at random to another page… “the PageRank value of a web page is the long run probability that the surfer will visit that page” (Levene 2005, page 95)

Recap of Lecture 4 (*ADDED) • “Whether PageRank Leakage exists or not, is a question of semantics. The PageRank for a given page is solely determined by the inbound links.However, an outgoing link can drain the entire site for PageRank”. http://www.pagerank.dk/Pagerank-leak/Pagerank-leakage.htm

Past Exams • Previous exams and solutions for CSM06 are available from: www.computing.surrey.ac.uk/personal/pg/A.Salway/csm06/csm06_exam_info_2005.html IMPORTANT • 1)The content of the module is updated and revised each year so some of the past questions refer to topics that were not part of the module in 2005. • 2)There were some changes to the structure of the exam in 2004, e.g. each question is worth 50 marks. This will be the same in 2005. Also, in 2004 more emphasis was put on current research and development of information retrieval systems (cf. some of the research papers given as Set Reading). As in 2004 the 2005 exam will include questions that ask you to write about some specified research. You do NOT need to go beyond the lecture content and the Set Reading to answer these questions. • 3)The solutions that are provided are written in a style to help with the marking of the exams – this does not necessarily reflect how you would be expected to write your answers, e.g. solutions are sometimes given in note form, whereas you would normally be expected to write full sentences.

Lecture 5: OVERVIEW • Retrieving ‘similar pages’ based on link analysis: companion and cocitation algorithms (Dean and Henzinger 1999) • Transforming questions into queries: TRITUS system (Agichtein, Lawrence and Gravano 2001) • Evaluating web search engines

“Finding Related Pages in the World Wide Web” (Dean and Henzinger 1999) • Use a webpage (URL) as a query – may be an easier way for a user to express their information need • The user is saying “I want more pages like this one” – maybe easier than thinking of good query words? • e.g. the URL www.nytimes.com (New York Times newspaper) returns URLs for other newspapers and news organisations • Aim is for high precision with fast execution using minimal information Two algorithms to find pages related to the query page using only connectivity information, i.e. link analysis (nothing about webpage content or usage): • Companion Algorithm • Cocitation Algorithm

What does ‘related’ mean? “A related web page is one that addresses the same topic as the original page, but is not necessarily semantically identical”

Companion Algorithm • Based on Kleinberg’s HITS algorithm – mutually reinforcing authorities and hubs 1. Build a vicintiy graph for u 2. Contract duplicates and near-duplicates 3. Compute edge weights (i.e. links) 4. Compute hub and authority scores for each node (URL) in the graph  return highest ranked authorities as results set

Companion Algorithm 1. Build a vicintiy graph for u The graph is made up of the following nodes and edges between them: • u • Up to B parents of u, and for each parent up to BF of its children – if u has > B parents then choose randomly; if a parent has > BF children, then choose children closest to u • Up to F children of u, and for each child up to FB of its parents NB. Use of a ‘stop list’ of URLs with very high indegree

Companion Algorithm 2. Contract duplicates and near-duplicates:if two nodes each have > 10 links and > 95% are in common then make them into one node whose links are the union of the two 3. Compute edge weights (i.e. links) • Edges between nodes on the same host are weighted 0 • Scaling to reduce the influence from any single host: “If there are k edges from documents on a first host to a single document on a second host then each edge has authority weight 1/k” “If there are l edges from a single document on a first host to a set of documents on a second host, we give each edge a hub weight of 1/l”

Companion Algorithm 4. Compute hub and authority scores for each node (URL) in the graph  return highest ranked authorities as results set “a document that points to many others is a good hub, and a document that many documents point to is a good authority”

Companion Algorithm 4. continued… H = hub vector with one element for the Hub value of each node A = authority vector with one element for the Authority value of each node Initially all values set to 1

Companion Algorithm 4. continued… Until H and A converge: For all nodes n in the graph N A[n] = Σ H[n´]*authority_weight(n´,n) For all nodes n in the graph N H[n] = Σ A[n´]*hub_weight(n,n´)

Cocitation Algorithm • Finds pages that are frequently cocited with the query web page u – “it finds other pages that are pointed to by many other pages that also point to u” • Two nodes are co-cited if they have a common parent: the number of common parents is their degree of co-citation

Cocitation Algorithm • Select up to B parents of u • For each parent add up to BF of its children to the set of u’s siblings S • Return nodes in S with highest degrees of cocitation with u NB. If < 15 nodes in S that are cocited with u at least twice then restart using u’s URL with one path element removed, e.g. aaa.com/X/Y/Z  aaa.com/X/Y

Evaluation of companion and cocitation algorithms • 59 input URLs chosen by 18 volunteers (mainly computing professionals) • The volunteers were shown results for each URL they chose and have to judge it ‘1’ for valuable and ‘0’ for not valuable  Various calculations of precision, e.g. ‘precision at 10’ for the intersection group (those query URLs that all 3 algorithms returned results for)

Evaluation of companion and cocitation algorithms • Authors suggest that their algorithms perform better than an algorithm (Netscape’s) that incorporates content and usage information, as well as connectivity information – “This is surprising” – IS IT?? • Perhaps it is because they had more connectivityinformation??

Transforming Questions into Queries… • Users of IR systems might prefer to express their information needs directly as questions, rather than as keywords, e.g. • “What is a hard disk?” – rather than the query “hard disk” • What the user wants is a specific answer to their question, rather than web-pages selling hard disks, or web-pages reviewing different kinds of hard disks • But, web search engines may treat the query as a ‘bag of words’ and not recognise questions as such; documents are returned that are similar to the ‘bag of words’

Transforming Questions into Queries… • The challenge then is to automatically transform the question into a suitable query for which search engines will return more pages that do answer the user’s question • Here we consider the work of Agichtein, Lawrence and Gravano (2001) who developed the Tritus system to try and solve this problem… • Cf. AskJeeves (www.ask.com)

Tritus: premise • A good answer to the question “What is a hard disk?” might be “magnetic rotating disk used to store data” • So maybe the query “What is a hard disk?” should be transformed into the query – “hard disk” NEAR “used to”

Tritus: aim • To automatically learn how to transform natural language questions into queries that contain terms and phrases which are expected to appear in documents containing answers to these questions.

Tritus: learning algorithm Step 1 Select question phrases from a set of questions by extracting frequent n-grams that don’t contain domain specific nouns, e.g. “who was”, “what is a”, “how do I”

Tritus: learning algorithm Step 2 For each question type, select candidate transformations from set of good answers for each question, e.g. “what is a” {“is used to”, “is a”, “used”}

Tritus: learning algorithm Step 3 Weight and re-rank transformation using results from web search engines

Tritus: in use • Trained to learn the best query transformations for specific web search engines, e.g. Google and AltaVista • Evaluation conducted to compare the effect of query transforms, and to compare with AskJeeves

Evaluation of Web Search Engines • Precision may be applicable to evaluate a web search engine, but it may be the precision in the first page of results that is most important • Recall, as traditionally defined, may not be applicable because it is difficult or impossible to identify all the relevant web-pages for a given query

Four strategies for evaluation of web search engines • Use precision and recall in the traditional way for a very tightly defined topic: only applicable if all relevant web pages are known in advance • Use ‘relative recall’ – estimate total number of relevant documents by doing a number of searches and adding the total number of relevant documents returned • Statistically sample the web in order to estimate number of relevant pages • Avoid recall altogether SEE: Oppenheim, Morris and McKnight (2000), p. 194

Alternative Evaluation Criteria • Number of web-pages covered, and coverage: Is more pages covered better? May be more important that certain domains are included in coverage? • Freshness / broken links: Web-page content is frequently updated so index also needs to be updated; broken links frustrate users. Should be relatively straightfoward to quantify.

continued… • Search Syntax: More experienced users may like the option of ‘advanced searches’, e.g. phrases, Boolean operators, and field searching. • Human Factors and Interface Issues: Evaluation from a user’s perspective is a more subjective criterion, however it is an important one – it can be argued that an intuitive interface for formulating queries and interpreting results helps a user to get better results from the system.

continued… • Quality of Abstracts: related to interface issues are the ‘abstracts’ of web-pages that a web search engine displays – if good then these help a user to quickly identify more promising pages

Set Reading for Lecture 5 Dean and Henzinger (1999), ‘Finding Related Pages in the World Wide Web’. Pages 1-10. http://citeseer.ist.psu.edu/dean99finding.html Agichtein, Lawrence and Gravano (2001), ‘Learning Search Engine Specific Query Transformations for Question Answering’, Procs. 10th International WWW Conference.**Section 1 and Section 3** www.cs.columbia.edu/~eugene/papers/www10.pdf Oppenheim, Morris and McKnight (2000), ‘The Evaluation of WWW Search Engines’, Journal of Documentation, 56(2), pp. 190-211. Pages 194-205.In Library Article Collection.

Exercise: Google’s ‘Similar Pages’ • It is suggested that Google’s ‘Similar Pages’ feature is based in part on the work of Dean and Henzinger. • By making a variety of queries to Google and choosing ‘Similar Pages’ see what you can find out about how this works.

Exercise: web search engine evaluation • Compare three web-search engines by making the same queries to each. How do they compare in terms of: • Advanced query options? • Coverage? • Quality of highest ranked results? • Ease of querying and understanding results? • Ranking factors that they appear to be using?

Further Reading • The other parts of the papers given for Set Reading

Lecture 5: LEARNING OUTCOMES • For both (Dean and Henzinger 1999) and (Agichtein, Lawrence and Gravano 2001), you should be able to • Explain how they were trying to make web search better for users • Outline their proposed solution • Discuss their evaluation of their solution and make your own comments • You should be able to explain and apply various techniques to compare and evaluate web search engines

Reading ahead for LECTURE 6 If you want to prepare for next week’s lecture then take a look at… The visual interface of the KartOO search engine: http://www.kartoo.com/ Use and read about the clustering of web pages done by Vivisimo: http://vivisimo.com/ Recent developments in Google Labs, especially Google Sets and Google Suggest: http://labs.google.com/

CSM06 Information Retrieval