Web Information Retrieval: Challenges and Techniques for Effective Search Engines

CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk

Lecture 4: OVERVIEW • Previously we looked at IR techniques that indexed a document based on the words that occur in the document • Some of these techniques are applied in web search engines (but VSM may not be appropriate). However, web IR can also exploit a distinctive feature of information on the web – hypertext link structure • Use of anchor text for indexing web pages • The PageRank algorithm based on link structure analysis • Other techniques for ranking web pages

Challenges for IR on the Web • High volume of information • Heterogeneous information (multimedia and multilingual) • Diverse users - hence diverse information needs, and many inexperienced users • Average query length 2-5 words • Poorly structured and low quality information

Scale • Projection of worldwide Internet population in 2005 = 1.07 billion users, www.clickz.com/stats/web_worldwide/ • Early in 2005 Google claimed to index over 8 billion web pages, Yahoo recently claimed 19 billion, now Google claims to index 3 times more than nearest competitor http://select.nytimes.com/gst/abstract.html?res=F30610F93E540C748EDDA00894DD404482 • Given the low overlap in search engine results for a given query, it is likely that the total number of webpages is much greater than that indexed by any single web search engine

Requirements of Web Search Engine Users? • Fast response time • Some relevant results in first page; maybe less concern with getting all relevant results • Good coverage of web, at least of ‘important sites’ • Up-to-date links • Simple and intuitive to use – making queries and understanding results NB. Some of these requirements contrast with those of expert researchers using specialist information retrieval systems

User Goals (Information Needs) • Queries are used to express a user’s goal (or information need), but note that the same query might be used for quite different goals (Rose and Levinson 2004)

User Goals: Rose and Levinson’s classification (2004) • Navigational – wanting a specific known website • Informational – “my goal is to learn something by reading or viewing web pages” – e.g. closed and open-ended questions, advice • Resource – “my goal is to obtain a resource (not information) available on web pages” – e.g. download music, interact with online shopping service NOTE: prior to web most IR was concerned only with Informational queries

User Goals: Rose and Levinson’s classification (2004) • The more a search engine understands about a user’s goal then the better results it can provide  User goals may be deduced not only from the query, but also from • The results returned by the search engine • Results clicked on by the user • Further searches / actions by the user

Opportunity… • Web search engines can exploit the fact that information on the web is in the form of hypertext…

Hypertext • The web is, in some senses at least, hypertextual, i.e. it can be viewed as networks of nodes (e.g. pages) and links (between pages)

Hypertext • Links suggest – relatedness of topic / perhaps also a recommendation • Topological information about the hypertext graph gained by link structure analysis can be exploited for ranking

Use of Anchor Text (Brin and Page 1998) • Words in the anchor text can be used to index the webpage being linked to – the text in an anchor may give a good description of the page it points to, e.g. <ahref=“www.bio.com/beckhambio.html"> A Biography of David Beckham</a></p> • The words in the anchor text might be a better indicator of what the webpage is about than the words in the webpage • Anchor text is also good for resources like images that can not be analysed as keywords

PageRank (Brin and Page 1998) • “Google makes use of both link structure and anchor text” • “The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines”  PageRank is “an objective measure of [a web page’s] citation importance that corresponds well with people’s subjective idea of importance”

Calculating PageRank PR(A) = (1-d) + d*(PR(T1)/C(T1) + … + PR(Tn)/C(Tn) PR(A) = PageRank of webpage A C (A) = the number of links out of webpage A T1…Tn = the webpages that point to webpage A d = a damping factor set between 0-1 In reality, the calculation of PageRank is iterative

Web-adjacency Analysis (a similar idea to PageRank) • Kleinberg and colleagues proposed a method for identifying authoritative web-pages • Identify set of relevant pages (as normal) • Identify those with a large in-degree, i.e. lots of pages point to them (cf. ‘impact’) • Ensure that the authorities selected are referred to by a number of the same hubs, i.e. those with a large out-degree

Web-adjacency Analysis • “Hubs and authorities exhibit what could be called a mutually reinforcing relationship” (Kleinberg 1998) • Computing authority and hub values for web-pages is an iterative process over a graph, where each node is a web-page • Two weights are given to each node relating to in-degree and out-degree: total in-degree weights and total out-degree weights are kept constant • Weights are modified each iteration depending on weights of connected nodes

Some other Factors used to rank Web Pages (Hock 2001) • Popularity of the Page: measured either by how many other web-pages link to it, or by how many people have clicked on it when they had the same query • Frequency of search terms: need to consider length of the document, and web-page authors attempts to affect ranking by deliberate repetition • Number of query terms matched: but remember many queries are only one or two words

Other Factors (continued…) • Rarity of terms: rank pages containing rare search terms more highly (cf. TFIDF) • Weighting by Field: give high ranking to pages including search terms in important fields, e.g. Title • Proximity of Terms: rank pages more highly if search terms occur near one another • Order of Query Terms: give priority to pages containing the search term entered first

Set Reading for Lecture 4 • Page and Brin (1998), “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. SECTIONS 1 and 2. Explains Google’s use of anchor text and PageRank. www-db.stanford.edu/~backrub/google.html • Hock (2001), The extreme searcher's guide to web search engines,pages 25-31. Gives an overview of some factors used by web search engines to rank webpages. AVAILABLE in Main Library collection and in Library Article Collection.

Exercise • Explore the idea of PageRank using an online PageRank calculator, e.g. www.markhorrell.com/seo/pagerank.shtml OR www.webworkshop.net/pagerank_calculator.php3

Further Reading Rose and Levinson (2004), “Understanding User Goals in Web Search”, 13th International WWW Conference, 2004. www.sims.berkeley.edu/courses/is141/f05/readings/rose_www04.pdf Page, Brin, Motwani and Winograd (1999), “The PageRank Citation Ranking: Bringing Order to the Web.” http://dbpubs.stanford.edu:8090/pub/1999-66 Belew (2000), Finding Out About, pages 195-199 for an overview of Kleinberg’s work on web-adjacency analysis and authorities and hubs. Kleinberg (1998), ‘Authoritative Sources in a Hyperlinked Environment’, Journal of the ACM. http://citeseer.nj.nec.com/87928.html Kobayashi and Takeda (2000), “Information Retrieval on the Web”, ACM Computing Surveys 32(2), pp. 144-173. AVAILABLE IN LIBRARY / ARTICLE COLLECTION. **This comprehensive article reviews a lot the ideas covered so far in this module and discusses them in the context of Web IR. NOTE, it is already a little out of date in places because of the rapid changes of the Web.

Lecture 4: LEARNING OUTCOMES After this lecture you should be able to: • Explain how the challenges of web IR are different than those facing the developers of traditional IR systems • Explain how web search engines can exploit the hypertext structure of the web to index and rank web pages, e.g. using Anchor Text, and PageRank • Explain how PageRank is calculated • Discuss and critique a range of factors used by web search engines to rank web pages

Reading ahead for LECTURE 5 If you want to read about next week’s lecture topics, see: Dean and Henzinger (1999), ‘Finding Related Pages in the World Wide Web’. Pages 1-10. http://citeseer.ist.psu.edu/dean99finding.html Agichtein, Lawrence and Gravano (2001), ‘Learning Search Engine Specific Query Transformations for Question Answering’, Procs. 10th International WWW Conference. **Section 1 and Section 3** www.cs.columbia.edu/~eugene/papers/www10.pdf Oppenheim, Morris and McKnight (2000), ‘The Evaluation of WWW Search Engines’, Journal of Documentation, 56(2). Pages 194-205. In Library Article Collection.

Web Information Retrieval: Challenges and Techniques for Effective Search Engines

Web Information Retrieval: Challenges and Techniques for Effective Search Engines

Presentation Transcript

Information Retrieval

CSM06 Information Retrieval

CSM06 Information Retrieval

CSM06 Information Retrieval

Information Retrieval

CSM06 Information Retrieval

CSM06 Information Retrieval

CSM06 Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

CSM06 Information Retrieval

information retrieval

CSM06 Information Retrieval

CSM06 Information Retrieval

CSM06 Information Retrieval

Information Retrieval