1 / 9

Indexing the Web: Authors, Ranking, PageRank, Anchor Text, Spamming, Scaling

This discussion class focuses on various topics related to indexing the web, including identifying authors, criticizing ranking methods, understanding the concept of precision, exploring PageRank algorithm, analyzing anchor text, addressing spamming issues, and discussing scalability challenges of web systems.

mccasland
Télécharger la présentation

Indexing the Web: Authors, Ranking, PageRank, Anchor Text, Spamming, Scaling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discussion Class 7 Google 1

  2. Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others to comment. When answering: Stand up. Give your name. Make sure that the TA hears it. Speak clearly so that all the class can hear. Suggestions: Do not be shy at presenting partial answers. Differing viewpoints are welcome.

  3. Question 1: Indexing the Web Who are the authors of this paper? (b) The authors criticize conventional ranking methods, based on vector similarity. What are their criticisms? Do you agree with them? (c) Why not use standard full-text indexing with tf.idf weighting?

  4. Question 2: Ranking The authors of the paper state that their objective is to maximize precision. (a) What do they mean by "precision"? (b) What assumptions does this imply about users and their wishes? How does their view of relevance differ from the conventional view? How well would you expect Google to perform in the TREC ad hoc track?

  5. Question 3: PageRank Algorithm • Traditional text search engines rank hits by the similarity of each document to a query. How does PageRank rank the hits returned by a query? • What is the concept behind PageRank? • What other ranking methods does Google use?

  6. Question 3 (continued) Page A has pages T1...Tn which point to it. The parameter d is a damping factor which can be set between 0 and 1. C(A) is defined as the number of links going out of page A. The PageRank of a page A is: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) "... d damping factor is the probability at each page the 'random surfer' will get bored and request another random page."

  7. Question 4: Anchor Text What is anchor text? How does Google use anchor text to index a web page? What are the computational challenges in this approach?

  8. Question 5: Spamming "There are even numerous companies which specialize in manipulating search engines for profit." (a) Explain this statement. (b) How did Google overcome this problem at the time of the paper? Why are the authors unenthusiastic about using metadata for indexing the web? Discuss the statement, "we think some of the most interesting research will involve leveraging the vast amount of usage data that is available from modern web systems."

  9. Question 6: Scaling Much of the article is about scalability. (a) How many pages were they indexing when they wrote the article? How many today? How many queries does the system handle every day? (b) What is their strategy for scalability? Where do you think the limitations lie? (c) How did they manage to implement such a large-scale (and ever changing) system with a small technical staff?

More Related