150 likes | 419 Vues
The Page Rank Citation Ranking: Bringing Order to the Web. Larry Page, Sergey Brin , Rajeev Motwani , Terry Winograd January 29, 1988 Speaker: AMAN BAKSHI University of Southern California. “ The Initiative's focus is to dramatically advance the
E N D
The Page Rank Citation Ranking: Bringing Order to the Web Larry Page, Sergey Brin, Rajeev Motwani, Terry Winograd January 29, 1988 Speaker: AMANBAKSHI University of Southern California “ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and processing via communication networks -- all in user-friendly ways “ ---quote from the DLII website
Behind the wheels : Google Search • When you search a keyword(s), you do not search on web. • Instead you search Google's index of the web. • This is done through spiders which traverse through hundreds of thousands of pages on web to narrow down results. Then it uses page rank to display top ones.
Introduction and Motivation • WWW is very large and heterogeneous • The web pages are extremely diverse in terms of content, quality and structure • Challenging for information retrieval on WWW. • Most web pages link to web pages as well • So, take advantage of the link structure of the Web to produce ranking of every web page known as PageRank.
The Mechanics • A Google bot comes periodically to do two things: 1. check authority of your site 2. Relevance of your site • For relevance, it does following: 1. On page factors: searches for keywords on your page • so have them in title, head or body. • have a fresh content. 2. Off page factors : who is linking to your site • The value is not linear. Its logarithmic. • Relevance is imp. For example a site say Baby food pointing to fish Fly makes no sense. So have pointing from a site which is ranked high
Theory and Analogy Behind • We can relate it directly to the way a painter paints on a canvas. To get a specific color, he mixes different colors. The amount and intensity of each color you mix ultimately governs the color of the final mixture NOT the number of colors !!! • Say a certain back link came from Yahoo! and another came from an obscure home page. • Think of the importance of the Yahoo! Page as opposed to the importance of the ‘home page’. • Backlinks (inedges) : Links that point to a certain page. • Forward Links (outedges): Links that emanate from that page • We can never know all the backlinks of a page, but we know all of its forward links
The Formula • Say for any Web Page u the number of forward links is given by Fuand the number of back links beBuand Nu=| Fu | • R() = Rank of page u ; c = Normalization Constant • Note: c < 1 to cover for pages with no outgoing links
Representation A is designated to be a matrix, u and v correspond to the columns of this matrix AT =
Computing Page Rank given a Directed Graph The transition matrix A = We get the eigenvalue λ = 1 Calculating the eigenvector
Problems Problem 1: Dangling Links • Dangling links are links that point to any page with no outgoing links or pages not downloaded yet. • Problem : how their weights should be distributed. • Solution 1: they are removed from the system until all the PageRanks are calculated. Afterwards, they are added in without affecting things significantly
Problems (contd..) Problem: Some pages form a loop that accumulates rank (rank sink) to the infinity. Problem 2: Rank Sink Solution: Random Surfer Model Jump to a random page based on some distribution E (rank source)
Page Rank Expression Let E(u) be some vector over the Web pages that corresponds to a source of rank. Then, the PageRank of a set of Web pages is an assignment, R’, to the Web pages which satisfies such that c is maximized and ||R’||1 = 1 (where||R’||1 denotes the L1 norm of R’). PageRank of document v that links to u Vector of web pages that the Surfer randomly jumps to u PageRank of document u Normalization factor Number of outlinks from document v
Searching with Page Rank • Two search engines: • Title-based search engine • Full text search engine • Title-based search engine • Searches only the “Titles” • Finds all the web pages whose titles contain all the query words • Sorts the results by PageRank • Very simple and cheap to implement • Title match ensures high precision, and PageRank ensures high quality • Full text search engine • Called Google • Examines all the words in every stored document and also performs PageRank (Rank Merging) • More precise but more complicated
Adaptive Measures for computation of Page Rank This paper presents two contributions: • Second, the authors develop two algorithms, called Adaptive PageRank and Modified Adaptive PageRank, that exploit this observation to speed up the computation of PageRank by 18% and 28%, respectively. • First, it shows that most pages in the web converge to their true PageRank quickly, while relatively few pages take much longer to converge. Further , slow-converging pages generally have high PageRank, and those pages that converge quickly generally have low PageRank.
Observations bmw.de banned from Google in early 2006 due to its doorway page ~ is a page stuffed full of keywords that the site feels a need to be optimized for blog: http://blog.outer-court.com/archive/2006-02-04-n60.html •“Google Bomb” http://searchengineland.com/070125-230048.php • create lots of links to one certain destination, • label all of them with the same remarkable terms • query Google for those terms • You will get the linked page • Unwanted Uses ofPageRank
Applications • Estimating Web Traffic On analyzing the statistics, it was found that there are some sites that have a very high usage, but low PageRank. e.g.: Links to pirated software • PageRank as Backlink Predictor The goal is to try to crawl the pages in as close to the optimal order as possible i.e., in the order of their rank according to an evaluation function. PageRank is a better predictor than citation counting • User Navigation: The PageRank Proxy The user receives some information about the link before they click on it. This proxy can help users decide which links are more likely to be interesting • “If an SEO creates deceptive or misleading content on your behalf, such as doorway pages or ’throwaway’ domains, your site could be removed entirely from Google’s index.” ---- unknown at Google • Page rank is ONLY for the page. But there is nothing like Domain rank.