Accelerating Ranking-System Using WebGraph

Accelerating Ranking-SystemUsing WebGraph Project Report by Padmaja Adipudi

Outline of My Talk • Needle Search Engine/Ranking-System • Ranking-System Issue/Resolution • Accelerating Ranking-System using WebGraph • Ranking Algorithms Overview • Google’s PageRank, ClusterRank, SourceRank & Truncated PageRank • Experimental Results • Efficiency Measure • Quality Measure • Conclusion • Which algorithm is better in terms of Efficiency & Quality

Search Engine • Web is a terrific place to get the information on any topic. • Search Engine is a useful application for the information retrieval on the WWW. • Search Engine has five basic components, a Crawler, a Parser, a Ranking-System, a Repository and a Front-End.

Ranking-System • Determines the importance of a Web page. • Google's PageRank algorithm is the famous Ranking-System and is based on URL link structure. • In Google’s PageRank, the importance of a Web page is based on the importance of it’s parent Web pages.

Needle Search Engine • A Search Engine developed by former students at UCCS. • ClusterRank algorithm is implemented as the Ranking-System. • The former student Yi-Zhang developed a Cluster ranking system which takes an average of 3 hours to rank 300,000 URLs.

Ranking-System Issue • The major issue with the current ranking system is, it takes long update times, 3 hours for 300K URLs. • As the number of pages increases it is going to be a severe problem.

Project Goal • Accelerate the existing Ranking-System of the Needle Search Engine at UCCS using a package called “WebGraph”. • Upgrade the Needle Search Engine system up to 1 Million Web pages from the 50K Web pages (crawled).

Steps to reach Goal • Use WebGraph package to represent the graph efficiently using compression techniques. • Compute the Page-Rank using algorithms namely ClusterRank, SourceRank and Truncated PageRank. • Compare the results based on time and quality measure for ClusterRank with the results of SourceRank, Truncated PageRank and choose the best for the Needle Search Engine.

Work Flow ClusterRank Page Rank Results SourceRank Compressed Graph Truncated PageRank

Why Truncated & Source Algorithms • These are the latest papers available in the Page Ranking area. • Authors used WebGraph package for their experiments while developing the algorithm.

Node Graph • Node graph is used in ranking system. • Node graph consists of nodes and directed links from node to node. • URLs are represented by nodes and the hyperlinks are represented as directed links between nodes. • Compression techniques to represent the Node graph in efficient manner.

Google’s PageRank • Page Lawrence, Brin Sergey, Rajeev Motwani, Terry Winograd from Stanford University, 1999. • Importance of a page is based on the incoming link count and also how important are those incoming links. • PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) • PR(Tn): Each page has a notion of its own self-importance. That’s “PR(T1)” for the first page in the web all the way up to PR(Tn) for the last page. • C(Tn): Each page spreads its vote out evenly amongst all of its outgoing links. The count, or number, of outgoing links for page 1 is C(T1), C(Tn) for page n, and so on for all pages. • PR(Tn)/C(Tn): if a page (page A) has a back link from page N, the share of the vote page A gets is PR(Tn)/C(Tn). • d: All these fractions of votes are added together but, to stop the other pages having too much influence, this total vote is "damped down" by multiplying it by 0.85 (the factor d).

ClusterRank • Yi Zhang, a student at UCCS is the author, 2006. • Algorithm is based on Google’s PageRank. • Designed to speed up PageRank calculation and also to provide a feature of grouping similar Web pages together in to clusters. • The original PageRank algorithm is applied on Clusters. • The rank is then distributed to members of the by weighted average.

ClusterRank (Cont’d) • Group all pages into clusters. • Perform first level clustering for dynamically generated page. • URLs are grouped based on the “?” , “#” • Example: All URLs below will be grouped in to one Cluster • http://www.uccs.edu/057/cs_sub.shtml • http://www.uccs.edu/057/cs_sub.shtml#news • http://www.uccs.edu/057/cs_sub.shtml#dates • http://www.uccs.edu/057/cs_sub.shtml#spotlight

ClusterRank (Cont’d) • Perform second level clustering on virtual directory and graph density. • URLs are grouped based on the last “/” symbol of the URL. • Density is calculated for the proposed clusters. • Approve the cluster based on the pre-set threshold value.

ClusterRank (Cont’d) • Calculate the rank for each cluster using the original PageRank algorithm. • Distribute the rank number to its members by weighted average by using: • PR = CR * Pi/Ci. • The notations here are: • PR: The rank of a member page • CR: The cluster rank from previous stage • Pi: The incoming links of this page • Ci: Total incoming links of this cluster.

SourceRank • James Caverlee, Ling Liu, and S.Webb from Georgia Institute of Technology, 2007. • The Web graph is represented as Sources. • The Source is a logical collection of Web pages. • Assigns a score to each page based on the overall quality of the source that the page belongs to, through a random walk over Web sources.

SourceRank (Cont’d) • Group all pages into Sources based on “Domain”. • URLs are grouped based on the first “/” symbol of the URL • Example: All URLs below will be grouped in to one Source • http://office.microsoft.com/en-us/default.aspx • http://office.microsoft.com/en-us/assistance/default.aspx • http://office.microsoft.com/en-us/assistance/CH790018071033.aspx

SourceRank (Cont’d) • Calculate the rank for each Source with the original PageRank algorithm • Distribute the rank number to its members by weighted average by using: • PR = SR * Si • The notations here are: • PR: The rank of a member page • SR: The source rank from previous stage • Si: Total incoming unique links of this source

Truncated PageRank • L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates from Italy, 2006. • In PageRank, the Web page can gain high Page-Rank score with supporters (in-links) that are topologically “Close” to the target node. • Spammers can afford to influence only a few levels. • Truncated PageRank is similar to PageRank, except that the supporters that are too “close” to a target node do not contribute towards its ranking.

Truncated PageRank (Cont’d) The notations here are: C: Normalization constant : The damping factor • PR(p) = t · Mt = damping(t) · Mt

WebGraph Package • Paolo Boldi and Sebastiano Vigna from Italy, 2004. • Represents the Node graph in efficient manner using Differential compression technique. • Allows applications to encode compactly a new version of data with respect to a previous or reference version of same data. • WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks) in as little as 3:08 bits per link, and its transposed version in as little as 2:89 bits per link. • WebBase is a repository of Web pages crawled by Ubi crawler from Stanford University.

WebGraph Package (Cont’d) • Node graph initial representation: • Node graph with Reference compression:

WebGraph Package (Cont’d) • Node graph with Differential compression: • Differential compression allows to code a link in less than a bit (Not possible with plain Reference compression)

WebGraph Package (Cont’d) Link Structure From DB Graph in Ascii format Graph in BV format Graph in BV Format PageRank Module

BVGraph Details • BVGraph: Boldi Vigna Graph • BVGraph is generated using a graph that is represented in ASCII format. • The first line contains the number of nodes ‘n’, then ‘n’ lines follow the i-th line containing the successors of the node ‘i’ in the increasing order (nodes are numbered from 0 to n-1). The successors are separated by a single space.

For example, consider a graph of three vertices, a, b, and c, consisting of the following edges: (a, b) (a, c) (b, c) (b, a) (a:0, b:1, c:2) This graph could be expressed as below 3 1 2 0 2 1 BVGraph Details (Cont’d)

The URLLinkStructure table in the Database had linking information. ASCII graph is generated by using data in URLLinkStructure table and then the BV Graph is generated ASCII graph is represented as basename.graph-txt BVGraph is generated using the command: java it.unimi.dsi.webgraph.BVGraph -g ASCIIGraph basename bvbasename BVGraph – Current Implementation

BVGraph – Current Implementation (Cont’d) • The grapgh could be generated for incoming links as well as outgoing links. • BVnode-in, BVnode-out, BVSource-in graphs are generated. • BVGraph can be loaded using two loading methods load and loadOffline. • The load method is used for small graphs • The loadOffline method is used for large graphs

ClusterRank Using BVGraph

ClusterRank Using BVGraph (Cont’d) • Time gain using WebGraph for 300K URLS

Time Measure for Algorithms (in Seconds)

Time Measure for Algorithms (Cont’d)

Node In-Link Distribution across Nodes (4M URLs)

Cluster In-Link Distribution across Clusters (4M URLs)

Source In-Link Distribution across Sources (4M URLs)

Survey performed on quality of ranking algorithms, using 25 search keywords, by a group of people Obtained keywords from Google’s Keyword tool at: https://adwords.google.com/select/KeywordToolExternal Listed below are the keywords identified. Quality Measure for Algorithms

Survey performed to identify the following from KeyWord Search First page accuracy Second page accuracy Result order on the first page Result order on the second page Overall, are the important pages showing up early? Overall, the percentage in result hits are relevant? Quality Measure for Algorithms (Cont’d)

Quality Measure For Algorithms (Cont’d)

Conclusion • The ClusteRank computation can be accelerated using WebGraph. • The SourceRank algorithm takes less time for Page-Rank calculation compared to ClusterRank and is close to Truncated PageRank for the existing 4M URLs. • The SourceRank has better quality points out of the three algorithms. • By considering the Efficiency and Quality, SourceRank is better out of the three for the existing data based on experiments performed.

Success Criteria • Identified the efficiency of Page-Rank computation algorithm using time-measure generated by experiments • Identified the quality of the algorithm using manual survey results • Implemented the efficient algorithm for the Needle Search Engine in UCCS • Upgraded the existing Needle Search Engine to 1 Million pages (crawled, actual URLs are 4 Million) from the current 50K URLs (crawled, actual URLs are 300K).

References • [1] Paolo Boldi, Sebastiano Vigna. The WebGraph Framework 1: Compression Techniques. http://www2004.org/proceedings/docs/1p595.pdf • [2] Yen-Yu Chen, Qingqing Gan, Torsten Suel. I/O-Efficient Techniques for Computing PageRank. http://cis.poly.edu/suel/papers/pagerank.pdf • [3] Taher H. Haveliwala. Efficient Computation of PageRank.

References (Cont’d) • [4] Yi Zhang. Design and Implementation of a Search Engine with the Cluster Rank Algorithm. • [5] John A. Tomlin. A New Paradigm for Ranking Pages on the World Wide Web. • [6] Lawrence Page, Sergey Brin, Rajeeve Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web http://www.cs.huji.ac.il/~csip/1999-66.pdf

References (Cont’d) • [7] Ricardo BaezaYates, Paolo Boldi, Carlos Castillo. Generalizing PageRank: Damping Functions for LinkBased Ranking Algorithms. http://www.dcc.uchile.cl/~ccastill/papers/baeza06_general_pagerank_damping_functions_link_ranking.pdf • [8] Gonzalo Navarro. Compressing Web Graphs like Texts. • [9] The Spiders Apprentice. http://www.monash.com/spidap1.html

References (Cont’d) • [10] James Caverlee, Ling Liu, S.Webb. Spam-Resilient Web Ranking via influence Throttling. http://www-static.cc.gatech.edu/~caverlee/pubs/caverlee07ipdps.pdf • [11] G. Jeh, J. Widom, “SimRank: A Measure of Structural-Context Similarity”. http://www-cs-students.stanford.edu/~glenj/simrank.pdf • [12] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, “Using rank propagation and probabilistic counting for link-based spam detection, Technical report”, 2006.

Accelerating Ranking-System Using WebGraph