Random Walking on the World Wide Web Project Presentation

Random Walking on the World Wide WebProject Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim

Introduction • Statistics about web-pages are very important • Use a random sample of web pages to approximate: • search engine coverage • domain name distribution (.com, .org, .edu) • average number of links in a page • average page length • The Goal : Develop a cheap method to sample uniformly from the Web

Random Walker • Random walk on a graph provides a sample of nodes • Graph is undirected and regular sample is uniform Problem: The Web is neither undirected nor regular • Incrementally create an undirected regular graph with the same nodes as the Web • Perform the walk on this graph

WebWalker 3 5 amazon.com 3 2 • Follow arandom out-link or a random in-linkat each step • Useweighted self loopsto even out pages’ degrees 3 0 4 netscape.com 0 1 4 3 3 2 1 1 3 2 2 2 w(v) = degmax - deg(v) 4

WebWalker • A random walk on a connected undirected regular graph converges to a uniform stationary distribution. • Pseudo code: Webwalker(v): - Spend expected degmax/deg(v) steps at v - Pick a random link incident to v (either v  u or u  v) Webwalker(u).

MD and MH Algorithms Maximum-Degree • The algorithm works by adding self loops to nodes. • Causing random walk to stay at these WebPages (nodes). • And by that fixing the bias in the trial distribution. Metropolis-Hastings • The Algorithm gives preference to smaller documents by reducing the step probability to large documents. • This fixes the bias caused by large documents with a large number of pareses.

Project description • Implement the WebWalker algorithm • Design a simulation frame work • Compare the results to the Search Based random walks from our previous project • Analyzing and displaying the results

Software – Class Diagram (utility)

Software – Class Diagram (1)

Software – Class Diagram (2)

Software – Class Diagram (Result Analyzer)

Designing the Simulation Frame Work • Planning a series of simulations testing different parameters of the algorithms • Considering “bottlenecks” like the Yahoo daily query limit and H.D space. • Measuring the effect of each parameter on the algorithm • Running the simulations at the software lab on several computers at a time

Analysis Criteria • Similarity • Unique Hosts Visited • Final Similarity • Convergence

Results – Similarity

Results – Unique hosts visited

Results – Convergence

Results - SE vs. WW

Random Walking on the World Wide Web Project Presentation

Random Walking on the World Wide Web Project Presentation

Presentation Transcript

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

ADHD on the World Wide Web

The world wide web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web