260 likes | 393 Vues
This lesson delves into the concept of the Web Graph (W), a directed graph where vertices represent web pages and edges represent links between them. We explore the structure of the web, its growth metrics such as the number of web hosts and pages, dynamic vs. static content, and insights from recent studies. Key findings reveal that the web's size is constantly changing, with billions of pages, many of which remain elusive to search engines. The lesson further discusses power laws observed in web networks and the average distances within the graph, highlighting the interconnected nature of online content.
E N D
Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007
The Web Graph • The linkage structure of Web Pages forms a graph structure. • The Web Graph (hereinafter called W) is a directed graph W = (V,E) • V is the vertex set and each vertex represents a page in the Web. • E is the edge set and each directed edge (e1,e2) exists whenever a link appears in the page represented by e1 to the page represented by e2.
2 1 Link21 Link11 Link22 Link12 4 Link31 3 Link41 1 2 3 4 1 1 2,4 2,2 = 1 0 1 1 2 2 2 3,1 3,4 0 = 1 1 3 3 3 1 1 1 0 = = 4 4 4 3 3 0 0 1 = A Toy Example of W V= {1,2,3,4} E= {(1,2), (1,4), (2,3), (2,4), (3,1), (4,3)}
The size of W • What is being measured? • Number of hosts • Number of (static) html pages • Volume of data • Number of hosts - netcraft survey • http://news.netcraft.com/archives/web_server_survey.html • Monthly report on how many web hosts & servers are out there! • Number of pages - numerous estimates • Recently Yahoo announced an index with 20B pages.
The “real” size of W • The web is really infinite • Dynamic content, e.g. calendars, online organizers, etc. • http://www.raingod.com/raingod/resources/Programming/JavaScript/Software/RandomStrings/index.html • Static web contains syntactic duplication, mostly due to mirroring (~ 20-30%) • Some servers are seldom connected.
Recent Measurement of W • [Gulli & Signorini, 2005]. Total web > 11.5B. • 2.3B the pages unknown to popular Search Engines. • 35-120B of pages are within the hidden web. • The index intersection between the largest available search engines -- namely Google, Yahoo!, MSN, AskJeeves -- is estimated to be 28.8%.
Evolution of W • All of these numbers keep changing. • Relatively few scientific studies of the evolution of the web [Fetterly & al., 2003] • http://research.microsoft.com/research/sv/sv-pubs/p97-fetterly/p97-fetterly.pdf • Sometimes possible to extrapolate from small samples (fractal models) [Dill & al., 2001] • http://www.vldb.org/conf/2001/P069.pdf
Rate of change • There a number of different studies analyzing the rate of changes of pages in V. • [Cho & al., 2000] 720K pages from 270 popular sites sampled daily from Feb 17 - Jun 14, 1999 • Any changes: 40% weekly, 23% daily • [Fetterly & al., 2003] Massive study 151M pages checked over few months • Significant changed -- 7% weekly • Slightly changed -- 25% weekly • [Ntoulas & al., 2004] 154 large sites re-crawled from scratch weekly • 8% new pages / week • 8% die • 5% new content • 25% new links/week
The Power of Power Laws • A power law relationship between two scalar quantities x and y is one where the relationship can be written as y= axk where a (the constant of proportionality) and k (the exponent of the power law) are constants. • Power laws are observed in many subject areas, including physics, biology, geography, sociology, economics, and linguistics. • Power laws are among the most frequent scaling laws that describe the scale invariance found in many natural phenomena.
Power Law Probability Distributions • Sometimes called heavy-tail or long-tail distributions. • Examples of power law probability distributions: • The Pareto distribution, for example, the distribution of wealth in capitalist economies • Zipf's law, for example, the frequency of unique words in large texts http://wordcount.org/main.php • Scale-free networks, where the distribution of links is given by a power law (in particular, the World Wide Web) • Frequency of events or effects of varying size in self-organized critical systems, e.g. Gutenberg-Richter Law of earthquake magnitudes and Horton's laws describing river systems
The in/out-degree Power law trend:
Random Graphs • RGs are structures introduced by Paul Erdos and Alfred Reny. • There are several models of RGs. We are concerned with the model Gn,p. • A graph G = (V,E) Gn,p is such that |V|=n and an edge (u,v) E is selected uniformly at random with probability p.
W cannot be a RG • Let Xk be a discrete value indicating the number of nodes having degree equal to k. • Obviously in Gn,p the expected value of XpE(Xp) is . • Xk is asintotically distributed as a Poisson variable with mean k.
The avg distance of a graph G • Let u, vV be two nodes of G. • Let d(u,v) be the distance from u to v expressed as the length of the shortest path connecting u to v. If u and v are not connected then the distance is set to . • Definewhere S is the set of pairs of distinct nodes u, v of W with the property that d(u,v) is finite.
The avg distance of W • A small world graph is a graph whose avg distance is much smaller that the order of the graph. • For instance L(G) O(log(|V(G)|)). • L(W) is about 7. • Ld(W) is about 18
It is still an open problem to find a web graph model that produces graphs which provably has all four properties. What’s the best model for W? • A graph model for the web should have (at least) the following features: • On-line property. The number of nodes and edges changes with time. • Power law degree distribution. The degree distribution follows a power law, with an exponent >2. • Small world property. The average distance is much smaller that the order of the graph. • Many dense bipartite subgraphs. The number of distinct bipartite cliques or cores is large when compared to a random graph with the same number of nodes and edges.
W Models proposed so far. • [Bollobas & al., 2001]. Linearized Chord Diagram (LCD). • [Aiello & al., 2001]. ACL. • [Chung & al., 2003]. CL. • [Kumar & al., 1999]. Copying model. • [Chung & al., 2004]. CL-del growth-deletion model. • [Cooper & al., 2004]. CFV.
References • [Gulli & Signorini, 2005]. Antonio Gulli and Alessio Signorini. The indexable web is more than 11.5 billion pages. WWW (Special interest tracks and posters) 2005: 902-903. • [Fetterly & al., 2003]. Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener. A Large-Scale Study of the Evolution of Web Pages. 12th International World Wide Web Conference (May 2003), pages 669-678. • [Dill & al., 2001]. Stephen Dill, Ravi Kumar, Kevin S. McCurley, Sridhar Rajagopalan, D. Sivakumar, Andrew Tomkins: Self-similarity in the web. ACM Trans. Internet Techn. 2(3): 205-223 (2002).
References • [Cho & al., 2000]. Junghoo Cho, Hector Garcia-Molina. The Evolution of the Web and Implications for an Incremental Crawler. VLDB 2000: 200-209. • [Ntoulas & al., 2004]. Alexandros Ntoulas, Junghoo Cho, Christopher Olston. What's new on the web?: the evolution of the web from a search engine perspective. WWW 2004: 1-12. • [Bollobas & al., 2001]. Bela Bollobas, Oliver Riordan, G. Tusnary and Joel Spencer. The degree sequence of a scale-free random graph process. Random Structures and Algorithms, vol 18, 2001, 279-290.
References • [Aiello & al., 2001]. William Aiello, Fan R. K. Chung, Linyuan Lul. Random Evolution in Massive Graphs. FOCS 2001: 510-519. • [Chung & al., 2003]. Fan R. K. Chung, L. Lu. The average distances in random graphs with given expected degrees. Internet Mathematics. 1(2003): 91-114. • [Kumar & al., 1999]. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and Eli Upfal. Stochastic models for the Web graph. Proceedings of the 41th FOCS. 2000, pp. 57-65.
References • [Chung & al., 2004]. F. Chung, L. Lu. Coupling Online and Offline Analyses for Random Power Law Graphs. Internet Mathematics. Vol 1 (2003). 409-461. • [Cooper & al., 2004]. C. Cooper, A. Frieze, J. Vera. Random Deletions in a Scale Free Random Graph Process. Internet Mathematics. Vol 1 (2003). 463 - 483.