1 / 26

Mining di dati web

Mining di dati web. Lezione n° 2 Il grafo del Web A.A 2006/2007. The Web Graph. The linkage structure of Web Pages forms a graph structure. The Web Graph (hereinafter called W ) is a directed graph W = (V,E) V is the vertex set and each vertex represents a page in the Web.

pearl
Télécharger la présentation

Mining di dati web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

  2. The Web Graph • The linkage structure of Web Pages forms a graph structure. • The Web Graph (hereinafter called W) is a directed graph W = (V,E) • V is the vertex set and each vertex represents a page in the Web. • E is the edge set and each directed edge (e1,e2) exists whenever a link appears in the page represented by e1 to the page represented by e2.

  3. 2 1 Link21 Link11 Link22 Link12 4 Link31 3 Link41 1 2 3 4 1 1 2,4 2,2 = 1 0 1 1 2 2 2 3,1 3,4 0 = 1 1 3 3 3 1 1 1 0 = = 4 4 4 3 3 0 0 1 = A Toy Example of W V= {1,2,3,4} E= {(1,2), (1,4), (2,3), (2,4), (3,1), (4,3)}

  4. A More Realistic W

  5. The size of W • What is being measured? • Number of hosts • Number of (static) html pages • Volume of data • Number of hosts - netcraft survey • http://news.netcraft.com/archives/web_server_survey.html • Monthly report on how many web hosts & servers are out there! • Number of pages - numerous estimates • Recently Yahoo announced an index with 20B pages.

  6. The “real” size of W • The web is really infinite • Dynamic content, e.g. calendars, online organizers, etc. • http://www.raingod.com/raingod/resources/Programming/JavaScript/Software/RandomStrings/index.html • Static web contains syntactic duplication, mostly due to mirroring (~ 20-30%) • Some servers are seldom connected.

  7. Recent Measurement of W • [Gulli & Signorini, 2005]. Total web > 11.5B. • 2.3B the pages unknown to popular Search Engines. • 35-120B of pages are within the hidden web. • The index intersection between the largest available search engines -- namely Google, Yahoo!, MSN, AskJeeves -- is estimated to be 28.8%.

  8. Evolution of W • All of these numbers keep changing. • Relatively few scientific studies of the evolution of the web [Fetterly & al., 2003] • http://research.microsoft.com/research/sv/sv-pubs/p97-fetterly/p97-fetterly.pdf • Sometimes possible to extrapolate from small samples (fractal models) [Dill & al., 2001] • http://www.vldb.org/conf/2001/P069.pdf

  9. Rate of change • There a number of different studies analyzing the rate of changes of pages in V. • [Cho & al., 2000] 720K pages from 270 popular sites sampled daily from Feb 17 - Jun 14, 1999 • Any changes: 40% weekly, 23% daily • [Fetterly & al., 2003] Massive study 151M pages checked over few months • Significant changed -- 7% weekly • Slightly changed -- 25% weekly • [Ntoulas & al., 2004] 154 large sites re-crawled from scratch weekly • 8% new pages / week • 8% die • 5% new content • 25% new links/week

  10. Rate of change [Fetterly & al., 2003]

  11. Rate of change [Ntoulas & al., 2004]

  12. The Bow-Tie Structure

  13. The Power of Power Laws • A power law relationship between two scalar quantities x and y is one where the relationship can be written as y= axk where a (the constant of proportionality) and k (the exponent of the power law) are constants. • Power laws are observed in many subject areas, including physics, biology, geography, sociology, economics, and linguistics. • Power laws are among the most frequent scaling laws that describe the scale invariance found in many natural phenomena.

  14. Power Law Probability Distributions • Sometimes called heavy-tail or long-tail distributions. • Examples of power law probability distributions: • The Pareto distribution, for example, the distribution of wealth in capitalist economies • Zipf's law, for example, the frequency of unique words in large texts http://wordcount.org/main.php • Scale-free networks, where the distribution of links is given by a power law (in particular, the World Wide Web) • Frequency of events or effects of varying size in self-organized critical systems, e.g. Gutenberg-Richter Law of earthquake magnitudes and Horton's laws describing river systems

  15. The in/out-degree Power law trend:

  16. Random Graphs • RGs are structures introduced by Paul Erdos and Alfred Reny. • There are several models of RGs. We are concerned with the model Gn,p. • A graph G = (V,E)  Gn,p is such that |V|=n and an edge (u,v)  E is selected uniformly at random with probability p.

  17. W cannot be a RG • Let Xk be a discrete value indicating the number of nodes having degree equal to k. • Obviously in Gn,p the expected value of XpE(Xp) is . • Xk is asintotically distributed as a Poisson variable with mean k.

  18. The avg distance of a graph G • Let u, vV be two nodes of G. • Let d(u,v) be the distance from u to v expressed as the length of the shortest path connecting u to v. If u and v are not connected then the distance is set to . • Definewhere S is the set of pairs of distinct nodes u, v of W with the property that d(u,v) is finite.

  19. The avg distance of W • A small world graph is a graph whose avg distance is much smaller that the order of the graph. • For instance L(G)  O(log(|V(G)|)). • L(W) is about 7. • Ld(W) is about 18

  20. It is still an open problem to find a web graph model that produces graphs which provably has all four properties. What’s the best model for W? • A graph model for the web should have (at least) the following features: • On-line property. The number of nodes and edges changes with time. • Power law degree distribution. The degree distribution follows a power law, with an exponent >2. • Small world property. The average distance is much smaller that the order of the graph. • Many dense bipartite subgraphs. The number of distinct bipartite cliques or cores is large when compared to a random graph with the same number of nodes and edges.

  21. W Models proposed so far. • [Bollobas & al., 2001]. Linearized Chord Diagram (LCD). • [Aiello & al., 2001]. ACL. • [Chung & al., 2003]. CL. • [Kumar & al., 1999]. Copying model. • [Chung & al., 2004]. CL-del growth-deletion model. • [Cooper & al., 2004]. CFV.

  22. General Characteristics

  23. References • [Gulli & Signorini, 2005]. Antonio Gulli and Alessio Signorini. The indexable web is more than 11.5 billion pages. WWW (Special interest tracks and posters) 2005: 902-903. • [Fetterly & al., 2003]. Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener. A Large-Scale Study of the Evolution of Web Pages. 12th International World Wide Web Conference (May 2003), pages 669-678. • [Dill & al., 2001]. Stephen Dill, Ravi Kumar, Kevin S. McCurley, Sridhar Rajagopalan, D. Sivakumar, Andrew Tomkins: Self-similarity in the web. ACM Trans. Internet Techn. 2(3): 205-223 (2002).

  24. References • [Cho & al., 2000]. Junghoo Cho, Hector Garcia-Molina. The Evolution of the Web and Implications for an Incremental Crawler. VLDB 2000: 200-209. • [Ntoulas & al., 2004]. Alexandros Ntoulas, Junghoo Cho, Christopher Olston. What's new on the web?: the evolution of the web from a search engine perspective. WWW 2004: 1-12. • [Bollobas & al., 2001]. Bela Bollobas, Oliver Riordan, G. Tusnary and Joel Spencer. The degree sequence of a scale-free random graph process. Random Structures and Algorithms, vol 18, 2001, 279-290.

  25. References • [Aiello & al., 2001]. William Aiello, Fan R. K. Chung, Linyuan Lul. Random Evolution in Massive Graphs. FOCS 2001: 510-519. • [Chung & al., 2003]. Fan R. K. Chung, L. Lu. The average distances in random graphs with given expected degrees. Internet Mathematics. 1(2003): 91-114. • [Kumar & al., 1999]. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and Eli Upfal. Stochastic models for the Web graph. Proceedings of the 41th FOCS. 2000, pp. 57-65.

  26. References • [Chung & al., 2004]. F. Chung, L. Lu. Coupling Online and Offline Analyses for Random Power Law Graphs. Internet Mathematics. Vol 1 (2003). 409-461. • [Cooper & al., 2004]. C. Cooper, A. Frieze, J. Vera. Random Deletions in a Scale Free Random Graph Process. Internet Mathematics. Vol 1 (2003). 463 - 483.

More Related