 Download Download Presentation Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

# Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

Télécharger la présentation ## Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Search AlgorithmsWinter Semester 2004/200520 Dec 200410th Lecture Christian Schindelhauer schindel@upb.de

2. Chapter III Chapter III Searching the Web 20 Dec 2004

3. Searching the Web • Introduction • The Anatomy of a Search Engine • Google’s Pagerank algorithm • The Simple Algorithm • Periodicity and convergence • Kleinberg’s HITS algorithm • The algorithm • Convergence • The Structure of the Web • Pareto distributions • Search in Pareto-distributed graphs

4. The Webgraph • GWWW: • Static HTML-pages are nodes • links are directed edges • Outdegree of a node: number of links of a web-page • Indegree of a node: number of links to a web-page • Directed path from node u to v • series of web-pages, where one follows links from the page u to page v • Undirected path (u=w0,w2,…,wm-1,v=wm) from page u to page v • For all i: • There is a link from wi zu wi+1 or from wi+1 to wi • Strong (weak) connected subgraph • minimal node set including all nodes which have a directed (undirected) path from and to a reference node

5. The Web-Graph (1999)

6. Distributions of indegree/outdegree • In and Out-degree obey a power law • i.e. in- and out-degree appear with probability ~ 1/iα • According to experiments of • Kumar et al 97: 40 million Webpages • Barabasi et al 99: Domain *.nd.edu + Web-pages with distance 3 • Broder et al 00: 204 million webpages (Scan May and Oct 1999)

7. Is the Web-Graph a Random graph? No! • Random graph Gn,p: • n nodes • Every directed edge occurs with probability p • Is the Web-graph a random graph Gn,p? • The probability of high degrees decrease exponentially • In a random graph degrees are distributed according to a Poisson distribution • Therefore: The degree of a random graph does not obey a power law

8. Pareto Distribution • Discrete Pareto (power law) distribution for x  {1,2,3,…} with constant factor (also known as the Riemann Zeta function) • Heavy tail property • not all moments E[Xk] are defined • Expected value exists if and only if α>2 • Variance and E[X2] exist if and only if α>3 • E[Xk] defined if and only if α>k+1 • Density function of the continuous function for x>x0

9. Special Case: Zipf Distribution • George Kinsley Zipf claimed that the frequency of the n-th most frequent word occurs with frequency f(n) such that f(n) n = c • Zipf probability distribution for x  {1,2,3,…} with constant factor conly defined for finite sets, sincetends to infinity for growing n • Zipf distributions refer to ranks • The Zipf exponent can be larger than 1, i.e. f(n) = c/n • Pareto distributions refer to absolute size • e.g. number of inhabitants

10. Pareto-Verteilung (I) • Example for Power Laws (= Pareto distributions) • Pareto 1897: Wealth/income in population • Yule 1944: Word frequency in languages • Zipf 1949: Size of towns • Length of molecule chaings • File length of UNIX-files • …. • Access density of web-pages • Access density of a web-surfer at a particular web-page • …

11. City Size DistributionScaling Laws and Urban Distributions, Denise Pumain, 2003 Zipf distribution

12. Zipf’s Law and the InternetLada A. Adamic, Bernardo A. Huberman, 2002 Pareto distribution

13. Zipf’s Law and the InternetLada A. Adamic, Bernardo A. Huberman, 2002

14. Zipf’s Law and the InternetLada A. Adamic, Bernardo A. Huberman, 2002

15. Heavy-Tailed Probability Distributions in the World Wide WebMark Crovella, Murad, Taqqu, Azer Bestavros, 1996

16. Size of connected components • Strong and weak connected components obey a power law • A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. “Graph Structure in the Web: Experiments and Models.” In Proc. of the 9th World Wide Web Conference, pp. 309—320. Amsterdam: Elsevier Science, 2000. • Large weak connected component with 91% of all web-pages • Largest strong connected component has size 28% • Diameter ≥ 28

17. Searching in Power Law Networks Task: Given a network with undirected edges Degrees underlie a power law From a source node Find a target node Features Keep it simple no markers Visit one node at a time Every node knows its neighbor (and its degree) From Adamik, Lukose, Puniyani, Huberman, “Search in power-law networks”, Physical Review E, Vol.86, 046135 Three approaches Neighbors of random nodes Neighbors of a random walk: First random neighbor and continue Neighbors of High Degree Seeking: Start with random node Prefer neighbors with larger degree

18. Power Law Networks Undirected graph of n nodes The probability that a node has k neighbors is pk where pk = c k-for a normalization factor c For search in power law network Consider largest connected component and exponent t with 2<<3 Theorem For large enough power law graphs with exponent  For  <1 the graph is almost surely connected For 1<  <2: There is a giant connected component of size (n) For 2<  <3.4785: There is a giant component and all smaller components are of size O(log n) For  >3.4785: The graph has almost surely no giant component, ie. all components have size o(n) For  >4: All connected components underlie a power law by William Aiello Fan Chung Linyuan Lu, A Random Graph Model for Massive Graphs, Symposium on Theory of Computing (STOC) 2000)

19. Random Walk Random Walk: Start with random node as node u while neighbor of u is not target do u  random neighbor of u od Theorem In undirected connected graphs every node is visited by a random walk with probability proportional to its degree (on the long run). Conclusion: High degree nodes are preferred Possible improvement Avoid going back Avoid visiting already visited nodes Scan also second degree neighbors for target node RW: Random walk in 2.1 power law graph avoiding going back second degree scanning

20. Degree Seeking Degree Seeking Start with random node as node u while neighbor of u is not target do u  neighbor of u with highest degree that was not visited so far od Improvement: Scan also second degree neighbors for target Observation: The search in Power Law networks is considerably faster Why? RW: Random walk in 2.1 power law graph DS: Degree Seeking in the same graph avoiding already visited neighbors second degree scanning

21. Comparison Random Walk and Degree Seeking

22. Probability Generating Functions • For a discrete probability distribution X over {0,1,2,3,4..} let pk be the probability that event k  {0,1,2,3,...} • Then the generating function for the probability distribution is • Probability values • where G(k) is the k-th derivative of G • For probability distributions X and Y and their distribution generating functions GX, GY we have

23. Probability Generating FunctionsProperties • Sum of probabilites • Expectation • If Xi are independent discrete random variables and GXi the generating function then for • the generating function is • This implies for S=X1-X2, where X1 and X2 are independent • Let N be an independent random variable. Let X1,X2, .., independent and identically random variables. Then for the random variable XN the generating function is given by

24. Probability Generating FunctionsExamples • Remember that • Example: • Consider the random variable • then the generating function is • Poisson probability distribution with • Generating function: • Pareto (power law) probability distribution

25. Analyzing Power Law Graphs • Consider the generating function for the degree • Let pk = 0 for all k > m= n1/ and k=0 • Hence, the generating function is • Choose the normalization factor c such that • Then, the average degree is given by If m>n1/ then pm<n-1 This means less than one edge exists in the expectation

26. The Average Degree • Average degree of a node • A random edge chooses high degree nodes with higher probability, • if a node has k edges then the probability increases (for large networks) by a factor of k • i.e. probability p’(k) = k pk • the corresponding normalized generating probability function is • The probability function of a node after one random walk is given by this function shifted by one place, i .e.

27. The Neighbor’s Degree • Assume that • a node “knows” the degree of all neighbors • the probability that any second neighbor is connected to more than one first neighbor can be neglected • Then, the degree of the first neighbors and second neighbors are independent • Second neighbors are the neighbors in the next step • Let z2a denote the average number of second neighbors starting from a random node • Choose N according to G0 • Choose Xi according to G1 • Consider XN and the generating function • Then • Let z2b denote the average number of second neighbors starting from a node chosen by a random edge • Choose N according to G1 • Choose Xi according to G1 • Consider XN and the generating function • Then

28. Random Walks outperform Random Nodes Let z2a denote the average number of second neighbors starting from a random node The degree is dependent on the cut-off value m = (n1/) For 2<<3 one can obtain Hence, Let z2b denote the average number of second neighbors starting from a node chosen by a random edge The degree is dependent on the cut-off value m = (n1/) For 2<<3 one can obtain Hence,

29. Conclusions • The number of nodes that is in the neighborhood of nodes of a random walk is approximately a square of the number of nodes neighbored to random points of the network • This effect can be increased if we prefer the neighbor with the highest degree • This improves the search in power law networks • because more neighbors are in reach • In random graphs (Poisson graphs) this technique does not help such much • since the the degree distribution is sharply concentrated around the expectation.

30. Thanks for your attentionEnd of 10th lectureHappy X-mas and a happy new yearNext lecture: Mo 10 Jan 2005, 11.15 am, FU 116Next exercise class: Mo 20 Dec 2004, 1.15 pm, F0.530 or We 22 Dec 2004, 1.00 pm, E2.316