Créer une présentation
Télécharger la présentation

Télécharger la présentation
## Search Algorithms Winter Semester 2004/2005 20 Dec 2004 10th Lecture

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Search AlgorithmsWinter Semester 2004/200520 Dec 200410th**Lecture Christian Schindelhauer schindel@upb.de**Chapter III**Chapter III Searching the Web 20 Dec 2004**Searching the Web**• Introduction • The Anatomy of a Search Engine • Google’s Pagerank algorithm • The Simple Algorithm • Periodicity and convergence • Kleinberg’s HITS algorithm • The algorithm • Convergence • The Structure of the Web • Pareto distributions • Search in Pareto-distributed graphs**The Webgraph**• GWWW: • Static HTML-pages are nodes • links are directed edges • Outdegree of a node: number of links of a web-page • Indegree of a node: number of links to a web-page • Directed path from node u to v • series of web-pages, where one follows links from the page u to page v • Undirected path (u=w0,w2,…,wm-1,v=wm) from page u to page v • For all i: • There is a link from wi zu wi+1 or from wi+1 to wi • Strong (weak) connected subgraph • minimal node set including all nodes which have a directed (undirected) path from and to a reference node**Distributions of indegree/outdegree**• In and Out-degree obey a power law • i.e. in- and out-degree appear with probability ~ 1/iα • According to experiments of • Kumar et al 97: 40 million Webpages • Barabasi et al 99: Domain *.nd.edu + Web-pages with distance 3 • Broder et al 00: 204 million webpages (Scan May and Oct 1999)**Is the Web-Graph a Random graph? No!**• Random graph Gn,p: • n nodes • Every directed edge occurs with probability p • Is the Web-graph a random graph Gn,p? • The probability of high degrees decrease exponentially • In a random graph degrees are distributed according to a Poisson distribution • Therefore: The degree of a random graph does not obey a power law**Pareto Distribution**• Discrete Pareto (power law) distribution for x {1,2,3,…} with constant factor (also known as the Riemann Zeta function) • Heavy tail property • not all moments E[Xk] are defined • Expected value exists if and only if α>2 • Variance and E[X2] exist if and only if α>3 • E[Xk] defined if and only if α>k+1 • Density function of the continuous function for x>x0**Special Case: Zipf Distribution**• George Kinsley Zipf claimed that the frequency of the n-th most frequent word occurs with frequency f(n) such that f(n) n = c • Zipf probability distribution for x {1,2,3,…} with constant factor conly defined for finite sets, sincetends to infinity for growing n • Zipf distributions refer to ranks • The Zipf exponent can be larger than 1, i.e. f(n) = c/n • Pareto distributions refer to absolute size • e.g. number of inhabitants**Pareto-Verteilung (I)**• Example for Power Laws (= Pareto distributions) • Pareto 1897: Wealth/income in population • Yule 1944: Word frequency in languages • Zipf 1949: Size of towns • Length of molecule chaings • File length of UNIX-files • …. • Access density of web-pages • Access density of a web-surfer at a particular web-page • …**City Size DistributionScaling Laws and Urban Distributions,**Denise Pumain, 2003 Zipf distribution**Zipf’s Law and the InternetLada A. Adamic, Bernardo A.**Huberman, 2002 Pareto distribution**Zipf’s Law and the InternetLada A. Adamic, Bernardo A.**Huberman, 2002**Zipf’s Law and the InternetLada A. Adamic, Bernardo A.**Huberman, 2002**Heavy-Tailed Probability Distributions in the World Wide**WebMark Crovella, Murad, Taqqu, Azer Bestavros, 1996**Size of connected components**• Strong and weak connected components obey a power law • A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. “Graph Structure in the Web: Experiments and Models.” In Proc. of the 9th World Wide Web Conference, pp. 309—320. Amsterdam: Elsevier Science, 2000. • Large weak connected component with 91% of all web-pages • Largest strong connected component has size 28% • Diameter ≥ 28**Searching in Power Law Networks**Task: Given a network with undirected edges Degrees underlie a power law From a source node Find a target node Features Keep it simple no markers Visit one node at a time Every node knows its neighbor (and its degree) From Adamik, Lukose, Puniyani, Huberman, “Search in power-law networks”, Physical Review E, Vol.86, 046135 Three approaches Neighbors of random nodes Neighbors of a random walk: First random neighbor and continue Neighbors of High Degree Seeking: Start with random node Prefer neighbors with larger degree**Power Law Networks**Undirected graph of n nodes The probability that a node has k neighbors is pk where pk = c k-for a normalization factor c For search in power law network Consider largest connected component and exponent t with 2<<3 Theorem For large enough power law graphs with exponent For <1 the graph is almost surely connected For 1< <2: There is a giant connected component of size (n) For 2< <3.4785: There is a giant component and all smaller components are of size O(log n) For >3.4785: The graph has almost surely no giant component, ie. all components have size o(n) For >4: All connected components underlie a power law by William Aiello Fan Chung Linyuan Lu, A Random Graph Model for Massive Graphs, Symposium on Theory of Computing (STOC) 2000)**Random Walk**Random Walk: Start with random node as node u while neighbor of u is not target do u random neighbor of u od Theorem In undirected connected graphs every node is visited by a random walk with probability proportional to its degree (on the long run). Conclusion: High degree nodes are preferred Possible improvement Avoid going back Avoid visiting already visited nodes Scan also second degree neighbors for target node RW: Random walk in 2.1 power law graph avoiding going back second degree scanning**Degree Seeking**Degree Seeking Start with random node as node u while neighbor of u is not target do u neighbor of u with highest degree that was not visited so far od Improvement: Scan also second degree neighbors for target Observation: The search in Power Law networks is considerably faster Why? RW: Random walk in 2.1 power law graph DS: Degree Seeking in the same graph avoiding already visited neighbors second degree scanning**Probability Generating Functions**• For a discrete probability distribution X over {0,1,2,3,4..} let pk be the probability that event k {0,1,2,3,...} • Then the generating function for the probability distribution is • Probability values • where G(k) is the k-th derivative of G • For probability distributions X and Y and their distribution generating functions GX, GY we have**Probability Generating FunctionsProperties**• Sum of probabilites • Expectation • If Xi are independent discrete random variables and GXi the generating function then for • the generating function is • This implies for S=X1-X2, where X1 and X2 are independent • Let N be an independent random variable. Let X1,X2, .., independent and identically random variables. Then for the random variable XN the generating function is given by**Probability Generating FunctionsExamples**• Remember that • Example: • Consider the random variable • then the generating function is • Poisson probability distribution with • Generating function: • Pareto (power law) probability distribution**Analyzing Power Law Graphs**• Consider the generating function for the degree • Let pk = 0 for all k > m= n1/ and k=0 • Hence, the generating function is • Choose the normalization factor c such that • Then, the average degree is given by If m>n1/ then pm<n-1 This means less than one edge exists in the expectation**The Average Degree**• Average degree of a node • A random edge chooses high degree nodes with higher probability, • if a node has k edges then the probability increases (for large networks) by a factor of k • i.e. probability p’(k) = k pk • the corresponding normalized generating probability function is • The probability function of a node after one random walk is given by this function shifted by one place, i .e.**The Neighbor’s Degree**• Assume that • a node “knows” the degree of all neighbors • the probability that any second neighbor is connected to more than one first neighbor can be neglected • Then, the degree of the first neighbors and second neighbors are independent • Second neighbors are the neighbors in the next step • Let z2a denote the average number of second neighbors starting from a random node • Choose N according to G0 • Choose Xi according to G1 • Consider XN and the generating function • Then • Let z2b denote the average number of second neighbors starting from a node chosen by a random edge • Choose N according to G1 • Choose Xi according to G1 • Consider XN and the generating function • Then**Random Walks outperform Random Nodes**Let z2a denote the average number of second neighbors starting from a random node The degree is dependent on the cut-off value m = (n1/) For 2<<3 one can obtain Hence, Let z2b denote the average number of second neighbors starting from a node chosen by a random edge The degree is dependent on the cut-off value m = (n1/) For 2<<3 one can obtain Hence,**Conclusions**• The number of nodes that is in the neighborhood of nodes of a random walk is approximately a square of the number of nodes neighbored to random points of the network • This effect can be increased if we prefer the neighbor with the highest degree • This improves the search in power law networks • because more neighbors are in reach • In random graphs (Poisson graphs) this technique does not help such much • since the the degree distribution is sharply concentrated around the expectation.**Thanks for your attentionEnd of 10th lectureHappy X-mas and**a happy new yearNext lecture: Mo 10 Jan 2005, 11.15 am, FU 116Next exercise class: Mo 20 Dec 2004, 1.15 pm, F0.530 or We 22 Dec 2004, 1.00 pm, E2.316