290 likes | 447 Vues
Search Algorithms Winter Semester 2004/2005 13 Dec 2004 9th Lecture. Christian Schindelhauer schindel@upb.de. Chapter III. Chapter III Searching the Web 13 Dec 2004. Searching the Web. Introduction The Anatomy of a Search Engine Google’s Pagerank algorithm The Simple Algorithm
 
                
                E N D
Search AlgorithmsWinter Semester 2004/200513 Dec 20049th Lecture Christian Schindelhauer schindel@upb.de
Chapter III Chapter III Searching the Web 13 Dec 2004
Searching the Web • Introduction • The Anatomy of a Search Engine • Google’s Pagerank algorithm • The Simple Algorithm • Periodicity and convergence • Kleinberg’s HITS algorithm • The algorithm • Convergence • The Structure of the Web • Pareto distributions • Search in Pareto-distributed graphs
Kleinberg’s HITS-Algorithm(HyperText Induced Search) • Jon Kleinberg, „Authoritative Sources in a Hyperlinked Environment“, Journal of the ACM 46(5): 604-632(1999) • Idea of the Algorithm • Pages can serve as • Authorities (like in pagerank) or • Hubs • Hub pages point to interesting links to authorities = relevant pages • E.g. railway fans collect links of railway companies • Authorities are targets of hub pages • Mutually enforcing relationship Authorities Hubs
Constructing a Focused Subgraph • For a search pattern  choose S • S is relatively small. • S is rich in relevant pages. • S contains most (or many) of the strongest authorities. • Start with the output of a standard text based search engine • Enhance the set of pages by the predecessors and the successors of these pages (w.r.t. links)
Edge Selection • Offset the effect of links that serve purely a navigational function • Types of links • transverse if it is between pages with different domain names • intrinsic if it is between pages with the same domain name • Often intrinsic links very often exist purely for navigation • give much less information than transverse links about the authority of the pages they point to • therefore delete all intrinsic links from the focused subgraph • Other simple heuristics • Suppose a large number of pages from a single domain all point to a single page p. • often corresponds to a mass advertisement • for example, the phrase “This site designed by . . .” and a corresponding link at the bottom of each page in a given domain. • To eliminate this phenomenon allow a maximum number of links from a domain pointing to a page
Mutual Enforcing Relationship • Weights • Authority weight of a web-page i: xi • Hub weight of a web-page i: yi • Authority indicated by hub pages (I-Operation) • Hub pages indicated by authority pages (O-Operation) • c1, c2 are normalization factors w.r.t to the L2-Norm
Computing the Output • Does the algorithm converge? • How good is the output?
Matrix Representation • Adjacency matrix A: • Authorities: • Hub weights: • After t Iterations:
When does HITS converge? • M = A AT is symmetric matrix • For all symmetric matrices • all eigenvalues are real • all eigenvectors are orthogonal • There exists a representation • such that for the columns Si • If the largest eigenvalue 1 is larger than 2, the second eigenvalue, then HITS converges
The Webgraph • GWWW: • Static HTML-pages are nodes • links are directed edges • Outdegree of a node: number of links of a web-page • Indegree of a node: number of links to a web-page • Directed path from node u to v • series of web-pages, where one follows links from the page u to page v • Undirected path (u=w0,w2,…,wm-1,v=wm) from page u to page v • For all i: • There is a link from wi zu wi+1 or from wi+1 to wi • Strong (weak) connected subgraph • minimal node set including all nodes which have a directed (undirected) path from and to a reference node
Distributions of indegree/outdegree • In and Out-degree obey a power law • i.e. in- and out-degree appear with probability ~ 1/iα • According to experiments of • Kumar et al 97: 40 million Webpages • Barabasi et al 99: Domain *.nd.edu + Web-pages with distance 3 • Broder et al 00: 204 million webpages (Scan May and Oct 1999)
Is the Web-Graph a Random graph? • Random graph Gn,p: • n nodes • Every directed edges occurs with probability p • Is the Web-graph a random graph Gn,p? • Expected in/out-degree of Gn,p = (n-1)p • Average degree of GWWW is constant, so choose • Consider a web-page w • Let X be the number of links pointing from w • Let Xi =1 if link (w,i) exists, and Xi=0, else • Then P[Xi=1]=p und P[Xi=0]=1-p
The in/out degree distribution of the random graph • What is the probability that at least k links apear • Markov‘s inequality • This implies
The in/out degree distribution of the random graph • What is the probability that at least k links apear • Chebyshev‘s inequality • Since Xi are independent • This implies
The in/out degree distribution of the random graph • Chernoff bound • For independent Bernoulli variable Xi and with • This implies for • So, the probability decrease exponentially • Therefore: The degree of a random graph does not obey a power law
Pareto Distribution • Discrete Pareto (power law) distribution for x  {1,2,3,…} with constant factor (also known as the Riemann Zeta function) • Heavy tail property • not all moments E[Xk] are defined • Expected value exists if and only if α>2 • Variance and E[X2] exist if and only if α>3 • E[Xk] defined if and only if α>k+1 • Density function of the continuous function for x>x0
Special Case: Zipf Distribution • George Kinsley Zipf claimed that the frequency of the n-th most frequent word occurs with frequency f(n) such that f(n) n = c • Zipf probability distribution for x  {1,2,3,…} with constant factor conly defined for finite sets, sincetends to infinity for growing n • Zipf distributions refer to ranks • The Zipf exponent can be larger than 1, i.e. f(n) = c/n • Pareto distributions refer to absolute size • e.g. number of inhabitants
Pareto-Verteilung (I) • Example for Power Laws (= Pareto distributions) • Pareto 1897: Wealth/income in population • Yule 1944: Word frequency in languages • Zipf 1949: Size of towns • Length of molecule chaings • File length of UNIX-files • …. • Access density of web-pages • Access density of a web-surfer at a particular web-page • …
City Size DistributionScaling Laws and Urban Distributinos, Denise Pumain, 2003 Zipf distribution
Zipf’s Law and the InternetLada A. Adamic, Bernardo A. Huberman, 2002 Pareto distribution
Zipf’s Law and the InternetLada A. Adamic, Bernardo A. Huberman, 2002
Zipf’s Law and the InternetLada A. Adamic, Bernardo A. Huberman, 2002
Heavy-Tailed Probability Distributions in the World Wide WebMark Crovella, Murad, Taqqu, Azer Bestavros, 1996
Size of connected components • Strong and weak connected components obey a power law • A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. “Graph Structure in the Web: Experiments and Models.” In Proc. of the 9th World Wide Web Conference, pp. 309—320. Amsterdam: Elsevier Science, 2000. • Large weak connected component with 91% of all web-pages • Largest strong connected component has size 28% • Diameter ≥ 28
Thanks for your attentionEnd of 9th lectureNext lecture: Mo 20 Dec 2004, 11.15 am, FU 116Results and solutions of exam: Mo 13 Dec 2004, 1.15 pm, F0.530 or We 16 Dec 2004, 1.00 pm, E2.316