Google搜索与 Inter网的信息检索

Google搜索与Inter网的信息检索 马志明 May 16, 2008 Email: mazm@amt.ac.cn http://www.amt.ac.cn/member/mazhiming/index.html

约有626,000项符合中国科学院数学与系统科学研究院的查询结果，以下是第1-100项。(搜索用时 0.45秒） How can google make a ranking of 626,000 pages in 0.45 seconds?

A main task of Internet (Web) Information Retrieval = Design and Analysis of Search Engine (SE) Algorithm involving plenty of Mathematics

 HITS 1998 Jon Kleinberg Cornell University PageRank • Sergey Brin and Larry Page • Stanford University

One of Kleinberg‘s most important research achievements focuses on the internetwork structure of the World Wide Web. Prior toKleinberg‘s work, search engines focused only on the content of web pages，not on the link structure. Kleinberg introduced the idea of “authorities” and “hubs”: An authority is a web page that containsinformation on a particular topic, and a hub is a page that contains links tomany authorities. Zhuzihu thesis.pdf Nevanlinna Prize（2006)Jon Kleinberg

PageRank, the ranking systemused by the Google searchengine. • Query independent • content independent. • using only the web graph structure

PageRank, the ranking system used by the Google search engine.

WWW 2005 paper PageRank as a Function of the Damping Factor Paolo Boldi Massimo Santini Sebastiano Vigna DSI, Università degli Studi di Milano 3 General Behaviour 3.1 Choosing the damping factor 3.2 Getting close to 1 • can we somehow characterise the properties of ? • what makes different from the other (infinitely many, if P is reducible) limit distributions of P?

Conjecture 1: is the limit distribution of P when the starting distribution is uniform, that is,

Websiteprovide plenty of information: pages in the same website may share the same IP, run on the same web server and database server, and be authored / maintained by the same person or organization. there might be high correlations between pages in the same website, in terms of content, page layout and hyperlinks. websites contain higher density of hyperlinks inside them (about 75% ) and lower density of edges in between.

HostGraph loses much transition information Can a surfer jump from page 5 of site 1 to a page in site 2 ?

From: s06-pc-chairs-email@u.washington.edu [mailto:s06-pc-chairs-Sent: 2006年4月4日 8:36To: Tie-Yan Liu; wangying@amss.ac.cn; fengg03@mails.thu.edu.cn; ybao@amss.ac.cn; mazm@amt.ac.cnSubject: [SIGIR2006] Your Paper #191Title: AggregateRank: Bring Order to Web SitesCongratulations!! 29th AnnualInternationalConference onResearch & Development on Information Retrieval (SIGIR’06, August 6–11, 2006, Seattle, Washington, USA).

Ranking Websites, a Probabilistic View Internet Mathematics,Volume 3 (2007), Issue 3 Ying Bao, Gang Feng, Tie-Yan Liu, Zhi-Ming Ma, and Ying Wang

- --- We suggest evaluating the importance of a website with the mean frequency of visiting the website for the Markov chain on the Internet Graph describing random surfing. ---We show that this mean frequency is equal to the sum of the PageRanks of all the webpages in that website (hence is referred as PageRankSum )

---We propose a novel algorithm (AggregateRank Algorithm) based on the theory of stochastic complement to calculate the rank of a website. ---The AggregateRank Algorithm can approximate the PageRankSum accurately, while the corresponding computational complexity is much lower than PageRankSum

--- By constructing return-time Markov chains restricted to each website, we describe also the probabilistic relation between PageRank and AggregateRank. ---The complexity and the error bound of AggregateRank Algorithm with experiments of real dada are discussed at the end of the paper.

n webs in N sites,

The stationary distribution, known as the PageRank vector, is given by We may rewrite the stationary distribution as with as a row vector of length

where e is an dimensional column vector of all ones We define the one-step transition probability from the website to the website by

The N×N matrix C(α)=(cij(α)) is referred to as the coupling matrix, whose elements represent the transition probabilities between websites. It can be proved that C(α) is an irreducible stochastic matrix, so that it possesses a unique stationary probability vector. We use ξ(α) to denote this stationary probability, which can be gotten from

Since One can easily check that is the unique solution to We shall refer as the AggregateRank

That is, the probability of visiting a website is equal to the sum of PageRanks of all the pages in that website. This conclusion is consistent to our intuition.

the transition probability from Si to Sj actually summarizes all the cases that the random surfer jumps from any page in Si to any page in Sj within one-step transition. Therefore, the transition in this new HostGraph is in accordance with the real behavior of the Web surfers. In this regard, the so-calculated rank from the coupling matrix C(α) will be more reasonable than those previous works.

We have Let denote the number of visiting the website during the n times , that is

We define Assume a starting state in website A, i.e. and inductively It is clear that all the variables are stopping times for X.

Similarly, we have Let denote the transition matrix of the return-time Markov chain for site

Suppose that AggregateRank, i.e. the stationary distribution of is Since Therefore

Based on the above discussions, the direct approach of computing the AggregateRank ξ(α) is to accumulate PageRank values (denoted by PageRankSum). • However, this approach is unfeasible because the computation of PageRank is not a trivial task when the number of web pages is as large as several billions. Therefore, Efficient computation becomes a significant problem .

Construct the stochastic matrix • for by changing the diagonal elements of • to make each raw sum up to 1. AggregateRank 1. Divide the n × n matrix into N × N blocks according to the N sites.

3. Determine from 4. Form an approximation to the coupling matrix , by evaluating 5. Determine the stationary distribution of and denote it , i.e.,

Experiments • In our experiments, the data corpus is the benchmark data for the Web track of TREC 2003 and 2004, which was crawled from the .gov domain in the year of 2002. • It contains 1,247,753 webpages in total.

we get 731 sites in the .gov dataset. The largest website contains 137,103 web pages while the smallest one contains only 1 page.

Performance Evaluation of Ranking Algorithms based on Kendall's distance

Google搜索与 Inter网的信息检索