Extracting knowledge from the World Wide Web

Extracting knowledge from the World Wide Web Monika Henzinger and Steve Lawrence Google Inc. Presented by Murat Şensoy

Objective • The World Wide Web provides an exceptional opportunity to automatically analyze a large sample of interests and activity in the world. But How to: Extract knowledge from the web The Challenge: Distributed and heterogeneous nature of the web makes large-scale analysis difficult.

Objective • The paper providean overview of recent methods for : • Sampling the Web • Analyzing and Modeling Web Growth

Sampling the Web • Due to sheer size of the Web, even simple statistics about it are unknown. • The ability to sample web pages or web servers uniformly at random is very useful for determining statistics. • The Question is: How to Sample the Web uniformly ?

Sampling the Web Two Famous Sampling Methods for the Web are : • Random Walk • IP address Sampling

Sampling the Web with Random Walk Visit the pages with a probability proportional to its PageRank value Main Idea : Sample the visited pages with a probability inversely proportional to its PageRank value Thus, the probability that a page is sampled is a constant independent of the page.

PageRank PageRank has several definitions. Google’s creators Brin and Page published definition of PageRank as used in Google. Sergey Brin and Lawrence Page,"The Anatomy of a Large-Scale Hypertextual Web Search Engine”,in Proceedings of the 7th International World Wide Web Conference, pp. 107–117,1998.

PageRank PageRank has another definition depending on Random Walk. - Initial page is chosen randomly from all pages. - Let walk is at page p at a given time step. - With probability d, follow an out-link of p . - With probability 1-d, select a page out of all pages. The PageRank of a page p is the fraction of steps that the walk spent at p in the limit.

PageRank Two problems arise in the implementation: • Random Walk assumes already that it can find a random page on the web; the problem that we actually want to solve. • Many hosts on the web have a large number of links with in the same host and very few leaving them.

PageRank Henzinger proposed and implemented a modified Random Walk - Given a set of initial pages -Choose start page randomly from initial pages. - Let walk is at page p at a given time step. - With probability d, follow an out-link of p . - With probability 1-d,select a random host among visited hosts, then jump to a randomly selected page out of all pages visited on this host so far. - All pages in the initial set are also considered to be visited.

Sampling the Web with Random Walk The modified random walk visits a page with probability approximately proportional to its PageRank value. Afterward, the visited pages are sampled with probability inversely proportional to their PageRank value. Thus, the probability that a page is sampled is a constant independent of the page.

Sampling the Web with Random Walk An example of statistics generated using this approach:

Sampling the Web withIP Address Sampling • IP V.4 Addresses : 4 bytes • IP V.6 Addresses : 16 bytes There are about 4.3 billion possible IP V.4 addresses. IP address sampling is an approach depending on randomly sampling IP addresses and testing for a web server at the standard port (http:80 or https:443). This approach works only for IP V.4 IP V.6 address space, 2128 addresses, is too much to explore.

Sampling the Web withIP Address Sampling Solution: Check Multiple Times

Sampling the Web withIP Address Sampling This method finds many web servers that would not normally be considered as a part of the publicly indexable web. • Servers with authorization requirements • Servers with no content • Hardware that provides a Web Interface

Sampling the Web withIP Address Sampling A number of issues lead to minor biases: • An IP address may host several web sites • Multiple IP addresses may serve identical content • Some web servers may not use the standard port. There is a higher probability of finding larger sites using multiple IP addresses to serve the same content. Solution: Use the domain name system.

Sampling the Web withIP Address Sampling Analyses from the same study Only 34.2 % of servers contained the common “keyword” or “description” meta-tags on their homepage. Low usage of simple HTML metadata standard suggest that acceptance of more complex standards such as XML will be very slow. The distribution of server types found from sampling 3.6 million IP addresses in February 1999 Lawrence, S. & Giles, C. L. (1999) Nature 400, 107–109.

Discussion On Sampling the Web Current techniques exhibit biases and do not achieve a uniform random sample. • For Random Walk, any implementation is limited to a finite random walk. • For IP address sampling, main challenge is how to sub-sample the pages accessible from a given IP address.

For in-link distribution For out-link distribution Analyzing and Modeling Web Growth We can also extractvaluable information by analyzing and modeling the growth of pages and links on the web. The Web has a degree distribution following the Power Law.

Analyzing and Modeling Web Growth This observation led to the design of various models for the Web. • Preferential Attachment of Barabasi et al. • Mixed Model of Pennock et al. • Copy Model of Kleinberg et al. • The Hostgraph Model

Probability that a new node is connected to node u is Preferential Attachment As the network grows, the probability that a given node receives an edge is proportional to that node’s current connectivity. ‘rich get richer’

Preferential Attachment No Evidence Model suggest that for a node u created at time tu, the expected degree is m(t/tu)0.5. Thus older pages get rich faster than newer pages. Model explains Power Law in-link distribution. However, the model exponent is 3 (by mean-field theory), whereas the observed exponent is 2.1. In reality, different link distributions are observed among web pages of the same category.

Winners don’t take all The early models fail to account for significant deviations from power law scaling common in almost all studied networks. For example, among web pages of the same category, link distributions can diverge strongly from power law scaling, exhibiting a roughly log-normal distribution. Moreover, conclusions about the attack and failuretolerance of the Internet based on the early models may not fully holdwithin specific communities.

Winners don’t take all NEC researchers (Pennock et al.) discovered that the degree of "rich get richer" or "winners take all" behavior varies in different categories and may be significantly less than previously thought.

Winners don’t take all Pennocket al. introduced a new model of network growth, mixing uniform and preferential attachment, that accurately accountsfor the true connectivity distributions found in web categories, the web as a whole, and other social and biological networks.

Winners don’t take all The numbers represent the degree to which link growth is preferential (new links are created to already popular sites).

an existing node v is chosen uniformly at random. a new node u is added with d outlinks Dest 0 v Dest 1 u Dest jth Dest jth Dest d-1 Copy with Probability: Copy Model Kleinberg et al. explained the power-law inlink distributions with a copy model that constructs a directed graph. With Probability: 1-  Choose destination uniformly at random among existing nodes This model is also a mixture of uniform and preferential influences on network growth.

The Hostgraph Model • Models the Web on the host or domain level. • Each node represents a host. • Each directed edge represents the hyperlinks from pages on the source host to pages on the target host.

The Hostgraph Model Bharat et al. show that the weighted inlink and the weighted outlink distributions in the host graph have a power law distribution with  =1.62 and  = 1.67, respectively. However, the number of hosts with small degree is considerably smaller than predicted by the model. There is "flattening" of the curve for low degree hosts.

With probability  add a new node u with d outlinks and with probability 1- select an existing node with additional d outlinks. an existing node v is chosen uniformly at random. Then select d links of vuniformly at random Dest 0 v Dest 1 u Dest 0 Dest jth Dest d Dest jth Dest d-1 Copy with Probability: The Hostgraph Model Bharat et al. madea modification to the copy graph model, called the re-link model, to explain this flattening. With Probability: 1-  Choose destination uniformly at random among existing nodes With probability 1- no new node is added.So number of low degree nodes is reduced.

The Hostgraph Model

Communities on the Web • Identification of communities on the web is valuable . Practical applications include : • Automatic web portals • Focused search engines • Content filtering • Complementing text-based searches Community identification also allows for analysis of the entire web and the objective study of relationshipswithin and between communities.

Communities on the Web Flake et al. define a web community as : A collection of web pages such that each member page has more hyperlinks within the community than outside of the community. Flake et al. show that the web self-organizes such that these link-based communities identify highly related pages.

Communities on the Web

Communities on the Web • There are alternatives for the indication of Web communities: • Kumar et al. consider dense bipartite subgraphs as indications of communities. • Other approaches : • Bibliometric methods such as cocitation and bibliographic coupling • The PageRank algorithm • The HITS algorithm • Bipartite subgraph identification • Spreading activation energy

Conclusion There are still many open problems: • The problem of uniformly sampling the web is still open in practice: which pages should be counted, and how can we reduce biases? • Web growth models approximate the true nature of how the web grows: how can the current models be refined to improve accuracy, while keeping the models relatively simple and easy to understand and analyze? • Finally, community identification remains an open area: how can the accuracy of community identification be improved, and how can communities be best structured or presented to account for differences of opinion in what is considered a community?

Thanks For Your Patience Questions ?

Appendix

Google’s PageRank We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web

Google’s PageRank • Example : d = 0.85 PageRank(C) = 0.15 + 0.85( 1.49/2 + 0.78/1 ) = 1.45 PageRank for 26 million web pages can be computed in a few hours on a medium size workstation (Brin&Page 98).

The Hostgraph Model

Extracting knowledge from the World Wide Web

Extracting knowledge from the World Wide Web

Presentation Transcript

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The world wide web

The World Wide Web

The World Wide Web

Automatic term categorization by extracting knowledge from the Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

Learning to Extract Symbolic Knowledge from the World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

Learning to Extract Symbolic Knowledge from the World Wide Web