Navigating the Web: Notes on Virtual Robots and Relevancy

The Fragmented Web Notes on Chapter 12 For In765 Judith Molka-Danielsen

1. Virtual robots • Virtual robots read and index web pages. • Would be hard to navigate without them. • But, some pages are never mapped. • Simple search engines can return too much. • Meta-search engines select hits across engines. • www.lib.berkeley.edu/TeachingLib/Guides/Internet/MetaSearch.html

Steve Lawrence and C. Lee Giles Attempt to measure the Web in 1999 http://www.neci.nj.nec.com/homepages/lawrence/websize.html

2. Relevancy • Finding the “best” page is more important than finding the “most” pages. • Notes on Searching the Web: http://home.himolde.no/~molka/in350/week9y01.htm

Determining PageRankhttp://www.whitelines.nl/html/google-page-rank.html#example • According to Sergey Brin and Lawrence (Larry) Page, Co-founders of Google, the PR of a webpage is calculated using this formula: • PR(A) = (1 - d) + d * SUM ((PR(I->A)/C(I)) • Where: • PR(A) is the PageRank of your page A. • d is the damping factor, usually set to 0,85. • PR(I->A) is the PageRank of page I containing a link to page A. • C(I) is the number of links off page I. • PR(I->A)/C(I) is a PR-value page A receives from page I. • SUM (PR(I->A)/C(I)) is the sum of all PR-values page A receives from pages with links to page A.. • In other words: The PR of a page is determined by the PR of every page I that has a link to page A. For every page I that points to page A, the PR of page I is devided by the number of links from page I. These values are cumulated and multiplied by 0,85. Finally 0,15 is added to this result, and this number represents the PR of page A. • What is your PageRank? http://www.klid.dk/pagerank.php?url=

by Greg R. Notess.

Older Reports with Largest Three at that Time

Freshness

Billions Of Textual Documents IndexedDecember 1995-September 2003 http://searchenginewatch.com/reports/article.php/2156481

3. URL’s are directed links. Andrei Broder (2000)

http://www.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report.pdfhttp://www.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report.pdf Static html Db driven/on-demand

4. Defining Web based communities • 15% of web pages have links to opposing views. • 60% of web pages have links to like views. • Social segmentation is self re-enforcing. • Beliefs and affiliations have become public information represented in links and visits. Web based communities are hard to ID. • No boundaries; different sizes; dif. organized. • Pages with more internal links than outside links may be ID as a community. But, no efficient algorithm.

Other points… • 5. Technology can allow more control over individuals: ID them, track them. • Web topology (architecture by self-selecting where to link) limits our actions (browsing, some pages are invisible), more than the code (attempts at control, laws). • 6. Internet Archive maintained since 1996 by Brewster Kahle. Some data will never go away. • http://www.archive.org/ (Try the WayBack Machine.) • 7. Web is complex and self-organized. They started by looking at the macrostructure. The last chapters will look at the smaller groupings.

Navigating the Web: Notes on Virtual Robots and Relevancy

Navigating the Web: Notes on Virtual Robots and Relevancy

Presentation Transcript

The web beyond the Web

A Query Algebra for Fragmented XML Stream Data

Fragmented Credit Market – Adverse Selection

Lecture 9 Fragmented landscapes

French Administration: A Fragmented Machine

Sources of Order in a Fragmented World

Fragmented Worlds: The Middle Ages in East and West

How to grow on emerging and fragmented markets?

Fragmented property ownership and the High Street

Europe: A Less Fragmented Global and Regional Actor ?

ASEAN – The Fragmented Market

Time for Change: The Hidden Cost of a Fragmented Health Insurance System

Fragmented Worlds: The Middle Ages in East and West

The web beyond the Web

Time to Move Away From Fragmented IT Management

Fragmented Saccades: Inappropriate Operation of the Fixation System?

Survival of species in fragmented forest landscapes

Fragmented knowledge is a form of disempowerment

Restructuring the fragmented Gauteng City Region: Could smart shelters be the answer?

A Query Algebra for Fragmented XML Stream Data

Reducing Fragmented Risk Pools