The Fragmented Web Notes on Chapter 12 For In765 Judith Molka-Danielsen
1. Virtual robots • Virtual robots read and index web pages. • Would be hard to navigate without them. • But, some pages are never mapped. • Simple search engines can return too much. • Meta-search engines select hits across engines. • www.lib.berkeley.edu/TeachingLib/Guides/Internet/MetaSearch.html
Steve Lawrence and C. Lee Giles Attempt to measure the Web in 1999 http://www.neci.nj.nec.com/homepages/lawrence/websize.html
2. Relevancy • Finding the “best” page is more important than finding the “most” pages. • Notes on Searching the Web: http://home.himolde.no/~molka/in350/week9y01.htm
Determining PageRankhttp://www.whitelines.nl/html/google-page-rank.html#example • According to Sergey Brin and Lawrence (Larry) Page, Co-founders of Google, the PR of a webpage is calculated using this formula: • PR(A) = (1 - d) + d * SUM ((PR(I->A)/C(I)) • Where: • PR(A) is the PageRank of your page A. • d is the damping factor, usually set to 0,85. • PR(I->A) is the PageRank of page I containing a link to page A. • C(I) is the number of links off page I. • PR(I->A)/C(I) is a PR-value page A receives from page I. • SUM (PR(I->A)/C(I)) is the sum of all PR-values page A receives from pages with links to page A.. • In other words: The PR of a page is determined by the PR of every page I that has a link to page A. For every page I that points to page A, the PR of page I is devided by the number of links from page I. These values are cumulated and multiplied by 0,85. Finally 0,15 is added to this result, and this number represents the PR of page A. • What is your PageRank? http://www.klid.dk/pagerank.php?url=
Billions Of Textual Documents IndexedDecember 1995-September 2003 http://searchenginewatch.com/reports/article.php/2156481
3. URL’s are directed links. Andrei Broder (2000)
http://www.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report.pdfhttp://www.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report.pdf Static html Db driven/on-demand
4. Defining Web based communities • 15% of web pages have links to opposing views. • 60% of web pages have links to like views. • Social segmentation is self re-enforcing. • Beliefs and affiliations have become public information represented in links and visits. Web based communities are hard to ID. • No boundaries; different sizes; dif. organized. • Pages with more internal links than outside links may be ID as a community. But, no efficient algorithm.
Other points… • 5. Technology can allow more control over individuals: ID them, track them. • Web topology (architecture by self-selecting where to link) limits our actions (browsing, some pages are invisible), more than the code (attempts at control, laws). • 6. Internet Archive maintained since 1996 by Brewster Kahle. Some data will never go away. • http://www.archive.org/ (Try the WayBack Machine.) • 7. Web is complex and self-organized. They started by looking at the macrostructure. The last chapters will look at the smaller groupings.