1 / 28

Web Characterization

Web Characterization. Week 9 LBSC 690 Information Technology. Outline. What is the Web? What’s on the Web? What is the nature of the Web? Preserving the Web. Defining the Web. HTTP, HTML, or URL? Static, dynamic or streaming? Public, protected, or internal?.

kesia
Télécharger la présentation

Web Characterization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Characterization Week 9 LBSC 690 Information Technology

  2. Outline • What is the Web? • What’s on the Web? • What is the nature of the Web? • Preserving the Web

  3. Defining the Web • HTTP, HTML, or URL? • Static, dynamic or streaming? • Public, protected, or internal?

  4. Economics of the Web in 1995 • Affordable storage • 300,000 words/$ • Adequate backbone capacity • 25,000 simultaneous transfers • Adequate “last mile” bandwidth • 1 second/screen • Display capability • 10% of US population • Effective search capabilities • Lycos (now google), Yahoo

  5. Nature of the Web • Over one billion pages by 1999 • Growing at 25% per month! • Google indexed about 3 billion pages in 2003 • Unstable • Changing at 1% per week • Redundant • 30-40% (near) duplicates • e.g., unix man page tree

  6. Source: Michael Lesk, How Much Information is there in the World?

  7. Number of Web Sites

  8. Web Sites by Country, 2002

  9. What’s a Web “Site”? • OCLC counts any server at port 80 • Misses many servers at other ports • Some servers host unrelated content • Geocities • Some content requires specialized servers • rtsp

  10. World Trade in 2001 Source: World Trade Organization

  11. Global Internet User Population 2000 2005 English English Chinese Source: Global Reach

  12. Widely Spoken Languages Source: http://www.g11n.com/faq.html

  13. Source: James Crawford, http://ourworld.compuserve.com/homepages/JWCRAWFORD/can-pop.htm

  14. Web Page Languages Source: Jack Xu, Excite@Home, 1999

  15. European Web Size: Exponential Growth Source: Extrapolated from Grefenstette and Nioche, RIAO 2000

  16. European Web Content Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997

  17. Live Streams Almost 2000 Internet-accessible Radio and Television Stations source: www.real.com, Feb 2000

  18. Streaming Media • SingingFish indexes 35 million streams • 60% of queries are for music • Then movies • Then sports • Then news

  19. Crawling the Web

  20. Web Crawl Challenges • Temporary server interruptions • Discovering “islands” and “peninsulas” • Duplicate and near-duplicate content • Dynamic content • Link rot • Server and network loads • Have I seen this page before?

  21. Duplicate Detection • Structural • Identical directory structure (e.g., mirrors, aliases) • Syntactic • Identical bytes • Identical markup (HTML, XML, …) • Semantic • Identical content • Similar content (e.g., with a different banner ad) • Related content (e.g., translated)

  22. Robots Exclusion Protocol • Based on voluntary compliance by crawlers • Exclusion by site • Create a robots.txt file at the server’s top level • Indicate which directories not to crawl • Exclusion by document (in HTML head) • Not implemented by all crawlers <meta name="robots“ content="noindex,nofollow">

  23. Link Structure of the Web

  24. The Deep Web • Dynamic pages, generated from databases • Not easily discovered using crawling • Perhaps 400-500 times larger than surface Web • Fastest growing source of new information

  25. Content of the Deep Web

  26. Name Type URL Web Size (GBs) National Climatic Data Center (NOAA) Public http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html 366,000 NASA EOSDIS Public http://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html 219,600 National Oceanographic (combined with Geophysical) Data Center (NOAA) Public/Fee http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/ 32,940 Alexa Public (partial) http://www.alexa.com/ 15,860 Right-to-Know Network (RTK Net) Public http://www.rtk.net/ 14,640 MP3.com Public http://www.mp3.com/ Deep Web • 60 Deep Sites Exceed Surface Web by 40 Times

  27. Hands on: The Wayback Machine • Internet Archive • Stored Alexa.com Web crawls since 1997 • http://archive.org • Check out Maryland’s Web site in 1997 • Check out the history of your favorite site

  28. Discussion Point • Can we save everything? • Should we? • Do people have a right to remove things?

More Related