270 likes | 524 Vues
The Invisible Web. Chris Sherman Editor, SearchDay SearchEngineWatch.com Information Online 2003. Overview. How Search Engines Work What is the Invisible Web? Tactics for Searching the Invisible Web Future Trends. The Parts of a Search Engine. Three main parts of every search engine:
E N D
The Invisible Web Chris Sherman Editor, SearchDay SearchEngineWatch.com Information Online 2003 Sydney, Australia January 23, 2003
Overview • How Search Engines Work • What is the Invisible Web? • Tactics for Searching the Invisible Web • Future Trends Sydney, Australia January 23, 2003
The Parts of a Search Engine • Three main parts of every search engine: • The Crawler (aka spider) • The Indexer • The Search Engine Database Sydney, Australia January 23, 2003
How Search Engines Work The Web Crawler URL1 URL2 Indexer URL3 URL4 Your Browser Eggs - 90% Eggo - 81% Ego- 40% Huh? - 10% All About Eggs by S. I. Am Search Engine Database Eggs? Eggs. Sydney, Australia January 23, 2003
How Crawlers Work • Crawlers are like hyper-caffeineated browsers • Seeded with a set of URLs • Download Web pages, then: • Extract all links on every page for further crawling • Hand the page off to the indexer Sydney, Australia January 23, 2003
The Bow Tie Model • 30% in the core • 24% origination pages • 24% termination pages • 22% disconnected pages -- these are effectively invisible to search engines Source: IBM Sydney, Australia January 23, 2003
What is the Invisible Web? • “Stuff” that search engine crawlers (spiders) can not -- or will not-- add to their databases • 2 to 50 times larger than the visible Web • Resources often much higher quality than the visible Web Sydney, Australia January 23, 2003
What is the Invisible Web? • Certain file formats (PDF, Flash, Office files, streaming media) • Why? They aren’t HTML text • Most real-time data (stock quotes, weather, airline flight info) • Why? Ephemeral & storage intensive Sydney, Australia January 23, 2003
What is the Invisible Web? • Dynamically generated pages (cgi, javascript, asp, or most pages with “?” in URL) • Why? Spider traps • Web accessible databases • Why? Spiders can’t type Sydney, Australia January 23, 2003
The Opaque Web • Visible pages “hidden” behind dynamic navigation codes • Mostly graphic, non-text pages • “Disconnected” pages Sydney, Australia January 23, 2003
The URL Test Sydney, Australia January 23, 2003
The URL Test Sydney, Australia January 23, 2003
The URL Test Sydney, Australia January 23, 2003
The URL Test Sydney, Australia January 23, 2003
The URL Test Sydney, Australia January 23, 2003
The URL Test Sydney, Australia January 23, 2003
Invisible Web Searching:Core Tactics • The first step in determining the best approach for searching the Invisible Web is to have a clear idea of what you’re seeking. • Limit your search to appropriate tools for the particular type of information you’re looking for. Sydney, Australia January 23, 2003
Use Invisible Web Pathfinders • Intelliseek • http://www.invisibleweb.com • Invisible-web.net • http://www.invisible-web.net/ • Librarians’ Index to the Internet • http://www.lii.org Sydney, Australia January 23, 2003
Finding Non-HTML File Formats • Google & AlltheWeb: use the filetype operator • filetype:pdf • filetype:doc • Use specialized engines • searchpdf.adobe.com • Research Index Sydney, Australia January 23, 2003
Finding Real Time Information • Underground Weather • Google News Search • Yahoo Finance • J-Track Spacecraft Tracker Sydney, Australia January 23, 2003
Finding Images • Google/FAST/AltaVista Image Search • Google Catalogs • Visoo • Webseek @ Columbia Sydney, Australia January 23, 2003
Finding Streaming MediaFiles • Speechbot • Singingfish • MSN Music • British Pathe • WindowsMedia .com v.9 player Sydney, Australia January 23, 2003
Future Trends: The Invisible Web Revealed • More “difficult” content indexed • Flash, dynamic content • “Data centric” search engines • ResearchIndex • Agent-brokered database search • Form crawlers Sydney, Australia January 23, 2003
Conclusion • Searching the Invisible Web isn’t hard. It just takes a different mindset. • It’s crucial to develop your own, personal collection. • Expect the unexpected: the boundary between visible and invisible is changing as we speak. Sydney, Australia January 23, 2003
More Info CyberAge Books 0-910965-51-X http://www.invisible-web.net Sydney, Australia January 23, 2003
More Ranting • SearchDay Newsletter • http://searchenginewatch.com/searchday/ • Searchwise • http://www.searchwise.net csherman@searchwise.net Sydney, Australia January 23, 2003