700 likes | 1.11k Vues
Search Engine Roundup!!!. Michael Hunter Reference Librarian Hobart and William Smith Colleges For Rochester Regional Library Council Member Libraries’ Staff Sponsored by the Rochester Regional Library Council
E N D
Search Engine Roundup!!! Michael Hunter Reference Librarian Hobart and William Smith Colleges For Rochester Regional Library Council Member Libraries’ Staff Sponsored by the Rochester Regional Library Council Supported by Library Services and Technology Act (LSTA) and/or Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2003
For Today . . . • Search Engine Affiliations What am I actually searching, anyway? • The Major Services: an Overview • A Look in Detail: AlltheWeb, Teoma, WiseNut and Gigablast • Hands-on Session • Search Tips and Techniques • A Few Good Metas: Vivisimo and Ixquick • New (and newly-redesigned) Services
The Internet Search Industry: A Volatile World • Information as commodity • Overt actions: Mergers, Acquisitions • Covert actions: Database sharing • Total • Partial • Paid Listings only • NOTE: Data accurate as of Oct. 6, 2003
The Shrinking Search IndustryEditorial control of search is shared among few • Yahoo owns • AlltheWeb, Altavista, Inktomi, Overture (paid listings) • Google • MSN • AskJeeves owns Teoma • LookSmart owns Wisenut • Gigablast • NOTE: Ownership is different from database affiliation
Search EngineDatabase “Affiliates” or“What am I searching, anyway?” • Who crawls the Web? • Google • Alltheweb • Teoma • Inktomi • AltaVista • Wisenut • Gigablast
No Affiliates (for now!) • Altavista • Wisenut • Gigablast
Paid Listings Suppliers:“Sponsored Links” Often First in Results
Overture(NOTE:Purchased AlltheWeb & Altavista in Spring of 2003; Yahoo purchased Overture in Sept. of 2003)
Looking Over the Major Players • Database Size • Database Freshness • Popularity
Database Freshnesshttp://www.searchengineshowdown.com/stats/freshness.shtml • Based on a series of 6 current topic searches • Pages that are updated daily • AND report that date on the page • Queries submitted May 17, 2003
Database Freshnesshttp://www.searchengineshowdown.com/stats/freshness.shtml
Database Freshnesshttp://www.searchengineshowdown.com/stats/freshness.shtml • Most have some results indexed in the last few days • The bulk of most of the databases is about 1 month old • Some pages may not have been re-indexed for much longer
Searches per dayself-reported data, as of 2/28/03http://searchenginewatch.com/reports/article.php/2156461
Four Internet Search Engines:What’s Under the Hood?AlltheWeb, Teoma, Wisenut, Gigablast
AlltheWeb • Developed by FAST of Norway • Launched May, 1999 • Now owned by Overture • One of the best!
AlltheWeb: Databases • Indexed Web pages including PDF, Flash, and other file type • News (from 3,000+ international news sources) • Images • Videos • MP3 files • FTP files • Ads from Overture listed as "Sponsored Results"
AlltheWeb: Search Features • Boolean capabilities in Basic Search +(plus) for and • for not ( ) for or e.g. (jazz swing blues) = jazz or swing or blues • Boolean capabilities in Advanced Search • Via search boxes and drop-down menus • Use of rank boosts importance of records containing those term(s)
AlltheWeb: Search Features • Results clustered by topic (“Folders”) • Both HTML and Multimedia given, when available • NOTE: Located at the BOTTOM of each results screen
AlltheWeb: Field SearchingCommand Line and Drop Down Options • In the text • In the URL • In the link to URL • Retrieves pages that link TO the specified URL • In the Title • In the host name (anywhere)
AlltheWeb Advanced Search:Additional Filters and Limits • 49 Languages (select up to 8 per search using the Customize Option) • IP Address and/or range • Domain (TLD, country or region or entire website) • Date • Document size (UNIQUE!!!) • File formats (9) • Embedded Content (Media Type) • Offensive Content
Date • Date Range from Jan. 1, 1980 - present (based on last update, where available) • last month • last 3 months • last 6 months • last 9 months • last year
Document Size (!!) • Limit by bytes, kilobytes or megabytes
Additional File FormatsUndocumented in HELP, but they work as of 10/5/03 • filetype:rtf • filetype:powerpoint • filetype:excel • filetype:postscript • filetype:wordperfect • filetype:staroffice (Sun’s Office Suite, running on Linux)
Embedded Content • Images : All image types (the <img>Tag) • Audio : Audio files (midi, wav, au etc.) • Video : Video files (Quicktime, AVI, etc.) • RealVideo & RealAudio : Streaming RealVideo and RealAudio • Macromedia Flash : Macromedia Flash animations • Java applets : Java applets (the <applet> tag) • JavaScript : JavaScript and ECMAScript • VBScript : Microsoft VBScript
Website Evaluation FeatureType a URL in the Basic Search Box
Teoma • Launched in 2001 • Bought by AskJeeves in 2002 • Database • Indexed Web pages (no Images or other Media) • Paid listings from Google • Results displayed in 3 groupings: Results, Refine and Resources • Fourth in database size, after Google, ATW and Inktomi
TeomaAdvanced Search Features • Boolean available in Basic and Advanced Search modes • Field searches: full text, title or URL • Limit by language (8 European) • Most limits also operative as commands site: inurl: intitle: lang: Certain limits cannot be combined; see Advanced Search HELP
Results Features3 Results Groupings • Results • Ranked database results, with “Related Pages” • Refine • Clustering of your results and other related sites based on term relationships and web community linkages derived from your original results • Resources • “Link Collections from experts and enthusiasts” (Subject metasites)
Teoma’s Ranking • Includes a site’s relationship to other sites with similar content • How many links (incoming and outgoing) exist between this site and others on the same subject? • To what degree are those other sites inter-linked to the larger web “community” of high quality, similar-subject sites? (Requires some human examination)
Teoma • Plus: • Identifies metasites (“Resources”) • Offers linkage-based web communities (“Refine”) • Minus • Smaller database • No free URL submission • No cached copies • No subject directory
WiseNut • Launched July 2001 • Purchased by LookSmart in 2002 • Single crawler-created database, refreshed often • Claims database of 1.5 billion • pope canterbury 10/4/03 • Google:83,200 WiseNut:31,451 • One partner site, Korea WiseNut
WiseNut Search Features • Full Boolean in Basic and “WiseSearch” • Results clustered by content “WiseGuides” • “Search This” allows inclusion of WiseGuide folder titles in a search • Limit by language (25) • Adult content filtering “WiseWatch” • “Sneak-a-Peek” opens a result in a new window
Gigablast • Launched April, 2002 • Smaller database than others • Over 200 million on 10/4/03 • pope canterbury Google:83,200 Gigablast:24,919 • Created and maintained by Matt Wells (alone) • Only search engine “continuously updated with index refreshed in real time” (Site submissions are immediately searchable) • Ranking depends less on linkage than Google’s ranking, to avoid penalizing newer pages. • No advertising (to date)
Gigablast Search Features • Basic search Full Boolean • Advanced Search: Full Boolean and 2 (!) phrase boxes • Limit by site • Limit by domain (URL) • Links to a page available
Gigablast Search Features • Field searches include title, IP address and non-html filetypes: • PDF, Word, Excel, PPT, PostScript, Ascii Text • Results from one site clustered • Cached version available • Results include date indexed and lastmodified (!!) • Linking to Gigablast improves ranking there