380 likes | 486 Vues
Searching the Web Or “If there’s so much out there, why can’t I find it?”. Presented by: Allen Brown IS/SE Date : 2003 - 05-12. . Outline - Searching the Web. Information Cartography Visible and Invisible Web Information Information Finding Strategies
 
                
                E N D
Searchingthe Web Or “If there’s so much out there, why can’t I find it?” Presented by: Allen Brown IS/SE Date: 2003-05-12 
Outline - Searchingthe Web • Information Cartography • Visible and Invisible Web Information • Information FindingStrategies • Reference Tools, Pathfinders, Specialized Information Repositories,Subject Directories, and Search Engines • Information Search Strategies • Information Evaluation Strategies • Information Finding Summary • Search Engines and their Characteristics 
Information Cartography Imagine a physical map of an ocean basin • identifiable areas of the sea floor • large abyssal plain • many undulating hills above the plain • occasional higherelevations or plateaus • sparse atolls and seamounts • Imagine the Web • some information content identifiable by subject • vast amounts of very low value information • somegood stuff distributed across many sites • occasional high quality site with quality and quantity • sparse stunningly useful sites (to die for) 
Information Cartography - 2 Information issues: quality completeness + location! • In searching for information we need to adjust the: • breadth of search to find all that is relevant in an “ocean” of information • quality level to find only “atolls” of information quality • to find everything that is important and useful 
Information space Visible and Invisible Information Visible = indexed by search engine Invisible = not indexed but accessible engine 4 db 2 site 3 engine 2 engine 3 site 7 db 1 db 4 engine 1 site 5 db 6 
Search Engines Won’t Do It All! According to a recent study reported in Nature (1) nosearch engine indexes more than 16% of the Web.Even though search engine databases are enormous, they cover very little of what's actually available on theWeb. 1) Steve Lawrence and C. Lee Giles. (July 8, 1999). Accessibility of Information on the Web. Nature, 400, 107 - 109 
Information Finding Strategies Identify Starting Points based on your question: What type of information do you need? Facts, statistics,government document, scholarly articles, popular opinion, music, picture,multimedia, news, … What form do you want the information in? Dictionary definition, encyclopedia entry, journal article, elementary school project,video file, audio file, … Whattype of site would offer this information? Academic, commercial, government, non-government organization How much information do you need? Introduction, in-depth, references, … 
Information Finding Reference Materials (Often invisible) • dictionaries, thesauri, encyclopedia, newspapers Information Pathfinders (Sometimes invisible) / Portals / Vortals • subject specific, highly relevant, sometimes bizarre • usually high quality • managed by dedicated enthusiasts, possibly amateur • e.g., Web design, Perl, micro cars, Curta calculators, … Specialized Information Repositories (Often invisible) / Portals • institution-based, sometimes obscure • usually high quality • managed by information professionals • e.g., government documents, archives, … 
Information Finding - 2 Subject Indices (Often invisible – but this is changing) • subject-based • e.g., Yahoo Search Engines and Search Brokers (Visible web) • e.g., Google, Alta Vista, Hot Bot, Lycos, Vivisimo, dogpile 
Reference Tools - Dictionarieshttp://www.yourdictionary.com/ 
Reference Tools - Thesaurihttp://www.visualthesaurus.com/index.jsp 
Pathfinders A pathfinder site provides an information map of what is available within a fairly narrow area of interest; usually compiled by domain experts. These sites are often called “vortals” (vertical portals). 
Specialized Information Repositories - National Library of Canada A specialized information repository often collects and catalogues relatively specific information; usually compiled by information experts. Some are considered to be vortals. 
Subject Directorieswww.yahoo.com Subject directories are lists compiled by people. They are organized in a hierarchy where each subject includes a list of sub-topics. These sites are often called “portals” - a one-site starting location for general information seeking. 
Subject Directories • Subjects lists are usually evaluated but sites are not presented in order of relevancy. In other words, the best sites on a topic are not necessarily listed first. Sites are compiled through submission of URLs by site creators and human evaluation and selection. • One advantage of is their browsability, although this feature is only suitable with fairly general topics. A disadvantage is their relatively small size. • Other examples of subject directories : • Infomine: http://infomine.ucr.edu • Scout Report Signpost: http://www.signpost.org/signpost 
Invisible Web Directories Look at http://www.invisible-web.net/ 
Search Engines Search engines use computer programs that automatically collect web sites using "spiders" or "robots". The sites are indexed and stored in an index database. To query a search engine, type topic keywords and Boolean connectors into a search "box." The search engine scans its indexand returns links to websites containing the specified keyword relationships. Size matters - an advantage of using search engines is their coverage (though size is relative), but this can also be a disadvantage if relevance ranking is poor. 
index data base World Wide Web User Search Engine Search Engines: Operational Concepts query query parsing, index lookup, results ranking and management crawling and page contents extraction and indexing query results 
Size If you are looking for unusual or hard-to-find information should try one or more of the search engines with a large index tocheck more web content. This improves the likelihood of finding what you seek. However, for general searches or when looking for information about popular topics, a large index does not necessarily equal better results. Also, large indexes may have longer re-visit intervals. 
Search Engines:Search Scopingand Ranking / Results Management It is essential to learn and apply each engine's specialized searchformats to narrow results and filter and push themost relevant pages to the top of the results list. Use Boolen operators, proximity connectors, stems, wild cards, sounds-like, media-type and metadata filters. Result relevancy ranking also depends on the size of thesearch index and how the search engine interprets and uses your query. Each engine determines result relevancy ranking in unique ways. Consult the help file of each engine to learn aboutthese. Some engines offer search refinement and conceptual clustering for better focus (tighter “hit cluster”) or greater accuracy / validity (centred on the “right stuff”). 
Search Engines - Search Scoping + expands the scope, - reduces the scope • Exact phrase - - quotes, e.g., “We hold these things to be self-evident” • Boolean operators - and- (default) or+ (caution!) not - (extreme caution!), e.g., large male dog, large or male or dog, not cat • Proximity connectors - near- (depends on engine), e.g., spring near flower • Stemming and wildcards - + e.g., swim* swim, swimming, swimmer, swimmers, swimmingly, … • Sounds-like - + e.g.,table  cable, able, fable, … • Media type- - e.g., image, audio file, … • Concept-based+ - e.g., synonym  thesaurus, antonym, homonym, … • Metadata-based- - in some systems 
Search Engines - Ranking Result relevancy ranking (=“usefulness”) can be done according to two techniques (or some combination): • Conventional - using intra-page information • Relative - using extra-page information 
Search Engines - ConventionalRanking Conventional (intra-page): • frequency of words (number and density) • phrases (exact word sequences) • hierarchy (e.g., closer to the top of the document) • adjacency (proximity of words) • metadata (keywords provided by content owners) • font size and style (relative intra-page)  Jack Christensen repairs CURTA calculators. I've known Jack for many years and can highly recommend him. Here are a few questions I asked Jack: What do you charge to clean a Curta? Typically $65 to $95, depending on the work involved. More often than not, the upper carriage needs a complete disassembly, whereas the main body can be cleaned without a complete disassembly. If the main body needs to be completely disassembled, something is usually bent, out of adjustment, or broken. What do you charge when repairing a Curta? I charge $20 per hour of my time. It seems my hours are about 90 minutes long, however, because I rarely finish in the time I originally quoted. Extended repair time is absorbed by me. What spare parts do you have? Are they expensive? I actually have many hundreds of new original Curta parts. Most are for inside the instrument, though. I use them when I do general cleaning and repairs. Outer body pieces, replacement cannisters, and external parts that are easily damaged or broken due to abuse are not generally available, although I do occasionally locate some these items. Sometimes I have to fabricate a part, or repair an item as best I can. Obviously, this takes time, and the cost is high. Parts costs are charged as the traffic will bear. I usually try to be blunt about this to the Curta owner, often telling them that a severely damaged unit is best sold as a "parts Curta". Unfortunately, I've sometimes had to tell this to someone who wanted to repair a Curta looked upon as an heirloom. What to them appears to be a minor issue actually turns out to be a major problem (e.g., a crank handle tilted downward is due to a broken main shaft). I think the most I ever charged for a repair was about $375. There were many severe problems with the unit. Generally, when the price gets to be above $175 most people simply decide to keep the damaged Curta as a memento. Can you replace a clearing ring? What costs are involved? The plastic clearing rings are easy to install. I have several new ones, but I typically do not sell them separately as a spare part. Rather, I install them during a general cleaning and repair. Metal rings are more difficult to replace. As with the plastic clearing rings, I will only install a metal clearing ring during a general cleaning and repair. It takes a special tool to properly swage the rivet in place. [Editor's note: Very old Type I clearing rings were held on with a screw and nut. The nut was also crimped to the screw threads.] I used all the new metal clearing rings I had about five years ago, but I do have a few used ones that were removed from other damaged Curtas. I have these for both the Type I and 
Search Engines - RelativeRanking Relative (extra-page): • popularity (page visits - from the search engine) • citation (links pointing to the item) • relevance of the pages containing the links pointing to the item (!) Yahoo   Web Pages 
Search Engines: Keys to Success World Wide Web Size Large index and / or several engines Ranked and manageable results query construction and search engine features Scoped query “wide net” but appropriate “sieve” carefully constructed for your needs 
Meta Search Engines “Meta" search tools are able to search the index databases of multiple engines “simultaneously”, via a single interface. “Meta” search tools don’t really search metadata. They are simply brokers that reformulate a query and hand it off to a set of search engines, then combine the results. “Meta” engines are very fast but they do not offerthe same level of control over the relationship between keywords as do individual searchengines. Also, meta search engines may producepoor rankingof combined results. 
Search Engines Examples of popular search engines include: Google: http://www.google.com Alta Vista: http://www.altavista.com All the Web http://www.alltheweb.com Northern Light: http://www.northernlight.com Also see The KartOO clustering visual engine http://www.kartoo.com/ For meta engines, try Vivisimo at http://vivisimo.com/ 
Information Search Strategies • Think hard about what you are looking for! • Use a Reference Tool, if appropriate • Use a Pathfinder, if you know one • Use a Specialized Information Repository, if appropriate • Use Subject Indexes, if it is a common topic • Use several Search Engines, if needed, especially for the obscure or academic topic, but learn how they work • Use keywords - be narrow, and specific (and technical) • Use phrases - try synonyms or related concepts • Use Boolean connectors - but find out if / how the engine uses them • Use stemming and wildcards - but find out if / how the engine uses them • Use media-type filters or metadata, if appropriate 
Information space Information Search Tools - Use Pathfinder depth Search Engines and Meta-engines easy to use focused content pre-selected by domain experts obscure or academic caveat emptor! Subject Indexes popular or common pre-selected by interested people Specialized Information Repository hard to use well generic simple lookup created by professionalscontains “invisible” content related or themed pre-selected by professionals contains “invisible” content Reference Tool breadth 
Information Evaluation Strategies: CARS CARS checklist: http://library.queensu.ca./inforef/guides/evalchart.htm • Credibility - author credentials stated with email contact - evidence of quality control (site location) • Accuracy - timeliness - comprehensiveness - audience & purpose • Reasonableness - fairness - objectivity - consistency - world view • Support - source documentation or bibliography 
Summary • There is much information on the Web, but it’s not:- all there- all good (or all bad)- always easy to locate • Use an information search strategy that:- matches the information sought - uses the appropriate tools- uses them in the correct ways • Use an information evaluation strategy, e.g., CARS methodology. • Choose and use search engines wisely, knowing their strengths, features, and their limitations. 
How Do Search Engines Work? Three Activities Occur: 1. Crawling • fetch pages • compile URL list (a db) • re-visit pages 2. Page harvesting • parse page • add to index db and establish ranking 3. Responding to search requests • parse query • apply to index • present and rank results 
Reallyclever stuff in here URL World Wide Web User page contents Fairly clever stuff in here Search Engine Search Engines: Operation fetch Crawler Robot re-visit URL URL data base query QueryProcessor fetch Harvester Robot query results Index data base 
Search Engine - Hardware (not really …) 
How Do Search Engines Work? • See “The Anatomy of a Large-Scale Hypertextual Web Search Engine” at http://www-db.stanford.edu/~backrub/google.html 
References • Information Search Strategies: <http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/FindInfo.html> • Information Evaluation Strategies: <http://www.vuw.ac.nz/~agsmith/evaln/evaln.htm> • Search Engines: < http://www.library.arizona.edu/search.htm> < http://www.brightplanet.com/deepcontent/tutorials/search/index.asp > <http://www.searchenginewatch.com/> • Susan Maze, David Moxley, Donna Smith:Authoritative Guide to Web Search Engines, Neal Schuman Pub, 1997, ISBN 1555703054