1 / 87

Midnight in the Garden of Good and Evil Search Engines

Midnight in the Garden of Good and Evil Search Engines. Presentation by Richard Wiggins Technical Advisor, NEM Online, Michigan State University www.msu.edu/staff/rww wiggins@msu.edu Columnist, “Internet Buzz,” webreference.com www.webreference.com/outlook wiggins@internet.com

jovita
Télécharger la présentation

Midnight in the Garden of Good and Evil Search Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Midnight in the Garden of Good and Evil Search Engines • Presentation by Richard Wiggins • Technical Advisor, NEM Online, Michigan State University • www.msu.edu/staff/rww • wiggins@msu.edu • Columnist, “Internet Buzz,” webreference.com • www.webreference.com/outlook • wiggins@internet.com • Co-host, Nothing But Net television program (produced by Media One)

  2. A Parable: The Encounter Between the USS Nimitz and a Canadian Vessel...

  3. A Frequency Analysis of the Appearance of a Critical Search Term Among Major Search Engines...

  4. Frequency of the Search Term “Slavko” Among Major Search Indexes • AltaVista 5477 • Excite 1160 • Infoseek 1452 • Hotbot 4226

  5. Come Join Our Tour of SearchVannah ...a place millions want to visit... …where a cast of characters stands ready to help you find exactly what you’re looking for...

  6. SearchVannah’s Tour Guides • …a relatively new town • …only existed since 1993 • With so many visitors, lots of tour guides have set up shop • They tend to have funny names • They compete fiercely • They’re all trying to make money helping visitors find their way

  7. The Tour Guides • AltaVista • Fast, lots of memory, knows a lot • But people complain sometimes results are inconsistent • InfoSeek • Claims answers are more relevant • MetaCrawler • Doesn’t know anything at all! Just asks the other tour guides!

  8. HotBot HotBot: This tour guide wears the ugliest clothes!

  9. The Tour Guides... • Inktomi: other tour guides hire Inktomi to answer their questions • One guide knows a LOT less than all the others… • But it’s the most popular by far! • The smarter tour guides think of it as just a dumb Yahoo… • But maybe tourists want to know where the B&B is, not a list of all the towels and dishes

  10. Crawler: automated tool to discover new and changed pages, feeds data to… Indexer: builds and maintains an index, concordance-style Search engine: the actual tool end-users employ when searching …but in popular usage, all together = “search engine” Definitions

  11. Leveraging 30 Years of Information Retrieval (IR) • Most new ideas we see in Web engines were thought of long ago... • Stemming • Controlled vocabulary • Text analytics • Knowledge Bases • Personalization (by observing user usage patterns) • Natural language

  12. How Do People Search? SearchVannah “Honestly, tourists are the dumbest people” -- anonymous Tour Guide

  13. What Do People Search For? • Major search services say people look for... • Sex sites • One’s own name • Friends, colleagues’ Web sites (also by name) • Items in the news • Company / product information • Etc.

  14. Metaspy: Window into Real User Queries

  15. One user view of search.msu.edu: Academics • application for graduation • overseas study • ordering catalog • School of Music • Computer Science • human ecology department • psychology 101

  16. Another user view of search.msu.edu: Virtual Library • DNA sequencing • climate change • beam theory • feline brain tumor • PRL and sequencing

  17. Another user view of search.msu.edu: Extension • livestock pavilion • wildlife fisheries • bathtub removal and installation • Round Bale Storage

  18. Another user view of search.msu.edu: Conversational • I would like to know if you offer a workshop on “International Law”

  19. What Do People Search For?Matt Koll’s Formulation • “finding a needle in a haystack” • a known needle in a known haystack • a known needle in an unknown haystack, • to any needle in a haystack • Where are the haystacks? • GenX rendition: Needles? Haystacks? Whatever!

  20. Typical User Search Strategy • Type in a one-word search term • Maybe two words • Seldom exploit advanced options • Capitalization • Quoting phrases (e.g. “climate change”) • Date restrictions • Host:, URL: parameters • Seldom use iterative refinement

  21. Users Make “Wrong” Choices • Picking the right database is confusing • Reference librarians, experienced users learn brand names • Inexperienced users do not • Lycos example: “Small” versus “Large” catalog • “Small” catalog was faster, more precise • Virtually no one used it, thinking “Large” meant “better”

  22. A Route 128 Story • Engineering firm on Route 128 • Engineers new products • Has constant need for specialized information • Uses traditional sources, and the Web • “Joe down the hall” does the Internet searches • Joe is a reference librarian with an engineering degree (and no training in online searching!)

  23. Prospects for Training are Dismal! • We don’t know the users, so we can’t hope to train them • Users won’t read documentation or help notes • If engine doesn’t deliver, users react viscerally • “This engine is useless” or • “The Internet has nothing useful” • “The Internet has too much information!”

  24. How Well Do Today’s Engines Meet Real Users’ Needs? • Most engines cannot yield high precision, high recall hit list with only one search term • But most users don’t compose or refine their searches carefully • Boolean operators virtually unused • Therefore most users probably fail to get desired results • Many sample searches from MSU example would not yield desired information

  25. AltaVista “Intelligent” Case Matching Example • Looking for information on “TREC” search engines testing at NIST

  26. Scale Issues SearchVannah “This town is growing so fast, and there’s too many tourists!” -- a 3rd generation resident

  27. The Problem of Scale • No one knows exact size of Web • Databases, intranets complicate issue • “Dark matter” -- Vint Cerf • Probably 250 to 500 million pages publicly accessible • Recent Science article claims most spider coverage is incomplete • AltaVista claims 140 million pages in index

  28. 1 Billion URLs -- and Beyond 1000M 140M 30M 1996 1997 1998

  29. Problem of Scale: Transaction Load • AltaVista handles 30 million searches per day • Inktomi is “back-end” for numerous sites • HotBot, N2H2 (Japan), Australian news service • Soon, the “find a Web site” function in Windows 98 • No popular service has melted down yet

  30. Eric Brewer, CEO, claims centralized high-speed servers cannot scale Developed new clustering scheme: dozens or hundreds of low-cost servers on high-speed network But centralized engines have not broken down yet 64-bit processors @ 300-450 MHz, gigabytes of RAM, fast paths to disk Inktomi’s “Network of Workstations” Model

  31. Trends SearchVannah “We have a forward-looking sense of fashion!” -- one of the tour guides

  32. Trends Among Search Engines • Observations of Dr. Susan Feldman, Cornell: • More professional look, feel than a couple years ago • Common syntax evolving: • Plus sign prefix for required term, minus for excluded term • Quotes signify phrases, caps signify case significant • Unique “personalities” evolving

  33. The Role of Meta-Crawlers • Experts agree that spider coverage varies across services • No two services cover the same sites for a given search • Therefore searching across multiple indexes yields more results • Therefore metacrawlers can be useful

  34. Targeted Spiders • Train the spider to crawl only sites that fit a certain subject domain • InfoSeek News Index • Death of a Princess example • Internet.com’s “vertical” index • LawCrawler • NEM Online • Research project at Michigan State University • Harnessing information of use to manufacturers

  35. “death of Princess Diana” Search on Infoseek, 8/31/97 1:00 pm

  36. AskJeeves: Question-oriented Knowledge Base

  37. A Better AskJeeves Question

  38. Northern Light

  39. Traditional Model: First, Pick a Database, Then Do Your Search

  40. Why Northern Light is a Breakthrough • Delivering quality sources alongside Web resources • As Web becomes more cluttered, advantage grows • Database search paradigm inverted: First do your search, then pick your source • Automatic categorization yields manageable hit lists • Advantage also grows as Web grows

  41. Real Name System

  42. Specialized Engines: Serving Specific Geographic Areas

  43. Search for “Intel” on Excite

  44. Alexa: Group Experience

  45. Beyond Text: Still Images, Digitized Speech, Video • We tend to think of search engines as limited to text • But increasingly we will face digital content • Thanks to scanners, digital cameras, digital sound cards, digital video cameras • These digital collections will be corporate assets • But to use, and re-purpose, these assets, we will need search engines

  46. IBM Almaden’s Image Search Software • Able to index a large collection of still images • Able to find similar images • User selects image, asks for similar shapes • User draws shapes • User filters by color, textual metadata • Samples available online: • Searchable digital postage stamp archive • www.qbic.almaden.ibm.com/cgi-bin/stamps-demo • Searchable archive of trademarks (logos)

More Related