1 / 68

Introduction to Digital Libraries Distributed Searching

Introduction to Digital Libraries Distributed Searching. Web Search. Distributed Data : Documents spread over millions of different web servers. Volatile Data : Many documents change or disappear rapidly (e.g. dead links).

eve-pearson
Télécharger la présentation

Introduction to Digital Libraries Distributed Searching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Digital LibrariesDistributed Searching

  2. Web Search • Distributed Data: Documents spread over millions of different web servers. • Volatile Data: Many documents change or disappear rapidly (e.g. dead links). • Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents. • Quality of Data: No editorial control, false information, poor quality writing, typos, etc.

  3. Dead End Searches • Sometimes pages become either temporarily or permanently inactive. • Internet Explorer inactive page. • Netscape inactive page.

  4. 43 million web servers • 167 Terabytes of data

  5. 1 Kilobyte = a very short story “Jack and Jill went up the hill to fetch a pail of water. Jack fell down and broke his crown and Jill came tumbling after.” 35 Terabytes of text on surface Web? 1 Megabyte = a short book 35academic research libraries (with some 20,000 meters of shelved books each!) 1 Gigabyte = 20 meters of shelved books 1 Terabyte = an academic research library The Web (Corpus) by the Numbers (2)

  6. Search engines

  7. Choose a Search Tool

  8. about.com

  9. looksmart.com

  10. askjeeves.com

  11. AltaVista Uses a Spider Search

  12. Lycos Searches Selected Words

  13. Overlap Among 3 Major Search Engineshttp://missingpieces.dogpile.com/whitepaper.pdfhttp://comparesearchengines.dogpile.com/OverlapAnalysis.pdf

  14. Metasearch or …… • parallel search • federated search • broadcast search • cross-database search The Hidden Web of content databases is estimated to be thousands of times larger than the Open Web.

  15. The (non) Metasearch Search 1 - Find Search Engine - Logon - Compose search - Run search - Study results - Refine results - Find document - Get document Search 2 - Find search engine - Logon - Compose search - Run search - Study results - Refine results - Find document - GetdDocument Search 3 . . . . . . . . . . . . . . Search 1 Search 2 Search 3

  16. The Metasearch Search 1 - Find MetaSearch Engine - Logon - Compose search - Select Sources - Run search (Metasearch engine runs Searches 1a,b,c) - Study results - Refine results - Find document - Get document Search 1 Search 1c Search 1a Search 1b

  17. Spiders (Robots/Bots/Crawlers) • A program that automatically fetches Web pages. • Spiders are used to feed pages to search engines. • It's called a spider because it crawls over the Web. Another term for these programs is webcrawler.

  18. Spiders (Robots/Bots/Crawlers) • Start with a comprehensive set of root URL’s from which to start the search. • Follow all links on these pages recursively to find additional pages. • Index/Process all novel found pages in an inverted index as they are encountered. • May allow users to directly submit pages to be indexed (and crawled from).

  19. World Wide Web. WWW The Interface views the selected Database items. The Spider searches the WWW and adds sites to the Database. The Database is kept filled and updated by the Spider. Parts of the Internet Search Tool

  20. Search Strategies Breadth-first Search

  21. Search Strategies (cont) Depth-first Search

  22. Search pollution Search for Subcategories

  23. Metasearching • 3 functions of a metasearcher • choosing the sources to query • the source-metadata problem • dispatching the query to those sources • the query language problem • merging the query results • the rank-merging problem

  24. Source-Metadata Problem • How do you choose which sources to query? • manual • applicable only for small #’s of sources • automatic • how does the metasearcher “know” the nature of the various sources? what if there are 1000s of different sources? • approaches • extract enough of their publicly available information and guess • have the source explicitly export a description of itself

  25. Query-Language Problem • Different remote sources use different search engines with different syntaxes • boolean vs. vector • even if all support boolean, still could have different syntax • different field names for fielded searching • stemming vs. no stemming • different stop lists

  26. Rank-Merging Problem • Each source ranks its results which is valid only locally • there is no global scale to rank against, so ranked results cannot be “shuffled” together meaningfully • Issues: • proprietary ranking algorithms (Altavista, Infoseek, etc.) • even if the algorithms are known or even homogeneous, the collection that the document comes from impacts its ranking

  27. Document Classification “planning language proof intelligence” Testing Data: (AI) (Programming) (HCI) Classes: Planning Semantics Garb.Coll. Multimedia GUI ML Training Data: learning intelligence algorithm reinforcement network... planning temporal reasoning plan language... programming semantics language proof... garbage collection memory optimization region... ... ... (Note: in real life there is often a hierarchy, not present in the above problem statement; and you get papers on ML approaches to Garb. Coll.)

  28. Clustering of Text • Cluster documents on basis of terms they contain • Cluster documents on basis of co-occurring citations • Cluster terms on basis of documents they occur in

  29. Methods (1) • Manual classification • Used by Yahoo!, Looksmart, about.com, ODP, Medline • very accurate when job is done by experts • consistent when the problem size and team is small • difficult and expensive to scale • Automatic document classification • Hand-coded rule-based systems • Used by spam filter, Reuters, CIA, Verity, … • E.g., assign category if document contains a given Boolean combination of words • Commercial systems have complex query languages (everything query languages + accumulators)

  30. Cross Language Web Search

  31. Cross-Language IR • Accepting questions in one language, retrieving information in a variety of other languages • “questions”: web-style, or full-text narratives • “information”: web-sites, news, articles, speech, … • Why is it useful? • infeasible to translate collection into every language • feasible to translate selected docs from the ranked list • bilingual users prefer to see the original language • convenient to retrieve in both languages with a single query • CL-IR provides useful insights for conventional IR

  32. The General Problem Find documents written in any language • Using queries expressed in a single language

  33. Top Ten Languages on the Web internetworldstats.com, March, 2011

  34. Supply Side: Internet Hosts Guess – What will be the most widely used language on the Web in 2010? Source: Network Wizards Jan 99 Internet Domain Survey

  35. Top Spoken Languages

  36. Web Pages Global Internet Users

  37. Search Technology Chinese Feature Assignment Monolingual Chinese Matching 1: 0.72 2: 0.48 Language Identification Chinese Feature Assignment Chinese Query English Feature Assignment Cross- Language Matching 3: 0.91 4: 0.57 5: 0.36

  38. Language Identification • Can be specified using metadata • Included in HTTP and HTML • Can be determined using word-scale features • Which dictionary gets the most hits? • Can be determined using subword features 24

  39. Dealing with Morphology • Consider Arabic: Root + patterns+suffixes+prefixes=word ktb+CiCaC=kitab All verbs and nouns derived from fewer than 2000 roots Roots too abstract for information retrieval ktb→kitaba book kitabi my book alkitabthe book kitabuki your book (f) kataba to writekitabuka your book (m) maktaboffice kitabuhu his book maktaba library, bookstore ... Want stem=root+pattern+derivational affixes? No standard stemmers available, only morphological (root) analyzers

  40. Citation Indexing

  41. Citation Indexing • Premise: authors already use citations… use these to navigate the corpus • previously, subject indexes (and later, title indexes) were hand crafted to organize the literature • using citations, the literature organizes itself • each paper contains ~ 15 citations • some more (e.g., biochemistry), some less (e.g. mathematics)

  42. full text reference linking

  43. Who is Who extended services

More Related