Introduction to Digital Libraries Distributed Searching

Introduction to Digital LibrariesDistributed Searching

Web Search • Distributed Data: Documents spread over millions of different web servers. • Volatile Data: Many documents change or disappear rapidly (e.g. dead links). • Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents. • Quality of Data: No editorial control, false information, poor quality writing, typos, etc.

Dead End Searches • Sometimes pages become either temporarily or permanently inactive. • Internet Explorer inactive page. • Netscape inactive page.

43 million web servers • 167 Terabytes of data

1 Kilobyte = a very short story “Jack and Jill went up the hill to fetch a pail of water. Jack fell down and broke his crown and Jill came tumbling after.” 35 Terabytes of text on surface Web? 1 Megabyte = a short book 35academic research libraries (with some 20,000 meters of shelved books each!) 1 Gigabyte = 20 meters of shelved books 1 Terabyte = an academic research library The Web (Corpus) by the Numbers (2)

Search engines

Choose a Search Tool

about.com

looksmart.com

askjeeves.com

AltaVista Uses a Spider Search

Lycos Searches Selected Words

Overlap Among 3 Major Search Engineshttp://missingpieces.dogpile.com/whitepaper.pdfhttp://comparesearchengines.dogpile.com/OverlapAnalysis.pdf

Metasearch or …… • parallel search • federated search • broadcast search • cross-database search The Hidden Web of content databases is estimated to be thousands of times larger than the Open Web.

The (non) Metasearch Search 1 - Find Search Engine - Logon - Compose search - Run search - Study results - Refine results - Find document - Get document Search 2 - Find search engine - Logon - Compose search - Run search - Study results - Refine results - Find document - GetdDocument Search 3 . . . . . . . . . . . . . . Search 1 Search 2 Search 3

The Metasearch Search 1 - Find MetaSearch Engine - Logon - Compose search - Select Sources - Run search (Metasearch engine runs Searches 1a,b,c) - Study results - Refine results - Find document - Get document Search 1 Search 1c Search 1a Search 1b

Spiders (Robots/Bots/Crawlers) • A program that automatically fetches Web pages. • Spiders are used to feed pages to search engines. • It's called a spider because it crawls over the Web. Another term for these programs is webcrawler.

Spiders (Robots/Bots/Crawlers) • Start with a comprehensive set of root URL’s from which to start the search. • Follow all links on these pages recursively to find additional pages. • Index/Process all novel found pages in an inverted index as they are encountered. • May allow users to directly submit pages to be indexed (and crawled from).

World Wide Web. WWW The Interface views the selected Database items. The Spider searches the WWW and adds sites to the Database. The Database is kept filled and updated by the Spider. Parts of the Internet Search Tool

Search Strategies Breadth-first Search

Search Strategies (cont) Depth-first Search

Search pollution Search for Subcategories

Metasearching • 3 functions of a metasearcher • choosing the sources to query • the source-metadata problem • dispatching the query to those sources • the query language problem • merging the query results • the rank-merging problem

Source-Metadata Problem • How do you choose which sources to query? • manual • applicable only for small #’s of sources • automatic • how does the metasearcher “know” the nature of the various sources? what if there are 1000s of different sources? • approaches • extract enough of their publicly available information and guess • have the source explicitly export a description of itself

Query-Language Problem • Different remote sources use different search engines with different syntaxes • boolean vs. vector • even if all support boolean, still could have different syntax • different field names for fielded searching • stemming vs. no stemming • different stop lists

Rank-Merging Problem • Each source ranks its results which is valid only locally • there is no global scale to rank against, so ranked results cannot be “shuffled” together meaningfully • Issues: • proprietary ranking algorithms (Altavista, Infoseek, etc.) • even if the algorithms are known or even homogeneous, the collection that the document comes from impacts its ranking

Document Classification “planning language proof intelligence” Testing Data: (AI) (Programming) (HCI) Classes: Planning Semantics Garb.Coll. Multimedia GUI ML Training Data: learning intelligence algorithm reinforcement network... planning temporal reasoning plan language... programming semantics language proof... garbage collection memory optimization region... ... ... (Note: in real life there is often a hierarchy, not present in the above problem statement; and you get papers on ML approaches to Garb. Coll.)

Clustering of Text • Cluster documents on basis of terms they contain • Cluster documents on basis of co-occurring citations • Cluster terms on basis of documents they occur in

Methods (1) • Manual classification • Used by Yahoo!, Looksmart, about.com, ODP, Medline • very accurate when job is done by experts • consistent when the problem size and team is small • difficult and expensive to scale • Automatic document classification • Hand-coded rule-based systems • Used by spam filter, Reuters, CIA, Verity, … • E.g., assign category if document contains a given Boolean combination of words • Commercial systems have complex query languages (everything query languages + accumulators)

Cross Language Web Search

Cross-Language IR • Accepting questions in one language, retrieving information in a variety of other languages • “questions”: web-style, or full-text narratives • “information”: web-sites, news, articles, speech, … • Why is it useful? • infeasible to translate collection into every language • feasible to translate selected docs from the ranked list • bilingual users prefer to see the original language • convenient to retrieve in both languages with a single query • CL-IR provides useful insights for conventional IR

The General Problem Find documents written in any language • Using queries expressed in a single language

Top Ten Languages on the Web internetworldstats.com, March, 2011

Supply Side: Internet Hosts Guess – What will be the most widely used language on the Web in 2010? Source: Network Wizards Jan 99 Internet Domain Survey

Top Spoken Languages

Web Pages Global Internet Users

Search Technology Chinese Feature Assignment Monolingual Chinese Matching 1: 0.72 2: 0.48 Language Identification Chinese Feature Assignment Chinese Query English Feature Assignment Cross- Language Matching 3: 0.91 4: 0.57 5: 0.36

Language Identification • Can be specified using metadata • Included in HTTP and HTML • Can be determined using word-scale features • Which dictionary gets the most hits? • Can be determined using subword features 24

Dealing with Morphology • Consider Arabic: Root + patterns+suffixes+prefixes=word ktb+CiCaC=kitab All verbs and nouns derived from fewer than 2000 roots Roots too abstract for information retrieval ktb→kitaba book kitabi my book alkitabthe book kitabuki your book (f) kataba to writekitabuka your book (m) maktaboffice kitabuhu his book maktaba library, bookstore ... Want stem=root+pattern+derivational affixes? No standard stemmers available, only morphological (root) analyzers

Citation Indexing

Citation Indexing • Premise: authors already use citations… use these to navigate the corpus • previously, subject indexes (and later, title indexes) were hand crafted to organize the literature • using citations, the literature organizes itself • each paper contains ~ 15 citations • some more (e.g., biochemistry), some less (e.g. mathematics)

full text reference linking

Who is Who extended services

Introduction to Digital Libraries Distributed Searching

Introduction to Digital Libraries Distributed Searching

Presentation Transcript

Searching FSU Libraries Catalog

Searching Libraries

Searching FSU Libraries Catalog

Introduction to Digital Libraries

Introduction to Digital Libraries

Introduction to Digital Libraries

Distributed digital libraries infrastructure in Poland

Access To Distributed Clinical Digital Libraries

Introduction to Digital Libraries Information Retrieval

Exploring Digital Libraries: Integrating Browsing, Searching, and Visualization

An introduction to digital libraries

Introduction to Digital Libraries Digital Data (2)

Introduction to Digital Libraries

Digital Libraries: an introduction

Introduction to Digital Libraries Searching

A Distributed Architecture for Building Federated Digital Libraries

Introduction to Digital Libraries Information Retrieval

Introduction to Digital Libraries