1 / 35

Distributed Information Discovery

Lecture 14. Distributed Information Discovery. CS 430 Carl Lagoze 2001-03-08. Goals and Motivation. Lesson from the Web: relevant and valuable information is “everywhere” Rethinking the “library” in the digital age: Not as collector of information

louisa
Télécharger la présentation

Distributed Information Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 14 Distributed Information Discovery CS 430 Carl Lagoze 2001-03-08

  2. Goals and Motivation • Lesson from the Web: relevant and valuable information is “everywhere” • Rethinking the “library” in the digital age: • Not as collector of information • Rather as access point to distributed information • Perfect scenario: uniform access to all information with rich functionality

  3. Problems with the Perfect Scenario • Heterogeneity – what is the structure of the information we wish to discovery • Reliability – machines, networks, and organizations are sometimes (often) flaky • Complexity – cost vs. functionality tradeoff

  4. Function versus cost of acceptance Cost of acceptance Z39.50 SDLIP Metadata Harvesting Function

  5. Z39.50 http://www.loc.gov/z3950/agency/

  6. Aims of Z39.50 • Permits one computer, the client, to search and retrieve information on another, the database server • Important both technically and for its wide use in library systems • Most development has concentrated on bibliographic data • Most implementations emphasize searches that use a bibliographic set of attributes to search databases of MARC records

  7. Technical history • Z39.50 • Developed for X.25 networks (connection orientation), conversion to run over TCP fitted later • Original concept in days when repeating a search was expensive computation (about 1980) • WAIS is a stateless derivative of an early version of Z39.50

  8. Z39.50 principles • Abstract view of database searching. • Server stores a set of databases with searchable indexes • Interactions are based on a session • The client opens a connection with the server, carries out a sequence of interactions and then closes the connection. • During the course of the session, both the server and the client remember the state of their interaction.

  9. State • Z39.50 • The server carries out the search and builds a results set • Server saves the results set. • Subsequent message from the client can reference the result set. • Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database.

  10. Z 39.50 services • init -- client connects to the server and exchanges initial information, e.g., preferred message size • explain -- client inquires of the server what databases are available for searching, the fields that are available, the syntax and formats supported, and other options • search -- client presents a query to a database choices of syntax for specifying searches • • only Boolean queries widely implemented • • one or more records may be returned to the client

  11. Z 39.50 services manipulation of results sets -- e.g., sort or delete present -- requests the server to send specified records from the results set to the client in a specified format • options: for controlling content and formats for managing large records or large results sets

  12. Sample query • In the database named "Books" find all records for which the access point title contains the value "evangeline" and the access point author contains the value "longfellow.“ • Z39.50 defines a rich variety of search access points that can be extended by implementers

  13. Simple Digital Library Interoperability Protocol http://www-diglib.stanford.edu/~testbed/doc2/SDLIP/

  14. SDLIP • Compromise between a full-scale, all encompassing search middleware design such as Z39.50 and the “anything goes” approach typical for ad-hoc search interface design on web • Developed jointly by Stanford, Berkeley, and UC Santa Barbara • Heavily influenced by DASL from IETF

  15. SDLIP – search middleware

  16. Managing complexity through separate interfaces

  17. SDLIP Interfaces • Search Interface – defines simple query language, protocol can then include other languages • Result Interface – parking meter metaphor supports varying notions of results sets • Source Metadata Interface – provides extension mechanism through discovery server capabilities

  18. Open Archives Initiative Metadata Harvesting Protocol http://www.openarchives.org

  19. OAI Metadata Harvesting Protocol • Low-barrier framework for repository interoperability • Minimal burden for data providers • Plug-in concept to allow community and service specialization

  20. e-print e-print e-print e-print Metadata Harvesting metadata e-print

  21. e-print e-print e-print e-print Author Title Abstract Identifer Metadata Harvesting metadata e-print

  22. Reply • XML Schema • Self contained OAI core concepts • low-barrier interoperability • data-provider & service-provider model • metadata harvesting model • shared metadata format and parallel, community-specific metadata formats OAI 1.0 protocol HTTP based Dublin Core Community specific

  23. Some thoughts • There is (and will never be) one right solution (technical vs. cost vs. complexity vs. ??) • Distributed technical solutions have organizational ramifications • Distributed resource discovery (as with any distributed computer solution) entails various tradeoffs

  24. Distributed Searching Issues Global Distribution

  25. Broadcast Distributed Search

  26. replicates all query servers • used when primary is down Backup Index server backup index

  27. Deploying Collection Globally • Internet connectivity varies considerably • Good connectivity between nodes often does not correspond to geographic proximity • Connectivity Region - a group of nodes on the network that among them have good connectivity, relative to nodes outside of the region.

  28. Connectivity Regions • When possible route queries within region • In case of failure, use an alternate either within the region or in a “nearby” region

  29. Distributed Searching Issues Query Routing

  30. author=Hopcroft? Hopcroft doc8 Tarjan doc9 Tarjan doc6 Wilensky doc7 Hopcroft doc1, doc2 Hartmanis doc3, doc4 Routing ProblemDisjoint Indexes Hopcroft I1, I3 Hartmanis I3 Tarjan I1, I2 Wilensky I2 I1,I3 doc1, doc2 doc8 Content Summary I1 I2 I3

  31. author=Hopcroft? Hopcroft doc8 Tarjan doc9 Hopcroft doc8 Tarjan doc9 Routing ProblemReplicated Distributed Indexes Tarjan doc6 Wilensky doc7 Tarjan doc6 Wilensky doc7

  32. Routing Issues • Choice of primary?, secondary?, etc. • Fault-tolerance • Routing Factors • Performance-based • Freshness-based • Cost-based • weighted mix based on user preference

  33. Components of Replicated Routing Problem • Metadata Issue: metadata made available by indexer to aid in routing • Metadata Distribution Issue: topology of metadata repositories • Decision Issue: routing decision algorithms • Fault-tolerance: use of backup indexers

  34. Distributed Metadata for Query Routing central metadata store

  35. Performance-based Routing - present 8 T Timed low pass filter Average response time Predicted response time New = low pass filter(T, actual response time, old )

More Related