360 likes | 470 Vues
This document provides a retrospective on the Digital Libraries Initiative (DLI) at the University of Illinois, highlighting the development and challenges of federating repositories to enhance information retrieval in digital libraries. It details the evolution of information retrieval technologies, collaborative projects with major scientific publishers, and the integration of SGML-based searches. The insights gained from user evaluations and testbed implementations underscore the importance of scalability and semantic interoperability in creating efficient digital library services.
E N D
Federating Repositoriesof Scientific Literature www.canis.uiuc.edu The Interspace Prototype (1997-2000) Digital Libraries Initiative (1994-1998) Worm Community System (1990-1993) Telesophy System (1984-1989)
Federating Repositoriesof Scientific LiteratureThe University of Illinois Digital Libraries Initiative (DLI)Project Status & RetrospectiveBruce R. Schatz dli@uiuc.eduhttp://dli.grainger.uiuc.eduAAAS-98, Digital Libraries SessionPhiladelphia, February 1998
Concept Search Document Search Text Search Grand Visions 1960 1970 1980 1990 2000 2010 Syntax Structure Semantics Evolution of Information Retrieval across the Net from: Bruce R. Schatz, “Information Retrieval in Digital Libraries: Bringing Search to the Net” cover article in Science, vol 275, Jan 17, 1997 special issue on Bioinformatics
Illinois DLI Status • Production Testbed based in a Real Library • Document Search based on Structure • SGML Publisher Stream deployed at U of Illinois • Technology Research for Scalable Federation • Concept Search based on Semantics • Statistical Indexes across subjects and media
Production Testbed Status • Based in major Engineering Library • Production Stream - in testbed before on shelves • Full-text SGML -- Federated Structure Search • 5 publishers, 55 journals, 40,000 articles • Web version campus rollout October 1997 • integrated within library information services
Production Testbed Evaluation • 700 users, steadily increasing to max 1500 • used in intro Computer Science classes • developers and evaluators work closely • needs assessment and usability studies • careful multi-modal usage evaluation • session observations and transaction logs
Primary Partners • journal/magazine Publishers: • American Institute of Physics (AIP) • American Physical Society (APS) • American Astronomical Society (AAS) • American Society of Civil Engineers (ASCE) • American Society of Mechanical Engineers (ASME) • American Society of Agricultural Engineers (ASAE) • American Institute of Aeronautics & Astronautics (AIAA) • Institute of Electrical and Electronics Engineers (IEEE) • Institution of Electrical Engineers (IEE) • IEEE Computer Society (IEEE-CS) • testbed: SoftQuad, OpenText • infrastructure: Hewlett-Packard, Microsoft
Testbed Difficulties • Original plan was to modify Mosaic for search • Web became commercial -- we lost control of developers • Plan to use standard BRS as fulltext backend • needed to use SGML specific OpenText search engine • good-quality SGML simply not available • we had to train every publisher; nothing was ready • SGML interactive display not journal quality • physics requires equations -- hard to display well • Custom software hard to deploy widely • Web widespread but too lowend for professional search
Testbed Successes • Willing to build custom encoding procedures • so succeed with SGML where Elsevier and OCLC failed • Canonical encoding for structure tags • so can federate across publishers and journals • Willing to build custom software for Search • so able to do multiple views not single stream like Web • Production repositories for real Publishers • became R&D arm of major scientific publishers • Changing the nature of libraries with research • research prototype becomes standard service
Technology Transfer • Illinois DLI considered R&D arm of publishers • broad spectrum of major publishers in scientific literature • successful annual partner’s workshop plus high-level visits • Technology transferred to Publisher partners • contract with AIP to clone testbed software & processing • arrangements with ASCE for a second cloning • Testbed Continuance by University Library • industrial partners program between Library & Publishers • company formed to provide software and service
Technology Research • Scalable Semantics becoming feasible • statistical clustering proves useful interactively • concept spaces and category maps • Semantic indexes for large collections • 400K Inspec (1995) • 4M Compendex (1996) • Simulation of Community Repositories • 1000 collections across all of engineering • testbed for vocabulary switching (federation)
Vocabulary Switching • Grand Challenge of Digital Libraries • semantic interoperability across subject domains • vocabulary switching to suggest across domains • Generating 1000 community repositories • 600 categories across engineering (38 top-level) • 150 categories across EE, CS, physics • 3M raw abstracts, about 10M in community spaces • large-scale supercomputer simulation • 7 days of dedicated computation (10 days overall) • have space navigation; need space intersection
Multimedia Federation • Semantic Indexing within Media • Text, Image, Number • Semantic Interoperability across Media • Spatial Data (GIS) dataset intersection • Multi-site DLI Collaboration • U Illinois: systems and supercomputers • U Arizona: algorithms and experiments • UC Santa Barbara: collections and metadata
Semantic Analysis of Multimedia • Collections of Objects containing Units • Text: community repository (topic proximity) document abstracts containing noun phrases • Image: aerial photograph (spatial proximity) feature regions containing texture tiles • Units are media-dependent (statistical parsers) • Text: phrase segmentation (nouns on word parts of speech) • Image: texture segmentation (orientation on pixel densities) • Indexes are media-independent (statistical clusters) • Concept: co-occurrence similarity of units within objects • Category: self-organizing maps of objects within collections
Media Interoperability Experiment • Feature regions containing texture tiles in aerial photos • 1M regions in 5K photos around southern California (GIS) • text concept space and category map in geoscience • 10M phrases in 500K abstracts from Georef and Petroleum Abstracts • image concept space and category map in aerial photos • tile similarity space and visual thesaurus maps (10M tiles) • numeric satellite sensor data • 1M NASA AVHRR temperature records, 2M GNIS feature names • spatial gazetteer as bridge image<=>text<=>number • images are labeled by GNIS gazetteer (feature names for text search)
Federated Search • Multiple Indexes in Distributed Repositories • text search: SGML for full-text articles (Testbed) bibliographic abstracts for full coverage (INSPEC) • term suggestion: thesaurus for taxonomy (INSPEC) concept spaces for term coverage (SGML) • Multiple View User Interface Client • uniform displays for multiple indexes • drag-and-drop between display views to mix-and-match • uniform search across multiple repositories • Multiple Protocol Stateful Gateway • single query stream analog to single user interface • will handle distributed repositories for federation, e.g. AAS • Opentext (socket), term-suggest (SQL), Ovid/DRA (Z39.50)
Building a new Community starting the field of Digital Libraries • IEEE Computer DLI special issue May 1996 • Computer DLI retrospective planned for 1999 • Allerton workshops on DL Sociology • edited book planned on DL Evaluation • DLI National Coordination effort • Illinois DLI retrospective conference (Mar 98)
The 21st Century: Analysis • Beyond Search to Analysis • Cross-Correlating Information from many sources across the Net • The Net solves problems • Every community has its own special library • Every community and every person does indexing !! • The Internet evolves into the Interspace