500 likes | 617 Vues
Emerging Information Technologies: The Role of XML, DOIs, OpenURL, and Federated Search. William H. Mischo w-mischo@uiuc.edu Grainger Engineering Library Information Center University of Illinois at Urbana-Champaign 2002 International Conference on Digital Archive Technologies (ICDAT2002)
E N D
Emerging Information Technologies: The Role of XML, DOIs, OpenURL, and Federated Search William H. Mischo w-mischo@uiuc.edu Grainger Engineering Library Information Center University of Illinois at Urbana-Champaign 2002 International Conference on Digital Archive Technologies (ICDAT2002) December 19, 2002
Outline • Digital Libraries and the Distributed Information Environment. • Document Representation and Full-Text • Digital Library Tools • Illinois Projects. • XML Technologies. • Metadata Technologies. • DOIs, Linking, Local Resolver • Portals, Simultaneous Search, Linking • Grainger Search Aid • Issues & Trends.
The Digital Library • ‘Digital’, ‘Virtual’, ‘Electronic’ Library as network-based library without regard to place and time. • Tendency to apply term to collections and resources. • Digital Collections vs. Digital Library. • Emphasis on the integration of collections and services (e.g. NSDL grant). • Application of standards and protocols is important.
Scholarly Communication Overview • E-Resources are Web-based and publisher-centric. • Growth of Heterogeneous Distributed Repositories. • Value-added services and ‘branding’ of journals. • Prestige of Journals and Publishers • Reciprocal linking relationships between publishers. • Cooperation on linking standards (DOI, CrossRef). • Alternative publishing models - Academia, Preprint Servers, disintermediation.
Distributed Information Environment • We live in a world of multiple, heterogeneous information repositories, resources, portals, and IR systems. • OPACs – local, regional, national shared bibliographic databases. • Local and remote A & I Services. • Discrete publisher and vendor repositories (full-text). • Web search engines, vertical portals, custom portals (NSDL, ARL Portal). • Local metadata, digital objects, GIS, finding aids. • Preprint servers and institutional repositories (D-Space). • Instructional (course) management systems (WebCT, Blackboard). • Harvestable (OAI) sites and services.
Distributed Repository - Issues • Integration of discrete, heterogeneous information resources. • Role of federated and broadcast searching of distributed resources. • Integration of collections with reference, instructional and navigation services -TOC, remote reference assistance. • Integration of Library, institutional, vendor, publisher, and government portals and information services. • Linking technologies. • Metadata harvesting, archiving.
Distributed Environment Action Plan • Pressing need for document representation, retrieval, transmission, and linking middleware tools and standards. • Metadata standards, DOIs, OpenURL. • Factor: changing landscape of Scholarly Communication and disintermediation of publishers and libraries. • Federated search and simultaneous search with reference linking as mechanism to integrate DL landscape.
Web Client Portal Functions: --Authorization --Linking mechanisms between resources and among resources. --Simultaneous search. --Navigation Linking: --Between full-text using DOI, CrossRef, Appropriate Copy. --Between A&I and full-text. --Between OPAC and full-text. Portal Presentation Level Local Link Server, Local Value-Added CrossRef Metadata DOI Server A& I Services (Local and Remote) Full-Text Resources Local Databases and OAI Resources via DBMS Web Resources & Knowledge Environments OPAC E-Resource Registry Aggregator (Ebsco, OCLC) Publisher Portal (Elsevier)
Document Representation • Continuum of Web-Enabled technologies -- all presently being utilized. • Evolving technologies and standards. • Role and history of markup. • XML: its role and importance. • The Smart Document.
Digital Library Tools • We have at our disposal the tools to create integrated digital libraries from the distributed digital resources environment in which we operate: • Standard retrieval environment (Web) and interface/client (Web Browser); • Standard transport mechanisms to connect heterogeneous content (HTTP, OAI, SOAP); • Standard metalanguages and tools for describing and transforming content and metadata (XML, DTDs & Schemas, XSLT, DC/DCQ, RDF, METS); • Standardized search/retrieval mechanisms (HTTP Post/Get, SQL, Z39.50, Object Oriented Databases); • Standard linking tools and infrastructure (DOI, OpenURL, CrossRef). • Candidate set of ‘best practices’ for IR.
Work by Illinois DLI Group • We are attempting to address many of these issues within the Digital Library Initiatives group. • Headquartered at Grainger Engineering Library Information Center at UIUC. • Grant Work: • Digital Library Initiative I (NSF, others), 1994-1998. • Corporation for National Research Initiatives (CNRI) D-Lib Test Suite, 1998-2001. • Collaborating Partners Program, 1998--. • Andrew Mellon Foundation OAI Harvesting grant, 2001-2002. • NSF NSDL (National Science, Engineering, Technology, and Mathematics Digital Library) Program, 2002-2004. • Institute of Museum and Library Services (IMLS) Registry and Integration grant, 2002-2005.
Illinois Testbed Project • Funded under DLI-I by NSF, DARPA, and NASA, 1994--1998. Awards made to 6 universities. • Large-scale Testbed, Distributed Repository models, evaluation, Web software. • Funded under CNRI D-Lib Test Suite Program, 1998—2001. • Collaborating Partners Program. AIP, APS, ASCE, IEE, NRL, ASM, ACM, NTT Learning Systems, Elsevier. • All XML Journal -- AIP, APS, ACM.
Illinois Full-Text Testbed • American Institute of Physics--APL, JAP, RSI • 19,000+ articles, 1995--. • American Physical Society--PRL • 15,000+ articles, 1995--, weekly updates. • ASCE Journals (25 titles) • 11,000+ articles, 1995--. • IEE Proceedings and Electronics Letters • 9,500+ articles, 1993--. • IEEE Computer Society. • ASM (American Society for Materials) Handbook. • ACM (Association for Computing Machinery) Transactions. • Elsevier Science.
Accomplishments • Process & retrieve from multiple publishers & heterogeneous DTDs. • SGML to XML Conversion. • Development of a metadata specification that uses RDF, Dublin Core (DCQ and XML) XML Schemas, local Namespace. • Cross-repository searching (Testbed & D-LIB Test Suite). Full-Text and Metadata. • XSLT, CSS, for transformation & rendering, including Mathematics.
Accomplishments (2) • Introduction of numerous technologies now deployed within publisher repositories: • Forward and Backward links in bibliographies -- within Testbed/Repository, from/to A & I Services. • Use of XSLT for transforming XML to HTML. • Rich extended abstracts. • Conversion of ISO 12083 math markup to MathML. CSS/DHTML mathematics rendering. Use of plug-ins. • Enhanced Web retrieval mechanisms: Author Word Wheels, Co-Occurrence Matrices. • Local Link Server for DOIs, Context-Sensitive linking.
XML (eXtensible Markup Language) • Like SGML, a Data Description Metalanguage. • XML a subset/version of SGML. • Document representation and interchange Standard. • Allows fine-granularity markup of content and structure. Author can create their own elements (extensible). • Tags define the structure of document not the presentation format. • Validated vs. “well-formed” - separation of authoring process from representation & presentation. • Either validated in DTD/Schema or well-formed. • Integrated with relational DBs.
XML Features • The milestones in document description and transmission: ASCII, TCP/IP, HTTP and HTML, XML. Web Programmability. • DTD not required with XML. Needed if internal entities. • Use of Document Object Model (DOM). • Technology approach from Web developer’s standpoint: XML data, CSS presentation layer, XSLT to transform the structure (‘view’) of the data/document.
XML in Information Technologies • Used in Open Archives Initiative (OAI), NSDL. • Compatible with MS SQL Server, Tamino (Software AG), Oracle, DLXS/XPAT (University of Michigan/OpenText), others. • Integral to Web Services (WSDL) and SOAP – Google Web Service. • Used in Library of Congress MODS and METS metadata technologies. • Baked into XyVision and publishing packages.
XML, XSLT, and CSS • Use XML full-text articles as ordered hierarchy of content objects. • Generate item-level metadata in XML, using RDF and Dublin Core syntax and semantics. • XSLT and CSS used to present metadata and articles in either XML or HTML format depending on Browser. • Mathematics rendering using MathML tools (conversion from ISO 12083 to MathML). • Real-time transformation between XML and HTML using XSLT.
Schemas vs. DTDs • Both are systems of representing a data model that defines the data’s elements and attributes, and the relationship among elements. • Schema addresses limitations of DTDs and the increasingly data-oriented role of XML. • W3C XML Schema Working Group: two documents: XML structures and datatypes.
Schema Justification • Description of document type’s structure should be in an XML document instead of written in special syntax (DTD). • Schema are in XML: easier to edit and process using standard XML DOM manipulation tools. • DTD notation doesn’t allow schema designers the power to impose strong data typing -- for example, the ability to say that a certain element type must always have a positive integer value, that it may not be empty, or that it must be one of a list of possible choices.
Metadata and Linking Standards • Digital Object Identifier (DOI) and Persistent Object Identifiers. • OpenURL and Value-Added Service Components (SFX). • Open Archives Initiative (OAI), Dublin Core and Qualifiers, RDF. • Local Resolver Servers.
Open Archives Initiative (OAI) • Released version 1.0 of metadata harvesting protocols. Frozen through second quarter 2001. • Mechanism for data providers to expose their metadata through an HTTP protocol and a mechanism for harvesting records containing metadata from repositories. • Roots in e-print archives. • Lightweight, low-barrier. Easy to implement Web server to handle OAI protocol requests; need to develop procedures to access and extract your metadata.
Ongoing Investigations • Relationship between interoperability models for search and discovery: federated searching (OAI harvested) and broadcast, simultaneous searching of distributed repositories. Not mutually exclusive. • OAI Provider and Harvesting software. Encoding Archival Description (EAD). OAI Engineering/CS/Physics site. • Role of HTTP harvesting, Spider technology. • Reference Linking integration built on OpenURL and DOI. • Reference Assistant software with simultaneous search, point-of-contact assistance, and remote reference capability.
Portals and Gateways • Role is to bring together and integrate disparate e-resources. • Provide a systematic ‘view’ of the information landscape, particularly full-text. • Two primary foci: robust search/navigation and the ability to link everywhere from anywhere in the environment of OPACs, A & I Services, full-text. • Central to this implementation is federated and simultaneous search and reference linking technologies.
Digital Object Identifier (DOI) • DOI is both a unique identifier of a piece of digital content AND a system to access that content digitally. Persistent object identifier. • ‘The ISBN for the 21st Century’ -- Norman Paskin. • DOI system has two main parts: (the identifier and a directory system) and a third logical component, a database. • Developed by AAP (Association of American Publishers), now managed by International DOI Foundation.
DOI Construction • First real open standard for content identification. • DOI is a number that identifies a digital object: • 10.1063/S000369519903216 • 10 Registration Agency Prefix • 1063 Publisher Prefix • S000369519903216 Suffix (Publisher-assigned ID) • Suffix can be SICI or PII. • The DOI and URL pointing to the digital object, is registered with the International DOI Foundation, e.g: • 10.1063/333 | http://www.pubsite.org/apr99/artl1.pdf
Using a DOI • DOIs are resolved using the Handle System technology from CNRI (Corporation for National research Initiatives). • Retrieval of object is two step process: link is sent to central directory where current Web address is stored, location is sent back to browser with special message to redirect to address, e.g: • dx.doi.org/10.1063/333 redirects to www.pubsite.org/apr99/artl1.pdf
Reference Linking • CrossRef Publisher system: major Sci-Tech professional societies and commercial publishers. • System design calls for one URL for each DOI; underlying technology can handle multiple URLs however. • Issue: Directing users to locally held or licensed version of Digital Object (locally loaded or from Aggregator). Appropriate Copy problem.
DOI Proxy OpenURL Client (Web Browser) AIP Handle Server dx.doi.org/10.1063/1234 IEE Nosfx=y Cookie on client Aware Elsevier Local AIP, IEE OpenURL Local Value Added Illinois Local Link Server DOI CrossRef Metadata Database Metadata UIUC Metadata Registry
Simultaneous Search Implementations • DialIndex from Dialog. • Ex Libris MetaLib service. • Endeavor EnCompass. • Innovative Interfaces MetaFind. • Ovid Multiple Search and reference De-Duping. • ISI Web of Knowledge. • Gale Corporation InfoTrac Total Access. • WebFeat. • California Digital Library SearchLight system. • Los Alamos FlashPoint system. • Fretwell-Downing partnering with ARL Portal and Monash University.
Grainger Search Aid • Assist users in the selection of appropriate databases . • Normalize user search arguments and display search results from candidate databases. • Cross-database asynchronous concurrent searching. • Article level and e-journal Web site access to publisher full-text repositories. • Utilize OpenURL, CrossRef metadata database and DOI for reference linking at the article level. • Proxying of vendor systems and capability of ‘taking over’ the search in vendor native mode.
Reference Assistant Project • Utilize Search Aid simultaneous search and link capabilities. • Opportunity to explore interface and navigation issues. • Mimics the behavior of reference librarian. • Allows the application of ‘best match’ and ‘quorum searching’ algorithms.
Simultaneous Search Implementations • Shared Blackboard approach employing Independent Searchbots dedicated to searching information resources and passing results to Web clients. • Event-Driven, Asynchronous HTTP Queries from within a Single Script returning results to Web browser.
Event-Driven, Asynchronous Queries • Single, event-driven web server process, asynchronously querying multiple resources. • Uses WinHTTP from ASP and VBScript • Simpler, not as flexible. Search algorithms and processing coded in scripts. • This is the approach we currently use for our service. • Implementation of multi-step login and session variable passthru being investigated.
OpenURL-Based Services • Standard for expressing and transmitting metadata. • Promise of standardized, normalized search results. • Provides value-added links to the Ovid search results. • Using CrossRef metadata database to look up DOIs.
CiteParse.dll • An ActiveX DLL which can parse various Ovid citations and turn them into OpenURLs: • Tansu N. Chang YL. Takeuchi T. Bour DP. Corzine SW. Tan MRT. Mawst LJ. Temperature analysis … quantum-well lasers. [Article] IEEE Journal of Quantum Electronics. 38(6):640-651, 2002 Jun. • http://…/resolver.asp?genre=article&aulast=Tansu&auinit1=N&atitle=Temperature+analysis+…+quantum-well+lasers&title=IEEE+Journal+of+Quantum+Electronics&volume=38&issue=6&spage=640&epage=651&pages=640-651&date=2002-06
Conclusions • User reactions very positive. • The one-stop-shopping approach has been successful. • Users consider ability to link to full-text from citations in A & I Services and from references on publisher portals very helpful. • Technically, best approach appears to be a hybrid of asynchronous client interface with Web Services querying databases. Moves database middleware to Web Services and eliminates extensive custom script code for search and database query.
Publishing Trends • Publishers will continue to add value to online journal articles. • Digital version will become version of record. • Virtual journals (both publisher-based and cross-publisher) will become common. • Next-generation knowledge environments will evolve. Multimedia, data exposed, live equations with in-place calculations.
Publishing Trends (Continued) • Personalized services will be available -- agent technology, alerting services. • Different economic and subscription models will be introduced. • Deconstruction of Journal (Bob Kelly, APS); article at a time publishing. • Journal branding or perhaps publisher branding. • Academia issues: publishing, tenure.
Continuing Issues • Role of Authors, Academic Institutions, Libraries, Publishers, Abstracting & Indexing Services. • Disintermediation may affect both Libraries and Publishers. • Information as Function not Place. • Provide a ‘Digital Library’ out of digital collections. • Role of XML technology. • Service mechanisms: processing & archiving, search and discovery, presentation, linking.