SEARCH ABILITIES IN DIGITAL LIBRARIES WITH GENERIC DATABASES

SEARCH ABILITIES IN DIGITAL LIBRARIES WITH GENERIC DATABASES Kunal Bansal kbansal@cs.odu.edu

Overview • ‘Information Search’ has grown to be an industry of its own with the advent of the WWW • Serious challenges are being posed to libraries to both traditional and digital libraries by pioneers such as Google, MSN and Y! search. • Cataloging information such as electronic journals, e-books, reference papers in legacy databases can be cumbersome as complexity and titles increase. Intelligent Internet Databases

Overview (Continued) • Vast amount of information comprising of different content and type such as preprint and e-print servers, digital repositories, media archives are all integrated in databases. • Estimates suggest there are ~ 1 billion ‘visible’ web pages and up to ~ 550 billion ‘deep’ pages in 200,000 sites as of 2001. • Google’s search index was recorded at 4.2 billion ‘visible’ pages as of May 2004 compared to 3.3 billion in 2003. Intelligent Internet Databases

Vision (Identifying Possibilities) • Pages of scientific interest mainly identified through their domains (~ 22 billion). • Searches will not be influenced by the advertisement industry, but instead focus on content search related to quality. • Future searches based on major shared data source allowing individual customization for integration with a local host. • New service would have robustness and reliability comparable to Google, but quality and ‘proven’ content provided by networked libraries. Intelligent Internet Databases

Drawbacks in Current Systems (Challenges) • Main point of focus is metadata such as bibliographic content, references, keywords and abstract often restricted to HTML and TXT • Data available tends to be copyrighted and free academic content is skipped in the process • Sequentially incoming responses presented as joint result causing increasing dependence on target (source) databases decreasing performance and limiting scalabilty Intelligent Internet Databases

Drawbacks in Current Systems (Search Comfort and Page Rank) • Traditional Boolean searching impacts ease of search on part of user • Search engines incorporating linguistic analysis and semantic dictionaries allow greater tolerance but could still return irrelevant content due to factors such as page rank • Ranking of results based on several unrelated factors such as payment of search index providers by individuals owning those pages Intelligent Internet Databases

Preliminary Requirements (For Academic Content) • Indexing resources which are factually and intellectually sound thus enforcing a certain degree of standard. • Handling Data Heterogenity using intelligent mark up of of certain searches for routing and search enhancement • Results should overcome page rank and have the ability to filter amongst itself using further parameters from the user • Automated generation of metadata on the fly Intelligent Internet Databases

Federated Academic Network (Indexi) • Interoperability amongst heterogenous digital libraries such as GIOSS and STARTS • Searchable Database Markup Language (Search DB-ML) based on XML independent of the DB (not widely used though) • Central approach where Digital Libraries are linked up with a central service with XML description of capabilites Intelligent Internet Databases

DL Defination Language (Digital Library Language) • Extension of search capabilites by description of API’s for large number of libraries. • Improvments in the tags of DLDL made by following factors • Information already included in the library • Access methods • Information to be retrieved Intelligent Internet Databases

Mapping Queries (Federated Apporach with XML) • XML specification contains the mapping information • Generic specifications along with a included digital library’s behavior is used to generate the digital library XML specification • Resulting user interface is simple enough for future developments and modifications Intelligent Internet Databases

Integration of Data Sources (Dataflow) Screenshot 1 : Database Flow Intelligent Internet Databases

Search Engine Solutions (Advanced Search) Screenshot 2 : Advanced Search Capabilites Intelligent Internet Databases

Search Engine Solutions (Results) Screenshot 3 : Advanced Search Results Intelligent Internet Databases

Geo-coding and Geo-parsing (Mapping) • Unique form of searching for data corresponding to geographical co-ordinates such as latitude and longitude • Processing of ingested documents in digital media libraries • Information matches references on gazetteer which then ties this to existing latitudes and longitudes • Currently marketed by MetaCarta which allows its search technology to probe on XML Web Service for deeper integration with existing applications Intelligent Internet Databases

Geo-coding and Geo-parsing (MetaCarta) Screenshot 4 : Search results from MetaCarta for USEPA Intelligent Internet Databases

Worldwide Initiatives (Distributed Content Gateways) • German ‘Vascoda’ portal www.vascoda.de/ & www.vascoda.com • Deutsche Forschungsgemeinschaft (DFG) www.dfg.de/ & www.dfg.de/en/ • American Research Libraries ‘Scholars Portal’ www.arl.org/arl/pr/scholars_portal.html • British Resource Discovery Network (RDN) www.rdn.ac.uk/partners/ • European RENARDUS www.renardus.org • North American SCOUT project www.scout.wisc.edu Intelligent Internet Databases

Future Developments (Additional Needs) • Introduction of template technology to add additional search boxes for user inputted parameters • Search Interface should be developed based on API’s used for the search. • Automation and Configuration during the process of gathering and pre-processing of items of interest • Ultimate goal to enable a user to search a multiple of independent, discretely mounted, data sources or databases through one query (in case of federated systems) Intelligent Internet Databases

Conclusions • Search abilities can take off only when concerted effort is made on part of content providers to enhance information • Localized infrastructure for the searches needs to be given priority to advance existing indexes • More investment needed in technologies such as Federated Searches, improvements in DB-XML, Search API’s and SOAP • Ease of usability (so called search comfort) needs to be far more superior. Intelligent Internet Databases

References (Courtesy: World Wide Web) • Searching Digital Libraries – Pros and Cons http://www.dlib.org/dlib/june04/lossau/06lossau.html • Search Engines for Digital Libraries – A Realization http://www.dlib.org/dlib/september04/lossau/09lossau.html • Federated Searches for Libraries http://www9.org/final-posters/poster17.html & http://en.wikipedia.org/wiki/Federated_search • MetaCarta http://www.metacarta.com • Page Rank Citation – Bringing Order to the Web http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=1999-66&format=pdf&compression Intelligent Internet Databases

Questions & Comments Anybody ? Intelligent Internet Databases

SEARCH ABILITIES IN DIGITAL LIBRARIES WITH GENERIC DATABASES

SEARCH ABILITIES IN DIGITAL LIBRARIES WITH GENERIC DATABASES

Presentation Transcript

Digital Libraries

Diversity in digital libraries

Digital Libraries

Digital Libraries

Metadata in Digital Libraries

Integrating Digital Libraries with Traditional Libraries

Digital Libraries

Digital Libraries

Digital libraries in action

Diversity in digital libraries

Ethnography in Digital Libraries

Research in digital libraries

Services in digital libraries

Services in digital libraries

Digital Libraries

Usability in Digital Libraries

Research in digital libraries

Services in digital libraries

Diversity in digital libraries