1 / 53

Ian H. Witten

Ian H. Witten New Zealand Digital Library Project Computer Science Department Waikato University New Zealand http://greenstone.org. Browsing around a digital library. Greenstone: Open source system for creating and delivering digital library collections. Agenda. Context

kyna
Télécharger la présentation

Ian H. Witten

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ian H. Witten New Zealand Digital Library ProjectComputer Science DepartmentWaikato UniversityNew Zealand http://greenstone.org Browsing around a digital library Greenstone: Open source system for creating and delivering digital library collections

  2. Agenda • Context • Documents and interfaces • Different document types • … and interface languages • Searching and browsing • Different search indexes • … and browsing functionality • Collection configuration • (Using the Collector) • The power of open source

  3. What we wanted Greenstone turns a ragtag menagerie of documents in various formats into an easy-to-use collection that can run on a standalone laptop in a Ugandan village’s information center ALA 2002

  4. What we wanted • “Collections” of digital material • Individualized, depending on metadata etc • Up to several Gb of text … • … + associated images, movies, whatever • Fully searchable • Served on WWW, or published on CD-ROM • Multi-platform (Unix + all Windows) • Multi-format documents • Multi-lingual: documents and interfaces • Multimedia • Metadata: standard and non-standard

  5. Collections: on the Web nzdl.org (demo, not service)

  6. Greenstone collections: on CD-ROM UN and NGOs, e.g. • UNESCO • Global Help Project • United Nations University • World Health Organization • Pan American Health Organization

  7. Kataayi Multipurpose Cooperative Rural Uganda(20 km fromMasaka)

  8. HumanityDevelopment Library Example for sustainable development andbasic human needs 160,000 pages 30,000 images 1230 books 340 kg US$20,000 CD-ROM US$6 Win3.1x(!)/95/98/NT Stand-alone and intranet server Web browser user interface Global Help Project, Antwerp (+ UN agencies)

  9. Agenda • Context • Documents and interfaces • Different document types • … and interface languages • Searching and browsing • Different search indexes • … and browsing functionality • Collection configuration • Using the Collector • The power of open source

  10. Collection of pictures (pictures of text) Alexander Turnbull Library, NZ

  11. Voice (and pictures) Hamilton Public Library

  12. Music

  13. Chinese documents (pictures of text) + Chinese interface Peking University Library

  14. Chinese (Chinese & English interfaces) Classic Chinese literature

  15. Arabic (Arabic & English interfaces) Famous mosques

  16. French UNESCO, Paris

  17. Spanish PAHO, WHO

  18. Turkish

  19. Russian collection fromMari El Republic http://gov.mari.ru/gsdl

  20. Agenda • Context • Documents and interfaces • Different document types • … and interface languages • Searching and browsing • Different search indexes • … and browsing functionality • Collection configuration • Using the Collector • The power of open source

  21. Hierarchical document model • Metadata specifiedat any level Title metadata

  22. Searching and browsing • Searching • Metadata-based browsing Subject Title Publisher “HowTo” Dublin Core ad hoc

  23. Multiple search indexes text metadata

  24. Collection-dependent metadata

  25. Multilingual searching

  26. Browsing using classifiers AZList classifier (Title metadata)

  27. DateList classifier (Date metadata)

  28. Hierarchyclassifier (Subject metadata)

  29. Metadata extraction plugins Acronym extraction plugin

  30. Language identification plugin

  31. Emailplugin

  32. Phrase hierarchy extraction + thesaurus browsing

  33. Agenda • Context • Documents and interfaces • Different document types • … and interface languages • Searching and browsing • Different search indexes • … and browsing functionality • Collection configuration • Using the Collector • The power of open source

  34. Collection configuration file creator sjboddie@cs.waikato.ac.nz maintainer sjboddie@cs.waikato.ac.nz public true beta true indexes section:text section:Title document:text defaultindex section:text plugin GAPlug plugin ArcPlug plugin RecPlug classify Hierarchy hfile=sub.txt metadata=Subject sort=Title classify HDLList metadata=Title classify Hierarchy hfile=org.txt metadata=Organization sort=Title classify List metadata=Howto format SearchVList "<td valign=top>[link][icon][/link]</td> <td>{If}{[parent(All': '):Title],[parent(All': '):Title]: } [link][Title][/link]</td>" format CL4VList "<br>[link][Howto][/link]" format DocumentImages true format DocumentText "<h3>[Title]</h3>\\n\\n<p>[Text]" collectionmeta collectionname "greenstone demo" collectionmeta collectionextra "This is a demonstration collection for the Greenstone digital library software.\nIt contains a small subset (11 books) of the Humanity Development Library" collectionmeta iconcollectionsmall "/gsdl/collect/demo/images/demosm.gif" collectionmeta iconcollection "/gsdl/collect/demo/images/demo.gif" collectionmeta .section:Title "section titles" collectionmeta .document:text "entire books" collectionmeta .section:text "chapters“ • name, icon, etc • description • email of creator • search indexes • plugins • classifiers how to format • documents • query results • classifiers

  35. Alter configuration indexes document:Title • Add full-textindex of titles • ... or authors • Add alphabetic author browser • Include Word documents • Include PDF documents • Separate index for each language • Extract acronyms and add list • Import OAI metadata • Extract phrase hierarchy and addbrowser • Alter the format of any of the above • Restrict collection’s interface langs • Change default interface language additional indexes line … need author metadata add classifier line add plugin line (same) add languages line plugin option add plugin line add classifier line add format string add format string edit site config file indexes document:Creator classify AZList –metadata Creator plugin WordPlug plugin PDFPlug languages en fr es plugin PDFPlug –extract_acronyms plugin OAIPlug classify phind format … format PreferenceLangs en|fr|es cgiarg shortname=1 argdefault =fr

  36. Agenda • Context • Documents and interfaces • Different document types • … and interface languages • Searching and browsing • Different search indexes • … and browsing functionality • Collection configuration • Using the Collector • The power of open source

  37. The pen is mightier than the sword! Building and distributing collections carries responsibilities … legal … social … ethical … Be aware of the power of information and use it wisely Collector = software “wizard” for building new collections

  38. Status updated every 5 secs

  39. Agenda • Context • Documents and interfaces • Different document types • … and interface languages • Searching and browsing • Different search indexes • … and browsing functionality • Collection configuration • Using the Collector • The power of open source

  40. The power of open source: Greenstone uses … • Ghostscript • Kea • pdftohtml • rtftohtml • TextCat • wvWare • Xlhtml • XML::Parser Interpreter for Adobe Postscript documents (Postscript plugin) Keyphrase extraction program (to generate metadata) Converter for PDF documents (PDF plugin) Converter for RTF documents (RTF plugin) Detects languages and document encodings Converter for Word documents (Word plugin) Converter for Excel/Powerpoint documents (plugins) Parses XML documents, used to read and write Greenstone’s internal XML document format

  41. and … • MG • GDBM • wget • YAZ • Stemmer • GCC • CVS • Perl • Apache Creates compressed full-text indexes and performs searches Database used for metadata etc Downloading pages from the Web when creating collections Client and server implementation of Z39.50 English language stemmer C/C++ compiler Version control system Used for plugins etc Web server used by many Greenstone installations

  42. Greenstone DL software • Accessible via any Web browser • Server runs on Windows and Unix • Collections can be published on CD-ROM Access • Full-text and fielded search • Flexible browsing facilities • Metadata-based (Dublin Core) • Collection-specific • Hierarchical phrase browsing supported • Creates all access structures automatically Searching/browsing • Plugins — new document, metadata formats • Classifiers — new metadata browsers Extensible • Documents and interfaces • Chinese, Arabic, Maori, Russian etc (+ European) • Multimedia: video, audio collections exist Multilingual Distributed • CORBA protocol allows remote access • Z39.50 server/client for backwards compatibility What you see — you can get! • Open-source software: free, extensible

More Related