290 likes | 301 Vues
Learn about the LOCKSS project, a cooperative, decentralized archiving system that ensures long-term access to e-journal content. Discover why LOCKSS is needed and how it works. Presented at the Open Access Conference, Pretoria, July 2004 by Wouter Klapwijk from the University of Stellenbosch.
E N D
The LOCKSS Project: an overviewOpen Access Conference, Pretoria, July 2004Wouter Klapwijk, Univ. of Stellenbosch
Overview • What is LOCKSS ? • Why use LOCKSS ? • How does LOCKSS work ? • Stellenbosch-SA LOCKSS project
Background • LOCKSS program initiated by Stanford University Libraries. • Software under development since 1999. • Development funded by Mellon Foundation and NSF. • First beta version released in 2001 - Stellenbosch beta testing site. • Production version released in April 2004 as Open Source (http://www.sourceforge.com)
Technological view:LOCKSS defined as a Persistent cache • LOCKSS creates low-cost persistent digital “caches” of e-journal content at institutions that (a) subscribe to that content and (b) actively choose to preserve it. • Enables institutions to locally collect, store, preserve and archive the (authorized) content. • Unlike normal caches, pages in a LOCKSS cache are never flushed. • LOCKSS system loads itself with newly published content before the first local user seeks it.
Accessing cached content • Preserved content remains accessible at the original publisher’s URL. • Links and bookmarks, searches through I+A databases resolve either to the publishers site or the to the locally-cached content. • Techniques used to access content at publisher also work to find the preserved content.
So, let’s clarify what it’s all about by asking ourselves:“WHY USE LOCKSS ?”
Paper Library System • For centuries libraries and publishers had stable roles: publishers produced information and libraries kept it safe for reader access. • Librarians’ defence against irreplacable loss has always rested on redundancy. • “One library burns but only one of many copies of a work is destroyed” • A cooperative, affordable, decentralized, ‘archive system’ with LOTS OF COPIES
Going electronic from a library perspective • Libraries are continuing with paper subscriptions due to the absence of sustainable digital archiving solutions. • As a condition to moving towards electronic content, publishers must guarantee long-term access, but only some large publishers can. • Library acquisition funds are insufficient to purchase both formats. • Libraries only pay for access, not ownership.
Going electronic from a publisher perspective • Publishers do not currently guarantee perpetual access to their materials. • Publishers are reluctant to place their publishing platforms under risk. • Publishers might regard archiving as a responsibility of the librarian.
Still going electronic (from an accessibility perspective) • A unilateral change of policy by the publisher may cause access to a title to cease completely. • Failure to renew a subscription can remove a library’s electronic access to past material with no recourse. • Governmental policies? • Internet unavailable.
LOCKSS • The LOCKSS model capitalizes on the traditional roles of libraries and publishers. • Libraries should retain custodial role of preserving scholary information. • Publishers participate by permitting libraries to collect material as published for preservation (a so-called “Publisher manifest”) • Effected by utilizing LOCKSS as a persistent access preservation system. • A cooperative, affordable, decentralized, ‘archive system’ with LOTS OF COPIES
“Publisher manifest” • A Publisher manifest is a web page that lists a title’s top level URLs / volume and grants LOCKSS permission to collect and preserve the content. • Each volume of a title needs a publisher manifest. • Publishers permit libraries to use material preserved in caches consistent with original license terms. • Caches provide content only to the original authorized and authenticated subscriber base. • For paid e-journals, a library must participate at point of subscription or renewal to benefit from the system.
What to Collect and Preserve? • E-Journals • Titles you’ve paid for and are leasing • Freely available titles • Other genres • Newspapers, Gov Docs • http delivered - serial - stable URLs – authoritative version
6 free-access publishers • Absinthe Literary Review • Be (Berkeley Electronic) Press • Cultural Logic • Early Modern Literary Studies • Open Journal System • Other Voices 2 subscription-based publishers • Project MUSE • HighWire Press
LOCKSS Caches • Collect HTTP delivered content • “Crawls” publisher sites in the same way a search engine does. • All formats (PDF, HTML, JPEG, TIF, Audio, Video) • Preserve content integrity • Independent collection • Cooperate to audit and repair damage by means of polling and voting (“reputation based system”) • Provide access (i.e. serve content) • Via web browser • Utilizing EZproxy configurations or a PAC file.
Approximate Data Flows LOCKSS machines
Approximate Data Flows LOCKSS machines (proxy servers) Prevent the publisher from revoking access rights to back content
Look and Feel to Readers • Configure LOCKSS as a web proxy • Example: • PNAS Online table of contents page • from web (9/11/02) • from LOCKSS cache
Distributed Repository Model Technology • Uses many “unreliable repositories” (PCs) • Robustness through redundancy • Inexpensive consumer hardware • Low sys admin overhead (less 1 hour/mo) • Leverages web technology • HTTP delivered and displayed content, all formats • No need to replicate publisher’s system • No single point of failure
Collection AccessLOCKSS and Local Networkspublisher is available PAC File or Proxy PUB LOCKSS
Collection AccessLOCKSS and Local Networkspublisher is unavailable PAC File or Proxy PUB LOCKSS
Storage disc space?Terabytes of E-Journals • Median e-journal size is less then 0.5 GB/ year • 1 Terabyte (1000 GB) = 2000 journal years • J-yr storage TB/PC J-yrs/PC 2004 $0.35 1.44 2,880 2005 $0.28 2.88 5,760 2006 $0.14 5.76 11,520 2007 $0.07 11.52 23,000
South African Goal • LOCKSS Project runs from 1 August 2004 - 31 December 2005 with OSI Foundation funding. • PHASE 1 intended to collaboratively move towards setting up caches and developing Plug-ins (i.e. maintain momentum). • PHASE 2 focused on bandwith savings related to the initial crawl for titles (i.e. “designated cache”). • Focus on previously disadvantaged institutions.
Thank-you Long Lived: slow, determined, indestructible