310 likes | 450 Vues
UC3 Curation Micro-Services Simplified Repository Ingest. UC Curation Center California Digital Library May 20, 2010. Agenda. Introduction Welcome and review of objectives UC3 and digital curation Landscape, assumptions, and imperatives Curation micro-services The Merritt project
 
                
                E N D
UC3 Curation Micro-ServicesSimplified Repository Ingest UC Curation Center California Digital Library May 20, 2010
Agenda Introduction • Welcome and review of objectives • UC3 and digital curation • Landscape, assumptions, and imperatives Curation micro-services • The Merritt project • Design goals • The future of the DPR Simplified repository ingest • Concepts • Implementation • Demonstration Discussion
Objectives By the end of this discussion we hope that you will understand • Digital curation and the UC3 mission • The emergent, micro-services approach to curation infrastructure • The Merritt curation environment and the future of the DPR • The Merritt Ingest service and its interactions with the Identity, Storage, and Inventory services • How to incorporate the Ingest service into your workflows
University of California Curation Center (UC3) We’ve changed our name, but not our commitment • Ensuring that the information resources supporting, and resulting from, the University’s research, teaching, and learning mission remains authentic, available, and usable UC3 is a Center of Excellence • A creative partnership bringing together the expertise and resources of the CDL, the ten UC campuses, and the broader international curation community
Digital curation The set of policies and practices focused on managing and adding value to a body of trusted digital content • Preservation ensures access over time • Access depends upon preservation up to a point in time It can also be seen as facilitating the alignment of the scholarly and information lifecycles
Landscape Ever increasing number, size, and diversity of content • More stuff, less resources Ever increasing diversity of partners, stakeholders, and expectations • Producers / consumers  prosumers / conducers Inevitability of disruptive change • Technology • User expectation • Institutional mission and resources Problem or opportunity? Work $ Time
Assumptions Curated content gains • Safety through redundancy “Lots of copies keeps stuff safe” • Meaning through context “Lots of description keeps stuff meaningful” • Utility through service “Lots of services keeps stuff useful” • Value through use “Lots of uses keeps stuff valuable” Curation is an outcome, not a place • Decentralized curation can be as effective as centralized Curation stewardship is a relay
Imperatives Provide innovative, effective, and efficient services Plan for change • Focus on content, not the systems in which that content is managed • Systems come and go (but not our system ;-) • Occam’s Razor and Murphy’s Law suggest • Favor the small and simple over the large and complex • Favor the minimally sufficient over the feature laden • Favor the configurable over the prescribed • Favor the proven over the (merely) novel Enable curation at the point of use Do more with less
Curation micro-services Devolve curation function into a granular set of independent, but interoperable micro-services • Since each is small and self-contained, they are collectively easier to develop, maintain, and enhance • Since the level of investment in, and therefore commitment to, any given service is small, they are easier to replace when they have outlived their usefulness • The scope of each service is limited, but complex behavior emerges from the strategic composition of individual atomistic services
Merritt curation micro-services Ingest Inventory Storage Identity
What is the future of the DPR? The DPR will continue to be operated as a core UC3 service • However, the components of the underlying system will be gradually replaced with their new Merritt-based equivalents • All content currently managed in the DPR will be automatically migrated to the new environment Micro-services also can be used to deploy locally-hosted repositories to meet specialized local needs
What is the future of the DPR? Continuing stewardship commitment by UC3 regarding managed content • Safety, persistence, efficiency, economy Streamlined workflows for submission, access, and collection management • Easy in , easy out Accept any content Great flexibility in deploying customized repository solutions
Design goals Policy neutral, protocol and platform independent • We know we can’t foresee all of the contexts in which these services can be usefully deployed Principle of least surprise • Extensive options, but meaningful default behavior Linked data • All entities exist within a web of semantic relations http://linkeddata.org/ The file system is the database • All content and metadata are expressed in the file system • Some subset of this information may be replicated in databases as an optimization for fast query
Design goals Code to interfaces • Underlying implementations should and will evolve over time without invalidating the public interface “contract” Exploit agile methods • Early prototyping, frequent refactoring • Stakeholder engagement The appropriate benchmark for submission user experience is Flickr
Storage concepts Node • A sub-domain of the Storage service established to meet specific policy, administrative, or technical needs Object • Encapsulation in digital form of an abstract intellectual or aesthetic work Version • A set of files representing a discrete state of the object • Any change to object state constitutes a new version File • A formatted bit stream
Storage concepts Stable reference • All objects (and their versions, and their files) managed in the Storage service have stable URLs that can be used to retrieve entities or metadata about entities, subject to appropriate access control http://example-store.edu/content/abc/1234 http://example-store.edu/content/abc/1234/3 http://example-store.edu/state/abc/1234/3/xyz Storage service Request type Storage node Object Version File
Ingest concepts Queue • Asynchronous processing of submitted material Batch • A set of digital objects submitted together • The unit of notification and reporting Job • The processing of a single digital object Handler • A specific processing stage
Ingest concepts Profile • A user-specific set of processing choices • Negotiated as part of the submission agreement Notification • At the time of ingest submission and completion • Our stewardship obligation begins at the time of ingest completion Submit by-value (a file) or by-reference (a URL)
Ingest process flow Create identifier Identity Identifier Submit Submitting library Ingest Node Addversion Notification Version metadata Addversion Getversion metadata Storage Node Notification Version metadata Getversion metadata Inventory Node Version metadata
Ingest implementation Submitting library Ingest notification HTML form Submission notification Batch or single object Job metadata Submitter Queue Consumer Ingester Storage Servlet Implicitly multi-threaded ZooKeeper dæmon Dæmon Explicitly multi-threaded Servlet Implicitly multi-threaded Job payload
Demonstration A few caveats… • Still a work in progress! • The final interface style sheets are not yet applied • Inventory and authentication/authorization services still under development • Full error reporting is not complete
Early community reaction Collaborative development and integration projects with UC3 partners Independent implementation of key Merritt specifications Presentation/BOF at Open Repositories 2010 Digital curation group and Barcamp http://groups.google.com/group/digital-curation http://groups.google.com/group/digital-curation/web/curation-technology-sig
Discussion Will existing workflows continue to work? • Yes, we have a crosswalk from the existing METS-based feeder submission What are the minimal requirements for an acceptable digital object? • A per-object METS file is no longer required • The DPR will accept any content in any form • However, the long-term curation service level may vary depending on the object’s formal characteristics, the presence (or absence) of accompanying metadata, the general state of curation understanding, and the availability of appropriate tools
Discussion How do I include metadata in my submission? • The Ingest submission form provides an opportunity to specify descriptive Dublin Kernel metadata • Administrative metadata is implied by the user’s profile • Name, affiliation, contact information, collection, … • Technical (and, potentially, descriptive) metadata is automatically extracted by the characterization handler • Additional metadata can be expressed in recognized schemas and stored in files with well-known names mrt-dublin-core.txt mrt-mods.xml mrt-creative-commons.rdf …
Discussion Isn’t a enterprise storage solution or RDMS (e.g. Oracle) better than just relying on the file system? • No, we believe that there are a number of important advantages to directly exploiting the file system • No vendor lock-in; propriety systems are difficult to debug • Modern file systems have excellent scaling characteristics • The ability to re-instantiate the system by walking the file system is significant
Discussion Why is there a separate Ingest service? Why can’t I just submit directly to the Storage service? • Merritt embraces the “separation of concerns” principle http://en.wikipedia.org/wiki/Separation_of_concerns • The Storage service only “knows” about storage and has strict requirements for the allowable form of submissions • The Ingest service was explicitly designed for user-facing operation and imposes minimal constraints on submission forms
Discussion (questions for you) What constitutes a “collection”? • Does it have hierarchically-arranged sub-components? What tools do you need to manage your collections effectively? How do you expect to retrieve content from the repository? • Following a saved link? • Search query? If so, what would be the query terms?
Discussion (questions for you) What level of access control is necessary? • Bright vs. dark policy • Embargo periods • Redaction Who are the subject populations? • UC affiliates • Non-UC How fine-grained must this control be? • Collection or object • Campus, research group, user
Discussion (questions for you) Are there other repository tools or protocols that we should investigate? Please respond to the DPR survey at http://vovici.com/wsb.dll/s/aaeg44ec2
For more information UC Curation Center http://www.cdlib.org/services/uc3 Curation micro-services https://confluence.ucop.edu/display/Curation DPR survey http://vovici.com/wsb.dll/s/aaeg44ec2 Digital curation group and Barcamp http://groups.google.com/group/digital-curation http://groups.google.com/group/digital-curation/web/curation-technology-sig UC3 Stephen Abrams Erik Hetzner Margaret Low Mark Reyes Perry Willet Patricia Cruse Greg Janée John Kunze Tracy Seneca Scott Fisher David Loy Isaac Rabinovitch Marisa Strong