1 / 50

Digital Preservation

Digital Preservation. David Giaretta (APA). First PRELIDA Workshop, Tirrenia , June 25th--‐27th,2013. Outline. Fundamental demands Fundamental concepts Trust OAIS and Linked Data. Fundamental demands. Preservation and value. Why pays? Why? What to preserve? What value?. Examples.

cahil
Télécharger la présentation

Digital Preservation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Digital Preservation David Giaretta (APA) First PRELIDA Workshop, Tirrenia, June 25th--‐27th,2013

  2. Outline • Fundamental demands • Fundamental concepts • Trust • OAIS and Linked Data

  3. Fundamental demands

  4. Preservation and value • Why pays? • Why? • What to preserve? • What value?

  5. Examples • Books • Web • Science data • What are the differences?

  6. Value Riding the Wave

  7. Vision 2030 (2) Researchers and practitioners from any discipline are able to find, access and process the data they need. They can be confident in their ability to use and understand data and they can evaluate the degree to which the data can be trusted. • Create a robust, reliable, flexible, green, evolvable data framework with appropriate governance and long-term funding schemes to key services such as Persistent Identification and registries of metadata. • Propose a directive demanding that data descriptions and provenance are associated with public (and other) data. • Create a directive to set up a unified authentication and authorisation system. • Set Grand Challenges to aggregate domains. • Provide “forums” to define strategies at disciplinary and cross-disciplinary levels for metadata definition. • IMPACT IF ACHIEVED • Dramatic progress in the efficiency of the scientific process, and rapid advances in our understanding of our complex world, enabling the best brains to thrive wherever they are.

  8. Vision 2030 (3) Producers of data benefit from opening it to broad access and prefer to deposit their data with confidence in reliable repositories. A framework of repositories work to international standards, to ensure they are trustworthy. • Propose reliable metrics to assess the quality and impact of datasets. All agencies should recognise high quality data publication in career advancement. • Create instruments so long-term (rolling) EU and national funding is available for the maintenance and curation of significant datasets. • Help create and support international audit and certification processes. • Link funding of repositories at EU and national level to their evaluation. • Create the discipline of data scientist, to ensure curation and quality in all aspects of the system. • IMPACT IF ACHIEVED • Data-rich society with information that can be used for new and unexpected purposes. • Trustworthy information is useable now and for future generations.

  9. Vision 2030 (4) Public funding rises, because funding bodies have confidence that their investments in research are paying back extra dividends to society, through increased use and re-use of publicly generated data. • EU and national agencies mandate that data management plans be created. • IMPACT IF ACHIEVED • Funders have a strategic view of the value of data produced.

  10. Vision 2030 (6) The public has access and can make creative use of the huge amount of data available; it can also contribute to the data store and enrich it. All can be adequately educated and prepared to benefit from this abundance of information. • Create non-specialist as well as specialist data access, visualisation, mining and research environments. • Create annotation services to collect views and derived results. • Create data recommender systems. • Embed data science in all training and academic qualifications. • Integrate into gaming and social networks • IMPACT IF ACHIEVED • Citizens get a better awareness of and confidence in sciences, and can play an active role in evidence based decision making and can question statements made in the media.

  11. Vision 2030 (7) Policy makers can make decisions based on solid evidence, and can monitor the impacts of these decisions. Government becomes more trustworthy. • Policy makers are able to make decisions based on solid evidence, and can monitor the impacts of these decisions. Government becomes more trustworthy. • IMPACT IF ACHIEVED • Policy decisions are evidence-based to bridge the gap between society and decision-making, and increase public confidence in political decisions.

  12. Fundamental concepts OAIS

  13. Digital Preservation… • Easy to do… • …as long as you can provide money forever • Easy to test claims about repositories… • …as long as you live a long time

  14. Preservation techniques For each technique • look for evidence – what evidence? • must at least make sure we consider different types of data • rendered vs non-rendered • composite vs simple • dynamic vs static • active vs passive • must look at all types of threats

  15. Threats • Things change…… • Hardware • Software • Environment • Tacit knowledge Things become unfamiliar

  16. Problems when preserving data • Preserve? • Preserve what? • For how long? • How to test? • Which people? • Which organisations? • How well? • Metadata? – What kind? How much?

  17. Requirements raising tide of data… “A fundamental characteristic of our age is the raising tide of data – global, diverse, valuable and complex . In the realm of science, this is both an opportunity and a challenge.” Report of the High-Level Group on Scientific Data, October 2010 “Riding the Wave: how Europe can gain from the raising tide of scientific data” Who pays? Why?

  18. Raising tide of data…

  19. Opportunities

  20. Data contains numbers etc – need meaning

  21. ...to be combined and processed to get this Processing/ combining Processing Level 0 Level 1 Level 2

  22. Preserving digitally encoded information • Ensure that digitally encoded information are understandable and usable over the long term • Long term could start at just a few years • Chain of preservation • Need to do something because things become “unfamiliar” over time • But the same techniques enable use of data which is “unfamiliar” right now

  23. Lots of useful terminology Preservation Planning PRODUCER CONSUMER Descriptive Information Descriptive Information Data Management queries Access Ingest query responses orders SIP Archival Storage AIP AIP DIP Administration MANAGEMENT

  24. Key OAIS Concepts • Claiming “This is being preserved” is untestable • Essentially meaningless • Except “BIT PRESERVATION” • How can we make it testable? • Claim to be able to continue to “do something” with it • Understand/use • Need Representation Information • Still meaningless… • Things are too interrelated • Representation Information potentially unlimited • Need to define a Designated Community – those we guarantee can understand – so we can test

  25. OAIS Information model: Representation Information The Information Model is key Recursion ends at KNOWLEDGEBASE of the DESIGNATED COMMUNITY (this knowledge will change over time and region) Does not demand that ALL Representation Information be collected at once. A process which can be tested

  26. Representation Network GOCE Level 1 (N1 File Format) GOCE Level 0 Processor Algorithm GOCE Level 0 GOCE N1 file standard GOCE N1 file Dictionary GOCE N1 file description PDF standard Dictionary specification PDF software XML

  27. described by delimited by Archival Information Package Packaging Information Package Description derived from identifies Content Information Preservation Description Information further described by Interpreted using * Data Object Representation Information 1 Interpreted using Reference Information Provenance Information Context Information Fixity Information Physical Object Digital Object Structure Information Semantic Information Other Representation Information Access Rights Information adds meaning to 1 1...* Bit

  28. has Representation Information Provenance has

  29. When things changes • We need to: • Know something has changed • Identify the implications of that change • Decide on the best course of action for preservation • What RepInfo we need to fill the gaps • Created by someone else or creating a new one • If transformed: how to maintain data authenticity • Alternatively: hand it over to another repository • Make sure data continues to be usable

  30. Transformation • Change the format e.g. • Word  PDF/A • PDF/A does not support macros • GIF  JPEG2000 • Resolution/ colour depth……. • Excel table  FITS file • NB FITS does not support formulae • Old EO or proprietary format  HDF • Certainly need to change STRUCTURE RepInfo • May need to change SEMANTIC RepInfo Transformational Information Properties

  31. Hand-over • Preservation requires funding • Funding for a dataset (or a repository) may stop • Need to be ready to hand over everything needed for preservation • OAIS (ISO 14721) defines “Archival Information Package (AIP). • Issues: • Storage naming conventions • Representation Information • Provenance • Identifiers • ….

  32. RepInfo toolkit, Packager and Registry – to create and store Representation Information. In addition the Orchestration Manager and Knowledge Gap Manager help to ensure that the RepInfo is adequate. Registry and Orchestration Manager to exchange information about the obsolescence of hardware and software, amongst other changes. The Representation Information will include such things as software source code and emulators. Authenticity toolkit will allow one to capture evidence from many sources which may be used to judge Authenticity. Packaging toolkit to package access rights policy into AIP Persistent Identifier system: such a system will allow objects to be located over time. Orchestration Manager will, amongst other things, allow the exchange of information about datasets which need to be passed from one curator to another. Certification toolkit to help repository manager capture evidence for ISO 16363 Audit and Certification

  33. Infrastructure support • SCIDIP-ES • Converting CASPAR prototypes into robust services

  34. Trust

  35. Vision 2030 (2) Researchers and practitioners from any discipline are able to find, access and process the data they need. They can be confident in their ability to use and understand data and they can evaluate the degree to which the data can be trusted. • Create a robust, reliable, flexible, green, evolvable data framework with appropriate governance and long-term funding schemes to key services such as Persistent Identification and registries of metadata. • Propose a directive demanding that data descriptions and provenance are associated with public (and other) data. • Create a directive to set up a unified authentication and authorisation system. • Set Grand Challenges to aggregate domains. • Provide “forums” to define strategies at disciplinary and cross-disciplinary levels for metadata definition. • IMPACT IF ACHIEVED • Dramatic progress in the efficiency of the scientific process, and rapid advances in our understanding of our complex world, enabling the best brains to thrive wherever they are.

  36. Trust

  37. Reality check From Riding the Wave

  38. Trust issues • Has it been preserved properly? • Is it of high quality? • Has it been changed in some way? • Does the pointer get me to the right object?

  39. Has it been preserved properly? • Can the repository be trusted? • Certification of various kinds • ISO16363 certification should be available soon • Judged on the basis of evidence collected and examined

  40. Is it of good quality? More than one in ten scientists and doctors claim to have witnessed colleagues deliberately fabricating data in order to get their research published, a new poll has revealed. The survey of almost 2,800 experts in Britain also found six per cent knew of possible research misconduct at their own institution that has not been properly investigated. The poll for the hugely-respected British Medical Journal (BMJ) http://www.dailymail.co.uk/sciencetech/article-2085814/Scientists-falsify-data-research-published-whistleblowers-bullied-keeping-quiet-claim-colleagues.html

  41. Dirk Smeesters had spent several years of his career as a social psychologist at Erasmus University in Rotterdam studying how consumers behaved in different situations. Did colour have an effect on what they bought? How did death-related stories in the media affect how people picked products? And was it better to use supermodels in cosmetics adverts than average-looking women? The questions are certainly intriguing, but unfortunately for anyone wanting truthful answers, some of Smeesters' work turned out to be fraudulent. The psychologist, who admitted "massaging" the data in some of his papers, resigned from his position in June after being investigated by his university, which had been tipped off by Uri Simonsohn from the University of Pennsylvania in Philadelphia. Simonsohn carried out an independent analysis of the data and was suspicious of how perfect many of Smeesters' results seemed when, statistically speaking, there should have been more variation in his measurements. Dutch psychologist DiederikStapel. He was found to have fabricated data for years and published it in at least 30 peer-reviewed papers, including a report in the journal Science about how untidy environments may encourage discrimination. http://www.guardian.co.uk/science/2012/sep/13/scientific-research-fraud-bad-practice

  42. Peer review of data • ….is difficult

  43. Lessons from APARSEN • Data Quality • Cost Models for preservation • Preservation tools • Preservation services

  44. Has it been changed in some way? • OAIS defines Authenticity as: The degree to which a person (or system) regards an object as what it is purported to be. Authenticity is judged on the basis of evidence. • Need to capture evidence –what evidence?

  45. Authenticity evidence • Authenticity Model • Provenance capture • How to deal with combinations of data • How to deal with changes • Security and tampering with logs

  46. OAIS and Linked Data

  47. Linked Open Data: Issues • Links – just another dataset? • Or do we have to view as part of a huge “cloud” • is that cloud just another dataset? • Is it just like archiving snapshots of the Web? • Snapshots? But at different times across the cloud • HTTP URIs – how persistent? • HTTP – how persistent? • RDF – how persistent? • What do the links mean?

  48. OAIS-related issues • Designated community • Representation Information • Provenance • Rights • Authenticity • Trustability • Is it easier to “poison” the system?

  49. OAIS / Linked Data questions • Can OAIS concepts be applied to the preservation of Linked Data? • Do existing concepts apply? • Are new concepts needed? • What new terminology is needed?

  50. END QUESTIONS?

More Related