1 / 32

The Conversion Software Registry

The Conversion Software Registry. Michal Ondrejcek, Kenton McHenry, Rob Kooper, Luigi Marini, and Peter Bajcsy. Overview.

olaf
Télécharger la présentation

The Conversion Software Registry

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Conversion Software Registry Michal Ondrejcek, Kenton McHenry, Rob Kooper, Luigi Marini, and Peter Bajcsy

  2. Overview With an increasing number of file formats used each year preservation of electronic records has become one of the major challenges for the National Archives and Records Administration (NARA). The Strategic Plan of the National Archives and Records Administration (NARA) 2006-2016, Preserving Past to Protect Future 2006, URL http://www.archives.gov/about/plans-reports/strategic-plan/ • Why?: • Will there be software to load the file in the future? • If not will the specification for the format still exist? • What would be the best file format conversion in terms of information preservation? • Was the specification ever available in the case of closed/proprietary formats to begin with? This research is partially supported by a National Archive and Records Administration supplement to NSF PACI cooperative agreement CA #SCI-9619019. 2010 MS eScience -1

  3. Conversions • Convert files to an open standardized format to store with original • How and which format? • Conversions often result in some information loss, which format would have the least? • If we had a “universal” converter we could test conversions and compare before after files to estimate information loss • How do we convert!? • MANY file formats! • MANY closed/proprietary formats! • MANY with large/complex specifications 2010 MS eScience -2

  4. Available 3D File Formats… • Many applications to create/view/save 3D content • MANY of them introduce a new file format for that content! 2010 MS eScience -3

  5. Most applications support a handful of imports and exports • Perform differently based not only on the algorithm used but also on the purpose of the format domain • emphasis on the texture in 3d • morphology vs. color histogram in 2d 2010 MS eScience -4

  6. NCSA file conversion technologies 2010 MS eScience -5

  7. Input/Output Graphs • Software import and export options are visualized. • I/O Graph chooses the shortest path with the minimum applications. 2010 MS eScience -6

  8. Input/Output Graphs Shortest conversion path 3DS Max Adobe 3D Reviewer AutoCAD Blender Cinema 4D K-3D LightWave 3D Maya Wings 3D 2010 MS eScience -7

  9. Software Reuse Layer Making Use of 3rd Party Software - We define this as the wrapping of 3rd party software, utilizing whatever interfaces the software vendors have made available, in order to re-introduce an API like interface to embedded functionality. 2010 MS eScience -8 Exists for the sole purpose of providing an API interface to functionality in 3rd party software • Controls software via wrapper scripts • AutoHotkey, AppleScript, various shell scripts • Vision based scripts • Hides away details of using 3rd party software • Attempts to recover from errors, can throw exceptions

  10. Software Reuse Layer 2010 MS eScience -9 Exists as a service on the machine where the 3rd party software exists Clients provide the Java API interface Many servers can exists on many machines of different platforms

  11. Polyglot 2010 MS eScience -10 The sole purpose of this layer is conversions. • Uses multiple software reuse servers • Merges available script operations into an I/O-Graph • Searches I/O-Graph for conversion paths between an input format and a desired output format • Has no knowledge of underlying 3rd party software • Can use redundancy in software reuse servers to improve performance and work around faults

  12. Comparison Layer 2010 MS eScience -11 The sole purpose of this layer is to compare files. • Versus, a framework for pair-wise digital object comparisons. The library extracts the same features from both objects and computes the similarity based on the chosen measure. • Uses Polyglot layer to convert many test files across many of the possible paths A -> B -> A’ • Compare files before and after conversion • I/O Graph Weights Tool - Converts a set of files across many paths using Polyglot and scripts. Adds information losses obtained from Versus as edge weights to I/O Graph.

  13. Conversion Software Registry (CSR) 2010 MS eScience -12

  14. http://isda.ncsa.illinois.edu/NARA/CSR • Complementary to format registries such as PRONOM and GDFR • No similar service that we are aware of. • Community contributions encouraged • A database focused on: • Conversion software! • Finding subsets of software for specific conversion needs • Find conversion paths between pairs of formats 2010 MS eScience -13

  15. The CSR pseudo-tables block design Parts: 1) Conversions, 2) Software, 3) Formats and Files, 4) Scripts, 5) User login and history 2010 MS eScience -14

  16. Adding Conversions 2010 MS eScience -15

  17. Adding Conversions - scripts Script headers are standardized with up to four lines with Software name and version, software domain (image, 3d, document, etc.), and input/output formats. • Script types present: • Convert - full conversion • Monitor - monitoring software • behavior • Kill - terminating the software • Open/Save/Import/Export 2010 MS eScience -16

  18. Editing Pane • Software • Vendors • Software platforms • Interfaces • Formats • Equivalent extensions • Sample files 2010 MS eScience -17

  19. File formats identifiers and extensions CSR relies on the identifiers. Canonical and derived identifiers: Common usage ‘TIFF’ MIME ‘image/tiff’ UTI ‘public.tiff’ PRONOM puid ‘fmt/10’ PUID is used for different format versions. For example, a tiff extension is represented as PUID ‘fmt/10’ for the version 6.0, ‘fmt/155’ for GeoTiff. CSR search by extensions, MIME, PUID 2010 MS eScience -18

  20. Test files Any file which can be used for conversion accuracy and software validation. The files are uploaded and verified through the UNIX File command and against the file extension entry in the CSR database. Additional file validation has been performed semi-automatically by NARA using GTRI (Georgia Tech Research Institute) File Type Identifier. W. Underwood, “Extensions of the UNIX file command and magic file for file type identification”, Technical report ITTL/CSITD 09-02, Georgia Tech Institute, 2009, URL: http://perpos.gtri.gatech.edu/publications/index.htm 2010 MS eScience -19

  21. Searching for Software Find a conversion path for converting a file format A to a file format B. 2010 MS eScience -20

  22. Searching for Software Find a conversion path for converting a file format A to a file format B. 2010 MS eScience -20

  23. Searching for Software Find a conversion path for converting a file format A to a file format B. 2010 MS eScience -21

  24. Shortest path from file A to B • Dijkstra's algorithm - path with lowest cost (e.g. the shortest path) between one vertex/node and every other vertex with edges defined by some measure • Subjective measure - software ranking by user propagates to all conversions. • Quantitative measures within the domain (images, 3d etc.). • Images: Normalized cross correlation measure, Histogram distance measure, • 3d: Surface area, Statistics, Spin images, Light fields • Document (pdf) • Audio User specified measures for example a linear combination of measures. 2010 MS eScience -22

  25. Searching for Conversion Paths 2010 MS eScience -23

  26. Searching for Conversion Paths 2010 MS eScience -23

  27. Searching for Conversion Paths 2010 MS eScience -23

  28. Searching for Conversion Paths 2010 MS eScience -23

  29. Searching for Conversion Paths 2010 MS eScience -23

  30. Searching for Conversion Paths 2010 MS eScience -23

  31. Future Directions • Compiling known “good” data of various formats • Systematically measuring information loss across software and formats • Possibly distributing task among a community • Ranking software based on performance • Integration of CSR and Polyglot. 2010 MS eScience -24

  32. Summary • Currently contains 2,006 software packages • 1,682 format extensions • 233,810 conversions • No similar service that we are aware of • Complementary to format registries such as PRONOM and GDFR • Free • Community contributions encouraged http://isda.ncsa.illinois.edu/NARA/CSR 2010 MS eScience -25

More Related