1 / 77

The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln. 1. The vision. Observation. ► Photoshop ► . ► Photoshop ► . Works only, if you are examining the actual image data …. png. image info 1. Extractor.

zack
Télécharger la présentation

The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The XCL LanguagesDigital Preservation – The Planets WayDresden, April 23rd 2010Manfred Thaller, Universität zu Köln

  2. 1. The vision

  3. Observation ► Photoshop ► ► Photoshop ► Works only, if you are examining the actual image data …

  4. png image info 1 Extractor Comparator image info 2 Vision stage 1 the same? Format conversion tiff

  5. png image info 1 Extractor Comparator image info 2 Vision stage 2 the same? Format conversion png rules tiff rules tiff

  6. Obj 1 object info 1 Extractor Comparator object info 2 Vision stage 3 the same? Format conversion rule set 1 rule set 2 Obj 2

  7. Obj 1 Abstract description of file content: „eXtensible Characterisation Definition Language“ (XCDL), able to describe the content of digital objects (=1 + n more files), processible by a software tool for further analysis. XCDL 1 Extractor Comparator XCDL 2 Machine readable form of a file format specification: „eXtensible Characterisation Extraction Language“ (XCEL), able to describe any machine readable format in a formal language, processible by a software tool for extraction of content as XCDL. Vision stage 4 Specification of „similiarity“ to be used: „comparator comparison [Language] “ (coco). the same? Format conversion Specification of „similiarity“ observed: „comparator results [Language] “ (copra). XCEL 1 XCEL 2 Obj 2

  8. 2. Examples I

  9. XCL by Example Image width: 277 Image length: 339

  10. XCEL representation <!-- Tag 256: ImageWidth (XCL: imageWidth) --> <item xsi:type="structuringItem" identifier="IFDE_256" optional="true"> <symbol interpretation="uint16" length="2" value="256"/> <item xsi:type="structuringItem" order="choice"> <item xsi:type="structuringItem" order="sequence"> <!– Data type (value ‚3‘ means uint16)--> <symbol interpretation="uint16" length="2" value="3"/> <!– number of values (N)-> <symbol interpretation="uint32" length="4" value="1"/> <!-- the value and name of property --> <symbol interpretation="uint16" length="2" name="imageWidth"/> <!-- wasted space--> <symbol interpretation="uint16" length="2"/> […] </item> </item> </item>

  11. XCEL representation <!-- Tag 256: ImageWidth (XCL: imageWidth) --> <item xsi:type="structuringItem" identifier="IFDE_256" optional="true"> <symbol interpretation="uint16" length="2" value="256"/> <item xsi:type="structuringItem" order="choice"> <item xsi:type="structuringItem" order="sequence"> <!– Data type (value ‚3‘ means uint16)--> <symbol interpretation="uint16" length="2" value="3"/> <!– number of values (N)-> <symbol interpretation="uint32" length="4" value="1"/> <!-- the value and name of property --> <symbol interpretation="uint16" length="2" name="imageWidth"/> <!-- wasted space--> <symbol interpretation="uint16" length="2"/> […] </item> </item> </item>

  12. XCEL representation <!-- Tag 256: ImageWidth (XCL: imageWidth) --> <item xsi:type="structuringItem" identifier="IFDE_256" optional="true"> <symbol interpretation="uint16" length="2" value="256"/> <item xsi:type="structuringItem" order="choice"> <item xsi:type="structuringItem" order="sequence"> <!– Data type (value ‚3‘ means uint16)--> <symbol interpretation="uint16" length="2" value="3"/> <!– number of values (N)-> <symbol interpretation="uint32" length="4" value="1"/> <!-- the value and name of property --> <symbol interpretation="uint16" length="2" name="imageWidth"/> <!-- wasted space--> <symbol interpretation="uint16" length="2"/> […] </item> </item> </item>

  13. XCDL representation … <property id="p5"> <name id="id30" >imageWidth</name> <valueSet id="i_i1_s4" > <labValue> <val>277</val> <type>int</type> </labValue> </valueSet> </property> ... XCEL entry: <!-- the value and name of property --> <symbol interpretation="uint16" length="2" name="imageWidth"/>

  14. XCDL representation … <property id="p5"> <name id="id30" >imageWidth</name> <valueSet id="i_i1_s4" > <labValue> <val>277</val> <type>int</type> </labValue> </valueSet> </property> ... XCEL entry: <!– Data type (value ‚3‘ means uint16)--> <symbol interpretation="uint16" length="2" value="3"/>

  15. XCDL representations can now be compared…

  16. 3. Syntactical aspects of XCL processing

  17. The XCEL tree The XCEL tree describes a format.

  18. The result tree Parsing a file produces a result tree.

  19. XCDL: models • All file contained is understood as instances of “higher order data types”: • image • text • [ sound ] • [[ vector graphics ]]

  20. XCDL: text model • A text (= <object>) is composed of • data (= <normData>) plus • Interpretations / properties of data according to the underlying format specification (= <property>).

  21. Representing a text in XCDL This is a text <refData id="1">54 68 69 7320 69 73 20 61 20 74 65 78 74</refData> … <property> <name>fontsize</name> <rawVal> <val>48</val> <type>unsignedInt8</type> </rawVal> <dataRef> <!-- property refers to discrete part of reference data--> <ref id="1" start="0" end="3"/> <ref id="1" start=“10" end="12"/> </dataRef> </property>

  22. XCDL: recursiveness XCDL is fully recursive An arbitrarily complex image can be a property of a textual position. Aka: Illustrations in a text file

  23. XCDL: recursiveness XCDL is fully recursive An arbitrarily complex text can be a property of a textual position. Aka: footnotes

  24. XCDL: recursiveness XCDL is fully recursive An arbitrarily complex text can be a property of an image segment. Aka: embedded image descriptions

  25. 3. Semantic aspects of processing

  26. How do Humans do it? Are the following two items equal: VIII  8

  27. How do Humans do it? eight eight VIII  8

  28. How do Humans do it? otto eight eight VIII  8 otto

  29. How do Humans do it? otto acht eight eight VIII  8 acht otto

  30. How do Humans do it? 8.0 otto acht eight eight VIII  8 acht otto

  31. Replicating the approach in a machine: Information model: „an image“ / „a text“ otto acht eight eight VIII  8 acht otto

  32. Replicating the approach in a machine: Information model: „an image“ / „a text“ Format ontology: „what terms are used in formats to describe image / textual properties“. VIII  8

  33. Replicating the approach in a machine: Information model: „an image“ / „a text“ Format ontology: „what terms are used in formats to describe image / textual properties“. Extraction language: “how to get the terms describing an image / a text out of a file encoded in a specific format”.

  34. The Planets XCL Approach – The Ontology

  35. 4. Conceptual aspects of processing

  36. Assumption I Data which represent stored information do so in two forms: As a set of tokens, which describe atomic items of information. By a set of independent parameters, which describe, in a formalized way, the semantic interpretation of these items of information.

  37. Assumption II Most algorithms today are based on “data types”, which are reflecting hardware characteristics (char, int, float ...). “Objects”, which are constructed from these data types, are transient concepts, which are meaningful only within a specific implementation / environment. What we would need are considerably higher order objects, which are persistent by themselves and independent of a specific implementation / environment.

  38. Assumption III The need formulated as assumption II can be fulfilled using assumption I.

  39. Generalisation of Langefors “Infological Equation” I = i (D, S, t) I2 = i (I1, S2, t) Ix = i (Ix-1, Sx, t) Sx = s (Ix-1, t) Ix = i (Ix-α, Sx-β, t) Ix = i (Ix-α, s(Ix-β, t), t) I = Information i(…) = interpretative process D = data S = previous knowledge t = time

  40. 5. Inclusion of rendering results

  41. Observation: A file in Word 2003

  42. Observation: A file in Word 2007

  43. Observation: A file in Open Office

  44. Observation: A file in Acrobat

  45. Proposal to measure layout Cut out page from rendering surface. Scale to common dimensions: 371 +/- 1 x 521 +/- 1 Measure The leftmost and lowest completely black pixel in the letter “A” starting the first line of the main text. The leftmost and highest completely black pixel in the letter “E” starting the first line of the text in the footnote. The geometrical centre of the period at the end of the main sentence. The geometrical centre of the period at the end of the footnote text.

  46. Proposal to measure layout

  47. Could (will ?) be done algorithmically by the way. • <significantPoints> • <point name =“i” x=”45” y=”134” /> • <point name =“ii” x=”57” y=”470” /> • <point name=“iii” x=”215” y=”322” /> • <point name=“iv” x=”254” y=”483” /> • </significantPoints>

  48. Measuring Word 2003 • = 45 / 134; • = 57 / 470; • = 215 / 322 ; • = 254 / 483

  49. Measuring Word 2007 • = 45 / 134; • = 57 / 470; • = 215 / 322 ; • = 254 / 483

  50. Measuring Open Office • = 45 / 134; • = 52 / 470; • = 215 / 322 ; • = 247 / 483

More Related