DDI the Movie #1: Architecture for a Modular Distributed Metadata XML

DDI the Movie #1:Architecture for a Modular Distributed Metadata XML By I-Lin Kuo

Table of Contents • 1. Modularity and Physical Instances • 2. Modules, Visibility, and Versioning • 3. Modules and Lists • 4. The Known Intended Functions of DDI • 5. DDI Formats • 6. Modularity and Grouping: Composition vs. Inheritance • 7. Modules and the Study Space • 8. Links and Referencing

Modularity and Physical Instances Chapter 1

This is a standalone instance, like DDI 2.0’s main use case. Question and variables are all contained within the same physical instance. A standalone instance is complete in and of itself, like a “codebook”. Standalone instance (~DDI2.0) Questions Variables

Exporting a module: However, we might want to pull out the questions into its own instance so it can be shared by several studies/datasets. Parts of the standalone instance must be modularized. Two questions must be resolved: Q1.1: How do we indicate this [sharing] relationship between documents? Q1.2: How do we actually reference between the documents in a robust way? Instances Sharing Modules Use Case Study 1 Study 2 Variables Variables Questions

A similar case is the adding of translations to an existing DDI. We would like to alter the existing DDI instance as little as possible. Thus, we would like the translations to be in separate documents We would like the existing DDI instance to be able to function without the translation. We would like multiple translations which could be added later with no more changes to the original. We would like the same translation to be usable if the original instance were updated. German Finnish Translation Use Case Questions Variables

Q1.1: Relationship Indication • Q1.1: How do we indicate this relationship between documents? • A: We borrow a concept from Eclipse plugins. Eclipse plugins have either an extendsrelationship or a dependency relationship with other plugins. These relationships are expressed explicitly in the plugin’s manifest files. • We will express these relationships explicitly in the DDI’s header section, using a similar tag style as Eclipse.

We may refactor a standalone instance so that it uses a shared question module. The core instance is incomplete without the Questions module, so the relationship is dependency. A dependency relation is expressed by <requires> The original instance indicates this dependency by an appropriate <requires> element in its header. The required module simply provides a name/identifier in its header. (the uniqueness and format of this identifier are to be resolved later). The module does not indicate what it is used by since it may be used by several different instances Instances sharing modules Study 1 Study 2 <requires module=“Questions”/> <requires module=“Questions”/> Variables Variables Questions <module name=“Questions”/>

Since the original instance is complete without the translation, the relationship is extension. It is important to differentiate between dependencies and extensions. An <extension-point> indicates some indeterminate module may be plugged in Q1.3: Why is it important? Q1.4: What happens when we have multiple extensions/translations? German Finnish Translation <extension name=“translation”/> Questions Variables <extension-point module=“translation”/> <extension name=“translation”/>

Q1.4 in more detail • Q1.4: What happens when we have multiple extensions/translations? • Note that the Instances-Sharing-Modules is an example of many-sharing-one. The Translation example is an example of one-using-many. In the one-using-many, we often have to decide to use only one – but which one should we use? • Which-one-to-use is known to the application at runtime, not at markup time. The actual choice may depend on application context or user choice. For example, if the user had previously selected the language German, then the German translation should be used. • A more precise expression of the above is that extensions are late-bound and dependencies are early-bound.

SAS STATA SPSS Identically structured datasets • Consider a study with multiple datasets all identically structured. • This might be U.S. Census 2000, or … • … a simple study provided with multiple physical data formats – SAS, SPSS, STATA. • As much as possible, we’d want the same DDI instance to be used for all three data formats. • This is another one-using-many example. Structurally, it is identical to the Translation Use Case <extension-point module=“physical”/> <extension name=“physical”/> <extension name=“physical”/> <extension name=“physical”/>

Q4 answered • Q1.4: What happens when we have multiple extensions/translations? • A: The application must select the appropriate one based on context. • Therefore, the markup must specify the context-type to be used when selecting. At run-time, the user or application will provide the actual context.

Selectors • The Eclipse model of linking between plugins needs to be enriched by adding a selectoror context-type concept to capture the conditional relationship between modular DDI instances. • Selectors allow a decision of which actual connection between the one and the many to be made at runtime.

German Finnish Q1.4 answered II The context of the selection is “language” or xml:lang <extension name=“translation” xml:lang=“german”/> Questions Language selector example Variables <extension-point module=“translation” selector=“xml:lang”/> <extension name=“translation” xml:lang=“:finnish”/>

SAS STATA SPSS Q1.4 answered III The context of the selector is “stat-format” Statistical format selector example <extension-point module=“physical” selector=“stat-format”/> <extension name=“physical” stat-format=“STATA”/> <extension name=“physical” stat-format=“SAS”/> <extension name=“physical” stat-format=“SPSS”/>

Q1.3 • Q1.3: Why is it important to distinguish between dependency and extension? • A: For example, contrast a multiple language survey with a single language survey that has been translated. • The documentation for a multi-language survey is incomplete without its language components. The relationship is requires. • The documentation for the single language survey is complete without its translations. The relationship is extends. • Both have a language selector, however.

Q1.3 • After rethinking, it may not actually be important to distinguish between dependency and extension. I’ll have to rethink this for the October meeting. • In any case, a <requires> element is a kind of <extension-point>

In the first example, we exported questions into its own separate instance. To do this, we had to add <requires …> to the header. Exporting Modules <requires module=“Questions”/> Variables Questions <module name=“Questions”/>

Let’s say we wanted to export Variables instead of Questions. Should we just place Variables in its own physical instance and add <requires …> to the header like we did before? Exporting Modules <requires …/> Is this OK? Questions Variables <module name=“Variables”/>

Variables depends on questions, so there is a circular dependency. Circular dependencies must be avoided! Q5: Why must circular dependencies be avoided? Q6: How do we avoid circular dependencies? Exporting Modules <requires module=“Variables/> NO!! Questions Variables <module name=“Variables”> <requires module=“Questions”/>

In this example, because there is a circular dependency, Study 2 indirectly depends on Study 1. This is very, very bad. Why circular dependencies must be avoided Study 1 Study 2 <requires module=“Variables”/> <requires module=“Variables”/> Questions Questions Variables <module name=“Variables”/>

Export Heuristic • Q6: How do we avoid circular dependencies? • HEURISTIC: If we export a module then we must either export or copy all of its dependencies.

In this picture, questions has been exported to its own physical instance. If there are other modules within the original DDI that depend on Questions, then there would also be a <requires module=“Questions”/> within the original’s header. Check that the Study 2 depends on Study1 scenario cannot occur. Solution 1: Export dependencies <requires module=“Variables/> Questions Variables <module name=“Variables”> <requires module=“Questions”/>

In this case, there is duplication but no circular dependency. This is acceptable but not ideal Solution 2: Copy Dependencies <requires module=“Variables/> Questions Variables Questions <module name=“Variables”>

In this case, the Questions module is exported within the Variables module so there are not 3 physical instances as in Solution 1. There is also no duplication as in Solution 2. Note also that solution 3 can morph into solution 1 by exporting the Questions module from Variables. This is done without further changes to the original core DDI instance Solution 3: Export Related Dependencies <requires module=“Variables/> Variables Questions <module name=“Variables”>

What is a module? • The preceding discussion has implications on module design. In particular, the avoidance of circular dependencies impacts module design. If our modules have circular dependencies, then we should not consider them to be modules. • Our current design decomposes into functional modules. This kind of decomposition doesn’t necessarily avoid circular dependencies, so I think our current modular decomposition must be revised. • See Chapter 3: Modules and Lists

Discussion on Preservation • The relation between the core DDI and its modules advanced in this chapter is like that of the hub and spokes of a wheel. METS also has this kind of structure. However, the DDI Modular architecture differs from METS in two significant ways: • METS is static while DDI modules are dynamic. In other words, in METS, what’s at the end of the spoke may not change, but with DDI, what’s at the end of the spoke may be switched out by the application • The METS “spoke” is a loosely coupled, top-level reference. The DDI “spoke” is a tightly coupled, multi-reference. • The dynamic nature of DDI Modular “spokes” has consequences from the preservation point of view, as it is unclear what it means to preserve a dynamic entity. (not covered by OAIS model?) Preserve a snapshot?

Discussion on Preservation II • The motivation for a dynamic modular DDI architecture involving swappable modules comes from the following use cases • Translations • Continuous wave • Enhancement by end-users, archives and harvesters • Extended data lifecycle • The OAIS reference model deals only with the archival phase and thus does not encounter processing issues. It seems that processing concerns and preservation concerns are at odds.

Summary:Important concepts to remember • Exporting a module • Dependency relationship • Extension relationship • Selectors • Export heuristic and circular dependencies

Modules, Visibility, and Versioning Chapter 2

Note: the completion of this chapter predates the versioning document at http://www.pop.umn.edu/~wlt/arofan/versioning.doc and will need to be updated accordingly to incorporate the ideas of versioning.doc

Visibility • The header or wrapper for a physical instance declares the availability of its modules to the world. • Undeclared modules are not visible, i. e. cannot be referenced from external instances. These are called internal modules. There are reasons why some modules should not be visible externally. • Modules must also declare version in their header

Q2.1 Why do we care about versioning? • Historicity/Provenance • Interoperability • The two usages should be separated, as provenance requires tracking at a much finer level of granularity than interoperability, and is technically more demanding.

Historicity • If a changing resource such as DDI is cited, then it is important to identify the specific version of the resource. • For reproducibility of analysis, it is important that actionable metadata be versioned as well as data. (Non-actionable metadata need not be versioned). • The debate of whether or not DDI ought to be versioned usually boils down to whether or not DDI is regarded as actionable metadata.

Historicity II • It is inaccurate to change the @author of a module every time the module is modified throughout the lifecycle. Thus, depending on the requirements, it may be necessary to label individual elements by the versions in which they were last changed. • It may also be necessary to label elements with multiple @author to track the sources of the changes. • Placing the above information in the xml metadata itself is verbose and prone to fault. It is the opinion of this author that there are far better mechanisms (a la CVS) to accomplish this purpose.

Interoperability • If metadata is to be machine-actionable, then it needs to be versioned. • Applications will need to know what versions of the metadata are compatible with each other.

Suggested versioning scheme • DDI Versioning is not mandatory. However, it is recommended that those applications which do version DDI instances follow the following versioning scheme: • DDI instance versions should be identified by a 4-part versioning number, as well as a publication timestamp. The versioning number should consist of digits only. • An optional non-digit identifier may be placed in @edition • For both document-based and dynamic RDBMS-based archives, this allows retrieval by either version number or timestamp, but not both.

Left side driven by data changes – if the data itself is versioned, then this should match as closely as possible 1.3 -> 1.4 might involve minor data cleaning related changes, or reissue in a different physical format 1.3->2.0 would involve significant data changes such as adding newly recoded variables Right side driven by metadata changes 7.2 -> 7.3 should involve minor metadata changes with no anticipated incompatibilities such as typo corrections, adding question text, adding related non-data materials etc. 7.2->8.0 should involve major metadata changes such as modularization of DDI, adding comparison linkages, adding ISO1179 markup 4-Part Version Number 1.3.7.2 ?? I’m uncertain as to whether 2-2 is good enough. Perhaps 3-2 ??

Examples • 3.11 is not acceptable • Only has two parts. Recommended: 3.11.0.0 • 4.11a.1.2 is not acceptable • Has non-digits. Recommended: 4.11.1.2 edition=“a” • 2.a.5.2 is not acceptable • Has non-digits. Recommended: 2.0.5.2 edition=“a”

Versioning Authority • If the data producer provides a data version number, then that should be used for the left side, if possible. The data producer is the data versioning authority. • The metadata versioning authority is the organization which houses/disseminates the metadata. • If there is no version number for the data, or if there are multiple pieces of data with different version numbers, then the data versioning authority is the metadata versioning authority and assigns numbers to the left side. • If a harvester such as VDC or Nesstar does not change the metadata, then the original source of the DDI is the metadata versioning authority. If the harvester enhances the harvested metadata (and redistributes the DDI) then the harvester becomes a metadata versioning authority and should include a reference to the original. If the harvester does not redistribute the DDI, then no metadata versioning authority change is necessary.

Version Derivation • If an organization or application receives a DDI from another organization or application which it further enhances, then the second is derived from the first. • Version derivation information should be recorded. • The data version numbers should agree, if possible, between the original and the derived. • The metadata version numbers need not agree, and indeed, the derived metadata version number may be < the original metadata version number. This is because the keeper of the metadata is the metadata versioning authority

Version Dependencies • Dependencies should indicate version for the sake of interoperability. Currently, we have two types of dependency -- <requires> and <extension> -- located in the header • <requires requires-version=“1.2.3.0” edition=“a”> • <extension extends-version=“0.9.11.0”> • Modules which lack a version number cannot be used externally. • However, dependencies should specify a range of versions rather than a single version, since metadata can change.

Version Range Examples • 1.2.0.* (does not include 1.2.1.0) • 1.* • 1.2.0.3+ (does include 1.2.1.0) • 1.2.0.3 – 1.2.0.5 • 1.2.0.3 – 1.2.0.* • 1.2.0.3+ – 1.2.0.* (same as 1.2.0.3 – 1.2.0.*) • It is not anticipated that an edition range is necessary. However, a module may extend more than one extension-point via multiple <extension> elements. This mechanism may effectively allow for an “edition range”

Summary • Only modules which are declared in the header are visible externally. • The declaration must include a @version number and a @publish-date, with an optional @edition if applicable. A @version-authority is required, with possibly a @version-derivation if necessary. • The recommended versioning scheme (TBD) is a 4-part numeric version number. • Dependencies must declare the version or version range upon which they depend. • Versioning and publishing are intertwined. Only published modules may be versioned, and the publishing authority is the same as the versioning authority.

Modules and Lists Chapter 3

Q3.1 Why Lists? • In DDI 2.0, we refer to producer, author, researcher, etc. multiple times when they are the same entity. We would like to not have to repeat this information every time we use it. • In DDI 3.0, multiple variables may be derived from the same question. We would like to not have to repeat the question text and associated information. • Other examples abound… • So, for example, we may choose to gather all the producers, authors, etc. in a <institutionsAndPersonsList> and simply refer to the items in the list.

Module = Collection of Lists • The simplest module is a single list: • <QuestionsList> • <institutionsAndPersonsList> • The most complicated module is a collection of Lists • More precisely, a module is a collection of unordered lists • Q3.1: Why unordered lists?

Lists map to OO and RDBMS • This conception of a module allows a relatively straightforward mapping to OO and RDBMS implementations • Modules map to OO packages or RDMBS schemas • Unordered Lists map to OO Collections or RDBMS tables • List items map to OO objects or RDMBS rows • 1:n or n:1 relationships map to foreign key refs in the usual way. N:n map to linking tables.

List/Module Management Fundamental Operations • Publish/Unpublish • Export • Import • Rename/Move/Copy • Extract (from inline to List)/Inline • Filter • Concatenate • Merge-common/Consolidate? Is this necessary? • Merge-all • Resolve by publication date or version number or derivation • Resolve conflicts manually if necessary

If you understand the module operations and you also understand where you may want to use these operations in the DDI data lifecycle then you understand DDI Modules

DDI the Movie #1: Architecture for a Modular Distributed Metadata XML

DDI the Movie #1: Architecture for a Modular Distributed Metadata XML

Presentation Transcript

DDI-RDF

The Movie

DDI-HeatExchangers Inc. ddi-heatexchangers DDI Since 1980

DDI The Movie 2: Applications of the Architecture (early draft)

DDI the Movie #1: Architecture for a Modular Distributed Metadata XML

THE MOVIE

DDI for the Uninitiated