350 likes | 380 Vues
Explore the importance of metadata in the NIR context, including automatic, semi-automatic, and manual tasks. Learn about objective versus subjective data, editorial and authorial interventions, design challenges, and more.
E N D
Metadata in NIR Fabio Vitali University of Bologna Maria Guercio University of Urbino
Introduction • Metadata support has always been present in NIR • Recently (June/July 2004) deep (and hot) discussions have happened within the WG about identifying a full set of metadata information • This is the result so far of the status of discussion.
Some terminology • Automatic: any task that can be completely left to the machine to be performed • All kinds of data format conversion • E.g. XML->HTML or NIR XML -> NIR RDF. • Semi-automatic: any task that can, with a certain degree of precision, be performed by the machine, but that still requires a human for final verification and approval. • Identification of structures • E.g. partitioning of documents, identification and interpretation of citations • Manual: any task that needs to be decided upon and performed by a thinking human, even though the machine can provide the support to help him/her and ease the task itself.
Some terminology (2) • Objective • an objective datum is something for which no reasonable discussion can exist as to its value. • E.g. the title of article 15, the publication date • Subjective • A subjective datum is something that requires an active interpretation from a human that may be wrong, or for which different opinions exist • E.g., resolution of implicit citations, classification of provisions • Explicit • A datum that is actually written somewhere in the text • Implicit • A datum that needs to be deduced from the external, or through the application of specific reasoning
Some terminology (3) • Low competence • the kind of competence one may expect from a non-specialized employee, such as a secretary, armed with just common sense and some topical experience • E.g.: where does article 1 end and article 2 start • High competence • The kind of competence one may expect from overspecialized jurists that come to some results after careful and painful reasoning • e.g.: dates and times in norms. • Editorial intervention • by the publisher of a document • Authorial intervention • by the author of a document
Design issues for NIR (1) • Data structure rather than application • Norme In Rete knows about applications, but is not dependent on any use of the data and is not specifically targeted towards any specific application (except presentation) • The same text should be marked in the same way by different editors (at least in the most fundamental structures)
Design issues for NIR (2) • Rigorous distinction of roles • The author of a norm is the legislator, the provider of the actual XML document is the editor. • The legislator is GOD (his decisions cannot be discussed), but He only speaks through the text of the norms. • The editor can add a large quantity of information, but it has no official status • The very act of adding tag is an editorial operation, subjective and open to discussions. • In fact, any addition coming from editors (structure identification, notes, comments, interpretation) happens outside of the document content (in markup structures or in special metadata sections)
Design issues for NIR (3) • Complexity of the access to texts • Many editors, many publishing systems, many copies in different stages of evolution • There is no authoritative source of XML documents (only of printed documents). • One web site could forget about updating a law to the latest version • Use of URN allows to refer to the text of a law without identifying a single existing authoritative source.
Design issues for NIR (4) • Support for description and prescription • Tagging of existing texts can only be descriptive (supporting any possible mess that the legislator may have put in) • Support for legal drafting can be provided, suggesting or enforcing legal drafting rules in the writing.
Design issues for NIR (5) • Everything has a reliable name • Every legal structure needs to be referenced and accessible. • References need to be unambiguous, universal, definitive. • URN for whole documents, • id attributes for substructures and spans • XPointers for even smaller entities.
Design issues for NIR (6) • Clean separation between objective properties and interpretation • Objective properties can be marked by low-level editors, while interpretation requires experts and high-level editors. • Objective (manifest) properties include identification of boundaries (articles, slauses, etc.) and official facts about texts (publication dates, etc.) • Interpretation includes identification of troublesome dates (dies coactu, dies valens), identification of normative content of the texts provisions, application of modifications.
Design issues for NIR (7) • Specific support for multiple interpretations • “Disposizioni” (law provisions) can be identified and specified on the text. • Multiple different interpretations of the same text must be allowed • So they cab be placed outside of the main document.
Basic structures (1) • Containers • Documents, parts, subparts, articles, etc. • All numbered and titled • Text containers • Clauses (comma), list elements, etc. • Inline elements • Presentation oriented (bold, italics, etc.): discouraged, we rely on HTML elements and CSS styles • Legal oriented (references, modifications, specification of dates, organizations, roles, places, etc.): we rely on specific NIR elements.
Basic structures (2) • Metadata • Publication information and other data supplied by editors (publication notes, document evolution, etc.) • Law provisions for the interpretation of the semantics of the content • Support for irregular texts (those that do not comply with standard legal drafting rules) is available through relaxed syntax in some cases (documentoNIR)
The Schemas for NIR documents • 3 different DTDs • Strict rules (prescriptive) • Loose rules (descriptive) • Light rules (support for most common cases) • They are intercompatible • The vocabulary is exactly the same • All light documents are also loose • All strict document are also loose
The needs for metadata • Metadata represent the only chance for putting information that was not explicitly written by the legislator. • All possible types of additional information beyond those provided in the text need to find a place here. • Uses: archival, analysis, annotations, automatic processing (consolidation), etc.
Official classification of metadata • A starting point is provided by NISO (US National Information Standards Organization) in the guide “Understanding metadata” (2004): • descriptive metadata to describe a resource “for purposes such as discovery and identification” • structural metadata to indicate “how compounds objects are put together” • administrative metadata to provide information “to help manage a resource”, articulated (only) as rights management metadata and preservation metadata (“information needed to archive and preserve a resource”)
But… • The distinction between descriptive, structural and administrative metadata cannot find any concrete basis on the real practice: • All the communities involved in the preservation of documents have developed and used relevant information related to the structure identification as a sub-set of information of their descriptive systems. They never consider the structural data as independent component. • The ambiguity of the administrative metadata is even more evident, specifically in the digital systems where the technological components are less and less relevant for the long-term preservation and play a function for physical retrieval of a resource in a digital repository, but are considered part of the descriptive system in the case of web resources.
Metadata in the NIR DTD text • Any kind of information that is provided by the editor rather than by the author. • In a way even tagging text is metadata • Deriving new versions out of an original and a few modification documents is also adding metadata. • But adding proper metadata means providing additional information to a version of a document that can be used to better search, contextualize and understand a document. <xml> Text </xml> <xml> Changes </xml> <xml> Changes </xml> <xml> Changes </xml> meta <xml> Changes </xml>
Proper metadata in the NIR DTD • Can be specified • In an external document (in RDF - still underspecified) • In an internal section at the beginning of the document (meta) in a NIR vocabulary • In many internal sections near the parts of the text they refer to, in a NIR vocabulary • Conversion back and forth is always possible and automatic. • Deals with description, structure, administration, as well as: • Interpretation of content • Relationships with other documents • Comments and notes
Seven types of proper metadata • Reflective information • Things the document knows about itself • Positioning information • Things the document knows about the norms it expresses and the legal system it belongs to • Lifecycle information • Special moments in the history of the document and of its norms, and the list of other documents that justify them • Editorial notes • Things the editor wants to attach to specific parts of the document but cannot, since the DTD does not allow editorial intervention on content • Iter-connected texts • The history of the document before its approval • Proprietary extensions • Provisions (disposizioni)
Reflection info (descrittori) • Refers to the document, not its content • Publication date. Re-publications. Errata. Official clarifications. • URN(s), aliases • Objective data, easy to find even with low competences • Storing freshness information? • A document does not usually know whether it is up-to-date. We may deal with stale documents, dead web sites, CD-ROMs • The best we can do is to provide them with a last-updated date • The normative system will confirm whether this is the last interesting date, or there exist more recent versions of the same document
Positioning info (inquadramento) • Refers to the norms contained in the doc • Missing parts • Rank, function, nature and proposers of the law • Keywords and taxonomies they belong to • Objective data (mostly), but requiring high competence to write down.
Lifecycle (altriatti) - 1 • Over time, documents undergo changes (in content, efficacy, power and so on) • These change happen at specific points in time and depend on specific documents (modification documents). • Usually modification documents specify several changes on the same modified document, and may specify multiple modification dates. • Therefore it makes sense to create a secondary structure where all relevant moments and documents can be matched
t05 t01 t02 t04 t03 suspended repealed resumed modified original v02 v01 v02 1/3/1997 24/9/1999 1/1/2001 1/1/1996 12/6/1998 Lifecycle (altriatti) - 2
Lifecycle (altriatti) - 3 • The lifecycle section only provides information about the relation to the document that causes the modifications • This information is objective and can be provided with low competence • Information about each actual modification is optional and placed in the provision section. • That information is sometimes subjective and can be provided only with significant competence
Other sections • Editorial notes (redazionale) • Footnotes, comments, and any text the editor feels like adding. It can point to specific places in the text through <ndr> elements • Iter-connected data (lavoripreparatori) • An official blurb detailing the iter for the approval of the act, with presentation dates, discussion dates, etc. Plain text. • Proprietary • An open-ended section where editors can add their own metadata with freedom.
Provisions • Provisions describe the meaning of each meaningful fragment of the text according to a predefined (and hopefully complete) taxonomy (ontology???) • Divided in three main sections plus a residual category: • Justifications • Analytical provisions • Modifications • Other
Justifications • Some norms (e.g., decrees) introduce before the actual text a foreword providing a number of justifications: • Considered… • Consulted… • Based on a proposal by • Considering… • Etc.
Analytical provisions • Describe properties and meaning of fragments of the actual text. • A full taxonomy exists, including concepts like definition, obligation, right, etc. • Carlo will be speaking about them
Modifications • In a modifying law, each modification can be described in detail with a provision. • The provision describes in details what kind of modification, the document it is applied to, where inside it, and when. • Possible modifications are: abrogation, substitution,insertion, renumbering, change of terms, prorogation, repetition, suspension, retro-activity, ultra-activity, etc (a total of 24 different types). • Currently no way to express normal case (dies coactu = dies valens = 15 days after publication for the whole act), but a way will be found soon.
Arguments for provisions • All provisions have some specific arguments, plus some shared arguments • E.g.: <motivazioni> <regole> <obbligo> <pos href=“#art12com5”/> <destinatario>sindaco</destinatario> <controparte>ufficio tributi</controparte> <termine da=“r01” a=“r02”/> </obbligo> … </regole> • Important shared arguments are positions and terms
Positions • All provisions point to a position inside the document where the text of the provision is placed. <articolo id="art1"> <num>1.</num> <comma id="art1-com1"> <num>1</num> <corpo>blah blah</corpo>… <obbligo> <pos href=“#art1com”/> <destinatario>xxx</destinatario> <controparte>y1</controparte> </obbligo> • The pos element points to the id, or XPointer, or the text content, of the part of the document that contains the provision.
Terms • Specify conditions, and specific efficacy (dies coactu) and validity (dies valens) intervals. • No formal language exists yet for specifying conditions • E.g.: “after the approval of the corresponding regulation” • Dates are specified by referring to the id of the relevant date as placed in the lifecycle section.
Conclusions • Metadata are still under heavy evolution within the NIR WG. • In the last 4 month a major work has been started, in order to perform a systematic analysis of the desired metadata information for NIR documents. • I haven’t even mentioned namespaces • Some details are still shaky (required elements, repeatable elements, conditions, default values), but the structure should be reasonable stable. • These are not in the published version: it is still way too early.