370 likes | 481 Vues
Structured Management of Digital Content and Licenses. Electronic Publishing, Digital Archiving and Licensing workshop Frankfurt October 20 2005 Norman Paskin, International DOI Foundation n.paskin@doi.org. Structured Management of Digital Content and Licenses. Outline:
E N D
Structured Management of Digital Content and Licenses Electronic Publishing, Digital Archiving and Licensing workshop Frankfurt October 20 2005 Norman Paskin, International DOI Foundation n.paskin@doi.org
Structured Management of Digital Content and Licenses Outline: • Define terms in the title • Two principles: identification and description. • Identification: resolution, persistence, interoperability • Internet identifiers; URI, URN, is DNS enough? • What do we need to identify? • Description: what is it we are identifying? • Metadata: taxonomies, ontologies, folksonomies • Summary of key issues
Structured Management of Digital Content and Licenses Management: • know what it is you are managing – label it • Require a unique label for an entity involved in a DRM transaction • An identifier string, which can do something Digital Content and Licenses: • Enties in transactions: stuff, people, deals (= content, users, licences) • indecs: “people make stuff, people do deals about stuff; stuff is used by people” • Same system for all these entities, using internet standards Structured: • Objective: capable of being used in distributed systems • someone else can come along at another time/place, and may need to link to another system, etc • So must be persistent and interoperable (which means: description)
ID Two principles for persistent identification resource 1. Obvious:IDENTIFICATION Assign ID to resource Once assigned the number must identify the same resource • Beyond the lifetime of the resource, or the assigner • Less obvious: DESCRIPTION Assign Resource to ID • The resource must be described • If the Resource is not alwayssecurely and exclusively bound to the ID – then: • Describe the resource “content” [with precision] • Failure to do this will ultimately break interoperability • How far do we go in each? Depends on what is “good enough” • Technologists have focussed on (1) [and “bags of bits/data structures”] • The content/rights world on (2) [and focus on “intellectual content”]: ISBN etc • Both viewpoints valid • (2) is now becoming more relevant – because more open/distributed systems
Structured Management of Digital Content and Licenses Outline: • Explaining the terms in the title • Two principles: identification and description • Identification: resolution, persistence, interoperability • Internet identifiers; URI, URN, is DNS enough? • What do we need to identify? • Description: what is it we are identifying? • Metadata: taxonomies, ontologies, folksonomies • Summary of key issues
Identifiers do something • Identifier: A unique label for an entity involved in a transaction • Note the ambiguity of the word “identifier”: • Label (e.g. ISBN) • Specification (e.g. URN) scheme for making actionable + = Implemented system (e.g. DOI, Bar code) “actionable identifier” • But pure versus actionable identifier is not a clear distinction – any pure identifier may become actionable in the future through new specifications being applied • Resolution: The process in which an identifier is the input (a request) to a network service to receive in return a specific output. • Both concepts are in principle neutral as to technology implementation • Abstract concepts, but implementations typically at least “internet” TCP/IP (the more general the better, e.g. not just “Web”)
Technical and social infrastructure issues Persistence • "It is intended that the lifetime of a [persistent identifier] be permanent. That is, the [persistent identifier] will be globally unique forever, and may well be used as a reference to a resource well beyond the lifetime of the resource it identifies or of any naming authority involved in the assignment of its name.“ • [Persistent Identifier] = URN in IETF RFC 1737: Functional Requirements for Uniform Resource Names. (http://www.ietf.org/rfc/rfc1737.txt)
Interoperability • Persistence can be seen as just one aspect of this wider concept • “persistence is interoperability with the future” • We know what we mean, but others may not. • Identifiers assigned in one context may be encountered, and may be re-used, in another place or time [= persistence] - without consulting the assigner. You can’t assume that your assumptions made on assignment will be known to someone else. Interoperability = the possibility of use in services outside the direct control of the issuing assigner • This will be key for publishing, archiving and licensing – all assume distributed access
Persistent identifiers on the Internet: DNS • Domain Name System: DNS • designed primarily as a level of indirection for IP addresses: 132.157.24.3 is a machine. Move server.acme.com to another machine, you don't have to tell everyone but just change your DNS records so it now points to 132.157.24.6 instead. • A number of assumptions that were valid at that time now pose problems : • All the data is public: difficult for use in applications like voice over IP. • The data can be implicitly trusted: you need some way to trust that you are talking to who you think you are talking to. • The names can all be in ASCII – but Chinese etc is important after all. • Administration will be done by sys admins sitting at consoles: no need for an administrative protocol. Ownership is then naturally at the level of whoever owns the servers and pays the sys admins. • Control of the naming authority will not be a problem: ICANN, Root zone file is a very active UN row now going on (WSIS) • DNS designed for servers: • When Tim B- L came out with a plan for linking documents it seemed natural to build on DNS: tack file paths on the end of the server names in order to identify the business ends of the links: URLs (now URIs). • But now the documents are identified starting with the names of the organizations that own the servers they sit on. A problem.
Persistent identifiers on the Internet: Handle • DNS is not essential to the underlying TCP/IP network, but just to the current use of that network. One proposed solution to DNS problems; Handle system (1995+) • identify objects, not servers. • objects can be anything identified: accounts, names, ids, phone #s, content… • explicit improvements for identifying verylarge number of digital objects. • not all the data is public: individual values within a handle can be private. • all transactions can be certified. • any Unicode character set can be used. • separation between who owns and controls the handle versus who happens to run the servers (distributed administration, ownership at the handle level) • gets rid of semantics in the identifier: makes it easy to move ownership across organizations without your objects having someone else's name. • Freely available to be used as engine underneath other named identifiers. Does not need DNS, but can work with DNS. • Basis of DOI system – advantages as above, proven for publishers. Used in Grid computing, US govt applications, DOI, etc though most DOIs are used in translated http proxy form • “The governance of the DNS will not completely encompass future Internet addressing and navigation…The system…is not static but a technology capable of evolving into a better form. As such, the current system should not be treated as sacrosanct, but amenable to innovation”. Kenneth Neil Cukier (Technology Correspondent, The Economist) • However, most identifier methodologies still use the DNS basis: URI, URN
URI : observations • Web based (W3C led). Still much wider uptake than DOI etc. Takes DNS as basis. Problems: • URLs, as currently understood, are demonstrably not persistent: calling them URIs doesn’t fix that • Inherits DNS problems (last slide) especially the name/place confusion • Many important recent developments are not based on URIs in any way e.g. VoIP (Skype), Peer-to-peer • Some are URI based but with different registration requirements (MPEG-21) • The Web is not the end point of evolution: grid computing, mobile computing • The IETF RFC consensus process, and the separate existence of W3C, leads to ongoing debate and standards with a vague existence (Cf. ISO standards: W3C web site on naming and addressing is “incomplete”) • Persistence = organisation is now becoming recognised, and technical solution should follow • e.g. “commitment statement” in archiving is seen as important (ARK) • e.g. IDF has established rules for social network support of DOIs • Importance ofsocial infrastructure • URN mechanism (>10 years old) meant to be solution: • But still not implemented – recent renewed interest may help
URN: observations • URN (Uniform Resource Name): using DNS to add names to locations • Part of mid90s IETF design concept: URL/URN/URC • Still inherits problems of DNS, but better than URL • But not widely used • A single point re-direction to URLs using an http: proxy server • Any existing identifier can add the URN spec: • isbn:12345678 as a URN = urn:isbn:123456789. • Assumes a DNS-based Resolution Discovery Service (RDS) • No such widely deployed RDS schemes currently exist: Browsers cannot action URN strings without some additional programming “plug-in”. • Some have been built for individual communities • Example: Life Science identifier LSID • fine but also needs a social infrastructure • functionally gives nothing beyond the functionality achieved by coherent management of the corresponding URLs – • but they work for that community, by adding that coherent management . • URN code or plug-in promised for CENDI (US government users). Some movement to “re-define URN”. If that happens and is taken up, it could be significant.
Identifier systems • Each community tends to arrive at its own “good enough for us” solution • less focus now on “what is a persistent identifier?” More on “how do we build a system… ” • Whatever mechanism, resolvable identifiers must provide: • Agreed numbering syntax • Resolution mechanism • Data model to define “what it is we are identifying” • Technical and social infrastructure to implement • (compare physical world bar codes, etc) • could be assembled ad hoc, or offered as a packaged system (e.g.DOI)
Identifying entities of all types • Resources: most commonly content (Stuff) • Licences (some music industry applications now looking at this (Deals) • Parties (see earlier InterParty project) including Institutions(people): • e.g. exploratory stakeholders' meeting took place Washington DC October 7 to examine the feasibility of an Institution Registry • Problem: libraries deliver contact names and numbers, IP address ranges, etc to publishers, • Publishers manage this in their access and subscription systems in order to be able to authenticate library users • This exchange of information is usually done individually between publishers and libraries; much duplication of effort, no possibility of synergy • Institution Registry could at minimum provide a central space to hold this information once only .
Structured Management of Digital Content and Licenses Outline: • Explaining the terms in the title • Two principles: identification and description • Identification: resolution, persistence, interoperability • Internet identifiers; URI, URN, is DNS enough? • What do we need to identify? • Description: what is it we are identifying? • Metadata: taxonomies, ontologies, folksonomies • Summary of key issues
Resolution and “What are we identifying?” • Resolution: The process in which an identifier is the input (a request) to a network service to receive in return a specific output • Identifier identifies an entity. • “what I point to” (resolve to and get) is not always “what is identified”, • Can identify but not “get” directly things that are intangible (works), or fugitive (performances) or that change: (“Todays NY Times”) or people and concepts…. • Pointing and clicking can return different things in different contexts, or give multiple options • Entities can be physical, abstract, tangible, intangible, things, people, concepts, colours… • Resolution provides a mechanism to describe the resource “content” through a service which delivers a description
What are we identifying? “what I point to” (resolve to and get) is not always obvious Document on screen Abstract work? Manifestation of abstract work? Version? This HTML file? All/some of these?
Describing what we are managing Whatpreciselyare we identifying by this identifier? How are these things related to other things? Common approaches: • Taxonomies • Ontologies • Folksonomies
Taxonomy • (Greek) taxis, arrangement; + -nomie, method • Division into ordered groups or categories • Hierarchical, parent/child relationships • Defined area of interest • Gives a good way of being unambiguous within a controlled, defined area • Best example is Linnean taxonomy of life: the classification of organisms in an ordered system that indicates natural relationships • And that illustrates a key point…
Taxonomy • “It’s a Robin” • Id = Robin • ..and we all know what a Robin looks like… • “we know what we mean but others may not”
Chordata | Aves | Passeriform | Turdidae | Erithacus | Rubecula European Robin
Chordata | Aves | Passeriform | Turdidae | Turdus | Migratorius American Robin (different genus)
Chordata | Aves | Passeriform | Eopsaltridae | Petroica | Multicolor Scarlet Robin (Australasia) (different family)
? | ? | ? | ? | ? | ? Robin (red) (and Batman)
? | ? | ? | ? | ? | ? Robin Reliant (red)
Ontologies • differ from taxonomic approach: • Not just “stamp collecting” but extensible • do not follow a rigid/parent child hierarchical structure: terms may inherit meaning from more than one parent • a more complex relationship is maintained. • Can build on / are more complex than taxonomies • Show how taxonomies map to each other • May add inference engines etc • the proposed third (missing) component of the semantic web: • XML allows users to add arbitrary structure to their documents but says nothing about what the structures mean. • RDF enables expression of meaning (sets of triples, each triple being rather like the subject, verb and object) • Ontologies “will enable machines to comprehend semantic documents and data"
Ontologies • Use underlying data model – a “context model” - to express an events-based structure • the accepted ontology approach [context based= events and states] • We often think of metadata as “about” things, people, etc • static views e.g. about “person A” ; “creation B” • Events link things (e.g. to describe rights activities) by relating things and people in the context which generated/used them • dynamic views e.g. “A created B” • Events description is the key to “rights metadata” • all such transactions are contextual (events) • describing the event in context, using formal dictionary terms, enables semantic interoperability • The common methodology with most uptake and promise is the <indecs> one • developed in more detail by CONTECS and by RightsCom • MPEG21 RDD the first result of the extended methodology
2005 Int DOI Foundation indecs (2000) CONTECS (2001+) ISO MPEG21 RDD IDF + ONIX indecsDD EU project -> indecs Framework Ltd IFPI/RIAA, MPA, IDF, DentsuMMG, Rightscom OntologyX Mi3p etc 1998-2005: Defining what is identified through metadata Development of indecs 1998-2005 Black = what Red = who
Folksonomies • Current hot web topic: individuals assign their own keywords to content • Examples: • www.flickr.com (photo-sharing); • http://del.icio.us/ (social bookmarking)
Folksonomies • Rough and ready alternative to traditional information organisation • Most people use tags first and foremost to organise their own information in a way that makes sense to them • Sharing this creates a side-effect of “vast democratically structured frameworks of organisation” • Not much good for managed structured searching/management: • e.g. “recipe” “cooking” “barbecue” • the Robin problem • But don’t write them off: • cf Wikipedia (people said it would never work…) • imagine some automated organisation/rules/dictionary being added in certain communities • imagine links to Autonomy type searching
Structured Management of Digital Content and Licenses Outline: • Explaining the terms in the title • Two principles: identification and description • Identification: resolution, persistence, interoperability • Internet identifiers; URI, URN, is DNS enough? • What do we need to identify? • Description: what is it we are identifying? • Metadata: taxonomies, ontologies, folksonomies • Summary of key issues
Summary: key issues • What are we identifying? [content not just bits] • What are we resolving to from this identifier? • What, if any, explicit metadata are we making available? • How will the social infrastructure be provided? The mechanisms must allow: • Identification of entities of all forms • To be used in variety of contexts • Appropriate use of metadata at appropriate level • Development of ontology tools to describe entity relationships The logic chain: Identification Persistent Interoperable Automation Precision Logic
Structured Management of Digital Content and Licenses Electronic Publishing, Digital Archiving and Licensing workshop Frankfurt October 20 2005 Norman Paskin, International DOI Foundation n.paskin@doi.org