... personal wiki, blog and notes
Citation, Digital Object Identifiers, Persistence, Correction and Metadata
CEDA now has a mechanism for minting Digital Object Identifiers (DOIs). This means we need to finalise some decisions of rules of behaviour, which means we have some interesting issues to address.
Traditional Journal View
Let's start by considering how DOIs actually work for traditional journal articles:
User dereferences a DOI via http://dx.doi.org/
dx.doi.org redirects to a landing page URL. (This is the entire job of the doi handle system done!)
Publisher owns the landing page, and prominent on that page is some metadata about the paper of interest, J', (author, abstract, correct citation etc), and a link to the actual object of interest (J, generally a pdf). NB: There maybe a paywall between the landing page and the object of interest.
Publishers can change the landing page any time they like, but conventionally you had better be able to get to your digital object from there. There are some other de facto rules too:
If there is a new version of the paper, a new DOI is needed.
Landing pages can have query based links to other things (papers which cite this one) etc ...
The metadata (J') shoulsn't change (since it is indelibly linked to the paper). I suppose if the metadata had inadvertently left out an author, a publisher might update it and hope no one noticed ... but there is an important thing to understand about this metadata:
It describes the digital object and represents it faithfully. It ought not change, since any change to it, ought to reflect a change to the digital object (and that should trigger a new DOI) ...
and the original landing page can indicate that a newer version of J exists, but it should still point to the older version!
So the DOI system is effectively used as follows:
A DOI resolves to an representation of a record that describes a digital object which is retrievable in its own right, and that representation can carry lots of other extraneous stuff including material built from interesting queries related to the object of interest.
(In most cases the primary representation of the landing page is html, but other representations, including rdf, might exist.)
OK, now consider the data publication version of this story.
Consider a real world observation (O), which used a process (P) to produce a result (R). (Aficionados will recognise a cut down version of O&M.) Expect that we have described these things with metadata O', P' and R', which describe the observational event, the process used, and the result data (syntax mainly). You might think of it like this:
So what does data publication actually mean in this context? Sure we can assign a DOI to each of the O', P', R' entities, but why would we do that, what value would that have over using a URL?
We believe (pdf) that the concept of Publication is an important one, which is distinct from publication (making something available on the web). Publication (with the capital) denotes connotations of both persistence and some sort of process to decide on fitness of purpose (peer review in the case of academic Publication).
So how do we peer review data? In our opinion you can't Publish data without Publishing adequate descriptions of at least the observation event, the process used, and the resulting data. That is, we want to see a form of peer review of the (O',P',R') triumvate. (O&M aficionados: for the purpose of this discussion I've collapsed the phenomenon , sampling feature, and feature of interest into the result description, don't get hung up on that.)
Strictly, in the O&M view of the world, the process description can be reused, so we might assign DOIs to each process (P').
Clearly we can assign a DOI to R', but we would argue (Lawrence et al 2009) that without O' and P', the data is effectively not fit for wide use (and hence Publication). So, better would be to assign a DOI to O' and ensure that O' points to R' and P' (as it must do, since that's how it's defined).
(Not having a DOI for R' is consistent with the O&M use of composition for R as part of O, but actually I think that's a bit broken - in O&M - for the case where we want to have observations which are effectively collections of related observations, but that's a story for another day.)
The html representation of O' itself (not to be confused with the landing page for O') could link to R' and P', or it could directly compose their content onto the page.
Now we have some interesting issues. Can we imagine R' being updated because of incompleteness or inaccuracy? Well, yes, but hopefully it's about as likely as journal metadata changes. From a practical point of view, a cheeky change in R' wouldn't affect the conclusions that one might make about how and why someone used R itself. So, we might get away with fixing R' in place, but in principle one shouldn't.
What about changes to our description of P'. Someone following a citation to O' might get entirely the wrong idea of why someone used R (or misunderstand the usage). Hence, the right thing to do would be to:
create a new description of P (call it P+),
create a new description of O (call it O+), and ensure it composes P+.
review these new descriptions, then
mint a new DOI for O+ (and one for P+ if desired as well).,
create new landing pages accordingly, and
update the landing page for O' (and the landing page for P' if it exists separately) to indicate that a superceded version exists.
Note that none of this discussion depends on the detail of the underlying representation(s) of each of O',P',R', P+, or O+, these could utilise any technology (including OAI/ORE etc). However, for each object, one of the representations should represent the (apparently) immutable digital object linked via the landing page and the DOI. If that particular representation became unavailable, the replacement representation would become effectively a new editions of the digital object, and the landing page modified accordingly. It might be acceptable, and probably has to be for practical reasons (since such changes are likely to be associated with software changes), that the original DOIs resolve to landing pages that no longer point to the old primary representation - but they had better make the evolution clear.
(Note that one would not countenance the data object R itself being allowed to change format without resulting in a new DOI.)
In terms of the physical infrastructure, one could imagine the landing pages being dynamically generated from the metadata records they are linking to, along with queries making other associations etc. One could even imagine multiple representations of the landing page itself (e.g. XHMTL, multiple dialects of XML, RDF etc). Such different representations could be available by content negotiation - but to reiterate once again - even if multiple representations are available for the digital objects of interest, only one is the object for which the DOI was assigned.