... personal wiki, blog and notes
Citation, Hosting and Publication
My last example was an MST data set held at the BADC, and I was suggesting something like this (for a citation):
Natural Environment Research Council, Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth, [Internet], British Atmospheric Data Centre (BADC), 1990-, urn badc.nerc.ac.uk/data/mst/v3/upd15032006, feature 200409031205 [http://featuretype.registry/verticalProfile] [downloaded Sep 21 2006, available from http://badc.nerc.ac.uk/data/mst/v3/]
which I could also write like this to give some hint of the semantics:
The tags are made up, but hopefully identify the important semantic content of the citation. As I said last time, there is some redundant information there, but maybe not (there is no guarantee that the Identifier and the AvailableAt carry the same semantic content).
Inherent in that example, and my meaning, was a concept of publication, and I introduced that distinction by comparing the MST and our ASHOE dataset (which is really "published" elsewhere). In the library world, there is a concept of "Version of Record", which isn't exactly analogous, but I would argue BADC holds the dataset equivalent of the version of record for the MST, and NASA AMES the equivalent for the ASHOE dataset.
Generally, in scholarly publication, in the past one distinguished between the refereed literature, the published literature and the grey literature1, where the latter might not have been allowed as a valid citation. The situation has become more complicated with the urge to cite digital material, but one of the reasons for the old rules was about attempting to ensure permanence and access - something that is obviously becoming a problem again. Thus, we should explore the concepts of publication and version of record a bit further, before we create new problems. Cathy Jones, working on the CLADDIER project, has made the point in email that a publisher does something to the original that adds value, and I think in the case of digital data, that something should include at least:
provision of catalogue metadata
some commitment to maintenance of the resource at the AvailableAt url
some commitment to the resource being conformant to the description of the Feature
some commitment to the maintenance of the mapping between the identifier and the resource.
And so, in a reputable article (whatever that means), or in the metadata of a published dataset, I wouldn't allow the citation of a dataset that didn't meet at least those criteria, but once we have met those criteria, then that first version should be the version of record, and copies held elsewhere should most definitely distinguish between the publisher and the availability URI.
Arguably the 2nd and 4th of these criteria could be collapsed down to the use of a DOI. While that's true, I think the use of both helps the citation user (just as I think it best to do a journal citation with all of the volume, page number and DOI). However, if the publisher does choose to use a DOI, it would help if the holders of other copies did not! Whether or not it's true, the use of a DOI does imply some higher level of ownership than simply making a copy available.
Implicit in my discussion of the metadata of a published dataset, is the idea that just as in the document world, we could introduce the concept of some sort of kite-mark or refereeing of datasets. A refereed dataset would be
available at a persistent location
accompanied by more comprehensive metadata (which might include calibration information, algorithm descriptions, the algorithm codes themselves etc)
quality controlled, with adequate error and/or uncertainty information
and it would have been
assessed as to it's adherence to such standards.
There might or might not be a special graphical interface to the data and other well known interfaces (e.g. WCS etc) ought probably be provided.
Datasets published after going through such a procedure would essentially have come from a "Data Journal", and so in my example above, such the <Publisher> would become the name of the organisation responsible for the procedure, and the <Title> might well become the title of the "Data Journal".
Persistence (from "Bryan's Blog" on Monday 23 October, 2006)
Citing data with ISO19139 (from "Bryan's Blog" on Wednesday 25 October, 2006)
Citation and Claddier (from "Bryan's Blog" on Monday 21 May, 2007)