... personal wiki, blog and notes
Bryan's Blog 2005/12/22
Life in blog world
The ups: William Connelly has found my blog - it's always nice when someone notices what you've been writing and likes it.
The downs: some of my blog entries have suffered comment spam ...
Dammit. This means I have to spend some time upgrading leonardo to try and avoid comment spam. I was planning on doing some useful things, like more thinking about a decent search provider in the backend, or version control. Oh well. That's life.
by Bryan Lawrence : 2005/12/22 (permalink)
One of the aims of CLADDIER is to establish methodologies for data citation. There are a lot of issues, some of which have been addressed before, by projects, like for example a German pilot experiment. The way I see it, things we have to address include:
How do we deal with dataset transience? My first reaction to this question was to say that one has to make dataset publication final - if one wants to change the dataset, one should publish a new version. However, it doesn't take long to realise that while that might make sense for datasets that are in some way canonical, it isn't a general solution. Many scientific datasets are subject to more or less continual revision as more data is collected (for example, covering a wider spatial or temporal domain), or data is revised as quality control is improved. In the case of most of our data, the latter situation should result in revised datasets - which was the situation I had envisaged - but in a gene database for example, that doesn't make nearly as much sense if one plans to cite the database, which one expects to change as new sequences are added and old ones revised. This leads us to
How do we cite something within a database/dataset? Here I think the answer is that database contents should be thought of consisting of features in the OGC/ISO sense; that is, of objects which are instances of some named class which is well defined (with the class definition being well described and lodged in some registry). In that case, all such instance objects should be described by a unique identifier and thus citeable in some way. However, it is important that as such instances are revised/improved/replaced, not only do the new instances carry new identifiers, but the old instances must carry information that they are obsolete.
What metadata is needed to publish a dataset (as opposed to, say, a book)? In the ndg project, we have been investigating metadata, and we think there are at least four categories of metadata, one needs to deal with, which we have briefly summarised as:
A (for Archive): what you need to understand the format and direct content (i.e. what quantities are actually stored).
B (for Browse): what you need to understand the context of the data, and to allow you to choose between otherwise similar datasets (hence Browse).
C (for Character or Citation): what you need to support annotation of the data.
D (For Discovery): what you need to find the datasets - broadly similar to the catalogue records in a library, but enhanced because we're dealing with data.
If substantial amounts of metadata are needed, how do we deal with the definition of authorship for datasets? In the four categories above, one might expect the data originator to supply A, and some of the B metadata, but in reality it is rare for a dataset collector to provide enough B metadata ... they have too much institutional wisdom which they fail to encode. Hence, to my mind, dataset publication will involve significant efforts from third parties (not the collector, nor necessarily, the publisher), and that effort will be in the form of material which were it in a standalone document, would be reflected with coauthorship ... I think however, for data, we need a recognised categories of authorship which clearly delineate between the collector/compiler, and the metadata creators, but which reflect the academic aspirations of both.
If we want reputable data citation systems, how do we deal with refereeing? Clearly we (the BADC) could publish datasets ourselves, and indeed, we intend to do so, however, what guarantee of quality will our users get? Obviously we can do internal quality control, but that can only deal with compliance of the data to some schema, e.g. it has the right metadata, the numbers are within bounds, etc. What we can't do, and what we need refereeing for is to comment authoritatively on, for examples:
the suitability of using a particular instrument to make a specific measurement, or
the suitability of a particular algorithm for combining data.
How do we deal with persistence of the digital objects that we cite? This is a whole topic in itself, but all objects which we expect to be citable, have to have identifiers which are live as long as the objects, and in particular, are independent of the storage mechanism, or the interfaces.