... personal wiki, blog and notes
Bryan's Blog 2009/03/04
The initial thrust of the day was the presentation of a plan to deliver the first steps of a national research data service, followed by some stakeholder perspectives, and then funder perspectives. Sadly, I thought the proposal was seriously flawed, being based on a naive expectation of what could and should be achieved. I found little dissent from those I spoke with in the margins (or indeed, from those who spoke from the floor during the day). There were one or two individuals who saw this as an opportunity that needed to be siezed, despite the flaws in the proposal, but mostly I heard practical objections, followed by rational arguments about what might be done - and total agreement that something should be done. But what?
Firstly, scope. What's in a name? Are we trying to build a national research data infrastructure, or a research data service? Is the concept of curation relevant? In my mind, having a clear view on the distinction between these concepts is crucial to working out what can be achieved. To help with what I mean here, I'm thinking
of an infrastructure as being composed of many services and facilities. So, if we need a national infrastructure, what "services" need to be national? What facilities deliver which services?
of a distinction between helping with storage, management and use of data in the near term (data facilitation), and into the medium and longer term future (data curation).
(It certainly seemed like some of the folks at this meeting were thinking curation while talking about facilitation and others vice versa, something that became particularly clear during the video presentation about the Australian National Data Service - of which another day, if I find the energy and time!)
Who are the stakeholders and players in these activities? I might summarise them as:
some of whom are in universities, some of whom are not, who sometimes appear as producers, and sometimes as consumers,
some of which are universities, some of which are not, who might teach, or run libraries and other facilities which may or may not have a role to play,
existing data facilities which have trans-institutional (and often trans-national) mandates,
advisors (of good practice),
assessors (of good practice),
Of those, I think of both the research and non-research consumers as users, and all the rest are players in the research infrastructure. In particular, the funders are as much players as those who deliver storage facilities! Without the funders being engaged to the point of mandating and rewarding appropriate levels of engagement by the other players, then some of the goals one might aspire to are simply not feasible. Similarly, the institutions that employ and educate the producers of data are just as much players.
For all their flaws, I think the UKRDS proposers understood the stakeholder and player relationships to a degree. Where it all went wrong, I think, was their understanding of a) the distinction between data and documents, and b) their understanding of the nature, discipline dependence, and fundamental importance of data management plans, as well as c) how one would need to balance the desire to facilitate data reuse against the desire to collect new data and to carry out new research.
Taking these in turn:
data are not documents. There are no common understandings of how to format data, how to store data, and how to document them, and neither should there be1. The consequences of this heterogeneity are profound, not least because such heterogeneity means that in any given institution, there can be many more formats and facilitation requirements than there can be experts in those formats and requirements.
data management plans are discipline and data dependent, reflecting varying importance of varying maturities of data, and the presence or absence of overarching agreements (e.g collaboration agreements etc). Again, while any given institution can have expertise in the concepts of data management plans, the construction and relevance of the plans are likely to be trans-institutional more often than not.
the choice about relative importance of facilitation, curation, and new research are discipline dependent. Yes, it might well be that if we don't look after old data it will be lost, but the decision as to whether that matters in the context of the available funds can only really be made within the scope of the research priorities and expectations
I appreciate that teaching institutions may have other reasons for curating data, but those reasons need to be assessed against other possible ways of using the effort.
All these issues boil down to the fact, that, on the national scale, data facilitation and curation are not overheads, that is activities that can occur by right. Within NERC, we have multiple data centres, precisely because different components of NERC make the judgments in different ways. It's important to note however that this is not an argument that NO component of the infrastructure can be delivered via the mechanism of overheads on research (which might be regularly reviewed), but it is argument that the entire procedure should not (particularly if some parts of the community who already fund data management see an overhead charged on them to do data management in other communities - something that is a big risk with the UKRDS proposal). So, whatever one does for a national infrastructure has to deal with heterogenity on all fronts: from funding mechanisms, to facilities and objectives!
If we return to the NERC example, we have an overarching policy and strategy2, but the implementation in sub-discipline dependent. Can we scale any of the lessons out from the NERC experience nationally? Yes I think so3.
The lessons are simple:
We do need to get to the data producers while they are still producing the data. But doing so is incredibly time and human intensive. So we only want to get to some data producers some of the time. The key role of the data management plan is to identify who and what gets attention and when.
Distinguish between generic data management plans, which might essentially say, no we aren't going to do various levels of data management, and specific plans that will need to engage professionals.
By all means preserve digital data that you haven't invested in ingesting professionally, but don't kid yourselves that you'll be able to use it years later. If you don't invest in professional ingestion (ie. time consuming metadata production - whether manual or via software systems - and quality control), then the data may or may not be recoverable down the track, but if is, it'll be mightily expensive to do so.
But it's a balance, preserving everything to a recoverable standard is mighty expensive too. One of the roles of DMPs is to take an educated punt on what we can risk not doing!
Be very clear about the goals, what is desirable, and what is not, and allow the balance to vary between disciplines and within institutions.
Reward data management, and by this I don't mean the professionals alone, I mean the academics who do their bit. That's a key role for both institutions and funders!
Also a key role for publishers and data journals! I can't stress enough that I think the only way that digital curatoin will ever really take off is when it's realised that the act of curating data (including documentation) is as much a part of scientific discourse as journal articles.
But do assess the quality of the professionals too, and invest in their training and skills, and facilitate interdisciplinary data management knowledge transfer.
So what would a national infrastructure that did that look like? Well, we'll have to save that for another day ...