... personal wiki, blog and notes
Bryan's Blog 2009/03
Reading 2009, 5: Pillars need shortening
Well, it's some time since I caught up with my reading reports. Many weeks ago I waded through Ken Follett's The Pillars of the Earth. I say wade in both a positive and negative sense: it's a long shallow book, with lots of good bits and a fair bit of meandering, rather like wading through a river via warm pools interspersed with cold.
I first read this book a long long time ago, and liked it enough then to keep a copy. However, the copy I read this time was yet another Christmas present, and this read was probably enough to convince me that my groaning shelves can probably do without either copy. Don't get me wrong, this is a pretty good book, but it's not a monumental book, and the issue for me now is not the quality, but the length: I don't have the patience and time for 1000 page books that I had anon.That said, when I finished it, I was way happier with having done so than the previous book. If you're reading this to decide whether to read it yourself, then I suspect if you're looking to read "literature", finding a book with less characters and a smaller canvas could be your thing. But if you want medieval entertainment (without ever quite being enthralled), for night after night, then go for it.
the publication ecosystem
Cameron Neylon drew my attention to an absolutely fascinating set of figures that appeared in a Research Information network paper last year (pdf). You should read Cameron's post, it's interesting in it's own right for anyone interested in the place of peer review in the firmament. However, what I wanted to here was abstract some rather interesting numbers from the report. They all appear here:
This figure requires a bit of explanation, which you're not going to get here, aside to say the numbers are an estimates of the totals spent globally in the entire process of scholarly communication and research. Note that the cost of "reading" is split out from the cost of "doing research", if you really wanted to know the cost of research itself you might add those two figures together. Amongst the interesting conclusions that we can draw is that the cost of the publication machinery, from submission to consumption via the eyeballs, is roughly 1/7 of the total costs of the global research ecosystem - and half of that is the cost of searching and printing out the resulting documents. Deep in the model, we find an assumption that it takes 12.5 minutes search time per article and that dominates the search/print cost. I'm not quite sure where that number comes from, but it's very plausible to me. Another interesting number buried in the publishing cost is 2 billion on peer review, i.e. peer review is of the order of 1-2% of the total cost of research.
It's not a great leap to wonder if those large figures on discovery and reading are the consequences of "publish or perish" coupled to the "smallest publishable unit" leading to far to much crap literature to wade through in many disciplines. (However, to be fair, as one who always used to argue to my graduate students that one hasn't "done anything" until one has published it, simply bashing the amount of material out there isn't the whole story.)
Anyway this piece of work prompted me to wonder if anyone has actually quantified what proportion of research time/effort/budget is spent dealing with data handling? I've heard lots of anecdotes, and I've created a few guesstimates myself, but I wonder if anything half as authorative as this report has been done?1 If so, then the obvious question to ask would be whether those numbers would support more or less of a professional data handling infrastructure?
The initial thrust of the day was the presentation of a plan to deliver the first steps of a national research data service, followed by some stakeholder perspectives, and then funder perspectives. Sadly, I thought the proposal was seriously flawed, being based on a naive expectation of what could and should be achieved. I found little dissent from those I spoke with in the margins (or indeed, from those who spoke from the floor during the day). There were one or two individuals who saw this as an opportunity that needed to be siezed, despite the flaws in the proposal, but mostly I heard practical objections, followed by rational arguments about what might be done - and total agreement that something should be done. But what?
Firstly, scope. What's in a name? Are we trying to build a national research data infrastructure, or a research data service? Is the concept of curation relevant? In my mind, having a clear view on the distinction between these concepts is crucial to working out what can be achieved. To help with what I mean here, I'm thinking
of an infrastructure as being composed of many services and facilities. So, if we need a national infrastructure, what "services" need to be national? What facilities deliver which services?
of a distinction between helping with storage, management and use of data in the near term (data facilitation), and into the medium and longer term future (data curation).
(It certainly seemed like some of the folks at this meeting were thinking curation while talking about facilitation and others vice versa, something that became particularly clear during the video presentation about the Australian National Data Service - of which another day, if I find the energy and time!)
Who are the stakeholders and players in these activities? I might summarise them as:
some of whom are in universities, some of whom are not, who sometimes appear as producers, and sometimes as consumers,
some of which are universities, some of which are not, who might teach, or run libraries and other facilities which may or may not have a role to play,
existing data facilities which have trans-institutional (and often trans-national) mandates,
advisors (of good practice),
assessors (of good practice),
Of those, I think of both the research and non-research consumers as users, and all the rest are players in the research infrastructure. In particular, the funders are as much players as those who deliver storage facilities! Without the funders being engaged to the point of mandating and rewarding appropriate levels of engagement by the other players, then some of the goals one might aspire to are simply not feasible. Similarly, the institutions that employ and educate the producers of data are just as much players.
For all their flaws, I think the UKRDS proposers understood the stakeholder and player relationships to a degree. Where it all went wrong, I think, was their understanding of a) the distinction between data and documents, and b) their understanding of the nature, discipline dependence, and fundamental importance of data management plans, as well as c) how one would need to balance the desire to facilitate data reuse against the desire to collect new data and to carry out new research.
Taking these in turn:
data are not documents. There are no common understandings of how to format data, how to store data, and how to document them, and neither should there be1. The consequences of this heterogeneity are profound, not least because such heterogeneity means that in any given institution, there can be many more formats and facilitation requirements than there can be experts in those formats and requirements.
data management plans are discipline and data dependent, reflecting varying importance of varying maturities of data, and the presence or absence of overarching agreements (e.g collaboration agreements etc). Again, while any given institution can have expertise in the concepts of data management plans, the construction and relevance of the plans are likely to be trans-institutional more often than not.
the choice about relative importance of facilitation, curation, and new research are discipline dependent. Yes, it might well be that if we don't look after old data it will be lost, but the decision as to whether that matters in the context of the available funds can only really be made within the scope of the research priorities and expectations
I appreciate that teaching institutions may have other reasons for curating data, but those reasons need to be assessed against other possible ways of using the effort.
All these issues boil down to the fact, that, on the national scale, data facilitation and curation are not overheads, that is activities that can occur by right. Within NERC, we have multiple data centres, precisely because different components of NERC make the judgments in different ways. It's important to note however that this is not an argument that NO component of the infrastructure can be delivered via the mechanism of overheads on research (which might be regularly reviewed), but it is argument that the entire procedure should not (particularly if some parts of the community who already fund data management see an overhead charged on them to do data management in other communities - something that is a big risk with the UKRDS proposal). So, whatever one does for a national infrastructure has to deal with heterogenity on all fronts: from funding mechanisms, to facilities and objectives!
If we return to the NERC example, we have an overarching policy and strategy2, but the implementation in sub-discipline dependent. Can we scale any of the lessons out from the NERC experience nationally? Yes I think so3.
The lessons are simple:
We do need to get to the data producers while they are still producing the data. But doing so is incredibly time and human intensive. So we only want to get to some data producers some of the time. The key role of the data management plan is to identify who and what gets attention and when.
Distinguish between generic data management plans, which might essentially say, no we aren't going to do various levels of data management, and specific plans that will need to engage professionals.
By all means preserve digital data that you haven't invested in ingesting professionally, but don't kid yourselves that you'll be able to use it years later. If you don't invest in professional ingestion (ie. time consuming metadata production - whether manual or via software systems - and quality control), then the data may or may not be recoverable down the track, but if is, it'll be mightily expensive to do so.
But it's a balance, preserving everything to a recoverable standard is mighty expensive too. One of the roles of DMPs is to take an educated punt on what we can risk not doing!
Be very clear about the goals, what is desirable, and what is not, and allow the balance to vary between disciplines and within institutions.
Reward data management, and by this I don't mean the professionals alone, I mean the academics who do their bit. That's a key role for both institutions and funders!
Also a key role for publishers and data journals! I can't stress enough that I think the only way that digital curatoin will ever really take off is when it's realised that the act of curating data (including documentation) is as much a part of scientific discourse as journal articles.
But do assess the quality of the professionals too, and invest in their training and skills, and facilitate interdisciplinary data management knowledge transfer.
So what would a national infrastructure that did that look like? Well, we'll have to save that for another day ...
I don't know what it says, because I'm not about to pay $18 to read it, but I assume I'd like what he wrote ...