... personal wiki, blog and notes
Bryan's Blog 2006/04
More on Data Citation
In talking to folk, it became clear that it helps to draw a distinction between those who interpret data and those who produce data. Often these are the same people, and there is a clear one-to-one link between some data, the data creator(s) and a resulting paper which describes the data in some detail. However, in this situation, the key reason why the paper is published will normally be the interpretation that is presented, not the data description per se (although there are some disciplines in which the paper may simply describe the data, this is not the normal case).
So what CLADDIER is about is the situation where the data creator(s) has (have) only been involved in the creation of the data, not the resulting interpretation. Again, in some disciplines the data creators may end up as authors on relevant papers (and may never even see or comment on the text). However, in most disciplines this either doesn't happen, or is severely frowned upon, and in these situations the data creators are somehow second class academic citizens (despite being an essential part of the academic food chain). What we want to be able to do is have recognition for these folk via citation of the dataset itself, not a journal paper ... that way important datasets will be measurably important.
There are a lot of complications, some of which I've addressed before. Some of the additional things we discussed yesterday included
How to handle small contributions (such as annotations). My take on this is that small contributions are visible via authorship within the cited entity, but probably ought not be visible outside of it (although in the semantic web world one ought to be able to count the number of these things any individual does). At some point folk have to decide on the definition of small though (probably a discipline dependent decision) ...
The situation is rather more complicated with meta-analyses. Arguably we're in the same situation as an anthology or book of papers ... in either case we would expect the contributions to be citable as is the aggregated work.
One new concept for me was the taxonomy world might want to count (in some way) the number of times a specific taxa is mentioned, and use that as a metric of the work done categorising the species. It struck me that this might not be too clever in their world - surely the description of some rare species ought to be pretty important, but it might not get cited as frequently as some pretty routine work on, for example some agriculturally important species. (In the journal paper world this is the same reason why citation impact alone never used to be the whole story in the UK Research Assessment Exercise)
NDG Status Talk
As yesterday's last blog intimated, I'm in the middle of a two-day meeting of the NERC e-science community. Yesterday I gave a talk on the status of the NERC DataGrid project:
NERC DataGrid Status
In this presentation, I present some of the motivation for the NERC DataGrid development (the key points being that we want semantic access to distributed data with no centralised user management), link it to the ISO TC211 standards work, and take the listener through a tour of some of the NDG products as they are now. There is a slightly more detailed look at the Climate Sciences Modelling Language, and I conclude with an overview of the NDG roadmap.
Presentation 5 MB ppt
Another Missed Anniversary
As of last Tuesday I have been running the BADC for six years! I can't believe it's been that long! My first tour of duty in the UK (1990-1996) was just under six years, and my return to Godzone (1996-2000) was only four years. Yet the last six years have gone by incredibly quickly, and seem to have been much faster (and in some ways less full) than those other two chunks of time. (Mind you, my wife and I were talking this over, and when we listed what we had done in the last six years it became obvious that the key word in that last sentence was seems.)
When I was interviewed for the post they asked me what I thought I would be doing in five years. I said I didn't know, but I was sure I would have moved on to new challenges. I guess I got that mostly right, I have moved on to new challenges, but I've moved on by staying physically in the same place and growing the job (and taking on a daughter) ...
... and I have no plans right now of jumping back down under, despite having done another long tour of duty :-)
by Bryan Lawrence : 2006/04/27 (permalink)
Treasury via the Office of Science and Innovation1 is putting a good deal of pressure on the research councils2 to contribute to UK wealth by more effective knowledge transfer. The key document behind all of this is the Baker Report. It's been hanging around since 1999, having more and more influence in the way the research councils behave, but from my perspective it's finally really beginning to bite, so I decided I'd read the damn thing, so people would stop blindsiding me with quotes.
The first, and most obvious thing to note, is that it really is about commercialisation, and it's driven by government policy objective of improving the contribution of publicly funded science to wealth creation. But right up front (section 1.9) Baker makes the point that the free dissemination of research outputs can be an effective means of knowledge transfer, with the economic benefits accruing to industry as a whole, rather than to individual players. Thus the Baker report is about knowledge transfer in all its forms.
The second obvious point is that with all the will in the world, the research councils can't push knowledge into a vacuum: along with push via knowledge transfer initiatives, there needs to be an industry and/or a market with a will to pull knowledge out! Where such industry is weak or nonexistent there is the strongest case to make research outputs freely available as a methodology for knowledge transfer.
Some of my colleagues will also be glad to know that the presumption is that the first priority of the research councils should be to deliver their science objectives:
Nothing I advocate in this report is intended to undermine the capacity of (the Research Councils) to deliver their primary outputs.
Baker actually defines what he calls knowledge transfer:
collaboration with industry to solve problems (often in the context of contract research for industry)
the free dissemination of information, normally by way of publication
licencing of technology to industry users
provision of paid consultancy advice
the sale of data
the creation and sale of software
the formation of spin out companies
joint ventures with industry
the interchange of staff between the public and private sector.
Given that Baker recognises the importance of free dissemination of information it's disappointing that he implies that data and software are not candidates for free dissemination. Of course, he was writing in 1999, when the world of open source software was on the horizon, but not really visible to the likes of Baker, so I would argue that the creation of open source software by the public sector research establishment would not only fit squarely within these definitions had he been writing today, but he might have explicitly included it (indeed he probably would have been required to). In terms of free dissemination of data, most folk will know I'm working towards the formal publication of data, so that fits in this definition too.
I was also pleased to see (contrary to what others have said to me), that Baker explicitly (3.17 and 3.18) makes the point that knowledge transfer is a global activity, and the benefit to the UK economy will flow whether or not knowledge is transferred directly into UK entities or via global demand. The key point seems to be that the knowledge is transferred, not where it is transferred to (although he sensibly make the point that where possible direct UK benefit should be engendered).
Where it starts to go wrong, or at least, the reader can get carried away, is the emphasis in the report on protecting and exploiting Intellectual Property. At one point he puts it like this:
the management of intellectual property is a complex task that can be broken down into three steps; identification of ideas with commercial potential; the protection and defence of these ideas and their exploitation..
There is a clear frame of thinking that protecting and defending leads to exploitation, and this way of thinking is very easy to lead one astray. It certainly doesn't fit naturally with all the methods of knowledge transfer that he lists! It can also cause no end of problem for those of us with legislative requirements to provide data at no more than cost to those who request it (i.e. particularly the environmental information regulations, e.g. see the DEFRA guidance - although note that EIR don't allow you to take data from a public body and sell it or distribute it without an appropriate license so the conflict isn't unresolvable).
Baker does realise some of this of course, he makes the point that:
There is little benefit in protecting research outputs where there is no possibility of deriving revenues from the work streams either now or in the future.
I was amused to get to the point where he recognises that modest additional funding would reap considerable reward, but of course that money hasn't transpired (as far as I can see, but I may not be aware of it). As usual with this government, base funding has had to stump up for new policy activities. (This may be no bad thing, but it's more honest to admit it - the government is spending core science money on trying to boost wealth creation. Fine, and indeed we have been doing knowledge transfer, and will continue to do so, from our core budget, but the policy is demanding more).
The final thing to remember is that the Baker report is about the public sector research establishment itself, my reading of it definitely didn't support the top-slicing of funds from the grant budgets that go to universities to support knowledge transfer, but that's what is happening. Again, perhaps no bad thing, but I don't see Baker asking for it (although there is obvious ambiguity, since it covers the research councils, but when a research council issues a grant, the grant-holding body gets to exploit the intellectual property).
So the Baker report was written in 1999, but government policy is being driven by rather more recent things too. Over the next couple of months, I'll be blogging about those as well (if I have time). One key point to make in advance is that knowledge transfer can and now does include the concept of science leading to policy (which is of course a key justification for the NERC activities)
I'm still naive
I've just read the real climate post on how not to write a press release. I was staggered to read the actual press release that caused all the fuss (predictions of 11C climate sensitivity etc). The bottom line is that had I read that press release without any prior knowledge I too might have believed that an 11 degree increase in global mean temperature was what they had predicted (which is not what they said in the paper). I can't help putting some of the blame back on the ClimatePrediction.net team - the press release didn't reflect the message of their results at all properly, and they shouldn't have let that happen. I'm still naive enough to believe it's incumbent on us as scientists to at least make sure the release is accurate, even if we can't affect the resulting reporting.
Having said that, I thought Carl Cristenson made a fair point in the comments to the RC post (number 13): to a great extent, the fuss is all a bit much, we should concentrate on the big picture - the London Metro isn't the most significant organ of the press :-)
(I stopped reading the comments around number 20 ... does anyone have time to read 200 plus comments?)
Amusingly I read the RC post (and the final point: "not all publicity is good publicity") less than four hours after I listened to Nick Faull (the new project manager for cp.net) ruefully review the press coverage (in a presentation at a nerc e-science meeting). He finished up with "but at least they say no publicity is bad publicity" ... one can find arguments to support both points of view!
As an aside, we spent some time at the NERC meeting discussing the fact that current climateprediction experiment is completely branded as a BBC activity, even though NERC is still providing significant funding (and seeded the project in the first place as part of the COAPEC programme) ... this at a time when NERC needs to get its knowledge transfer activities visible.
Data Storage Strategy
Recently I was asked to come up with a vision of where the UK research sector needed to be in terms of handling large datasets in ten years time. This is being fed into the deliberations of various committees who will come up with a bid for infrastructure in the next government comprehensive spending plan. Leaving aside that the exam question is unanswerable, this is what we1 came up with:
The e-Infrastructure for Research in next 10 years (Data Storage Issues)
The Status Quo
In many fields data production is doubling every two years or so, but there are a number of fields where in the near future, new generations of instruments are likely to introduce major step changes in data production. Such instruments (e.g. the Large Hadron Collider and the Diamond Light Source) will produce data on a scale never before managed by their communities.
Thus far disk storage capacity increases have met the storage demands, and both tape and disk storage capacities are likely to continue to increase although there are always concerns about being on the edge of technological limits. New holographic storage systems are beginning to hit the market, but thus far are not yet scalable nor fast enough to compare with older technologies.
While storage capacities are likely to meet demand, major problems are anticipated in both the ability to retrieve and manage the data stored. Although storage capacities and network bandwidth have been roughly following Moore?s Law (doubling every two years), neither the speed of I/O subsystems nor the storage capacities of major repositories have kept apace. Early tests of data movement within the academic community have found in many cases that their storage systems have been the limiting factor in moving data (not the network). While many groups are relying on commodity storage solutions (e.g. massive arrays of cheap disks), as the volumes of data stored have gone up, random bit errors are beginning to accumulate, causing reliability problems. Such problems are compounded by many communities are relying on ad hoc storage management software, and few expect their solutions to scale with the oncoming demand. As volumes go up, finding and locating specific data depends more and more on sophisticated indexing and cataloguing. Existing High Performance Computing facilities do not adequately provide for users whose main codes produced huge volumes of data. There is little exploitation of external off-site backup for large research datasets, and international links are limited to major funded international projects.
In ten years, the UK community should expect to have available both the tools and the infrastructure to adequately exploit the deluge of data expected. The infrastructure is likely to consist of
One or more national storage facilities that provide reliable storage for very high-volumes of data (greater than tens of PB each).
A number of discipline specific data storage facilities which have optimised their storage to support the access paradigms required by their communities, and exploit off site backups, possibly at the national storage facilities.
All these storage facilities will all have links with international peers that enable international collaborations to exploit distributed storage paradigms without explicitly bidding for linkage funding. Such links will include both adequate network provision and bilateral policies on the storage of data for international collaborators. The UK high performance computing centres will have large and efficient I/O subsystems that can support data intensive high performance computing, and will be linked by high performance networks (possibly including dedicated bandwidth networks) to national and international data facilities.
The UK research community will have invested both in the development of software tools to efficiently manage large archives on commodity hardware, and on the development of methodologies to improve the bandwidth to storage subsystems. At the same time the investment in technologies to exploit both distributed searching (across archives) and server-side data selection (to minimise downloading) will have been continued as the demand for access to storage will have continued to outstrip the bandwidth.
Many communities will have developed automatic techniques to capture metadata, and both the data and metadata will be automatically mirrored to national databases, even as they are exploited by individual research groups.
As communities will have become dependent on massive openly accessible databases, stable financial support will become vital. Both the UK and the EU will have developed funding mechanisms that reflect the real costs and importance and of strategic data repositories (to at least match efforts in the U.S. which have almost twenty years of heritage). Such mechanisms will reflect the real staffing costs associated with the data repositories.
In all communities, data volumes of PB/year are expected within the next decade:
In the environmental sciences, new satellite instruments and more use of ensembles of high resolution models will lead to a number of multi-PB archives in the next decade (within the U.K. the Met Office and European Centre for Medium Range Weather Forecasting already have archives which exceed a PB each. Such archives will need to be connected to the research community with high-speed networks.
In the astronomy community, there are now of the order of 100 experiments delivering 1 or more TB/year with the largest at about 20 TB/year, but in the near future the largest will be providing 100 TB/yr and by early in the next decade PB/year instruments will be deployed.
In the biological sciences, new microscopic and imaging techniques, along with new sensor arrays and the exploitation of new instruments (including the Diamond Light Source) are leading to, and will continue to, an explosion in data production.
Within the particle physics community, the need to exploit new instruments at CERN and elsewhere is leading to the development of new storage paradigms, but they are continually on the bleeding edge, both in terms of software archival tools, and the hardware which is exploited.
Legal requirements to keep research data as evidential backup will become more prevalent. Communities are recognising the benefits of meta-analyses and inter-disciplinary analyses across data repositories. Both legality issues and co-analysis issues will lead to data maintenance periods becoming mandated by funding providers.
With the plethora of instruments and simulation codes within each community, each capable of producing different forms of data, heterogeneity of data types coupled with differing indexing strategies will become a significant problem for cross-platform analyses (and the concommittant data retrieval). The problem will be exacerbated for interdisciplinary analyses.
Government Policy on Open Source Software
All this strategy stuff is making my head hurt. But meanwhile, I think I'll have to start a series of blog entries to provide me with relevant notes. To start with, here is a verbatim quote from the UK government policy on open source software (pdf, October 2004):
The key decisions of this policy are as follows:
UK Government will consider OSS solutions alongside proprietary ones in IT procurements. Contracts will be awarded on a value for money basis.
UK Government will only use products for interoperability that support open standards and specifications in all future IT developments.
UK Government will seek to avoid lock-in to proprietary IT products and services.
UK Government will consider obtaining full rights to bespoke software code or customisations of COTS (Commercial Off The Shelf) software it procures wherever this achieves best value for money.
Publicly funded R&D projects which aim to produce software outputs shall specify a proposed software exploitation route at the start of the project. At the completion of the project, the software shall be exploited either commercially or within an academic community or as Open Source Software
(although the last point has some exemptions, the key one - from my perspective - being trading funds like the Met Office)
Well, I may not have done much blogging this month, but I've achieved a bunch of other stuff:
Firstly, and most importantly, Elizabeth turned one, and we had a big family and friends party over Easter to celebrate. She contributed in her normal way by smiling and laughing and eating in roughly equal proportions. At the same time, we built a deck outside our back door so we can get out without negotiating a temporary step that I put there nearly two years ago ...
Secondly, I'm suffering a barrage of strategy meetings, starting with a NERC information strategy meeting a couple of weeks ago, then a knowledge strategy meeting on Monday, and a one hour telecon today on technology strategy. I'm all strategied out.
Thirdly, I've probably written more code this month than in any month for a long time - an analysis of who was doing what on the way to the alpha release of our ndg code and services made it clear that there were a couple of gaps and believe it or not my toying with the code for this blog made me the person best fitted to do it.
Fourthly, I enjoyed a bit of time reading and making some contributions to some scientific papers.
All this working four days a week. Since the beginning of March I've been looking after Elizabeth on Fridays, and been fitting my nominal 40 hours in the previous four days and nights (I say nominal, because I'd like to be only doing 40 hours). It's been an interesting exercise, because I think I'm actually being more productive doing fewer longer days. However, for all that, I've still got too many balls in the air, so if you're reading this, and wondering why I haven't done something I should have, sorry ... hopefully it's still in the queue!
by Bryan Lawrence : 2006/04/25 (permalink)
I keep having to look up the iso document for this list of allowable extensions to iso19115:
adding a new metadata section;
creating a new metadata codelist to replace the domain of an existing metadata element that has free text listed as its domain value;
creating new metadata codelist elements (expanding a codelist);
adding a new metadata element;
adding a new metadata entity;
imposing a more stringent obligation on an existing metadata element;
imposing a more restrictive domain on an existing metadata element.
More Meteorological Time
My first post at describing the time issues for meteorological metadata led to some confusion, so I'm trying again. I think it helps to consider a diagram:
The diagram shows a number of different datasets that can be constructed from daily forecast runs (shown for an arbitrary month from the 12th til the 15th as forecasts 1 through 4. If we consider forecast 2, we are running it to give me a forecast from the analysis time (day 13.0) forward past day 14.5 ... but you can see that the simulation began (in this case 1.0) days earlier, at a simulation time of 12.0 (T0). In this example
we've allowed the initial condition to be day 12.0 from forecast 1.
we've imagined the analysis was produced at the end of a data assimilation period (let's call this time Ta).
the last time for which data was used (the datum time, td) corresponds to the analysis time.
(Here I'm using some nomenclature defined earlier as well as using some new terms).
Anyway, the point here was to introduce a couple of new concepts. These forecast datasets can be stored and queried relatively simply ... we would have a sequence of datasets, one for each forecast, and the queries would simply then be on finding the forecasts (using discovery metadata, e.g. ISO19139) and then on extracting and using the data itself (using archive aka content aka usage metadata, e.g. an application schema of GML such as CSML).
What's more interesting is how we provide, document and query the sythesized datasets (e.g. the analysis and T+24 datasets). Firstly, if we look at the analysis dataset, we could extract the Ta data points and have a new dataset, but often we need the interim times as well, and you can see that we have two choices of how to construct them - we can use the earlier time from the later forecast (b), or the later time from the earlier forecast (a). Normally we choose the latter, because the diabatic and non-observed variables are usually more consistent outside the assimilation period when they have had longer to spin up. Anyway, either way we have to document what is done. This is a job for a new package we plan to get into the WMO core profile of ISO19139 as an extension - NumSim.
From a storage point of view, as I implied above, we can extract and store the new datasets, or we can try and do this as a virtual dataset, described in CSML, and extractable by CSML-tools. We don't yet know how to do this, but there is obvious utility in saving storage in doing so.
Curating metadata aka XML documents
One of the things that has worried me for a long time about relational systems for storing metadata is that it is non-trivial to reliably maintain metadata provenance. I don't want relational integrity, I want historical integrity (and don't give me rollback as an answer). One of the reasons why I like XML as a way of dealing with our scientific metadata as it is much more intuitively obvious how one deals with provenance (one keeps old records), but differencing xml documents is tedious and in the long run keeping multiple copies of large documents with minor changes is expensive.
I'll be interested to see if the delta web gains any traction.
sponsoring the development of Geospatial Standards For Free Software. These standards will provide a way to share geospatial data between oepn (sic) source applications in a vendor-neutral, XML based format.
Sound like dejavu? Well these folks recognise this up front:
Many GIS users will be critical of setting up an additional set of open standards for geospatial technology, when the OGC already maintains a set of similar standards. While we recognize the benefits of a unified approach to standards design, we believe there are some serious and fundamental flaws in the approach OGC takes to standards development.
Well, I'm not a GIS user (yet), but I'm sure critical of this. What we don't need is even more nugatory effort reinventing wheels. Anyway, they their reasons why they don't like OGC, so let's see if we can make some sense of their reasons:
The OGC is about geospatial standards, but it is not about free and open source software.
True, but as the SurveyOS guys admit: OGC promotes the development and use of consensus-derived publicly available and open specifications that enable different geospatial systems (commercial or public domain or open source) to interoperate. So this might be a good case for folk to build open source implementations of OGC standards, not a good case for creating new standards.
Membership to the OGC is expensive and exclusive.
True. But it's not hard to get involved, even without investing money. The issue would appear to be that the surveyOS folks want to influence the standards without paying to sit at the table. I understand the motivation. We've paid the minimum we can to sit close, and we don't have a vote, but I've seen no evidence that the OGC community ignore good technical advice.
The OGC focus a great deal of its efforts on GIS for the web. Some of us still use GIS on the desktop. GIS and the internet are a powerful combination, but they are not the solution to every problem. The GSFS will also focus its efforts on GIS for the desktop.
This might be valid, but frankly what sits under the hood of a gui can have the same engine as something that sits under the hood of a browser. But even if we believe this point, it's an argument to develop a complementary set of protocols not a competing set.
OGC specifications are difficult to read and understand.
True. And there are a lot of them as well. But why not get involved in documenting what exists, rather than developing new things? The SurveyOS web page says: We will give explanations, go into details, provide examples, and explain technical jargon. Our specifications will be designed to read like a book or tutorial, not like a specification. So why not write a book or a tutorial for the OGC specs?
The OGC Provides A Standard But No Implementation. There are a few open source projects that implement the OGC standards ... Every standard adopted as a part of GSFS will be developed in conjuction with an open source implementation that can be used as an example for other developers.
This is the worst of the lot. Let's rephrase this to tell it like it is: Because some folk have spent years developing some technically excellent standards and have done the initial implementations in the commercial world, and not built us an open source implementation, we will take a few months (years?) to develop our own standards, and build our own implementations which wont interoperate with the others. How does that help anyone? There is nothing to stop the surveyOS guys from building their own open source implementations of any OGC spec.
They conclude with this statement:
The SurveyOS will make an effort to learn from and adopt portions of the OGC standards, but believes the problems listed above require a separate effort at creating geospatial standards.
What a waste of effort. I agree that the OGC standards are daunting in their complexity and volume, and agree that they depend heavily on the ISO standards which are expensive to obtain, but in my experience they have covered sucha lot of ground that repeating the effort will be really nugatory.
I'm a big fan of open source developments. Indeed the NDG is open source, and building on OGC protocols. I recognise the advantages of competition in developing implementations, but if one wants to repeat the standards work, one should base the argument for doing so around where the standards are deficient, not whether or not there are open source implementations of the standards!
Is the bottom line that these people too lazy to read the specs properly?
Food for thought in the software industry, or any industry actually!
Today I'm moving office - only about 50 yards along a corridor, but it's the usual trauma: do I need to keep this? Can I throw it away?
Ideally most of the paper would become electronic ... what I would give to be able to search my photocopied papers and grey material and books etc? But even the academic papers are beyond help .... I have a thousand plus of the things, all nicely entered in a bibtex file with a box name associated with them ... but they'll never become digital I'm sad to say. What will be interesting will be when I'll feel comfortable enough with finding digital copies (for free) on the web, to biff them. I wonder how long that will be?
by Bryan Lawrence : 2006/04/03 (permalink)