... personal wiki, blog and notes
Bryan's Blog 2006/04/26
Treasury via the Office of Science and Innovation1 is putting a good deal of pressure on the research councils2 to contribute to UK wealth by more effective knowledge transfer. The key document behind all of this is the Baker Report. It's been hanging around since 1999, having more and more influence in the way the research councils behave, but from my perspective it's finally really beginning to bite, so I decided I'd read the damn thing, so people would stop blindsiding me with quotes.
The first, and most obvious thing to note, is that it really is about commercialisation, and it's driven by government policy objective of improving the contribution of publicly funded science to wealth creation. But right up front (section 1.9) Baker makes the point that the free dissemination of research outputs can be an effective means of knowledge transfer, with the economic benefits accruing to industry as a whole, rather than to individual players. Thus the Baker report is about knowledge transfer in all its forms.
The second obvious point is that with all the will in the world, the research councils can't push knowledge into a vacuum: along with push via knowledge transfer initiatives, there needs to be an industry and/or a market with a will to pull knowledge out! Where such industry is weak or nonexistent there is the strongest case to make research outputs freely available as a methodology for knowledge transfer.
Some of my colleagues will also be glad to know that the presumption is that the first priority of the research councils should be to deliver their science objectives:
Nothing I advocate in this report is intended to undermine the capacity of (the Research Councils) to deliver their primary outputs.
Baker actually defines what he calls knowledge transfer:
collaboration with industry to solve problems (often in the context of contract research for industry)
the free dissemination of information, normally by way of publication
licencing of technology to industry users
provision of paid consultancy advice
the sale of data
the creation and sale of software
the formation of spin out companies
joint ventures with industry
the interchange of staff between the public and private sector.
Given that Baker recognises the importance of free dissemination of information it's disappointing that he implies that data and software are not candidates for free dissemination. Of course, he was writing in 1999, when the world of open source software was on the horizon, but not really visible to the likes of Baker, so I would argue that the creation of open source software by the public sector research establishment would not only fit squarely within these definitions had he been writing today, but he might have explicitly included it (indeed he probably would have been required to). In terms of free dissemination of data, most folk will know I'm working towards the formal publication of data, so that fits in this definition too.
I was also pleased to see (contrary to what others have said to me), that Baker explicitly (3.17 and 3.18) makes the point that knowledge transfer is a global activity, and the benefit to the UK economy will flow whether or not knowledge is transferred directly into UK entities or via global demand. The key point seems to be that the knowledge is transferred, not where it is transferred to (although he sensibly make the point that where possible direct UK benefit should be engendered).
Where it starts to go wrong, or at least, the reader can get carried away, is the emphasis in the report on protecting and exploiting Intellectual Property. At one point he puts it like this:
the management of intellectual property is a complex task that can be broken down into three steps; identification of ideas with commercial potential; the protection and defence of these ideas and their exploitation..
There is a clear frame of thinking that protecting and defending leads to exploitation, and this way of thinking is very easy to lead one astray. It certainly doesn't fit naturally with all the methods of knowledge transfer that he lists! It can also cause no end of problem for those of us with legislative requirements to provide data at no more than cost to those who request it (i.e. particularly the environmental information regulations, e.g. see the DEFRA guidance - although note that EIR don't allow you to take data from a public body and sell it or distribute it without an appropriate license so the conflict isn't unresolvable).
Baker does realise some of this of course, he makes the point that:
There is little benefit in protecting research outputs where there is no possibility of deriving revenues from the work streams either now or in the future.
I was amused to get to the point where he recognises that modest additional funding would reap considerable reward, but of course that money hasn't transpired (as far as I can see, but I may not be aware of it). As usual with this government, base funding has had to stump up for new policy activities. (This may be no bad thing, but it's more honest to admit it - the government is spending core science money on trying to boost wealth creation. Fine, and indeed we have been doing knowledge transfer, and will continue to do so, from our core budget, but the policy is demanding more).
The final thing to remember is that the Baker report is about the public sector research establishment itself, my reading of it definitely didn't support the top-slicing of funds from the grant budgets that go to universities to support knowledge transfer, but that's what is happening. Again, perhaps no bad thing, but I don't see Baker asking for it (although there is obvious ambiguity, since it covers the research councils, but when a research council issues a grant, the grant-holding body gets to exploit the intellectual property).
So the Baker report was written in 1999, but government policy is being driven by rather more recent things too. Over the next couple of months, I'll be blogging about those as well (if I have time). One key point to make in advance is that knowledge transfer can and now does include the concept of science leading to policy (which is of course a key justification for the NERC activities)
I'm still naive
I've just read the real climate post on how not to write a press release. I was staggered to read the actual press release that caused all the fuss (predictions of 11C climate sensitivity etc). The bottom line is that had I read that press release without any prior knowledge I too might have believed that an 11 degree increase in global mean temperature was what they had predicted (which is not what they said in the paper). I can't help putting some of the blame back on the ClimatePrediction.net team - the press release didn't reflect the message of their results at all properly, and they shouldn't have let that happen. I'm still naive enough to believe it's incumbent on us as scientists to at least make sure the release is accurate, even if we can't affect the resulting reporting.
Having said that, I thought Carl Cristenson made a fair point in the comments to the RC post (number 13): to a great extent, the fuss is all a bit much, we should concentrate on the big picture - the London Metro isn't the most significant organ of the press :-)
(I stopped reading the comments around number 20 ... does anyone have time to read 200 plus comments?)
Amusingly I read the RC post (and the final point: "not all publicity is good publicity") less than four hours after I listened to Nick Faull (the new project manager for cp.net) ruefully review the press coverage (in a presentation at a nerc e-science meeting). He finished up with "but at least they say no publicity is bad publicity" ... one can find arguments to support both points of view!
As an aside, we spent some time at the NERC meeting discussing the fact that current climateprediction experiment is completely branded as a BBC activity, even though NERC is still providing significant funding (and seeded the project in the first place as part of the COAPEC programme) ... this at a time when NERC needs to get its knowledge transfer activities visible.
Data Storage Strategy
Recently I was asked to come up with a vision of where the UK research sector needed to be in terms of handling large datasets in ten years time. This is being fed into the deliberations of various committees who will come up with a bid for infrastructure in the next government comprehensive spending plan. Leaving aside that the exam question is unanswerable, this is what we1 came up with:
The e-Infrastructure for Research in next 10 years (Data Storage Issues)
The Status Quo
In many fields data production is doubling every two years or so, but there are a number of fields where in the near future, new generations of instruments are likely to introduce major step changes in data production. Such instruments (e.g. the Large Hadron Collider and the Diamond Light Source) will produce data on a scale never before managed by their communities.
Thus far disk storage capacity increases have met the storage demands, and both tape and disk storage capacities are likely to continue to increase although there are always concerns about being on the edge of technological limits. New holographic storage systems are beginning to hit the market, but thus far are not yet scalable nor fast enough to compare with older technologies.
While storage capacities are likely to meet demand, major problems are anticipated in both the ability to retrieve and manage the data stored. Although storage capacities and network bandwidth have been roughly following Moore?s Law (doubling every two years), neither the speed of I/O subsystems nor the storage capacities of major repositories have kept apace. Early tests of data movement within the academic community have found in many cases that their storage systems have been the limiting factor in moving data (not the network). While many groups are relying on commodity storage solutions (e.g. massive arrays of cheap disks), as the volumes of data stored have gone up, random bit errors are beginning to accumulate, causing reliability problems. Such problems are compounded by many communities are relying on ad hoc storage management software, and few expect their solutions to scale with the oncoming demand. As volumes go up, finding and locating specific data depends more and more on sophisticated indexing and cataloguing. Existing High Performance Computing facilities do not adequately provide for users whose main codes produced huge volumes of data. There is little exploitation of external off-site backup for large research datasets, and international links are limited to major funded international projects.
In ten years, the UK community should expect to have available both the tools and the infrastructure to adequately exploit the deluge of data expected. The infrastructure is likely to consist of
One or more national storage facilities that provide reliable storage for very high-volumes of data (greater than tens of PB each).
A number of discipline specific data storage facilities which have optimised their storage to support the access paradigms required by their communities, and exploit off site backups, possibly at the national storage facilities.
All these storage facilities will all have links with international peers that enable international collaborations to exploit distributed storage paradigms without explicitly bidding for linkage funding. Such links will include both adequate network provision and bilateral policies on the storage of data for international collaborators. The UK high performance computing centres will have large and efficient I/O subsystems that can support data intensive high performance computing, and will be linked by high performance networks (possibly including dedicated bandwidth networks) to national and international data facilities.
The UK research community will have invested both in the development of software tools to efficiently manage large archives on commodity hardware, and on the development of methodologies to improve the bandwidth to storage subsystems. At the same time the investment in technologies to exploit both distributed searching (across archives) and server-side data selection (to minimise downloading) will have been continued as the demand for access to storage will have continued to outstrip the bandwidth.
Many communities will have developed automatic techniques to capture metadata, and both the data and metadata will be automatically mirrored to national databases, even as they are exploited by individual research groups.
As communities will have become dependent on massive openly accessible databases, stable financial support will become vital. Both the UK and the EU will have developed funding mechanisms that reflect the real costs and importance and of strategic data repositories (to at least match efforts in the U.S. which have almost twenty years of heritage). Such mechanisms will reflect the real staffing costs associated with the data repositories.
In all communities, data volumes of PB/year are expected within the next decade:
In the environmental sciences, new satellite instruments and more use of ensembles of high resolution models will lead to a number of multi-PB archives in the next decade (within the U.K. the Met Office and European Centre for Medium Range Weather Forecasting already have archives which exceed a PB each. Such archives will need to be connected to the research community with high-speed networks.
In the astronomy community, there are now of the order of 100 experiments delivering 1 or more TB/year with the largest at about 20 TB/year, but in the near future the largest will be providing 100 TB/yr and by early in the next decade PB/year instruments will be deployed.
In the biological sciences, new microscopic and imaging techniques, along with new sensor arrays and the exploitation of new instruments (including the Diamond Light Source) are leading to, and will continue to, an explosion in data production.
Within the particle physics community, the need to exploit new instruments at CERN and elsewhere is leading to the development of new storage paradigms, but they are continually on the bleeding edge, both in terms of software archival tools, and the hardware which is exploited.
Legal requirements to keep research data as evidential backup will become more prevalent. Communities are recognising the benefits of meta-analyses and inter-disciplinary analyses across data repositories. Both legality issues and co-analysis issues will lead to data maintenance periods becoming mandated by funding providers.
With the plethora of instruments and simulation codes within each community, each capable of producing different forms of data, heterogeneity of data types coupled with differing indexing strategies will become a significant problem for cross-platform analyses (and the concommittant data retrieval). The problem will be exacerbated for interdisciplinary analyses.
Government Policy on Open Source Software
All this strategy stuff is making my head hurt. But meanwhile, I think I'll have to start a series of blog entries to provide me with relevant notes. To start with, here is a verbatim quote from the UK government policy on open source software (pdf, October 2004):
The key decisions of this policy are as follows:
UK Government will consider OSS solutions alongside proprietary ones in IT procurements. Contracts will be awarded on a value for money basis.
UK Government will only use products for interoperability that support open standards and specifications in all future IT developments.
UK Government will seek to avoid lock-in to proprietary IT products and services.
UK Government will consider obtaining full rights to bespoke software code or customisations of COTS (Commercial Off The Shelf) software it procures wherever this achieves best value for money.
Publicly funded R&D projects which aim to produce software outputs shall specify a proposed software exploitation route at the start of the project. At the completion of the project, the software shall be exploited either commercially or within an academic community or as Open Source Software
(although the last point has some exemptions, the key one - from my perspective - being trading funds like the Met Office)