... personal wiki, blog and notes
Bryan's Blog 2010/09
cyberinfrastructure for data
Today I'm sitting in a "data task force" meeting organised by NSF to address future cyberinfrastructure requirements. We're in a Microsoft building in Redmond: comedy moment as it seems that of the three dozen or so folks here, half of them are using Macs ... (and I suspect a few more are like me, using Linux). That said, I'm obviously drinking from the kool aid, because there is a feeling afoot that Microsoft are doing the right thing in a number of fields nowadays (unlike Oracle who are trending darkwards).
Anyway, I was asked to present a couple of slides on challenges for open access data repositories ... but clearly I don't believe that such challenges are independent of the science, so I produced a few more slides to give context. I also wanted to make a lot of points, not just a couple ... so I did that, but I do get to some big points eventually.
My slides are here (pdf). The main points I wanted to make are
(slide two): Climate science is a global problem, as indicated by CMIP5, which involves dozens of organisations producing petabytes of data organised into millions of datasets. The solution to handling this sort of thing has to be global. (I could have led with something from the surface temperature meeting a couple of weeks ago, the same imperative for a global solution would emerge).
(slide three): For all that, the global solutions have to be supported by national solutions. In the UK we might consider, for large scale earth simulation, developing a national cache - for simulations to be analysed "centrally" (to avoid a many-to-many copying problem of large volumes of data). To do that, we need high bandwidth, reliable network links to data producers internal to the UK, and further afield. One also wants to allow folks to analyse the data on their own terms: hence the provision of something with the attributes of a private cloud.
(slide four): Just to make the point that it's not just simulation, earth observation data is growing massively as well - and it's globally distributed as well. This is just ESA.
(slide five): And to make the point that sensors are proliferating (from lamp posts, ships and planes, to long term observatories).
(slide six): and in passing, to note that the amount of available storage can't keep up with the data being produced.
(slide seven): Getting to the meat of it all: All these data sources are globally distributed, heterogeneous, and voluminous - and increasing on all those axes (more distributed, more heterogeneous, more voluminous). All of these data products need to be integrated, manipulated, and understood - and preserved (most of these data cannot be recaptured). These lead to major challenges, designing, building and delivering reliable:
Global data systems, with global data movement, and global caches!
National data systems, connected to major data sources internally and externally - with virtualised computing.
Metadata systems to drive everything: they need to generate as much provenance as possible automatically, but intuitive human tools are needed. They need to be really reliable and quite prevalent. They need to be smart, and exploit as much machine understanding as possible via ontologies etc.
(slide eight):An attempt to depict all the things one needs to consider if one wants to avoid a WORN1 ("Write Once Read Never") archive. I tried to produce a stack diagram based on things ranging from the technology driven (the computer architectures) through to the science driven (the services, portals and visualisation needed). Key points that I noted include:
Choosing the storage system is becoming more and more a balancing act between the energy cost (in Joules used and produced) and the opportunity cost (immediate access via high bandwidth or retrieval from offline or nearline media).
Nothing will scale without metadata (all sorts), and interoperability (both here and now between software and human and over time) will require major investments in semantic tooling ... significantly beyond just marking things up with unconstrained RDF ... and dealing with the heterogeneity will require exploiting model driven architectures to build information systems which understand the metadata and data structures. This will depend on usage conventions for both data and metadata formats and schemae. It will also depend on data modelling paradigms and tooling that are significantly better than those available now - including data specific metamodels which are richer and more comprehensive. We will need automatic tools which capture provenance, but we will also need to continue to supplement such information with human input.
Applications will be decomposed into server side activities and client side activities, decoupled by networks that are not really able to cope with high volume data transfers - particularly over long distances (often the problems may be in the first or last mile, but only exposed with high volume long distance transfers). (This point is not about the bandwidth, which maybe there, but not utilisable because of component and/or software incompatibilities.)
Server side calculations will become more and more important - and these will require both "free form" APIs (essentially allowing the equivalent of machine and/or script virtualisations - because point and click doesn't scale) and fixed APIs (allowing the construction of complex portals for those who can and and do want to explore pre-canned functionality).
It's important that the systems have appropriate security policies along with authentication and authorisation tooling - because even if the data is intellectually open, it's useless if the systems are overloaded and/or compromised.
Two overarching points from this slide:
The existing cyberinfrastructure at pretty much every level is far too flakey. Research councils around the world have invested in bleeding edge infrastructure research, but have not necessarily invested in reliable infrastructures (and here I'm talking about all the levels of infrastructure - although strangely they often invest heavily in networks that have high bandwidth potential).
Much of the heterogeneity can be addressed by improved standardisation, along with better tooling for specialisation and extension, so that the gamut of heterogeneity can be reduced to a manageable number of activities.
(slide nine): My final slide was making the point that all of this is predicated on a number of social and cultural challenges:
Rewards: Everything is predicated on metadata? That means there need to be rewards for metadata, since metadata creation is a commons issue: you yourself don't get the direct benefit, but you benefit from the efforts of others.
Curation: Over time, the information needs to be migrated through the requirements of new/changed user communities. That needs people with appropriate careers to manage the evolution of the relevant ontologies (they can't be expected to migrate the data itself without automatic tooling - there will be too much data for it to be a manual process, and even too much data to imagine manual appraisal/disposal could suffice instead).
Citation: Back to the reward! With citation, effort is rewarded, and the edifice of the scientific method is maintained.
Licenses and IPR: all of this will depend on clear and unambiguous licensing and IPR rules.
Trust & Reliance : A global, national and institutional infrastructure will need to have interdependent components that can be relied upon, now and into the future. Folks will need to trust URIs issued by one institution will be de-referencable in the future.
Plans. And none of this can be done without plans that have appropriate timescales, that are understood, by individuals, communities, and funders alike, and updated as appropriate.
In the discussion I also talked about the importance of policies that mandate that publications should be associated with either the data themselves or information on where the data can be obtained.
Under pressure to pull out a couple of specific recommendations:
NSF should invest in reliable infrastructure for data management over the long term, and in better and more reliable tooling to support that.
It should continue to invest in the computing techniques to deliver model driven architectures for data and metadata services - and follow it up with funding to develop reliable implementations, and
Ensure it has appropriate data management policies in places, backed with mechanisms to implement/police them.
Of course, that's just me, this has time to run. We'll see what the task force eventually recommends: hopefully something a goodly more specific and encompassing than those final recommendations (and I guess they'll come after a second, open, workshop).
I have yet to read the paper (pdf sub needed), but via Joe Romm I see that MIT have just had a new study appear, from which they apparently (it's not in the paper itself) produced the following figure:
I don't think it much matters whether the details of these roulette wheels are right or not, because I think the general message is:
if no policy is enacted we face a climate future from a bunch of really bad outcomes, and
if some sensible policies are enacted, we might just have a chance of having a climate future for which our children might be able to forgive us, and
we don't get to choose within these wheels, only between them!
Public Opinion and Climate Change
Public opinion as to the reality of climate change waxes and wanes, but like most things, there are three categories of folks: the non-believers, the unconvinced, and the convinced.
(Don't bother telling me that the same arguments apply in reverse, of course they do, but a significant proportion of those in the convinced camp are scientists, for whom a change in the evidence base would immediately require a reassessment of the conclusions - there are few folks for whom that seems to be true in the first camp.)
Anyway, as has been well described in Aschwanden's article (the second link above), what is needed for the unconvinced, is a convincing narrative.
This year is a record breaking hot year in many ways. Any one year, in and of itself, does NOT contribute to the evidence base (one year doesn't prove a trend, even if for an "unbeliever", it seems to be allowed to prove the absence of a trend). However, one year does contribute to the narrative (and in that sense, the up years contribute to the "convinced" narrative, and the "down" years contribute to the unbelieving narrative).
Anyway, the point of all this is to show how the latest NOAA data
contributes to what I would consider to be the correct narrative: that this year is panning out exactly as we expect in a warming world.
For the denizens of the UK, and the west of the U.S,, we can see that their personal experience of heat/extremes is pretty minimal - beyond what they see on their televisions. For them, and apparently, especially the blokes, the narrative needs more work. For many, the narrative ought to be getting pretty convincing.
Surface Temperature Workshop
I've just spent three days in Exeter at a Met Office initiated exercise on issues associated with constructing a new surface temperature working programme.
I'm never going to write it up in any detail, but for a flavour, here are my tweets using the #climateobs hashtag (plus two that add some value):
Steve Worley on ICOARDS: entire dataset 90GB, issues of data publication, citation, format, software & project sustainability #climateobs (Tue)
#climateobs Jay Lawrimore experience at NCDC (world data centre for climate) on gathering and providing access to historical & current data (Tue)
#climateobs ... surprised to discover that daily precip and max/min temp are not required by WMO ... and therefore difficult to obtain (Tue)
#climateobs Albert Klein Tank: eg why daily extremes to be recorded (and extracted historically) Moscow July 2010, 31 days > 25C, cf av 9.5 (Tue)
#climateobs Will some aims of this project founder on national met services restricting access to data to protect commercial exploitation? (Tue)
#climateobs Now getting short presentations on the white papers (at the website where we introduced the hashtag) ... (Tue)
#climateobs As usual I'm struck with the antiquated approaches in the met/climate data community wrt data modelling and data curation. (Tue)
#climateobs ... but fossils have their role, and at least they don't change (much). Fossils are much much better than missing links. (Tue)
In breakout group on provenance and version control: too busy engaging to tweet.#climateobs (Tue)
When you have a hammer everything is a nail. My hammer is RM-ODP http://bit.ly/azsSsY #climateobs could do with decomposing accordingly (Tue)
kevingashley @bnlawrence Are the antiquated approaches because they've been doing it longer (so practice ossifies) ?
@kevingashley Existing practice certainly a factor. But part is that key influencers don't have the available attention span to look out (Tue)
#climateobs Tension between extensible data formats, and defined formats so that folks know what is wanted. This is such a solved problem! (Tue)
#climateobs Oh no, someone is seriously proposing redefining a new METAR format and constraining it to transmit via BUFR. BUFR? Dinosaurs! (Tue)
#climateobs So I commented from the floor about BUFR. Apparently 180 countries are in the process of moving to BUFR and fear more change. (Tue)
#climateobs ... and I rest my case about dinosaurs. How in the 21st C did we get to a lossy binary format with external tables that change? (Tue)
#climateobs day 2: concentrating on existing and required efforts to homogenise data to produce consistent records (Wed)
#climateobs importance of metadata: some stations moved from city centres to airports around ww2 (or did they)? Impact on trend analysis! (Wed)
#climateobs Matt Menne: Climate Hippocratic Oath: Do not flag good data as bad. Do not make bias adjustments where none are warranted. (Wed)
#climateobs Scary: correlation is all statistics and not physics? Radiosonde analyses without assimilation? Network not really dense enough? (Wed)
#climateobs (I should say the last tweet is my take, no one else has said that.) (Wed)
#climateobs Interesting initiative, about to be rejuvenated: An Informed Guide to Climate Data Sets; http://bit.ly/9yZKyP (Wed)
#climateobs Need a distinction between scientific assessment criteria & testing strategies (including using held back and/or synthetic data) (Wed)
#climateobs day3 Christy: our science is now in court ... that changes everything. Beyond provenance: rules of admissable evidence matter! (Thu)
#climateobs what worked for scientific uptake of datasets: update regularly, easy web access, clear explanations of changes (Thu)
#climateobs google.org relevance? Can they help? Only if they are in it for the long term - worrisome history (google science data) (Thu)
#climateobs Google tools good! But worry about aim to host data in cloud; what happens when they move on & institutional resources lost? (Thu)
#climtateobs Important standards: open provenance model, provenance markup language, iso19156 observations and measurements, netcdf/cf etc (Thu)
#climateobs Yet another community who need to use a lot of words rather than use the word metadata: far too ambiguous (http://is.gd/f2fWm) (Thu)
#climateobs Should data products be allowed "in" (whatever that means) that use data inputs that are not publicly available? (Thu)
CameronNeylon: @bnlawrence As long as license agreements don't contaminate downstream products or access to other data I don't see a problem #climateobs (Thu)
#climateobs Breakout groups back in plenary, my take: governance & ongoing impetus is going to be difficult (like all commons projects) (Thu)
#climateobs: Future ECMWF reanalyses will provide feedback timeseries compatible with the input station net on web in tbd databank format (Thu)
#climateobs Many institutes depend on data sales. If WMO res40 was renegotiated (how)? It would cause extra financial pressure at a bad time (Thu)
#climateobs And the room bifurcates: Should we use all the data or only the open data? Does traceability and openness trump correctness? (Thu)
#climateobs And now I have to leave Exeter ... discussion continues, but you wont hear about it from me. (Thu)
#climateobs Postscript: Feedback about my comments re BUFR. More than a dozen folk came up over next two days to agree: BUFR is a disaster. (Fri)