Bryan Lawrence : Bryan's Blog 2011

Bryan Lawrence

... personal wiki, blog and notes

Bryan's Blog 2011

Reviewing Parson and Fox on the data publication metaphor

Mark Parsons have a draft paper out for public review and are inviting comments there ...

... but I found myself hamstrung by the "comment" paradigm, so I'm responding here.

Firstly, the paper itself: Like Chris Rusbridge, commenting on the blog, I found myself in equal parts agreeing and disagreeing ... but the bottom line is that it is a good question to ask, so I hope a version of this paper appears! The more this gets discussed the better, either I'm right (and data Publication - as I define it - should be a dominant paradigm), or I'm wrong - but if right, it'll never happen until folks discuss and buy into it, and if I'm wrong, well best I find out :-)

Ok, some specific comments:

  1. There is an underlying assumption in this paper, which bubbles up to plain view in spots, and on the blog, that Publication and sharing are in conflict. I flat out disagree with this ... and don't know of any evidence that supports such an assertion.

    • The Publication metaphor already explicitly supports both via preprints (sharing or publication) and the formal paper of record (Publication).

  2. Metaphors are obviously useful, but as the authors agree, flat out dangerous as well. I've blogged on the importance of "verifiable statements" before (and folks should look at the Tauber link from that!). So do I agree with these particular metaphorical paradigms?

    • Well, no not really, we at CEDA would claim we were playing in all four paradigms as they are defined, and I don't think we're that unusual. It is true that the examples they have chosen can be roughly characterised by these examples, but I think there's a bit of selection bias, sadly. Partly, I think that's engendered by using these activities to try and stratify data management as opposed to data activity. For example, we are a big archive (Big Iron1) who have some datasets which we wish to publish, and we are involved in map making and linked data - but I would argue that the second two are activities that depend on data management, they are not data management per se! Other groups will have very different data management standards and methods, but still be involved in map making and linked data. I think this is recognised to some extent in the paper (via the table where the focus line makes these key points, the qualification that they're not mutually exclusive, and the first paragraph of section 4).

    • So, because I think these paradigms do mix management and activity they don't contribute easily to the question about publication, so I don't find them helpful in this context.

    • All that said, the discussion of how misunderstood definitions limit understanding is totally on the money (up to the point where there is an implicit assertion that "traditional peer review" is homogeneous enough to be differed from what one could do with data Publication. I think PLoS is an interesting analogy here ...

  3. Why do the authors think concepts like "registration, persistence" etc are not relevant (see the reference to Penev et al)? The answer appears to come from the next sentence where there is a conflation between Publication and some (closed) implementations thereof.

  4. I think the dynamic, annotation infested ( :-) ), world of Publications that I foresee is not inconsistent with open access, clear provenance, yet the authors assert that worrying about provenance and definition is undue? (I defy you to find a URL to a data object that is useful without you having some implicit or explicit knowledge about the medium that will be returned if you dereference a URL - obviously definition matters in the data world, and just as obviously some level of containerisation is necessary - even if it's only to say "stream of type X starts here, terminated by this binary string").

    • Of course annotation, federation, transformation are important, but if you want to use objects from that complex world, in a way that is scientifically useful, you need to know whether you can repeat or replicate your workflow!

    • Particularly, often one doesn't need to see all that "ecosystem" in the workflow. The annotations are logically distinct from the data (and may themselves be different publications). If they do need to be together, then the anthology is a useful metaphor - publish poems on their own, and in collections!

    • What is linked data, but simply linked data? ... the objects that are linked, may, or may not be Published ... fair enough ... no conflict there?

    • So I simply don't buy that Publication is in conflict with unlocking the deep web! It's orthogonal in some sense!

  5. All those picky disagreements aside, I find myself agreeing, the various paradigms available all lack in some way - but everything is a compromise. The analysis of their paradigms is fair enough ... in general ... but the conclusion lacks some rigour that they might be able to take from our paper.

  6. I really liked the discussion of infrastructure and ecosystems ... particularly when the authors, particularly when it got to recognising the different roles that exist in Publication and how they can be decoupled. (Again, our paper helps in this regard, although I think it could be usefully extended using the arguments that Parsons and Fox have espoused).

    • When we get to releases, we start to see the concept of versions (editions anyone?) of data ... right on!

    • I think there is real scope to consider how the infrastructure and ecosystem around data Publication will be (should be) very different from those around literary publication. I'd really like that section to be expanded ...

  7. ... it is important not to be hidebound by ... any one metaphor. Yes, Yes, Yes.

All of which is me saying my definition of Data Publication is clearly different from Parsons and Fox. Which is utterly fine ... neither of us is right or wrong, we just need to define what we are talking about!

Ok, now to consider the blog traffic :-) (which at the time of writing was up to comment 10).

The one point that I think that is explicitly different from any of the points above, is the importance of social science and philosphy in establishing both our expectations and our practice ... absolutely true!!!

The other point (to reiterate) is that data Publication is not in conflict with sharing, but there is a time for sharing and a time for not sharing - and whatever system(s) we come up with have to leave the decision as to how much sharing is wanted or desired to individuals (or their funders) - the "infrastructure" cannot make that decision (positively or negatively). It needs to facilitate it!

(In particular, it is just fine for scientists to hold some data close to their virtual chests ... consider, if nothing else, health records, or the geographical location of the last members of particular species etc).

Finally, I too am a big advocate of citation, but it is not the answer, but it's certainly an answer! Any metaphor is both useful and limiting, and even the best solutions are only useful 80% of the time. That means this data "publication" paradigm will involve multiple solutions. Bring on the separation of concerns :-)

1: or not? Like Chris in the comments on their blog, we're petascale, but all commodity hardware ... today! (ret).

by Bryan Lawrence : 2011/12/16 : 0 trackbacks : 2 comments (permalink)

Communicating Science (to the media)

Interesting seminar here this morning on the subject.

Some memes that came out of this (and a previous staff meeting):

  1. Most academic staff don't see public/media communication as their job, and that's a combination of being over-worked in the first place, not thinking we're particularly good at it (we tend to use over long, over-qualified sentences :-), and not wanting to get into a fight with some nutter in honour of some spurious notion of balance.

  2. We (academics) need to remember the reality of the new cycle: most stories have a four-hour life expectancy, and most of them are sprung from event, and prepared in a matter of hours, if not minutes. For television and radio particularly, waiting to do a properly researched story would, in many cases, mean the story would never appear - bad quick information wins over good slow information!

  3. It's important to remember that "show business" matters; an accurate story which is boring won't be listened to (or read)! So, like it or not, if you want to communicate, you have to play by the "make it brief, make it interesting, and do it with panache" rules. (Oh no, see the second caveat in 1. above :-).

  4. (However, in the 24-hour news cycle, you might be able to negotiate a "serialisation" of your message into byte (pun!) size chunks - giving a reporter several stories - but you'd have to be lucky :-)

  5. Some survey of stories in the "quality US newspapers" (editor: no comment :-), suggested that 85% of the stories were founded on only one source, and often that source was a press release! (Gosh!)

  6. Prepare for interviews, but be available. Saying "Not now, I'm available in an hour" might mean you're not "used" ... but if you're going to be interviewed, spend at least ten minutes trying to work out how to get your message across. If it's a soundbite interview, your message had better be in 45 words (or less if you have to deflect a spurious question "on-message"). If it's a three-minute "radio-4-type" interview, it'd better be 2-300 words max (and be prepared to deflect several spurious questions "on-message").

  7. Often a story/interview will go better if you take five minutes to send someone specific facts/figures in a few bullets in an email - it might well get them "on-message" and asking about the right things!

And the bottom line, for us climate scientists: As Edmund Burke said

All that's necessary for the forces of evil to win in the world is for enough good men to do nothing.

And we are, most of us, funded by the public (one way or another), to do the right thing for the public ... a bit like open data really I suppose!

(Update: I should have noted the seminar speaker was Jon Barton of Clarify Communications.)

by Bryan Lawrence : 2011/12/14 : Categories media : 0 trackbacks : 0 comments (permalink)

data scientists are anally retentive too!

Well, that title ought to make this blog article inaccessible to those of us with anally retentive firewalls, but because it's only a minor misquote of an article from the Times Higher Education, it is legitimate!

The original quote is: Professor Nunberg said

"like most high-tech companies, Google puts a much higher premium on innovation than maintenance. They aren't good at the punctilious, anal-retentive sort of work librarians are used to."

Well yes: but I think Professor Nunberg said the right thing for the wrong reasons. Firstly, the context: Google Books has a lot of metadata, and a lot of it's wrong, and Nunberg was effectively saying that Librarians do metadata better.

Well no: they clearly have procedures in place for checking metadata, but they don't necessarily have expertise, but if they do, that might not be enough. IMHO, in the long run, crowd sourced metadata (where one has potentially multiple experts on tap) will vastly exceed (individual) expert sourced metadata. Because when metadata is more than just (title, author, year), metadata really is just another sort of content, and when content matters for descriptive matter, Wikipedia v Britannica (e.g. here and here) shows us the direction of travel. Crowd sourcing wins, in the long run.

(Caveat: generating information is still an individual or at best a team game, you can't crowd source information ab initio, even if you can crowd source the evaluation and description of it. Caveat-squared: you can of course crowd source data - crowds can generate data. Anyway, enough of this digression.)

However, where librarians do matter, and data scientists too - for whom the same argument applies - is in setting the initial conditions! Since, after-all, we're never "in the long-run", we're always getting there, and we get there much faster if we know where we started from. Attention to detail is a great quality for librarians and data scientists alike (actually all scientists), such anal retention is an integral requirement for many of us ...

Taking us back to the THE article, clearly if the index doesn't point you to the vicinity of the thing you want, it will take a long time to get there. Similarly, if you don't have good scientific metadata, it takes a long time to choose the tree from the wood, and having done so, you might not have an initial evaluation for how good that tree was ...

What Nunberg might have said (who knows, maybe he did say, I only had the THE article to go on) was: why on earth didn't Google start with an algorithm to utilise the professionally curated metadata rather than their own? And why not put a proper crowd source front end, so it can be improved?

All of which is rather ho hum, but what really got me thinking was the parallel between Google, and the attitude of a great many of the scientific community to managing their data, in which case I'd say:

Like Google, most scientists put a much higher premium on innovation than maintenance. They aren't good at the punctilious, anal-retentive sort of work that is necessary for society to get the best out of the data it has paid to collect!

Which is to say, it's not just being anally retentive that matters, it's what you're anally retentive about.

OK, well that's the last time this subject (a-r) comes up on this blog ... even if it was enough to provoke me out of my workload-induced-blog silence (although in truth, what was really needed was the sudden cancellation of a scheduled meeting).

by Bryan Lawrence : 2011/12/08 : Categories metadata curation : 0 trackbacks : 1 comment (permalink)

European Exascale Software Initiative

Last week I spent two days at the lovely Domaine de Tremblay (an hour from central Paris):

Image: static/2011/07/02/eesi.tremblay.jpg

This was the venue of a workshop organised by the European Exascale Software Initiative (EESI). The project has an interesting blend of activities; four domain working groups

  • industrial and engineering applications,

  • weather, climatology and earth sciences,

  • fundamental sciences,

  • life sciences and health,

cut across four technical working groups

  • hardware roadmaps (and vendor links),

  • software ecosystem,

  • numerical libraries, solvers and algorithms, and

  • scientific software engineering.

to try and work out what Europe will need to do to exploit future exascale computing systems (which is of course a software problem).

I didn't keep proper notes, but I did tweet a bit, here's my tweet stream from the meeting:

  • Sitting in a long narrow room ready for two days on the European Exascale Software Initiative. Not expecting a hashtag here #eesi

  • Some real ambition in this room. Complete simulation of aircraft and airflow, would require zetaflop computing! #eesi

  • I so do NOT believe in codesign: if the s/w life cycle is much longer than the h/w life cycle, it's a non starter! #eesi

  • Other domains (genomics, seismology etc) as well as climate stress importance of exaSCALE cf exaFLOP! Need more balanced HPC! #eesi

  • Hardware review: interesting to note that the only thing all vendors are actively researching is hw fault tolerance! Frightened? #eesi

  • Worries about parallel file system futures: scalability and reliability, file number, metadata performance. Few vendors in the game! #eesi

  • Programming models needed which better allow incremental parallelisation, more decoupling between h/w and s/w, and easier portability #eesi

  • Not all faults expected to be masked by h/w, or tolerated easily by application (h/w correction will generate noise, global restart) #eesi

  • Quadrant diagram: divide tasks into ((production, testing) versus (slower, faster)). Helps decide between capability versus capacity #eesi

  • Not acceptable to try and control 3 levels of parallelisation by hand: system, nodes, cores. Improve abstraction in programming model #eesi

  • Future algorithms might have to start with single precision, because SP uses less power, and then only use double precision later #eesi

  • Crickey. Someone asked about libraries for symplectic integration. I can remember the noun Hamiltonian. Can't remember what it means! #eesi

  • Definition of co-design: h/w perspective, we'll work with s/w community so you can change your apps to use our (slightly changed) h/w. #eesi

  • Definition of co-design: s/w perspective, we'll work with h/w community so you can change your systems. Will they ever meet? #eesi.

  • exascale hybrid core future: we might need to spend half the money we used to spend on computers on s/w engineers so we can use them! #eesi

  • Bloke in audience: why does data management matter for exascale? Everyone: because the flops are useless if you can't feed & use them! #eesi

  • Speaker guess: 3 h/w generations between us & exascale? Me sotto voce: If they get more heterogeneous each time we'll never get there! #eesi

  • One breakout grp arguing for s/w isolation layer to hide h/w. Next asking for co-design between h/w and s/w. Draw your own conclusion #eesi

  • Programmability: Petascale systems are not easy to programme. Do we really believe it will get easier at exascale? Yet it must! #eesi

  • Programmability: An assertion: Parallelising serial codes will almost certainly fail. Decompose algorithm from 1st principles instead. #eesi

  • "If you want an exascale initiative of relevance: you will need to invest big time in REdeveloping applications (new algorithms)" #eesi

  • ... "you will also need to invest heavily in training, and a proper career path for scientific computing/computation ..." #eesi

I also chaired and reported on a breakout group on cross-cutting data issues. The breakout report is available as a pdf. Although the turnout to this breakout session was relatively low, a couple of folks came up to be me afterwards, and bemoaned having to be in two places at once, and having to choose programmability as a (slightly) higher priority. This meeting confirmed to me, for once and for all, that nearly every scientific domain is struggling with big data handling, particularly adjacent to HPC!

There was also considerable discussion in our breakout group about software sustainability, a theme that was picked up the next day in the EPSRC strategic advisory team meeting on research infrastructure.

(Yes, I did get home from a lovely venue in Isle de France and go straight to Polaris House in Swindon the next day, you can imagine how the comparison felt.)

Nearly every meeting I go to makes the following point: Given that software is such a big and crucial part of research infrastructure nowadays the system is going to have to change:

  • More professional software development and maintenance in academia, coupled with

  • Better careers for those in software support (and software leadership) roles, and

  • Sustained funding from research funders for codes which underpin the science community.

by Bryan Lawrence : 2011/07/02 : Categories exascale hpc : 0 trackbacks : 2 comments (permalink)

Thought provoking talk on climate policy

Today at the ICTP workshop on e-infrastructure for climate science, George Philander gave a really thought provoking talk on global warming and asepcts of policy consequences - particularly in an African context.

His basic meme followed the following line of argument:

  • We know that in pliocene times the earth appeared to be in a permanent state of El Nino (no evidence for cold water intrusions in equatorial regions).

  • No current climate model can replicate that state, so

  • We can only trust models for small perturbations from the current state

    • He alluded to a paper by Carl Wunsch pointing out that models can't be tested in paleo times ... and to a paper by Paul Valdes. I'm going to chase up both.

  • So, he worries about global warming, not because we know what is going to happen, but because we don't, which implies that

  • this is a time for circumspection. He likens us (the planet) to a ship in fog. It's time to slow down and sound our way!

He also had a lot to say about the policy paralysis which amongst most serious analysists comes down to an argument about the discount rate based on our current predictions for the future. Should we act now or later? But this is a rich persons argument! From those in poverty there is no argument, the only choice is later! However, the argument over discount rate is predicated on (relatively) small perturbations from the current state, but if our models don't do serious perturbations, that's a matter of concern. As he puts it, we will know these things, but we don't yet, so, let's put the argument in terms of the planet being a special place (habitable) at a special time (warm interglacial). Let's treat that planet with some circumspection until the future is clearer!

What exactly to do, if your perspective is that of trying to lift millions from poverty, even as you want to do it without burning coal, is a moot question. I don't think any of us know the answer - Philander certainly didn't claim an answer - but it's a question that I hadn't really given any thought to before today.

Most of these ideas appear in a paper called "Where are you from? Why are you here? An African perspective on global warming." which appears in the Annual Review of Earth and Planetary Sciences, pdf. It's well worth a read, since amongst other reasons, you wont often see a reproduction of a Titian in the same article as a figures showing milllenial temperature changes etc. There is also a call to arms for an African centre to study earth sciences. I hope it succeeds!

by Bryan Lawrence : 2011/05/18 : 0 trackbacks : 0 comments (permalink)

hpc futures - part two

What is this exascale wall that I've been tweeting about?

The last decade of advances in climate computing have been underpinned by the same advances in computing that everyone of us saw on our laptop and desktop until a couple of years ago: chips got faster and smaller. The next decade started a couple of years ago: processors are not getting faster, but they are still getting smaller, so we get to have more processors - there are even dual-core smartphones out there now! What's not so obvious though is that the memory per core is generally falling, and while the power consumption is falling, it's not falling fast enough ... pack a lot of these (cheap) cores together, and they use a ludicrous amount of power. The upshot of all this is great for smartphones, but not so good for climate computing: the era of easy incremental performance (in terms of faster time to solution, or more complexit/resolution at the same time to solution) in our climate codes is over. Future performance increase is going to have to come from exploiting massive parallelisation (tens of million to a billion) threads with very little memory per thread - and it'll come with energy cost as a big deal. I first starting wrote about exascale on my blog back in August last year. (I promised to talk about data then, and I will here ...)

What all this means is that our current generation of climate model codes probably have at best a few years of evolution in their current incarnation, before some sort of radical reshaping of the algorithms and infrastructures becomes necessary. At the US meeting I'm at now, two different breakout groups on this subject came to the same conclusion: if we want to have competitive models in five years and ten years time, we need to a) continue to evolve our current codes, but b) right now, start work on a completely new generation of codes to deploy for use in the next but one generation of supercomputers. That's a big ask for a community starved of folks with real hard-core computing skills. Clearly there are a lot of clever climate modellers, but the break out group on workforce summarise the reality of the issue to be that the climate modelling commmunity consists of three kinds of folks: diagnosticians, perturbers (who tinker with codes), and developers. Universities mainly turn out the diagnosticians, some perturbers, but very very few developers with the skillset and interest to do climate code work.

That's a big problem, but the data side of things is pretty big problem too. Yes, the exascale future with machines with tens to hundreds of millions of cores is a big problem, but even now we can come up with some scientifically sensible, and computationally feasible methods of filling such a machine. Colin Jones from the SMHI has proposed a sensible grand ensemble based on tracticable extension of how EC-Earth is being used now (running, in an extreme experimental mode at an effective 1.25 degrees resolution). An extrapolation of that model to 0.1 degrees resolution (roughly 10km) would probably effectively use 5000 cores or so. If one ran an ensemble of 50 members for a given start date, at the same time, it could use 250,000 cores. Ideally one would have a few different variants of this or similar models available, capturing some element of model uncertainty, let's say 4. Now we can use 1 million cores. To really understand this modelling "system", we might want to run 25 year simulations, but sample a range of initial states, let's say 40. Now we can use 40 million cores. This is an utterly realistic application. If a 40 million core machine was available, and we could use it all, this would be an excellent use of it (there are other uses too, and for those we need the new codes discussed above). But let's just consider a little further.

Colin tells me that the 1.25 degree (actually T159L62) model produces roughly 9 GB of data per simulation month writing out onto an N80 reduced Gaussian grid (which means you can double the following numbers if one wanted "full resolution"). Scaling up to the 0.1 degree version would result in 1.4 TB/month, and the grand ensemble described above would result in a total output of around 3 exabytes! For a reasonable time to solution (150 hours or a week all told, that is 6 hrs/model_instance_year), it would require a sustained I/O from the machine to storage of around 50 Tbit/s.

Archive and analysis may be a problem! Remember this 3 exabytes of data could be produced in one week!

At this point it's probably worth considering an exascale computer not as a "computer" but as a "data source" ... it's a bit of a paradigm shift isn't it? Even without heaps of work on exascale software, our exascale computer can produce an outrageous data problem. We need to start thinking about our computer in terms of its analysis and archive capability first, then we can think about it's computational ability, and indeed, how to get our codes to address other important problems (such as faster time to solution so we can have better high-res paleo runs etc). This ought to be affecting our purchasing decisions.

Hang on a moment though. The obvious immediate rejoinder is "we don't need to write all this data out". So what can we do to reduce the data output? We can calculate ensemble statistics in the machine, and we can choose only to write out some representative ensemble members. That might gain us a factor of 10 or so. We could simply say, we'll only write out certain statistics of interest, and not all the output, and that's certainly feasible for a large class of experiments where one is pretty sure the data will get no re-use, because the models are being deliberately put in some mode which is not suitable for projetion analysis or extensive comparison with obs - but many of these ensemble experiments are very likely to produce re-usable data. Should it be re-used?

Well, consider that our 40 million core machine will probably cost around 50 million USD when we get hold of one. If we depreciate that over five years say, then it's about 10 million per year (in capital cost alone) or about 20,000 USD per week to use. Double that figure for power costs and round up to 50,000 USD for that grand ensemble run. I have no idea what storage will cost when we can do this run, but my guess is that the storage costs be of the order of 300,000 USD. To be generous, let's imagine the storage costs will exceed the run time costs by a factor of 10.

(Where did that number come from? Well, today's tier-1 machine might have 50,000 cores, and it costs o(1) million USD per PB. To go to 50 million cores, we scale CPU by 1000, let's imagine we scale data costs accordingly. So when I have a 50 million core machine, I'll be able to get an EB of storage for the same price as today's PB.)

So the big question then is, how many application groups are likely to be able to exploit my run, and can I anticipate their specific data output needs? Well,my guess for an ensemble experiment as discussed above, there would be value in that data for considerably more than ten different groups - be they parameterisation developers, observatoinal analysists, or impacts advisors!

So, we probably should store as much of it as we feasibly can! We can talk about the analysis effort and the network issues at a later date!

(Actually, when I run the numbers for other possible permutations of output, and model configuration, the 9 GB/month we started with seems small, it's entirely feasible to suggest a realistic IPCC style output requirement when scaled out, would result in around 50 times more output, but for an IPCC experiment, hundreds of applications are feasible).

by Bryan Lawrence : 2011/04/28 : Categories exascale hpc badc : 0 trackbacks : 1 comment (permalink)

Pushing Water Analogies too far

Recently I heard a talk by Kevin Trenbeth, who showed a slide with the following cartoon:

Image: static/2011/02/09/firehose.jpg

It was perfect in context, but I also found it amusing in a serendipitous sort of a way: just before he started speaking , I had started (in a coffee break I hope) googling for images of reservoirs, hoses and sprinklers ... because I had a similar idea in mind, and I wanted an easy way of communicating it.

However, my idea was and is rather different from Kevin's. So why is it different? Firstly, mine was borne out of Ian Jackson's repeated mantra (on the NERC information strategy group) that a water-pipe system is no good without water, and my repeated response, and without a pipe system, the water just gets wasted. We were both right of course: a water delivery system is useless without water, and water gets wasted without a delivery system.

To some extent Kevin's image is redolent of wasted water, even though there are clearly components of a system ... but I'd put it slightly differently:

  • I don't think the sensor is analogous to the hydrant, I think it should be analogous to the rain pouring water into a reservoir via runoff, rivers, whatever ...

    • There are plenty of folk who will tell you it's raining data.

    • My point being that there are a plethora of sensors (water delivery mechanisms), each delivering data (water) into a bunch of archives (lakes, reservoirs, whatever).

    • The hydrant is simply a defined interface to which I can couple a hose! It depends on being connected to a local reservoir, which itself may be coupled by a canal or a pipe to another reservoir, or lake or whatever. So the lessons I draw are that:

      • In the case of data, we need local caches, to which we can connect delivery systems which understand a standard interface.

      • But they depend on the existence of connected (and large) managed bodies of water. In my world view, that might be a European archive (distributed or otherwise), and a national archive (which may be independent of, or part of, the European system).

  • I think the idea of the scientist cowering under a deluge of data is correct, but

    • more and more we see folks building complicated data delivery systems so that the data can be more easily consumed.

      • I see that as analogous to the design and construction of a range of sprinklers on the end of the hose, targeted at particular problems.

So here is my graphic:

Image: static/2011/02/09/water.analogy.sm.png

And the context?

We are being encouraged to put more and more effort into the front end - the portal, the visualisation system - to the detriment of the backend (the managed archive, well connected, with standard interfaces).

It clearly wont end well

  • if we have a lovely sprinkler system, but the reservoir looks like this:

    Image: static/2011/02/09/dryreservoir.jpg

Nor will it end well

  • if we ignore the importance of metadata,and the fact that the pipes can carry more than one type of fluid:

    Image: static/2011/02/09/industrial.png

(I'm well aware I need some picture credits on this post, since I made some serious use of other peoples pictures ... please let me know if your picture is here and you would like a credit - or for me to remove it.)

by Bryan Lawrence : 2011/02/09 : Categories metadata badc neodc : 0 trackbacks : 1 comment (permalink)

Weather Forecasting's Bad Name

Sometimes weather forecasting deserves its bad name!

I'm off to Geneva tonight, for a couple of days. It looks like I'll have a fifteen minute walk from hotel to meeting venue. Do I need a) a raincoat, or b) a warm jacket?

  • Weather Underground (and google):

    • Tuesday: -8 to 5C, clear (sunny)

    • Wednesday: -2 to 7C, clear (sunny)

  • BBC Weather:

    • Tuesday: -3 to 0C, grey cloud

    • Wednesday:-4 to 6C, light snow

  • Swiss Met Office:

    • Tuesday: -2 to 0C, grey cloud

    • Wednesday: -3 to 0C, grey cloud, windy (50% probability)

Of course my money is on the Swiss, so I'm taking a rain coat (which is also wind proof).

The problem of course is that how does the punter know which to believe? Of course for Switzerland, one might imagine the Swiss do best ... but beyond that the issue of course is that the weather underground folks are presumably just interpolating out of global NWP. Who knows what resolution? Global models near mountains ... not a good idea!

I'll report!

by Bryan Lawrence : 2011/01/31 : 0 trackbacks : 2 comments (permalink)

Numerical Observation Time

Over four years ago I had a series of conversations with colleagues, that led to my blog posts on meteorological time. Jeremy Tandy has subsequently refined our thinking of all those years ago in a cogent proposal intended for both OGC and WMO which casts the discussion of "forecast time" in an observations and measurements framework.

Jeremy's proposal is detailed and well thought through, but I don't think it fully covers all the edge cases, so this post is by way revising my post from 2006 in the context of Jeremy's proposal. In doing so, I guess I'm suggesting some minor changes to Jeremy, but I'm also writing this for the benefit of both myself and metafor.

By way of context then, here is my 2006 figure, redrafted to remove some implicit assumptions in what I wrote then:

Image: static/2011/01/06/ForecastDataSets4.png

The diagram shows a number of different datasets that can be constructed from daily forecast runs (shown for an arbitrary month from the 12th til the 15th as forecasts 1 through 4. If we consider forecast 2, we are running it to give me a forecast from T0(day 13.0) forward past day 14.5 ... but you can see that the simulation began (in this case 1.0) days earlier, at a simulation time of 12.0.

(NB: all symbols as defined in this post, not any previous one. Note that all times shown are times in the forecast reference frame, even though the diagram shows a progression of such times marked as real time ... the diagram doesn't show when the runs were actually completed.)

The key concepts are that:

  1. Forecasts begin with an assimilation period delineated by an initialisation time and a time of last possible observational data input (also known as the datum time, Td).

    • In the diagram, the initialisation time is 24 hours before T0, but it could be any period - including no period of assimilation at all.

    • Some forecasts simply continue as model runs with no new data acquired after Td, in this case T0 is the same as Td.

    • (Thus far these times are all in the forecast reference frame. Jeremy's presentation is hot on these distinctions!)

  2. Analyses are generally produced for some time in the observation window (aka assimilation window), sometimes, but often not at the end (see for example ECMWF 4D Var).

    • Analyses are used for many things, but one of the most important, is the initialisation data for a future forecast, as shown here in the case of using them to initialise the next days run. Another possibility is that an analysis from within the window is used to start a current forecast with a different resolution or version of the model. (As a consequence, we often find the assimilation portion of the run is not archived with the forecast portion.)

  3. Sometimes Analysis Datasets are produced by using the actual analyses with higher frequency data provided by taking data from the forecasts in a consistent way.

    • Two examples are shown in the diagram for T0, where we have assumed for convenience that the analysis time is the same as the datum time.

      • In case a, data from the last forecast is used to provide the interim data points. In this example this data is from the free running model after assimilation. This is often a good thing to do for physical variables which are not being directly assimilated - these are often more physically in balance after the assimilation window.

      • In case b, data from the next forecast is used to provide the interim points.

OK: with that language in mind, we can turn to Jeremy's OGC proposal.

Jeremy makes the key distinction between times of the observation event (which belong on the observation entity), and times of the result of the observation. (In this context the Observation is the Simulation).

Looking at Jeremy's times of interest, we have:

  1. The result times which appear in the result coverage. These are straight forwardly the times that the simulation thinks it is valid for (we'll come back to validity).

    • The bounding box of these times, the duration of the run, should appear in the MD_Metadata associated with the observation!

    • It should also appear as the phenomenonTime in OM_Observation.

    • We could have multiple datasets extracted from the forecasts above which had exactly the same result times and result bounding boxes (e.g. day 13 to day 14 from forecasts 1, 2 and 3 - noting the latter would include data from the assimilation period - or a composite analysis of days 13 to 14 via routes a or b).

  2. Jeremy rightly states It is the metadata about the simulation event that enables us to distinguish between these results

    • (From a metafor point of view, that clearly tells us our "simulation" class needs to be a specialisation of OM_Observation.)

    • We sometimes talk about a reference time which is often associated with the analysis time (T0 in our case), but it could be the initialisation time Tinit, or even the datum time (Td is not always equal to T0).

      • These are not standard O&M concepts, and would need to go in as named parameters, and this is where I differ from Jeremy: Because the reference time is somewhat ambiguous (is it T0 or Tinit), I would explicitly distinguish between them (even though often they are the same since the assimilation and forecast runs are separate). To distinguish, I would have a referenceTime which was explicitly defined as Tinit, and add an optional datum time (and it's absence would indicate no assimilation).

    • OM_Observation also has validTime, which could be confused with the phenomenonTime in the case of forecasts - but it isn't intended to be the same as validity time in the meteorological sense. If we use it, it should indicate a period in the real time reference frame (not forecast time reference frame) for which the forecast is intended to be used.

    • OM_Observation also has resultTime, which is the time at which the event "happened". Jeremy unpicks this well in his document. The bottom line is that it should correspond to when the result became available ... in the real time reference frame.

  3. Jeremy suggests that observation mashups (like the horizontal dataset constructed using T0 output plus either a or b data would not retain any semantics. In this I disagree, since these are "effective observations". The process will of course describe the detail of the mashup methodology, but I think the observation has to give some hint via it's time attributes how it was done. The interesting thing there is the output dataset consists of a series of {analysis field followed by 1.. n forecast field} sets. We need a notation for that. Which is more than I'm going to do today ...

By way of summary, this blog post summarises some things from Jeremy's proposal, and makes some minor quibbles. Overall I think it's an excellent proposal.

by Bryan Lawrence : 2011/01/07 : Categories metafor metadata : 0 trackbacks : 2 comments (permalink)

Citation, Digital Object Identifiers, Persistence, Correction and Metadata

CEDA now has a mechanism for minting Digital Object Identifiers (DOIs). This means we need to finalise some decisions of rules of behaviour, which means we have some interesting issues to address.

Traditional Journal View

Let's start by considering how DOIs actually work for traditional journal articles:

  1. User dereferences a DOI via http://dx.doi.org/

  2. dx.doi.org redirects to a landing page URL. (This is the entire job of the doi handle system done!)

  3. Publisher owns the landing page, and prominent on that page is some metadata about the paper of interest, J', (author, abstract, correct citation etc), and a link to the actual object of interest (J, generally a pdf). NB: There maybe a paywall between the landing page and the object of interest.

Publishers can change the landing page any time they like, but conventionally you had better be able to get to your digital object from there. There are some other de facto rules too:

  • If there is a new version of the paper, a new DOI is needed.

  • Landing pages can have query based links to other things (papers which cite this one) etc ...

  • The metadata (J') shoulsn't change (since it is indelibly linked to the paper). I suppose if the metadata had inadvertently left out an author, a publisher might update it and hope no one noticed ... but there is an important thing to understand about this metadata:

    • It describes the digital object and represents it faithfully. It ought not change, since any change to it, ought to reflect a change to the digital object (and that should trigger a new DOI) ...

    • and the original landing page can indicate that a newer version of J exists, but it should still point to the older version!

So the DOI system is effectively used as follows:

A DOI resolves to an representation of a record that describes a digital object which is retrievable in its own right, and that representation can carry lots of other extraneous stuff including material built from interesting queries related to the object of interest.

(In most cases the primary representation of the landing page is html, but other representations, including rdf, might exist.)

Data Publication

OK, now consider the data publication version of this story.

Consider a real world observation (O), which used a process (P) to produce a result (R). (Aficionados will recognise a cut down version of O&M.) Expect that we have described these things with metadata O', P' and R', which describe the observational event, the process used, and the result data (syntax mainly). You might think of it like this:

Image: static/2011/01/07/PackagingMetadata.png

So what does data publication actually mean in this context? Sure we can assign a DOI to each of the O', P', R' entities, but why would we do that, what value would that have over using a URL?

We believe (pdf) that the concept of Publication is an important one, which is distinct from publication (making something available on the web). Publication (with the capital) denotes connotations of both persistence and some sort of process to decide on fitness of purpose (peer review in the case of academic Publication).

So how do we peer review data? In our opinion you can't Publish data without Publishing adequate descriptions of at least the observation event, the process used, and the resulting data. That is, we want to see a form of peer review of the (O',P',R') triumvate. (O&M aficionados: for the purpose of this discussion I've collapsed the phenomenon , sampling feature, and feature of interest into the result description, don't get hung up on that.)

Strictly, in the O&M view of the world, the process description can be reused, so we might assign DOIs to each process (P').

Clearly we can assign a DOI to R', but we would argue (Lawrence et al 2009) that without O' and P', the data is effectively not fit for wide use (and hence Publication). So, better would be to assign a DOI to O' and ensure that O' points to R' and P' (as it must do, since that's how it's defined).

(Not having a DOI for R' is consistent with the O&M use of composition for R as part of O, but actually I think that's a bit broken - in O&M - for the case where we want to have observations which are effectively collections of related observations, but that's a story for another day.)

The html representation of O' itself (not to be confused with the landing page for O') could link to R' and P', or it could directly compose their content onto the page.

Now we have some interesting issues. Can we imagine R' being updated because of incompleteness or inaccuracy? Well, yes, but hopefully it's about as likely as journal metadata changes. From a practical point of view, a cheeky change in R' wouldn't affect the conclusions that one might make about how and why someone used R itself. So, we might get away with fixing R' in place, but in principle one shouldn't.

What about changes to our description of P'. Someone following a citation to O' might get entirely the wrong idea of why someone used R (or misunderstand the usage). Hence, the right thing to do would be to:

  • create a new description of P (call it P+),

  • create a new description of O (call it O+), and ensure it composes P+.

  • review these new descriptions, then

  • mint a new DOI for O+ (and one for P+ if desired as well).,

  • create new landing pages accordingly, and

  • update the landing page for O' (and the landing page for P' if it exists separately) to indicate that a superceded version exists.

Note that none of this discussion depends on the detail of the underlying representation(s) of each of O',P',R', P+, or O+, these could utilise any technology (including OAI/ORE etc). However, for each object, one of the representations should represent the (apparently) immutable digital object linked via the landing page and the DOI. If that particular representation became unavailable, the replacement representation would become effectively a new editions of the digital object, and the landing page modified accordingly. It might be acceptable, and probably has to be for practical reasons (since such changes are likely to be associated with software changes), that the original DOIs resolve to landing pages that no longer point to the old primary representation - but they had better make the evolution clear.

(Note that one would not countenance the data object R itself being allowed to change format without resulting in a new DOI.)

In terms of the physical infrastructure, one could imagine the landing pages being dynamically generated from the metadata records they are linking to, along with queries making other associations etc. One could even imagine multiple representations of the landing page itself (e.g. XHMTL, multiple dialects of XML, RDF etc). Such different representations could be available by content negotiation - but to reiterate once again - even if multiple representations are available for the digital objects of interest, only one is the object for which the DOI was assigned.

by Bryan Lawrence : 2011/01/07 : Categories metadata claddier badc curation : 0 trackbacks : 6 comments (permalink)


DISCLAIMER: This is a personal blog. Nothing written here reflects an official opinion of my employer or any funding agency.