## AGU Fall Meeting 2012

So clearly I'm in San Francisco, at the biggest conference in the geophysical year. I have a talk, you can download it from the talks page linked here, but that's not what this post is about.

This is a big conference, stupidly big. While I'm enjoying it more than I thought I would, there is more than a tinge of guilt in being here. Because this conference is so big, you miss most of the talks you are interested in - because inevitably the sessions clash. So, in comparison to not coming, in which case I miss all the talks, coming means I miss most of the talks. Of course others get to hear me, but is that what it's all about?

(The AGU Fall meeting talk attendance has a lot in common with winning the lottery - buy one ticket, or even many tickets, and you've got a very small non-zero possibility of winning, but mostly you lose. If you come to AGU, you have a small, non-zero possibility of seeing everything you want, but mostly you lose.)

So apart from the ego boost of an invitation, I didn't really come to hear the talks, I came for the conversations in and around the meeting. But even they clash horrendously when all the tribes are in one place.

For all my problems with clashes, I did hear some outstanding talks, and I have (and will yet) learned from listening. So there is real value ... but ...

Then there are the posters ... which are a horrendously inefficient way of communicating information, but a great way of starting conversations. I can't help thinking we can do that better using social media now ... do we really have to fly thousands of people thousands of miles for a few short conversations in front of something that could be better presented in a different way exploiting a combination of social media and remote attendance.

I really wonder whether this sort of conference should become a dinosaur of the past. I totally respect the opportunities which physical co-location brings, both for serendipitous science, and for interdisciplinarity, but surely we can come up with a way of doing this sort of thing that requires fewer carbon miles? Perhaps we should have the big interdisciplinary overview meetings a bit less frequently, and keep the domain specific stuff where it belongs - in smaller more focused meetings? I appreciate that approach might cause problems for folks nearer the start of their careers, so I don't have all the answers, but it can't be beyond the wit of our community to resolve these sorts of issues ...

I guess I wont get another invitation ...

(p.s. EGU the same applies to you :-)

## A tale of two ethics

A couple of years ago I took part in a lengthy discussion on scientific ethics. The context of that discussion was "the ethical risks" of any given piece of research, but I well remember the bottom line: in matters of ethics, there is no real black and white, "it all depends". It's possible to find a standpoint from which most "thinking people" (whatever that means) will believe you behaved ethically, and a standpoint from which most will not; often for the same action, albeit generally in different contexts!

So I find the sense of outrage about Peter Gleick's actions over the Heartland documents a bit much. That said:

• I find much to commend in Dan Moutal's analysis, and

• I'm pretty relaxed about James Garvey's article. (Unlike Richard Betts I didn't read it as a call to misbehave - indeed as Garvey says "It depends on how this plays out". Perhaps Richard would have been less offended if the subtitle "...perhaps more climate scientists should play dirty" had an explicit rather than an implicit question mark?)

• While I rather think Bob Ward has got it right (as reported by Scott Mandia here, I can't find the original tweet):

"Peter was wrong but no comparison to Heartland's tactics. Not even close. Perspective in order."

• ... I also have a lot of sympathy for Steve Easterbrook's position:

" ...If AGW kills 300,000 p/y, exposing Heartland is ethical!"

However, I don't think it's about whether he was right or wrong, or even ethical (particularly the latter). What it's really about is what he might have done to trust. Some of my colleagues think we (scientists) have to somehow have the moral high ground to keep trust, and are worried that we're off down a slippery slope.

(We'll come back to the moral high ground.)

Misrepresenting yourself is not in and of itself unethical, I do it every Christmas, pretending to be Santa. Do you trust what I say scientifically any less? (If you do, then frankly you're beyond help.) So now we've established that misrepresentation is "allowed" it comes down to the context.

If I were to pretend some knowledge which I don't have, then you should absolutely stop trusting me. If I pretend to be someone I'm not (perhaps I dress up in women's clothing ... or maybe I don't) it shouldn't necessarily affect your level of trust in my science. Context matters!

OK, now if I pretend to be a conservative voter long enough to get my MP to pay attention to my moans about the NHS? What now? If I get him to inadvertently send me some document admitting the conservative party has never believed in being the greenest party ever? What lines will you let me cross? Why those particular lines? Context?

And so we come to Peter Gleick - who I've never met. He's not happy about what he did. Very few people are! But as far as I can see he's done a bit of mild misrepresentation. Not good, but not terrible either. It's who he is, and who he did it to that provides the context - and eventually, the results!

What are the consequences of his actions? The context?

1. We know a lot more about Heartland!

2. Those in an echo chamber (any chamber), wont change their minds on anything.

3. Those who have minds that listen, will continue to weigh the evidence.

4. Scientists will continue to respect his refereed work on science - as they did before, nothing has changed.

And thus we come to the nub of it. Even in science, especially in science, we trust no one - without evidence, and peer review. So we're happy to keep trusting his work in his domain of expertise (because we don't have to trust him). However we don't trust others (whoever they might be) to exercise trust in evidence in quite the same disinterested (?) and objective (?) way! Unlike "us" (being superhuman of course1 :-) ).

Of course, it isn't easy to weigh up evidence. Particularly now. In a completely different context, we have this: Weinberger's equivalent of Newton's first law: "For every fact on the internet, there is an equal and opposite fact."

So, to make decisions, we need a bit of expertise, but most of all we need provenance for our "facts" ... but is our moral high ground the only important part of our provenance?! Hell no, it's the method! Just because someone has never lied before, you shouldn't trust their science, you trust their science because of the method, and peer review (warts and all).

So I don't think what he's done will matter a jot in terms of influencing opinion in general, given my enumeration above, but what I'm really scared of is that it gives an excuse, to a certain kind of (cowardly) politician, for sticking his or her head back in the sand. Now that really is a pity - but it has nothing to do with ethics, or right or wrong, it has to do with a certain spinelessness of the political class to make tough decisions! And that really is something to get wound up about. Not Peter Gleick.

1: For American readers: please look up the dictionary definition of "irony" at this point (ret).

## Massive Spam Problem

I've always had problems with spam on this blog, not quite sure why, culminating in over thirty thousand spam comments over the recent Christmas period - all on old articles!

As a consequence, I've gone back through all my posts (I hope), removed all the spurious comments (and hopefully only the spurious ones) ... and made comments readonly throughout. If I've accidentally deleted your (real) comments, then I'm sorry, but for obvious reasons I had to do this in a semi-automated way ...

Until I move over to a new system, comments will only remain open on new posts for a few days ...

## Reviewing Parson and Fox on the data publication metaphor

Mark Parsons have a draft paper out for public review and are inviting comments there ...

... but I found myself hamstrung by the "comment" paradigm, so I'm responding here.

Firstly, the paper itself: Like Chris Rusbridge, commenting on the blog, I found myself in equal parts agreeing and disagreeing ... but the bottom line is that it is a good question to ask, so I hope a version of this paper appears! The more this gets discussed the better, either I'm right (and data Publication - as I define it - should be a dominant paradigm), or I'm wrong - but if right, it'll never happen until folks discuss and buy into it, and if I'm wrong, well best I find out :-)

1. There is an underlying assumption in this paper, which bubbles up to plain view in spots, and on the blog, that Publication and sharing are in conflict. I flat out disagree with this ... and don't know of any evidence that supports such an assertion.

• The Publication metaphor already explicitly supports both via preprints (sharing or publication) and the formal paper of record (Publication).

2. Metaphors are obviously useful, but as the authors agree, flat out dangerous as well. I've blogged on the importance of "verifiable statements" before (and folks should look at the Tauber link from that!). So do I agree with these particular metaphorical paradigms?

• Well, no not really, we at CEDA would claim we were playing in all four paradigms as they are defined, and I don't think we're that unusual. It is true that the examples they have chosen can be roughly characterised by these examples, but I think there's a bit of selection bias, sadly. Partly, I think that's engendered by using these activities to try and stratify data management as opposed to data activity. For example, we are a big archive (Big Iron1) who have some datasets which we wish to publish, and we are involved in map making and linked data - but I would argue that the second two are activities that depend on data management, they are not data management per se! Other groups will have very different data management standards and methods, but still be involved in map making and linked data. I think this is recognised to some extent in the paper (via the table where the focus line makes these key points, the qualification that they're not mutually exclusive, and the first paragraph of section 4).

• So, because I think these paradigms do mix management and activity they don't contribute easily to the question about publication, so I don't find them helpful in this context.

• All that said, the discussion of how misunderstood definitions limit understanding is totally on the money (up to the point where there is an implicit assertion that "traditional peer review" is homogeneous enough to be differed from what one could do with data Publication. I think PLoS is an interesting analogy here ...

3. Why do the authors think concepts like "registration, persistence" etc are not relevant (see the reference to Penev et al)? The answer appears to come from the next sentence where there is a conflation between Publication and some (closed) implementations thereof.

4. I think the dynamic, annotation infested ( :-) ), world of Publications that I foresee is not inconsistent with open access, clear provenance, yet the authors assert that worrying about provenance and definition is undue? (I defy you to find a URL to a data object that is useful without you having some implicit or explicit knowledge about the medium that will be returned if you dereference a URL - obviously definition matters in the data world, and just as obviously some level of containerisation is necessary - even if it's only to say "stream of type X starts here, terminated by this binary string").

• Of course annotation, federation, transformation are important, but if you want to use objects from that complex world, in a way that is scientifically useful, you need to know whether you can repeat or replicate your workflow!

• Particularly, often one doesn't need to see all that "ecosystem" in the workflow. The annotations are logically distinct from the data (and may themselves be different publications). If they do need to be together, then the anthology is a useful metaphor - publish poems on their own, and in collections!

• What is linked data, but simply linked data? ... the objects that are linked, may, or may not be Published ... fair enough ... no conflict there?

• So I simply don't buy that Publication is in conflict with unlocking the deep web! It's orthogonal in some sense!

5. All those picky disagreements aside, I find myself agreeing, the various paradigms available all lack in some way - but everything is a compromise. The analysis of their paradigms is fair enough ... in general ... but the conclusion lacks some rigour that they might be able to take from our paper.

6. I really liked the discussion of infrastructure and ecosystems ... particularly when the authors, particularly when it got to recognising the different roles that exist in Publication and how they can be decoupled. (Again, our paper helps in this regard, although I think it could be usefully extended using the arguments that Parsons and Fox have espoused).

• When we get to releases, we start to see the concept of versions (editions anyone?) of data ... right on!

• I think there is real scope to consider how the infrastructure and ecosystem around data Publication will be (should be) very different from those around literary publication. I'd really like that section to be expanded ...

7. ... it is important not to be hidebound by ... any one metaphor. Yes, Yes, Yes.

All of which is me saying my definition of Data Publication is clearly different from Parsons and Fox. Which is utterly fine ... neither of us is right or wrong, we just need to define what we are talking about!

Ok, now to consider the blog traffic :-) (which at the time of writing was up to comment 10).

The one point that I think that is explicitly different from any of the points above, is the importance of social science and philosphy in establishing both our expectations and our practice ... absolutely true!!!

The other point (to reiterate) is that data Publication is not in conflict with sharing, but there is a time for sharing and a time for not sharing - and whatever system(s) we come up with have to leave the decision as to how much sharing is wanted or desired to individuals (or their funders) - the "infrastructure" cannot make that decision (positively or negatively). It needs to facilitate it!

(In particular, it is just fine for scientists to hold some data close to their virtual chests ... consider, if nothing else, health records, or the geographical location of the last members of particular species etc).

Finally, I too am a big advocate of citation, but it is not the answer, but it's certainly an answer! Any metaphor is both useful and limiting, and even the best solutions are only useful 80% of the time. That means this data "publication" paradigm will involve multiple solutions. Bring on the separation of concerns :-)

1: or not? Like Chris in the comments on their blog, we're petascale, but all commodity hardware ... today! (ret).

## Communicating Science (to the media)

Interesting seminar here this morning on the subject.

Some memes that came out of this (and a previous staff meeting):

1. Most academic staff don't see public/media communication as their job, and that's a combination of being over-worked in the first place, not thinking we're particularly good at it (we tend to use over long, over-qualified sentences :-), and not wanting to get into a fight with some nutter in honour of some spurious notion of balance.

2. We (academics) need to remember the reality of the new cycle: most stories have a four-hour life expectancy, and most of them are sprung from event, and prepared in a matter of hours, if not minutes. For television and radio particularly, waiting to do a properly researched story would, in many cases, mean the story would never appear - bad quick information wins over good slow information!

3. It's important to remember that "show business" matters; an accurate story which is boring won't be listened to (or read)! So, like it or not, if you want to communicate, you have to play by the "make it brief, make it interesting, and do it with panache" rules. (Oh no, see the second caveat in 1. above :-).

4. (However, in the 24-hour news cycle, you might be able to negotiate a "serialisation" of your message into byte (pun!) size chunks - giving a reporter several stories - but you'd have to be lucky :-)

5. Some survey of stories in the "quality US newspapers" (editor: no comment :-), suggested that 85% of the stories were founded on only one source, and often that source was a press release! (Gosh!)

6. Prepare for interviews, but be available. Saying "Not now, I'm available in an hour" might mean you're not "used" ... but if you're going to be interviewed, spend at least ten minutes trying to work out how to get your message across. If it's a soundbite interview, your message had better be in 45 words (or less if you have to deflect a spurious question "on-message"). If it's a three-minute "radio-4-type" interview, it'd better be 2-300 words max (and be prepared to deflect several spurious questions "on-message").

7. Often a story/interview will go better if you take five minutes to send someone specific facts/figures in a few bullets in an email - it might well get them "on-message" and asking about the right things!

And the bottom line, for us climate scientists: As Edmund Burke said

All that's necessary for the forces of evil to win in the world is for enough good men to do nothing.

And we are, most of us, funded by the public (one way or another), to do the right thing for the public ... a bit like open data really I suppose!

(Update: I should have noted the seminar speaker was Jon Barton of Clarify Communications.)

## data scientists are anally retentive too!

Well, that title ought to make this blog article inaccessible to those of us with anally retentive firewalls, but because it's only a minor misquote of an article from the Times Higher Education, it is legitimate!

The original quote is: Professor Nunberg said

"like most high-tech companies, Google puts a much higher premium on innovation than maintenance. They aren't good at the punctilious, anal-retentive sort of work librarians are used to."

Well yes: but I think Professor Nunberg said the right thing for the wrong reasons. Firstly, the context: Google Books has a lot of metadata, and a lot of it's wrong, and Nunberg was effectively saying that Librarians do metadata better.

Well no: they clearly have procedures in place for checking metadata, but they don't necessarily have expertise, but if they do, that might not be enough. IMHO, in the long run, crowd sourced metadata (where one has potentially multiple experts on tap) will vastly exceed (individual) expert sourced metadata. Because when metadata is more than just (title, author, year), metadata really is just another sort of content, and when content matters for descriptive matter, Wikipedia v Britannica (e.g. here and here) shows us the direction of travel. Crowd sourcing wins, in the long run.

(Caveat: generating information is still an individual or at best a team game, you can't crowd source information ab initio, even if you can crowd source the evaluation and description of it. Caveat-squared: you can of course crowd source data - crowds can generate data. Anyway, enough of this digression.)

However, where librarians do matter, and data scientists too - for whom the same argument applies - is in setting the initial conditions! Since, after-all, we're never "in the long-run", we're always getting there, and we get there much faster if we know where we started from. Attention to detail is a great quality for librarians and data scientists alike (actually all scientists), such anal retention is an integral requirement for many of us ...

Taking us back to the THE article, clearly if the index doesn't point you to the vicinity of the thing you want, it will take a long time to get there. Similarly, if you don't have good scientific metadata, it takes a long time to choose the tree from the wood, and having done so, you might not have an initial evaluation for how good that tree was ...

What Nunberg might have said (who knows, maybe he did say, I only had the THE article to go on) was: why on earth didn't Google start with an algorithm to utilise the professionally curated metadata rather than their own? And why not put a proper crowd source front end, so it can be improved?

All of which is rather ho hum, but what really got me thinking was the parallel between Google, and the attitude of a great many of the scientific community to managing their data, in which case I'd say:

Like Google, most scientists put a much higher premium on innovation than maintenance. They aren't good at the punctilious, anal-retentive sort of work that is necessary for society to get the best out of the data it has paid to collect!

Which is to say, it's not just being anally retentive that matters, it's what you're anally retentive about.

OK, well that's the last time this subject (a-r) comes up on this blog ... even if it was enough to provoke me out of my workload-induced-blog silence (although in truth, what was really needed was the sudden cancellation of a scheduled meeting).

## European Exascale Software Initiative

Last week I spent two days at the lovely Domaine de Tremblay (an hour from central Paris):

This was the venue of a workshop organised by the European Exascale Software Initiative (EESI). The project has an interesting blend of activities; four domain working groups

• industrial and engineering applications,

• weather, climatology and earth sciences,

• fundamental sciences,

• life sciences and health,

cut across four technical working groups

• software ecosystem,

• numerical libraries, solvers and algorithms, and

• scientific software engineering.

to try and work out what Europe will need to do to exploit future exascale computing systems (which is of course a software problem).

I didn't keep proper notes, but I did tweet a bit, here's my tweet stream from the meeting:

• Sitting in a long narrow room ready for two days on the European Exascale Software Initiative. Not expecting a hashtag here #eesi

• Some real ambition in this room. Complete simulation of aircraft and airflow, would require zetaflop computing! #eesi

• I so do NOT believe in codesign: if the s/w life cycle is much longer than the h/w life cycle, it's a non starter! #eesi

• Other domains (genomics, seismology etc) as well as climate stress importance of exaSCALE cf exaFLOP! Need more balanced HPC! #eesi

• Hardware review: interesting to note that the only thing all vendors are actively researching is hw fault tolerance! Frightened? #eesi

• Worries about parallel file system futures: scalability and reliability, file number, metadata performance. Few vendors in the game! #eesi

• Programming models needed which better allow incremental parallelisation, more decoupling between h/w and s/w, and easier portability #eesi

• Not all faults expected to be masked by h/w, or tolerated easily by application (h/w correction will generate noise, global restart) #eesi

• Quadrant diagram: divide tasks into ((production, testing) versus (slower, faster)). Helps decide between capability versus capacity #eesi

• Not acceptable to try and control 3 levels of parallelisation by hand: system, nodes, cores. Improve abstraction in programming model #eesi

• Future algorithms might have to start with single precision, because SP uses less power, and then only use double precision later #eesi

• Crickey. Someone asked about libraries for symplectic integration. I can remember the noun Hamiltonian. Can't remember what it means! #eesi

• Definition of co-design: h/w perspective, we'll work with s/w community so you can change your apps to use our (slightly changed) h/w. #eesi

• Definition of co-design: s/w perspective, we'll work with h/w community so you can change your systems. Will they ever meet? #eesi.

• exascale hybrid core future: we might need to spend half the money we used to spend on computers on s/w engineers so we can use them! #eesi

• Bloke in audience: why does data management matter for exascale? Everyone: because the flops are useless if you can't feed & use them! #eesi

• Speaker guess: 3 h/w generations between us & exascale? Me sotto voce: If they get more heterogeneous each time we'll never get there! #eesi

• One breakout grp arguing for s/w isolation layer to hide h/w. Next asking for co-design between h/w and s/w. Draw your own conclusion #eesi

• Programmability: Petascale systems are not easy to programme. Do we really believe it will get easier at exascale? Yet it must! #eesi

• Programmability: An assertion: Parallelising serial codes will almost certainly fail. Decompose algorithm from 1st principles instead. #eesi

• "If you want an exascale initiative of relevance: you will need to invest big time in REdeveloping applications (new algorithms)" #eesi

• ... "you will also need to invest heavily in training, and a proper career path for scientific computing/computation ..." #eesi

I also chaired and reported on a breakout group on cross-cutting data issues. The breakout report is available as a pdf. Although the turnout to this breakout session was relatively low, a couple of folks came up to be me afterwards, and bemoaned having to be in two places at once, and having to choose programmability as a (slightly) higher priority. This meeting confirmed to me, for once and for all, that nearly every scientific domain is struggling with big data handling, particularly adjacent to HPC!

There was also considerable discussion in our breakout group about software sustainability, a theme that was picked up the next day in the EPSRC strategic advisory team meeting on research infrastructure.

(Yes, I did get home from a lovely venue in Isle de France and go straight to Polaris House in Swindon the next day, you can imagine how the comparison felt.)

Nearly every meeting I go to makes the following point: Given that software is such a big and crucial part of research infrastructure nowadays the system is going to have to change:

• More professional software development and maintenance in academia, coupled with

• Better careers for those in software support (and software leadership) roles, and

• Sustained funding from research funders for codes which underpin the science community.

## Thought provoking talk on climate policy

Today at the ICTP workshop on e-infrastructure for climate science, George Philander gave a really thought provoking talk on global warming and asepcts of policy consequences - particularly in an African context.

His basic meme followed the following line of argument:

• We know that in pliocene times the earth appeared to be in a permanent state of El Nino (no evidence for cold water intrusions in equatorial regions).

• No current climate model can replicate that state, so

• We can only trust models for small perturbations from the current state

• He alluded to a paper by Carl Wunsch pointing out that models can't be tested in paleo times ... and to a paper by Paul Valdes. I'm going to chase up both.

• So, he worries about global warming, not because we know what is going to happen, but because we don't, which implies that

• this is a time for circumspection. He likens us (the planet) to a ship in fog. It's time to slow down and sound our way!

He also had a lot to say about the policy paralysis which amongst most serious analysists comes down to an argument about the discount rate based on our current predictions for the future. Should we act now or later? But this is a rich persons argument! From those in poverty there is no argument, the only choice is later! However, the argument over discount rate is predicated on (relatively) small perturbations from the current state, but if our models don't do serious perturbations, that's a matter of concern. As he puts it, we will know these things, but we don't yet, so, let's put the argument in terms of the planet being a special place (habitable) at a special time (warm interglacial). Let's treat that planet with some circumspection until the future is clearer!

What exactly to do, if your perspective is that of trying to lift millions from poverty, even as you want to do it without burning coal, is a moot question. I don't think any of us know the answer - Philander certainly didn't claim an answer - but it's a question that I hadn't really given any thought to before today.

Most of these ideas appear in a paper called "Where are you from? Why are you here? An African perspective on global warming." which appears in the Annual Review of Earth and Planetary Sciences, pdf. It's well worth a read, since amongst other reasons, you wont often see a reproduction of a Titian in the same article as a figures showing milllenial temperature changes etc. There is also a call to arms for an African centre to study earth sciences. I hope it succeeds!

## hpc futures - part two

What is this exascale wall that I've been tweeting about?

The last decade of advances in climate computing have been underpinned by the same advances in computing that everyone of us saw on our laptop and desktop until a couple of years ago: chips got faster and smaller. The next decade started a couple of years ago: processors are not getting faster, but they are still getting smaller, so we get to have more processors - there are even dual-core smartphones out there now! What's not so obvious though is that the memory per core is generally falling, and while the power consumption is falling, it's not falling fast enough ... pack a lot of these (cheap) cores together, and they use a ludicrous amount of power. The upshot of all this is great for smartphones, but not so good for climate computing: the era of easy incremental performance (in terms of faster time to solution, or more complexit/resolution at the same time to solution) in our climate codes is over. Future performance increase is going to have to come from exploiting massive parallelisation (tens of million to a billion) threads with very little memory per thread - and it'll come with energy cost as a big deal. I first starting wrote about exascale on my blog back in August last year. (I promised to talk about data then, and I will here ...)

What all this means is that our current generation of climate model codes probably have at best a few years of evolution in their current incarnation, before some sort of radical reshaping of the algorithms and infrastructures becomes necessary. At the US meeting I'm at now, two different breakout groups on this subject came to the same conclusion: if we want to have competitive models in five years and ten years time, we need to a) continue to evolve our current codes, but b) right now, start work on a completely new generation of codes to deploy for use in the next but one generation of supercomputers. That's a big ask for a community starved of folks with real hard-core computing skills. Clearly there are a lot of clever climate modellers, but the break out group on workforce summarise the reality of the issue to be that the climate modelling commmunity consists of three kinds of folks: diagnosticians, perturbers (who tinker with codes), and developers. Universities mainly turn out the diagnosticians, some perturbers, but very very few developers with the skillset and interest to do climate code work.

That's a big problem, but the data side of things is pretty big problem too. Yes, the exascale future with machines with tens to hundreds of millions of cores is a big problem, but even now we can come up with some scientifically sensible, and computationally feasible methods of filling such a machine. Colin Jones from the SMHI has proposed a sensible grand ensemble based on tracticable extension of how EC-Earth is being used now (running, in an extreme experimental mode at an effective 1.25 degrees resolution). An extrapolation of that model to 0.1 degrees resolution (roughly 10km) would probably effectively use 5000 cores or so. If one ran an ensemble of 50 members for a given start date, at the same time, it could use 250,000 cores. Ideally one would have a few different variants of this or similar models available, capturing some element of model uncertainty, let's say 4. Now we can use 1 million cores. To really understand this modelling "system", we might want to run 25 year simulations, but sample a range of initial states, let's say 40. Now we can use 40 million cores. This is an utterly realistic application. If a 40 million core machine was available, and we could use it all, this would be an excellent use of it (there are other uses too, and for those we need the new codes discussed above). But let's just consider a little further.

Colin tells me that the 1.25 degree (actually T159L62) model produces roughly 9 GB of data per simulation month writing out onto an N80 reduced Gaussian grid (which means you can double the following numbers if one wanted "full resolution"). Scaling up to the 0.1 degree version would result in 1.4 TB/month, and the grand ensemble described above would result in a total output of around 3 exabytes! For a reasonable time to solution (150 hours or a week all told, that is 6 hrs/model_instance_year), it would require a sustained I/O from the machine to storage of around 50 Tbit/s.

Archive and analysis may be a problem! Remember this 3 exabytes of data could be produced in one week!

At this point it's probably worth considering an exascale computer not as a "computer" but as a "data source" ... it's a bit of a paradigm shift isn't it? Even without heaps of work on exascale software, our exascale computer can produce an outrageous data problem. We need to start thinking about our computer in terms of its analysis and archive capability first, then we can think about it's computational ability, and indeed, how to get our codes to address other important problems (such as faster time to solution so we can have better high-res paleo runs etc). This ought to be affecting our purchasing decisions.

Hang on a moment though. The obvious immediate rejoinder is "we don't need to write all this data out". So what can we do to reduce the data output? We can calculate ensemble statistics in the machine, and we can choose only to write out some representative ensemble members. That might gain us a factor of 10 or so. We could simply say, we'll only write out certain statistics of interest, and not all the output, and that's certainly feasible for a large class of experiments where one is pretty sure the data will get no re-use, because the models are being deliberately put in some mode which is not suitable for projetion analysis or extensive comparison with obs - but many of these ensemble experiments are very likely to produce re-usable data. Should it be re-used?

Well, consider that our 40 million core machine will probably cost around 50 million USD when we get hold of one. If we depreciate that over five years say, then it's about 10 million per year (in capital cost alone) or about 20,000 USD per week to use. Double that figure for power costs and round up to 50,000 USD for that grand ensemble run. I have no idea what storage will cost when we can do this run, but my guess is that the storage costs be of the order of 300,000 USD. To be generous, let's imagine the storage costs will exceed the run time costs by a factor of 10.

(Where did that number come from? Well, today's tier-1 machine might have 50,000 cores, and it costs o(1) million USD per PB. To go to 50 million cores, we scale CPU by 1000, let's imagine we scale data costs accordingly. So when I have a 50 million core machine, I'll be able to get an EB of storage for the same price as today's PB.)

So the big question then is, how many application groups are likely to be able to exploit my run, and can I anticipate their specific data output needs? Well,my guess for an ensemble experiment as discussed above, there would be value in that data for considerably more than ten different groups - be they parameterisation developers, observatoinal analysists, or impacts advisors!

So, we probably should store as much of it as we feasibly can! We can talk about the analysis effort and the network issues at a later date!

(Actually, when I run the numbers for other possible permutations of output, and model configuration, the 9 GB/month we started with seems small, it's entirely feasible to suggest a realistic IPCC style output requirement when scaled out, would result in around 50 times more output, but for an IPCC experiment, hundreds of applications are feasible).

## Pushing Water Analogies too far

Recently I heard a talk by Kevin Trenbeth, who showed a slide with the following cartoon:

It was perfect in context, but I also found it amusing in a serendipitous sort of a way: just before he started speaking , I had started (in a coffee break I hope) googling for images of reservoirs, hoses and sprinklers ... because I had a similar idea in mind, and I wanted an easy way of communicating it.

However, my idea was and is rather different from Kevin's. So why is it different? Firstly, mine was borne out of Ian Jackson's repeated mantra (on the NERC information strategy group) that a water-pipe system is no good without water, and my repeated response, and without a pipe system, the water just gets wasted. We were both right of course: a water delivery system is useless without water, and water gets wasted without a delivery system.

To some extent Kevin's image is redolent of wasted water, even though there are clearly components of a system ... but I'd put it slightly differently:

• I don't think the sensor is analogous to the hydrant, I think it should be analogous to the rain pouring water into a reservoir via runoff, rivers, whatever ...

• There are plenty of folk who will tell you it's raining data.

• My point being that there are a plethora of sensors (water delivery mechanisms), each delivering data (water) into a bunch of archives (lakes, reservoirs, whatever).

• The hydrant is simply a defined interface to which I can couple a hose! It depends on being connected to a local reservoir, which itself may be coupled by a canal or a pipe to another reservoir, or lake or whatever. So the lessons I draw are that:

• In the case of data, we need local caches, to which we can connect delivery systems which understand a standard interface.

• But they depend on the existence of connected (and large) managed bodies of water. In my world view, that might be a European archive (distributed or otherwise), and a national archive (which may be independent of, or part of, the European system).

• I think the idea of the scientist cowering under a deluge of data is correct, but

• more and more we see folks building complicated data delivery systems so that the data can be more easily consumed.

• I see that as analogous to the design and construction of a range of sprinklers on the end of the hose, targeted at particular problems.

So here is my graphic:

And the context?

We are being encouraged to put more and more effort into the front end - the portal, the visualisation system - to the detriment of the backend (the managed archive, well connected, with standard interfaces).

It clearly wont end well

• if we have a lovely sprinkler system, but the reservoir looks like this:

Nor will it end well

• if we ignore the importance of metadata,and the fact that the pipes can carry more than one type of fluid:

(I'm well aware I need some picture credits on this post, since I made some serious use of other peoples pictures ... please let me know if your picture is here and you would like a credit - or for me to remove it.)

Sometimes weather forecasting deserves its bad name!

I'm off to Geneva tonight, for a couple of days. It looks like I'll have a fifteen minute walk from hotel to meeting venue. Do I need a) a raincoat, or b) a warm jacket?

• Tuesday: -8 to 5C, clear (sunny)

• Wednesday: -2 to 7C, clear (sunny)

• BBC Weather:

• Tuesday: -3 to 0C, grey cloud

• Wednesday:-4 to 6C, light snow

• Swiss Met Office:

• Tuesday: -2 to 0C, grey cloud

• Wednesday: -3 to 0C, grey cloud, windy (50% probability)

Of course my money is on the Swiss, so I'm taking a rain coat (which is also wind proof).

The problem of course is that how does the punter know which to believe? Of course for Switzerland, one might imagine the Swiss do best ... but beyond that the issue of course is that the weather underground folks are presumably just interpolating out of global NWP. Who knows what resolution? Global models near mountains ... not a good idea!

I'll report!

## Numerical Observation Time

Over four years ago I had a series of conversations with colleagues, that led to my blog posts on meteorological time. Jeremy Tandy has subsequently refined our thinking of all those years ago in a cogent proposal intended for both OGC and WMO which casts the discussion of "forecast time" in an observations and measurements framework.

Jeremy's proposal is detailed and well thought through, but I don't think it fully covers all the edge cases, so this post is by way revising my post from 2006 in the context of Jeremy's proposal. In doing so, I guess I'm suggesting some minor changes to Jeremy, but I'm also writing this for the benefit of both myself and metafor.

By way of context then, here is my 2006 figure, redrafted to remove some implicit assumptions in what I wrote then:

The diagram shows a number of different datasets that can be constructed from daily forecast runs (shown for an arbitrary month from the 12th til the 15th as forecasts 1 through 4. If we consider forecast 2, we are running it to give me a forecast from T0(day 13.0) forward past day 14.5 ... but you can see that the simulation began (in this case 1.0) days earlier, at a simulation time of 12.0.

(NB: all symbols as defined in this post, not any previous one. Note that all times shown are times in the forecast reference frame, even though the diagram shows a progression of such times marked as real time ... the diagram doesn't show when the runs were actually completed.)

The key concepts are that:

1. Forecasts begin with an assimilation period delineated by an initialisation time and a time of last possible observational data input (also known as the datum time, Td).

• In the diagram, the initialisation time is 24 hours before T0, but it could be any period - including no period of assimilation at all.

• Some forecasts simply continue as model runs with no new data acquired after Td, in this case T0 is the same as Td.

• (Thus far these times are all in the forecast reference frame. Jeremy's presentation is hot on these distinctions!)

2. Analyses are generally produced for some time in the observation window (aka assimilation window), sometimes, but often not at the end (see for example ECMWF 4D Var).

• Analyses are used for many things, but one of the most important, is the initialisation data for a future forecast, as shown here in the case of using them to initialise the next days run. Another possibility is that an analysis from within the window is used to start a current forecast with a different resolution or version of the model. (As a consequence, we often find the assimilation portion of the run is not archived with the forecast portion.)

3. Sometimes Analysis Datasets are produced by using the actual analyses with higher frequency data provided by taking data from the forecasts in a consistent way.

• Two examples are shown in the diagram for T0, where we have assumed for convenience that the analysis time is the same as the datum time.

• In case a, data from the last forecast is used to provide the interim data points. In this example this data is from the free running model after assimilation. This is often a good thing to do for physical variables which are not being directly assimilated - these are often more physically in balance after the assimilation window.

• In case b, data from the next forecast is used to provide the interim points.

OK: with that language in mind, we can turn to Jeremy's OGC proposal.

Jeremy makes the key distinction between times of the observation event (which belong on the observation entity), and times of the result of the observation. (In this context the Observation is the Simulation).

Looking at Jeremy's times of interest, we have:

1. The result times which appear in the result coverage. These are straight forwardly the times that the simulation thinks it is valid for (we'll come back to validity).

• The bounding box of these times, the duration of the run, should appear in the MD_Metadata associated with the observation!

• It should also appear as the phenomenonTime in OM_Observation.

• We could have multiple datasets extracted from the forecasts above which had exactly the same result times and result bounding boxes (e.g. day 13 to day 14 from forecasts 1, 2 and 3 - noting the latter would include data from the assimilation period - or a composite analysis of days 13 to 14 via routes a or b).

2. Jeremy rightly states It is the metadata about the simulation event that enables us to distinguish between these results

• (From a metafor point of view, that clearly tells us our "simulation" class needs to be a specialisation of OM_Observation.)

• We sometimes talk about a reference time which is often associated with the analysis time (T0 in our case), but it could be the initialisation time Tinit, or even the datum time (Td is not always equal to T0).

• These are not standard O&M concepts, and would need to go in as named parameters, and this is where I differ from Jeremy: Because the reference time is somewhat ambiguous (is it T0 or Tinit), I would explicitly distinguish between them (even though often they are the same since the assimilation and forecast runs are separate). To distinguish, I would have a referenceTime which was explicitly defined as Tinit, and add an optional datum time (and it's absence would indicate no assimilation).

• OM_Observation also has validTime, which could be confused with the phenomenonTime in the case of forecasts - but it isn't intended to be the same as validity time in the meteorological sense. If we use it, it should indicate a period in the real time reference frame (not forecast time reference frame) for which the forecast is intended to be used.

• OM_Observation also has resultTime, which is the time at which the event "happened". Jeremy unpicks this well in his document. The bottom line is that it should correspond to when the result became available ... in the real time reference frame.

3. Jeremy suggests that observation mashups (like the horizontal dataset constructed using T0 output plus either a or b data would not retain any semantics. In this I disagree, since these are "effective observations". The process will of course describe the detail of the mashup methodology, but I think the observation has to give some hint via it's time attributes how it was done. The interesting thing there is the output dataset consists of a series of {analysis field followed by 1.. n forecast field} sets. We need a notation for that. Which is more than I'm going to do today ...

By way of summary, this blog post summarises some things from Jeremy's proposal, and makes some minor quibbles. Overall I think it's an excellent proposal.

## Citation, Digital Object Identifiers, Persistence, Correction and Metadata

CEDA now has a mechanism for minting Digital Object Identifiers (DOIs). This means we need to finalise some decisions of rules of behaviour, which means we have some interesting issues to address.

Let's start by considering how DOIs actually work for traditional journal articles:

1. User dereferences a DOI via http://dx.doi.org/

2. dx.doi.org redirects to a landing page URL. (This is the entire job of the doi handle system done!)

3. Publisher owns the landing page, and prominent on that page is some metadata about the paper of interest, J', (author, abstract, correct citation etc), and a link to the actual object of interest (J, generally a pdf). NB: There maybe a paywall between the landing page and the object of interest.

Publishers can change the landing page any time they like, but conventionally you had better be able to get to your digital object from there. There are some other de facto rules too:

• If there is a new version of the paper, a new DOI is needed.

• Landing pages can have query based links to other things (papers which cite this one) etc ...

• It describes the digital object and represents it faithfully. It ought not change, since any change to it, ought to reflect a change to the digital object (and that should trigger a new DOI) ...

• and the original landing page can indicate that a newer version of J exists, but it should still point to the older version!

So the DOI system is effectively used as follows:

A DOI resolves to an representation of a record that describes a digital object which is retrievable in its own right, and that representation can carry lots of other extraneous stuff including material built from interesting queries related to the object of interest.

(In most cases the primary representation of the landing page is html, but other representations, including rdf, might exist.)

#### Data Publication

OK, now consider the data publication version of this story.

Consider a real world observation (O), which used a process (P) to produce a result (R). (Aficionados will recognise a cut down version of O&M.) Expect that we have described these things with metadata O', P' and R', which describe the observational event, the process used, and the result data (syntax mainly). You might think of it like this:

So what does data publication actually mean in this context? Sure we can assign a DOI to each of the O', P', R' entities, but why would we do that, what value would that have over using a URL?

We believe (pdf) that the concept of Publication is an important one, which is distinct from publication (making something available on the web). Publication (with the capital) denotes connotations of both persistence and some sort of process to decide on fitness of purpose (peer review in the case of academic Publication).

So how do we peer review data? In our opinion you can't Publish data without Publishing adequate descriptions of at least the observation event, the process used, and the resulting data. That is, we want to see a form of peer review of the (O',P',R') triumvate. (O&M aficionados: for the purpose of this discussion I've collapsed the phenomenon , sampling feature, and feature of interest into the result description, don't get hung up on that.)

Strictly, in the O&M view of the world, the process description can be reused, so we might assign DOIs to each process (P').

Clearly we can assign a DOI to R', but we would argue (Lawrence et al 2009) that without O' and P', the data is effectively not fit for wide use (and hence Publication). So, better would be to assign a DOI to O' and ensure that O' points to R' and P' (as it must do, since that's how it's defined).

(Not having a DOI for R' is consistent with the O&M use of composition for R as part of O, but actually I think that's a bit broken - in O&M - for the case where we want to have observations which are effectively collections of related observations, but that's a story for another day.)

The html representation of O' itself (not to be confused with the landing page for O') could link to R' and P', or it could directly compose their content onto the page.

Now we have some interesting issues. Can we imagine R' being updated because of incompleteness or inaccuracy? Well, yes, but hopefully it's about as likely as journal metadata changes. From a practical point of view, a cheeky change in R' wouldn't affect the conclusions that one might make about how and why someone used R itself. So, we might get away with fixing R' in place, but in principle one shouldn't.

What about changes to our description of P'. Someone following a citation to O' might get entirely the wrong idea of why someone used R (or misunderstand the usage). Hence, the right thing to do would be to:

• create a new description of P (call it P+),

• create a new description of O (call it O+), and ensure it composes P+.

• review these new descriptions, then

• mint a new DOI for O+ (and one for P+ if desired as well).,

• create new landing pages accordingly, and

• update the landing page for O' (and the landing page for P' if it exists separately) to indicate that a superceded version exists.

Note that none of this discussion depends on the detail of the underlying representation(s) of each of O',P',R', P+, or O+, these could utilise any technology (including OAI/ORE etc). However, for each object, one of the representations should represent the (apparently) immutable digital object linked via the landing page and the DOI. If that particular representation became unavailable, the replacement representation would become effectively a new editions of the digital object, and the landing page modified accordingly. It might be acceptable, and probably has to be for practical reasons (since such changes are likely to be associated with software changes), that the original DOIs resolve to landing pages that no longer point to the old primary representation - but they had better make the evolution clear.

(Note that one would not countenance the data object R itself being allowed to change format without resulting in a new DOI.)

In terms of the physical infrastructure, one could imagine the landing pages being dynamically generated from the metadata records they are linking to, along with queries making other associations etc. One could even imagine multiple representations of the landing page itself (e.g. XHMTL, multiple dialects of XML, RDF etc). Such different representations could be available by content negotiation - but to reiterate once again - even if multiple representations are available for the digital objects of interest, only one is the object for which the DOI was assigned.

This is going to be a difficult blog entry, because I'm going to have to say some nice things about Australia (something most kiwis find difficult). So, on with it.

I'm typing this on a Qantas flight home after a whirlwind speaking tour in Australia:

• a keynote for a one day workshop in Canberra organised by the Australian Government, CSIRO, and ANDS to address what needs to be done to develop interoperating information networks,

• a kickoff talk for another one day workshop on scientific metadata, and then

• one of five invited keynote talks for the Australisian e-Research meeting (programme,twitter archive)

(the latter two on the Gold Coast in Queensland). Slides for all three are on my talks page, this is by way of trying to summarise some of the things I learned.

I guess the first and most obvious thing hit me at Heathrow before I left: the Australian dollar has got huge in the last five years - roughly $1.50 a pound (and parity with the US dollar)! Apparently it's now the fifth most traded currency in the world since it's seen as a proxy for the Chinese yuan (basically such a lot of Chinese money is spent on stuff Australia digs out of the ground). As a consequence, everything in Australia seemed expensive (and even expensive at the old "right" exchange rate of about$2.20 a pound). But this isn't a travel blog, so let's get to the reason why I went.

Early on I tweeted a frantic bleat ... "About to go into 1st day of 6 days of "informatics" mtgs. Can already feel climate science lack (what was that about "you made your bed")?" So why did I put myself through it? Not just because a bunch of people who I like and respect had invited me (though that helped), but because I knew Australia has a lot to tell us about interoperability - facilitated by a range of technologies. Australians have always been very active in the "standards" community, and my theory on this (which I picked up from someone else, but I can't remember who), is that they have a federal system with the "right number" of states: not too few, not too many (he types, desperately trying to avoid sounding like a nursery rhyme). That brings a necessity for interoperability and couples it with the ability to achieve it (not too many parties to achieve consensus on how to do it). So all good, and in Simon Cox and Rob Atkinson they have two of the leading luminaries on coupling standards based information systems and model driven architectures (and they're about to get another one of their world experts back as Andrew Woolf goes home to Australia early next year, having been working here at STFC). And, in the last few years, they've taken great strides in building real information systems exploiting both their ideas and a raft of OGC web services.

Additionally, even more relevant for me right now, is that they are busy grappling with a bunch of "integrative" science problems that they just have to solve: the most important of which (IMHO) is how to get their water data together, so they can manage their (slim) water resources, and have a hope of adapting to the inevitable climate changes ahead. Probably perceived as more important by many in Australia is getting their geoscience (mining etc) reserach information aligned so that they can carry on digging stuff up and exporting it for the next few centuries. (I told everyone I could that they need to leave their coal in the ground, but I don't expect too many were listening.) As a consequence, they have some real experience of doing data things that we're only just trying to do. Oh for sure, we're doing many things they're not doing, but the converse is true too, they're doing things we're not. So, I mostly went to listen, and for the corridor conversations, and to set up some ongoing collaborations. I think on all three fronts I did well (much listening, chatting, and a couple of thing teed up).

So, what's the bottom line?

Well, I think folks would get a lot out of looking through the first few presentations of both workshops! The first is online already, the second will be soon. I'll put a link here as soon as I hear about it. I particularly recommend anything by the aforementioned Cox and Atkinson, but also by Wyborne, Lemon and Box ... there is some advice that the INSPIRE community would well do to head in the talks by Lemon and Box (which overlapped in material some, but each with slightly different perspectives). The bottom line is that Discover, Display, Download isn't a paradigm that hits the sweet spot for a large community of users (even if it does for a goodly proportion of science use cases) - given INSPIRE isn't primarily to support science, they'd do well to heed this message. That said, I think the sort of thinking, and activity, that INSPIRE is working through is a necessary step along the road to more sophisticated interoperability - somewhat like crawling, necessary for most (but not all) children before walking. (I can almost picture Sean Gillies choking on his beer, railing against me supporting OGC services, but all I can say is that not all roads get to Rome the same way, and I think for some organisations, a jump straight to RESTful services based on more sophisticated information models, just isn't going to happen.)

I think I also convinced myself of the strengths and weaknesses of "pure" RDF/linkeddata approaches to information management. I spent a lot of time in my second talk addressing what needs to happen to exploit both an RDF and XML view of the world, and on managing our view(s) of that world. To that end, the Atkinson talks on the importance of managing our information models themselves as firstclass artefacts in our information systems, strike me as pretty important. The trouble is that if you want to consume a sophisticated information artifact describing something about the world in a piece of software, and you want to do more than navigate around it (a la linkeddata), then you have to know a priori what the structure is. But of course the structure is changing ... managing all that requires a form of model driven architecture.

## Strongly Defend Weakly Held Positions

I've just been at a NERC data management workshop. I may well blog some more about it, but one thing I spent a lot of time repeating to lots of people was one of Bob Sutton's mantras: "Strongly Defend Weakly Held Positions".

There were (at least) two strands at the meeting where this mantra is particularly important, both in the context that we (the data management community, like everyone else) are trying to find our way forward in uncertain technical times - we know the science drivers are demanding interoperability, but doing so without strong constraints on what that actually means:

1. As Chris Rusbridge pointed out, integrated science (a loosely defined phrase but which at least points to why interoperabilty matters) means that scientists will be using unfamiliar data, therefore someone (data curators and managers) must make data available for unfamiliar users. This means we're often struggling to work out what needs to be communicated, how it can be communicated, and how to get data into differing toolsets, which brings us to

2. The technical options available to support interoperabilility are evolving rapidly, funding is limited, and while potential usage is effectively infinite, actual usage will drive us forward.

Both of these lead to the necessity for a strategy which simultaneously makes progress while yielding ground when progress via a particular route is overtaken by progress via another route. However, we need to actually make progress, and it's incumbent on those of us who have an overview of the options (scientific, social and technical) to provide leadership in terms of directions - and strongly defend those choices. But it's just as incumbent on us to yield quickly and gracefully when we (inevitably) make some wrong calls. It's also incumbent on us, individually and collectively, to recognise that sometimes it's the mistakes we make - even the expensive and time consuming ones - that tell us how we should have done things, and sometimes doing something wrong is the only way to find out how to do it right!

As an aside, just in case I don't follow up with more about the conference (and let's face it, this blog is a bit stop and start), Chris's point about unfamiliar users lead him to introduce to us the "fourth Rumsfield": the unknown knowns. I rather like to think that leads to part of the job description for data management:

• We exist to mitigate against the unknown knowns associated with the collection (or production) of data and it's usage becoming known unknowns!

## obscurity

For some reason both bloglines and technorati seem to have turned up their noses at my blog. I wouldn't mind, but I kind of rely on them to find if there are any inbound links to my blog (and thus anything worth pursuing). There aint much point asking the lazy web a question if I can't find any links pointing to the question ...

• Tecnorati is stuck claiming my latest post is from the 6th of October last year (despite the little preview graphic actually being current and a ping from yesterday).

• Bloglines has a post from November the 14th, but amusingly finds a citation to a january post when I look for posts, but doesn't find that post when I look for citations.

• Google blog reader finds my blog ok, and google web search finds stuff in my blog, but not google blog search. What to make of that?

Oh well. I guess I shouldn't be surprised, there's never been much overlap in what they found as citations, so the fact my posts themselves have disappared into obscurity snouldn't surprise me.

## NDG papers appear in Phil Trans

I'm really glad to see that our papers on

have appeared in the NERC e-science special edition of the Philosophical Transactions of the Royal Society.

## Reading in 2009, 4: Engleby not for me

I have a t-shirt that reads "too many books, too little time".

Every now and then I read a book and think, "with all the books in the world, why did I bother with this one?" Despite the amount of trash I read, this happens relatively rarely, if a book is entertaining, or, interesting, or both, then I'm usually a happy bunny, and I don't set the bar high.

And so to Sebatian Faulks' Engleby. Half way through I seriously considered not finishing it. It starts well enough, but it doesn't take long to suss the plot, and despite the odd passage of quality prose, mostly the mood it builds up is tedium, and a feeling of "when is this thing going to end" instead of the any of the feelings I would rather have had, like "how is this going to end" (frankly I didn't care) or "give me more" (please no more) ... or "how time has flown while I've been reading" (nope). So, sadly, this was a "I wish I never started it book". Having started a book, it has to be really bad for me not to finish it, and this wasn't really bad ...

If you ever make the mistake of reading it (all the way through), you'll realise that tedium may well have been exactly what Faulks wanted you to feel, but being true to your plot while boring your readers is something you can only get away with when you have a big reputation.

So, I'm sure some folk will like it (a quick look at Amazon seems to imply that lots of people liked it!). But not me. I wish I had spent the time reading something else.

(I hope that when I grow up I'll learn to not finish books I'm not enjoying!)

## when the paparazzi are ok

I stumbled across this a week or two ago, and have had it sitting tabbed waiting for a response since, because it really got my goat!

The basic thesis of the writer is that it's a bad idea for someone to wander round a poster session at a (scientific) conference, snapping away at the posters using a camera. Leaving aside the issue that I thought I was the first person to think of doing this with a camera phone (obviously not), like I say, it got my goat!

His justification of his position comes down to the fact that he sees "taking" (information) without "giving" (feedback) as not keeping up with the takers part of a two-way process. He's also worried about what he calls "espionage", and data getting discussed before it's peer reviewed.

Firstly, as to the taking without giving: In some communities, presenting is the price of attendance, the feedback is incidental. In all communities only a tiny percentage of attendees ever give feedback. Does not giving feedback mean I can't/shouldn't listen (to a talk)? Can't read (a poster)? Given how much it costs (in time, money, and emissions) to go to a conference shouldn't we make damn sure we get as much as possible. I could never engage with (or often even read) all the posters at many conferences.

As to the "discussion" before peer review. What's the point of putting an idea into the community if you don't want it to be discussed? (Risk of the data being analysed by someone else I hear him respond? They can't publish it without credible provenance, so what's the issue, the idea was out the moment you took it to a conference?)

Finally, in my opinion, the best conferences make sure the posters (and the presentations if you're really lucky) are on a memory stick and/or in a repository, so I can have access later. If they don't do that, how is not better for me to at least take a copy so I can read it later?

Science is about communication. Anything that hinders communication hinders science. Attribution is important, but a camera copy of a paper doesn't make attribution any less likely than publication, in fact, if we extrapolate from the open access experience, it's likely to make attribution more likely.

Ba! Humbug!

## bbc goes to rdf

And not only that, they built their domain model first then built an RDF ontology:

We set about converting our programmes domain model into an RDF ontology which we've since published under a Creative Commons License (www.bbc.co.uk/ontologies/programmes/). Which took one person about a week. The trick here isn't the RDF mapping - it's having a well thought through and well expressed domain model. And if you're serious about building web sites that's something you need anyway.

Someone once said to me that RDF wasn't big out there. Well I knew it was, and maybe he will believe me now!

## Reading in 2009, 2: Water Supply

And so to "When the Rivers Run Dry" by Fred Pearce. Which is about what it says on the tin ...

Another apocalyptic read (I'm not in an apocalyptic mood, it just happened that I got two birthday presents last year in the same vein). This book reads well, but it's another one that could get you breaking out the whiskey before the sun gets over the yardarm. It's absolutely not a book about global warming! Although global warming gets a few mentions, it's a book primarily about good intentions going bad coupled with bungled engineering and short term thinking. It is scary precisely because it would appear we're stuffed on the water front before we even get to the implications of warming ...

There are some really fascinating bits in this book, the state of the Aral Sea for example, I guess I vaguely knew what had been going on, but the detail presented in this book is scary, not just because of what has happened, but it (the dry up) was planned that way (and despite all the planning, the resulting water for "use" is being frittered away).

Here are a few bits that I noted (for my own nefarious purposes, not because they were necessarily the most important or most interesting ...). All the numbers (except where stated) are from the book, I don't know what the original sources might have been.

#### Not enough water in the first place

A back of the envelope calculation (p33-35) of water availability goes something like this: we will run out of water unless we only use the water that falls as rain (somewhere), that is from the "fast water cycle". In practise we only care about that which falls on land (60K cubic km per annum). If we neglect that which evaporates, and that which is transpired (hmm, I'll get back to that), that leaves about 40K cubic km of runoff for "consumption. Of that hydrologists reckon it's practical to "capture" 14K (why?). Take out the runoff in inaccessible places (like Siberia), and we're left with 9K, or about 1400 cubic m per annum per person. But earlier on (p22) he's calculated that he himself consumes around 1500-2000 cubic m per annum in terms of water needed to feed and cloth him (as well as that directly consumed which is far less). So the bottom line is that if everyone wants to live like him, then there's a problem.

• But we neglected the transpiration earlier on, and surely that's part of the water consumed to feed him? So I'm not so sure about the budget. However, whether or not he's got the budget details right, the actual efficiencies (or lack thereof) of actual hydrological systems that he discusses throughout the book make it clear that we have a major problem, and we're eating into water from the "slow water cycle" (deep acquifers etc, which are slowly, but surely, being drained).

#### We can feed them, but can we water them?

(p38) The UN FAO says that globally we now grow twice as much food as we did a generation ago, but we abstract three times as much water from rivers and acquifers to do so.

#### Dam them all

As a kiwi I both appreciate(d) the benefits of hydro power and mourn(ed) the losses from flooding ... but I always thought of dams as being a Good Thing (TM). However, it appears that it's not always that way: A World Commission on Dams (appointed by the World Bank) made some interesting observations in 2000 (p157-159):

• Two thirds of all damns built globally for water supply to cities deliver less than planned (a quarter less than half)

• A quarter of dams built to irrigate fields irrigated less than a third of the land intended.

• Half produced significantly less power than advertised

• an interesting number is the number of kw/flooded hectare: ranging from 0.2 to 5 for some examples he gives. I thought about that a bit: an interesting comparison is that a tenth of the area could provide between one and four times the same energy for the best of these (at 5-20 W/sq m - this last number from the synopsis - pdf - of a new book I want to read).

• Even dams built to protect against flooding have increased vulnerabilities (because they're generally kept full and "emergency releases" are floods in their own right).

• Dams have resulted in at least 80 million rural folks losing homes, lands and livelihoods!

• Many have been poorly sited, often on the basis of faulty estimates of climatic flow (even in wealthy countries like the States: consider the poor future for the Colorado, and Lake Powell in particular - p223 and more recently).

• and that's without considering silting and wetland removal etc

#### Water from thin air

On the positive side! Chapter 31 discusses technologies for "generating" water.

The discussion of water budgets above was about precipitated water. Of course at any given time a lot of water is sitting in the atmosphere as water vapour - roughly 98% of the 13K cubic km in the atmosphere at any one time (about six times the amount in the worlds rivers - again, at any one time).

There is a discussion of dew ponds, and artificial dew producers (using cold ocean water to cause condensation in the desert), and fog capture. Inspiring stuff. (He also talks about desalination and cloud seeding, both of which are rather less inspiring! Even if the former is widely deployed and/or necessary in some places, it's too energy intensive to be a "solution" to the global water issue.)

#### The bottom line

Actually, just like climate change, the water problem is not just a supply side problem, it's a demand problem too. In the final analysis, we need to drop demand, as well as address changing modes of supply. As far as the latter is concerned, there will be no one solution.

If I got one take home thread from this book it would be that there is a dire need for rational politicians and (more) sensible water managment practises, coupled with geographically realistic assessments of crop suitability. (And on the demand side, less cotton production - and consumption.)

## Reading in 2009, 3: Miss Smilla

We've just spent a week on holiday in Cornwall, where, apart from the IPCC tomes (mostly unopened) and Obama souvenier edition newspapers (mostly disguarded unread), my holiday reading was a reread of Peter Hoeg's Miss Smilla's Feeling for Snow.

I first read this about a decade ago, and had dug it out for someone else to read because we'd been talking about excellent translations (and translators). There is an interesting story (pdf) about the translation itself - there being substantial differences between the US and UK editions. Anyway, the book came back, and languished on the floor of the car until a couple of weeks ago, when I was stuck for half an hour in the car, and so I started to read it again ...

Of course a decade was plenty long enough for me to have forgotten the entire plot so it was effectively a fresh read.

The first thing to say about this book is that despite the Danish orgins, even in English, the prose is just fabulous! Mostly I don't care for "fabulous" prose ... when I'm reading a novel I just want a direct connect from text to my brain that doesn't have me realising that I'm actually reading at all ... fancy language gets in the way of that (for me). But this book is different. I can't pull a sentence, or paragraph for you, because I somehow managed to read it in my normal way (ie without being conscious of actually reading), but I still have a sense of joy from the process of reading it. It was clearly wonderful in the original Danish, so all Kudos to the translator(s).

The story itself is a pretty good thriller, with an engaging, resourceful lead character (Miss Smilla) who manages to segue in a nearly believable way from one scarcely survivable event to another. I enjoyed the first half more, as the believability/survivability function fell significantly in the second half, but for all that, it was a good read as a thriller. A majorly unbelievable bit was the reason for it all, but since that only became clear right at the denouement (which, as wikipedia puts it, is "unresolved"), it mattered not.

There is more to it than the thriller and the prose, the glimpses of Denmark and Innuit culture and society and their relationships, with each other and Smilla, weave throughout and give it real character.

## SD cards to the rescue

Next generation storage (press release pdf):

The next-generation SDXC (eXtended Capacity) memory card specification, pending release in Q1 2009, dramatically improves consumers' digital lifestyles by increasing storage capacity from 32 GB up to 2 TB and increasing SD interface read/write speeds up to 104 MB per second in 2009 with a road map to 300 MB per second. SDXC will provide more portable storage and speed, which are often required to support new features in consumer electronic devices and mobile phones.

Never mind the electronic devices and mobile phones, my data centre will scale to petabytes without issues associated with air conditioning, pwer consumption and physical volume!

It also removes another worry for me. In 2009 we expect to add between 500 TB and 1 PB of new physical storage (on spinning disk). This is a rather large perturbation to our normal growth, and I had been worried about how we would replace it in four years time. If consumer electronics does what it normally does, then in 2012-2013 we'll be replacing a room full of spinning disk with a rack full of SDXC cards ...

The faster bus speeds in the SDXC specification also will benefit SDHC, Embedded SD and SDIO specifications.

and scientific data analysis!

Hat tip (of all places) the online photographer!

## Heat not Drought

Just as my night time reading is all about drought (I'll tell you about that another day), I find this fascinating paper in this weeks Science:

Battisti, David S. and Rosamond L. Naylor, Historical Warnings of Future Food Insecurity with Unprecedented Seasonal Heat, Science (2009)

The bottom line is that heat waves may be more important than droughts for some food production. They give the example of a major perturbation on wheat production and consequential world wheat prices arising from the hot, dry 1972 summer in the Ukraine and Russia. They then point out that while that summer ranked in the top ten percent of temperature anomalies between 1900 and 2006 (with temperatures 2-4C above the long term mean), one third of the summers in the observation period were drier!

They then use observational data and simulations from 23 global climate models to show a high probability (>90%) that growing season temperatures in the tropics and subtropics by the end of the 21st century will exceed the most extreme seasonal temperatures recorded from 1900 to 2006. The consequences for food production are extreme!

A couple of choice quotes:

... regional disruptions can easily become global in character. Countries often respond to production and price volatility by restricting trade or pursuing large grain purchases in international markets?both of which can have destabilizing effects on world prices and global food security. In the future, heat stress on crops and livestock will occur in an environment of steadily rising demand for food and animal feed worldwide, making markets more vulnerable to sharp price swings.

... with growing season temperatures in excess of the hottest years on record for many countries, the stress on crops and livestock will become global in character. It will be extremely difficult to balance food deficits in one part of the world with food surpluses in another, unless major adaptation investments are made soon to develop crop varieties that are tolerant to heat and heat-induced water stress and irrigation systems suitable for diverse agroecosystems. The genetics, genomics, breeding, management, and engineering capacity for such adaptation can be developed globally but will be costly and will require political prioritization ...

## EGU 2009

Well, I haven't been to a major conference for a while, and I received a raft of invitations to give talks at EGU this year.

So, with colleagues, we have a raft of abstracts submitted:

## Two degrees of Warming

William asked what ML thought would happen with two degrees. I suspect the reason he asked that is that most of us believe that two degrees is in the pipeline, and pretty much inescapable now. Indeed, I reckon we'll see it (wrt 1960) within a few decades (wrt now).

Ideally folk should go read the book, but this is the gist of the one and two degree chapters - via section titles (and my parenthetic summary):

• One degree

• America's Slumbering Desert (droughts, soil loss etc)

• (An aside on the fact that the Day After Tomorrow hasn't and isn't likely to happen)

• Africa's Shining Mountain (fairwell glaciers on kilimanjaro, implications for water)

• Ghost Rivers of the Sahara (greening the Sahara, yes, no, maybe, floods and droughts).

• The Arctic Meltdown Begins (tipping points for ice, permafrost melt, drying)

• Danger in the Alps (mountains and villages at risk of destruction as permafrost melts)

• Queenslands Frogs Boil (dramatic biodiversity loss, in rainforests and reefs, in Queensland and elsewhere)

• Hurricane Warnings in the South Atlantic (are hurricane characteristics changing?)

• Sinking Atolls (bye bye Tuvalu, Kiribati etc)

• Two degrees

• China's Thirsty Cities (water shortages)

• Acidic Oceans (real problems for phytoplankton, and thus everything)

• Mercury Rise in Europe (i.e. more heat waves)

• Mediteranean Sunburn (fires and drought)

• The coral and the icecap (sea level rise beyond the IPCC predictions)

• Last stand of the polar bear (arctic melting)

• Indian summer (food production decline, water issues)

• Peru's melting point (glacier melt leading to water shortage)

• Sun and Snow in California (water crisis)

• Feeding the Eight Billion (ups and downs in food production, net down)

• Silent Summer (climate change too quick for ecosystems, mass extinctions)

There is clearly much more. Please read the book, even if you want to disagree with a few of the details! Obviously ML is cherry picking the literature, but there is much more out there, and I don't think it's unrepresentative!

## curation and specification

Chris Rusbridge has a couple of interesting posts (original and followup) about specification and curation. The gist is that he's reporting some conversations, which I might baldly summarise as saying something like:

1. Any application stores data and renders it for consumption (by a human or another layer of software). In the best of possible worlds, a specification for the data structure AND the application functionality should be enough to ensure that a third party could render the data for consumption at the next level without reference to the application code itself. However, certain real world experience suggests that the specifications are not enough, you need the code as well, because real implementors break their specifications.

2. There was some discussion about the ability of OOXML and ODT to preserve semantic content preferentially over latex and PDF ... (primarily I think because some of the key semantic content would have been in figures which could be parsed for logical content, and both latex and pdf would have turned those figures into images).

3. As a consequence, Chris gets to this position:

So ... running code is better than specs as representation information, and Open Source running code is better than proprietary running code. And, even if you migrate on ingest, keep BOTH the rich format and a desiccated format (like PDF/A). It won't cost you much and may win you some silent thanks from your eventual users!

You'll not be surprised to find I have some opinions on this ...

1. I think in nearly all the cases where the specification is not enough, it's because the specification was a) not designed for interoperabilty, and b) was not the definition of the format and functionality. In these cases we find the spec is an a postiori attempt to document the format (almost certianly the problem with the Microsoft and Postscript examples discussed in the links). In particular, in those cases where we're dealing with preserving information from a format and specification from one vendor, we find both a) and b) violated, nearly all the time. What that says to me is that we should avoid trying to curate information which is in vendor-specific formats in favour of those where there are multiple (preferably open-source) implementations.

2. Running code will become non-running code in time, and not much time at that. What I hope Chris means, is keep the source code which ran the application. Even then, every software engineer knows that the code is not documentation, and that with sufficiently complex code, NO ONE will understand it. So, code without specfication is a candidate for obsolescence and eventual residence in the canon of "not translated, not understood, not useful" write only read never (WORN) archives.

What do we do at BADC? (In principle!)

1. Preserve input data. Copy on ingest if we have to, but we prefer (for the data itself) to demand that the supplier reformat into a format which does conform to a) designed for interoperability, and b) where there is a specification which preserves enough of the information content. (Duplicates of TB of data are not viable).

2. Preserve input documentation. Preserve specifications. Demand PDF (for now). Yes, the images are an issue, but if the images are data, then they ought to be actively preserved as data in their own right.

3. Ban MS documentation. History suggests that MS documents become WORN in about 6-8 years. Those who know no history are doomed to repeat it ...

So, I would argue that if you are doing curation, you have to address workflow before you get to the point of curation. If you know you want to preserve it, then think about that from the off. If you know you don't care about the future (shame on you), then yeah, ok, use your cool vendor tool ... but don't give the data to someone to curate, because curation is, in the end, about format conversion. If not now, sometime in the future INEVITABLY. If the documentation doesn't exist to do it, it's not curating. Don't kid yourselves.

All that said, much of the initial conversation was in the context of document curation, not data curation. IMHO the reason for much of what i perceive as confusion in their discussion, is not recognising the distinction! In the final analysis, I think that

• if your object is to curate documents (i.e. what I would call the library functionality), then preserving PDF/A, latex etc, is perfectly fine - after all, with the spec, you're preserving with the same fidelity that documents have always been preserved.

• if your object is to preserve the data, then it's a different ballgame, and folk need to confront the fact that curating data requires changes to the original workflow!

## european summer drying

OK, I confess, I'm clearly reading my abstract summaries this morning ...

Briffa, van der Schrier and Jones: Wet and dry summers in Europe since 1750: evidence of increasing drought (International Journal of Climatology, 2009):

Moisture availability across Europe is calculated based on 22 stations that have long instrumental records for precipitation and temperature. The metric used is the self-calibrating Palmer Drought Severity Index which is based on soil moisture content. This quantity is calculated using a simplified water budget model, forced by historic records of precipitation and temperature data, where the latter are used in a simple parameterization for potential evaporation.

The Kew record shows a significant clustering of dry summers in the most recent decade. When all the records are considered together, recent widespread drying is clearly apparent and highly significant in this long-term context. By substituting the 1961-1990 climatological monthly mean temperatures for the actual monthly means in the parameterization for potential evaporation, an estimate is made of the direct effect of temperature on drought. This analysis shows that a major influence on the trend toward drier summer conditions is the observed increase in temperatures. This effect is particularly strong in central Europe.

## Reading in 2009, 1: Six Degrees

So clearly in the last few weeks I've not been working and I've had half as many kids to look after ... so in between tears I've been dealing with fears .... the sort that abound if you make it much past the first chapter of Mark Lynas' book: Six Degrees: Our Future on a Hotter Planet.

Inspired by Aaaron Schwartz, albeit recognising reality (although heavily modified by recent events), I've decided to try and blog my years reading. Don't expect it to be too erudite, when I've got time I read pretty eclectically, and often choose mindless crap just to eat up time with as few brain cells as possible involved.

Anyway, back to the book of the day. The basic thesis is that there are six chapters describing the likely outcomes should our planet heat by between one and six degrees as a result of anthropogenic CO2 climate change.

It's a pretty well written book, with what looks to me like a reasonable coverage of the apocalyptic end of the literature. Clearly it's got a journalistic tone, with a fair dose of hyperbole, but he does temper it with a some qualification from time to time. While we all hope the six degree end is pretty unlikely, the possible consequences of even (!) the 2-3 degree changes make scary reading.

It'd be pretty easy, I think, to find the "ifs buts and maybes" in the original literature, and not much of it was "news" to me, but the thing about reading it all in one place was that it brought home to me that if even some of the predictions come home to roost, the world (both geographically and socially) is going to be a pretty different place in just a few decades, let alone a few centuries. Again, maybe that wasn't news, but there's something about having it rammed home all in one volume ...

So it's inspired me in two ways: I'm going to get back to the entire IPCC report and read the bits I don't normally (i.e. WG2 and WG3 stuff), and I'm going to try much harder to avoid business travel (you may well ask about personal travel, but we'll save the answer for another day). As regular readers will know, I've been avoiding business travel this last year anyway, in favour of virtual conferencing. You now know why, Evan having been pretty sick for a long time, and while looking after Evan is no longer an excuse for not travelling, I think given my profession, and given what we now believe about the future, it'd be wrong not to continue to try. Which brings me to my new years resolution: to try and convince my colleagues, especially the senior ones, to try harder to avoid physical meetings - particularly where the meetings are part of a regular sequence.

## blimey! solar wind and tropical cyclones

Here are a couple of papers that I'm going to have to find time to read properly:

• Prikryl, P., Ru?in, V., and Rybansk\'y, M.: The influence of solar wind on extratropical cyclones ? Part 1: Wilcox effect revisited, Ann. Geophys., 27, 1-30, 2009, and

• Prikryl, P., Muldrew, D. B., and Sofko, G. J.: The influence of solar wind on extratropical cyclones ? Part 2: A link mediated by auroral atmospheric gravity waves?, Ann. Geophys., 27, 31-57, 2009.

Some choice excerpts from the abstracts:

A sun-weather correlation, namely the link between solar magnetic sector boundary passage (SBP) by the Earth and upper-level tropospheric vorticity area index (VAI), that was found by Wilcox et al. (1974) and shown to be statistically significant by Hines and Halevy (1977) is revisited. A minimum in the VAI one day after SBP followed by an increase a few days later was observed. Using the ECMWF ERA-40 re-analysis dataset for the original period from 1963 to 1973 and extending it to 2002, we have verified what has become known as the "Wilcox effect" for the Northern as well as the Southern Hemisphere winters.

Cases of mesoscale cloud bands in extratropical cyclones are observed a few hours after atmospheric gravity waves (AGWs) are launched from the auroral ionosphere. It is suggested that the solar-wind-generated auroral AGWs contribute to processes that release instabilities and initiate slantwise convection thus leading to cloud bands and growth of extratropical cyclones.

It is also observed that severe extratropical storms, explosive cyclogenesis and significant sea level pressure deepenings of extratropical storms tend to occur within a few days of the arrival of high-speed solar wind.

Do I believe in this?

Well, I haven't read the papers, but I'm on record as believing that upper boundary affects can reach the troposphere, so it's feasible, particularly in that the basic thesis seems to revolve around small scale waves driving systems across instability boundaries, a non-linear affect that is more than feasible.

## Kia Kaha My Boy

Evan Lawrence, Born 5 July 2007, Died 18 December 2008. Forever Young:

## exist memory and stability

Within ceda we are producing more and more xml documents, and the obvious tool for most of them is an xml database. At the moment there appears to be only one candidate in the opensource world that has something approaching the necessary reliability and scalablity: eXist. Colleagues who have used or are using Berkeley xml database and xindice have been pretty scathing about their experiences, and I'm not aware of other options (although something interesting may be happening in the mysqland postgres worlds). We're not overly impressed with the reliability of eXist either, which is what I want to document here, but that said, we still believe it's the way forward (e.g.in 2007 it was compared with MySQL circa 1997, which implies reasonable prospects).

We initially installed eXist in a tomcat container some years ago, and we've upgraded exist and tomcat a few times along the way. It's only been this last year that we've started to get upwards of thousands of documents in exist, and it's only in the last year that we've started to have major stability problems, with many tomcat and eXist restarts necessary, and several restores from backup. We have observed problems with consistency when large document insertions via xquery over xmlrpc have been interupted, resulting in the necessity to rebuild collections. We have also seen the problems with collection indexes reported in email to the exist list.

A couple of weeks ago, we decied to reinstall exist in the jetty container, and see if we could identify what the problems were. Doing so has been a bit hairy. eXist is pretty well documented, but even so, the multitude of ways that eXist can be deployed (standalone database server, embedded database, or in the servlet engine of a Web application), leads to a considerable amount of ambiguity in documentation, and most importantly, signfiicant difficulties in working out which files are associated with the various memory configuration options (because which files control the memory depend on the deployment option). We're pretty confident that at least some of our problems are memory related, and maybe where not the only folk in the same boat.

Anyway, herewith is what we think we need to do to control eXist when deployed as a service via the tools/wrapper/bin/exist.sh wrapper. Firstly, there are at least two places we think that the memory available to the processes can be configured:

• tools/wapper/conf/wrapper.conf where we (now) find the process memory configuration

# Initial Java Heap Size (in MB)
wrapper.java.initmemory=64
# Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=512


and

• conf.xml, where we find the db-connection configuration

<db-connection cacheSize="256M" collectionCache="124M" database="native"
files="/disks/databases/exist" pageSize="4096">


where we are warned that

• the cacheSize should not be more than half the size of the JVM heap size, which is not in this case, set (directly) by the JVM -Xmx parameter but by the wrapper. Just that one misconception alone tooks us ages to track down, and;

• wrt collectionCache, ...if our collections are very different in size, it might be possible that the actual amount of memory used exceeds the specified limit. You should thus be careful with this setting. Huh? So how do I be careful? What should I look for?

We think the setting we have should be ok because we have seen the eXist developers saying this should be ok in other email. But ... we're not happy with not knowing ... is this a failure mode or a performance problem setting?

Then, and worryingly, because some of our problems seem to occur in processing xqueries, we find in (old) email from an eXist developer (in an interesting thread) that

cacheSize just limits the size of the page cache, i.e. the number of data file pages cached in memory. This does not include the base memory needed by eXist (and the libraries it uses) for XML processing and querying.

which is probably why they recommend not having it larger than half. But what happens when those libraries blow out their memory requirements?

We haven't yet resorted to turning off full text indexing, because we want that, but we could try the upgrade mentioned in the eXist developer email. We're currently on 1.2.4-rev:8072-20080802.

At this point we haven't yet set a java expert on our problems, but I guess we might have to (following the sort of thing done here). Meanwhile, this post is by way of notes for later (and maybe a brief for such an expert).

A viral book meme is apparently propagating (thanks Sean):

1. Grab the nearest book.

2. Open it to page 56.

3. Find the fifth sentence.

4. Post the text of the sentence in your journal along with these instructions.

5. Don't dig for your favorite book, the cool book, or the intellectual one: pick the CLOSEST.

Well, I'm going to break the rules slightly, only because I kept thinking I had the closest book, and then finding a closer one (under a pile of paper), and the sequence of three pretty much parallels my career, so it's both scary and interesting in that way :-)

Anyway, from the real closest, to the one I initially thought was the closest.

1. "Since interfaces and abstract classes carry with them contracts for behaviour, the interface or abstract class can be used to represent an arbitrary implementation" (no I'm not sure I know what that means either). From Dan Pilone (2003), UML Pocket Reference.

2. "To jump to the next iteration of a loop (skipping the remainder of the loop body), use the continue statement" From David Beazley (2001), Python Essential Reference (2nd Ed).

3. "For a finite change of length from Li to Lf,

 W=\int_{L_i}^{L_f} F dL, *** dvi2bitmap error *** 

where F indicates the instantaneous value of the force at any moment during the process." From Mark Zemansky and Richard Dittman (1981), Heat and Thermodynamics.

Of course, being a physicist, I'm going to argue that any of the three could be the closest, because the distance metric is not described accurately enough; closest in what coordinate system, physical distance or accessibility? (With number of pages of paper on top being part of the functional description of the latter.)

by Bryan Lawrence : 2008/11/14 : 0 trackbacks : 1 comment (permalink)

## CMIP5 - The Federation

Next week there is a meeting which I hope will finalise the data requirements to participate in CMIP5. I can't go, but there are a number of issues on the table which I care about, I'll try and write about each in it's own blog post, which you'll be able to find by looking at my (new) CMIP5 category page.

PCMDI is leading the development of a global data archive federation to support CMIP5. It needs to be global: conservative estimates of the volumes of data to be produced for CMIP5 are that there will be PB produced in the many modelling centres involved in producing conforming simulations. Within those PB of data, certain variables, periods, and temporal resolutions of output are going to be defined to create a core archive. We hope that it will be about 500 TB in size.

The global federation being put together will federate

• three copies of that core (one at PCMDI, one here at BADC, and one at the World Data Centre for climate at DKRZ), and

• as much of the data held in the individual data providers as is possible.

Schematically we see something like this

where

• each colour is meant to represent the data from one modelling centre,

• the outer ring represents the federation

• the inner ring represents the core

• there are some modelling groups who have data in the federation and in the core, and

• some groups who are not part of the federation, yet have data in the core.

(the schematic has nine modelling groups, six in the federation, but in reality there will be many more and the distribution between the camps is not yet known, in particular, while the diagram shows an equal distribution between modelling centres and cores, we expect PCMDI to ingest most of the data directly as was done for CMIP3). (Clarification added 11/12/08.)

BADC expects to have three roles in this activity, we will be

1. representing the Met Office Hadley Centre (holding as much data as possible, whether in the core or not),

2. holding simulation data on behalf of the NERC community as a federation partner, again, whether in the core or not, and

3. as a core federation partner, holding a core copy.

I hasten to add, all the core partners will be deploying services to help prospective users by producing simple diagnostics and subset the archive in may different ways - so as to avoid crippling download volumes!

## tales from the sleep deprived

Milk chocolate is like low-alcohol beer. Pointless!

## lack of service

We've had a hell of a day today.

Somehow, for reasons unknown (but following a planned power cycle, and I don't really believe in coincidence), the router which supports the CEDA (BADC+NEODC etc) network decided yesterday to hide CEDA from the world (and the world from CEDA). At the same time it did the same thing for a few dozen offices as well (including mine).

This despite all sorts of redundancy that is supposed to stop this sort of thing happening (it might be that it was faulty redundancy that caused it). Frankly we still don't know what the problem was, and we've not got it back to normal. However, we've got most things back up and working (but I'm sitting typing this in my secretary's office, she's had a working connection all day, but me in the office next door, nothing). What is working serverwise is by dint of an extra cable or two connecting one part of our machine room to another (one entire link to a backbone router refuses to function, despite being physically ok).

The post mortems from this will run and run, but meanwhile I feel almost physically sick. One of our most important attributes is a reliable fast network, and for (hopefully) one day, it's been neither! It'd probabably be ok, but it seems every time we have power trouble - whether scheduled or not - we have problems, either immediately on in the weeks that follow, and every time it's another "class" of problem ... (because of course we do fix each failure mode as we discover it).

If anyone from FAAM is reading this. Sorry. We're not picking on you ...

Should we be better at anticipating problems? Yes. Can we do better? Yes. Will we never have problems like this again? No, we're simply not resourced for truly high availability.

## peak everything

My blogging has almost dried up. It will be back, but maybe not yet.

Meanwhile, I was reading Michael Tobis's excellent presentation on ethics, when I found buried in the comments a link to another fascinating and scary presentation (3 MB pdf) at brave new climate.

Do look at it, and especially the very last slide, but if you're lazy, the bottom line is:

• there isn't enough coal and oil for CO2 to exceed 460 ppm,

• there may not be enough phosphorus for western agriculture to survive either ...

Clearly I don't know anything about the reliability of the material, but when i get my head above water (one day ...), I'll go looking ...

(p.s: well done America, maybe this time they really have elected someone who could become a leader of the free world)

I've just spent three weeks without my laptop. That's a story in itself. (IBM worldwide warranty for Lenovo might well work, but glacially, and their ability to communicate sucks beyond belief ... but I'll tell that story another day, if I can be bothered).

So, while I was hamstrung, I was working on borrowed computers and laptops, and began to make a bit of use of delicious.com via the firefox plugin. Anyway, I'm back on my laptop, so I wanted konqueror to work with delicious. It's much easier than you think, but no prizes to delicious for telling you that you can pretty nearly copy the Safari instructions, i.e., all you need to do is drag the relevant links into the right places:

1. First, open up the "tools/Minitools/edit Minitools" window, and create a new bookmarklet called, say "post to delicious". Then drag the "Bookmarks on Delicious" javascript from the Safari section into the Location box.

• You'll need to enable the extra toolbar if you haven't already, and click on the yellow star and unclick on your new menu item.

2. Then just drag the "My Delicious" link onto your bookmark toolbar.

## Virtual Conferencing

For various reasons I'm unable to travel much at the moment, yet the last few weeks has seen two of the most important events of the year for me: one in Seattle and one in Toulouse. I was there, but not there, for both.

The first, GO-ESSP, the conference venue was Seattle Public Library, the local hosts had selected yugma professional as sharing technology, and I was mostly at home (1 Mbs broadband with a following wind and then the various bits of string in the public Internet). For the second, the conference was in a room at cerfacs, the technology was Acrobat Connect Pro (ACP), and much of the time I was at work (1 Gbs LAN connected via SuperJanet to GEANT ... you get the picture).

For the first, there was a bigger audience, and I sense, a bigger room. For the second, it seemed smaller and more intimate. Some were at both, maybe someone can tell me the relative sizes of the rooms! I did get video in to the second, but not the first ... the second got video of me, for what it was worth :-).

In truth, the biggest part of the difference came down to reliability of the internet, and reliability of the audio (we chose to use normal teleconferencing as well as the adobe connect for the latter, mainly because of echo cancelling issues which we didn't quite nail down in the pre-event trial).

As far as audio goes, my home long distance carrier was onetel. It isn't any more ... I spent the first day of the Seattle meeting frustrated because I couldn't hear what was going on, on the second, I called in briefly by mobile, and realised that onetel was much of the problem. With BT, I could hear the speaker fine. In general however, I couldn't hear the audience questions, or the replies (I guess the speakers were perambulating around the front away from a fixed mike). In Toulouse (I think) they only had one mike, but it seemed to be much better at getting everyone (albeit with a few requests for repeats from me) ... hence my comment about venue size. The take home message I think is: to support external audio, in a larger room, use a mike strapped onto the speaker, and position folk with microphones to pass to the audience for questions (or get the speaker to repeat them - a bit of that happened in Seattle, but often when the speaker had been let too far out on her/his tether :-). In a smaller room you can get by with one or two mikes.

Both technologies allow virtual desktop sharing, and in the case of ACP, I was even able to control the remote desktop in Toulouse and move the mouse around to highlight stuff. Of course, in practise though, the network was the big discriminator: I don't know much about Seattle public library, but I'm guessing their network wasn't provisioned for a couple of dozen simultaneous VPN downloads of large software packages and datasets. (What would you do if the speaker just advertised some cool thing? Yes, that's right, download it right then and there, and play with it for the rest of the presentation.) So yugma cut out quite a lot, and one or two presentations were basically lost at sea.

What do I think about the two technologies? Well, there's no doubt that at least in terms of the two clients I used, ACP is far more full featured (even though it is hamstrung on linux, unlike yugma which seemed O/S agnostic). I think for smaller meetings, ACP would be significantly better in that it can take over your camera and microphone and do the multiple windows thing naturally, and for bigger meetings it might not make much difference - with one reservation. With yugma it appears you are reliant on their servers ... with ACP, we can (and did) use our own server (in Germany), which implies that if local networks are not the pinch point, we can avoid pinch points at the server by deploying our own. I don't know if one can do that with yugma. (I should say that I don't know what the cost was to deploy either solution!) However, regardless of which is the winner, I was damned glad both were available. I felt like I got a good deal of benefit out of my late nights "in" Seattle, even without the questions, and I felt like I was really there in Toulouse (and in the latter case everyone else felt like I was there too)!

One last point: for those of you in Seattle, who had the delight of hearing my three-year old daughter interject during a talk: that's what happens when your laptop fan dies, and you have to resort to the desktop in the dining room, and when you fail to correctly implement the mute ... :-) Had i known the audio from my end was so good, well maybe I'd have been asking some questions too!

This lack of time for reading is not unrelated to a lack of time for blogging! So the relative quiet on this blog is a combination of both a lack of free time out of work, and too many balls in the air at work. For all of those waiting for metamodels part two, I have a talk which should be grist for one or more posts, but I haven't had time to get there ...

## why global warming is too slow

Another video to add to my collection: This one (by Dan Gilbert) is on why global warming hasn't resulted in (much) action: it turns out it's not PAINful (Personal, Abrupt, Immoral, Now).

(Thanks to Real Climate).

## More Comment Spam

Some sad git has coded around my comment captcha ... I wouldn't have thought it worth his/her time ... but perhaps it's a generic problem. In any case, as a consequence, I've coded around them ... and removed some (but definitely not all) of the recent comment spam. I may have inadvertently removed something which wasn't spam, so if you spot that I've removed something that might actually be of interest to readers, let me know.

I've always thought "comments policies" were a bit redundant. If I don't like what folk post, I'll remove it ... (in practise, it has to be off topic, and wildly so, or using offensive language). I'll continue to do that ...

by Bryan Lawrence : 2008/08/29 (permalink)

## Defining a Metamodel - Part One

I've just introduced the concept of a metamodel as being a key component of the conceptual formalism required to come up with a conceptual model of the world expressed in a conceptual schema.

But that's not enough, or at least, it wasn't enough for me. To move forwards we need to understand the relationship between metamodels, vocabularies and ontologies, and when we've done that we can get to grips with the basic entities that would have to exist in our metamodel.

There is a pretty good stab at definining these things in an article by Woody Pidcock at metamodel.com.

My summary of the key aspects of the key concepts is:

 Controlled Vocabulary A list of terms that have been enumerated explicitly, which have unambiguous definitions, and which is governed by a process Taxonomy A collection of controlled vocabulary terms organised into a hierarchical structure, with each term in one-or-more parent-child relationships Ontology A formal representation of a set of concepts within a UoD and the relationships between those concepts. In principle the list of concepts and relationships itself forms a controlled vocabulary. Meta-Model An explicit model of the constructs and rules needed to build models within a specific UoD. A valid meta-model is an ontology, but not all ontologies are modelled explicity as meta-models. (Uod: Universe of Discourse)

In particular, Woodcock points out that a meta-model can be viewed from at least two perspectives: as a set of building blocks and rules to build models (which is the sense I have used it thus far), and as a model of a domain of interest (as might happen if one produced a heirarchy of conceptual models). Clearly, an ontology constructed as a conceptual model describing a particular universe of discourse is much more likely to be the latter than the former, although one might well construct one of the former on the way ...

Which brings me to the next step.

For metafor, we're starting with quite a bit of prior art, including, but not limited to our own NumSim, the Numerical Model Metadata, and most importantly, the Earth System Curator ontology. But we're endeavouring to avoid the mistakes of the past, and going back to some fundamentals I listed a long time ago, and in particular "reusing blocks from widely adopted standards". One obvious set of standards are the ISO19000 series of geographic information standards.

These last include a set of metamodels (19103, 19109 etc ), as well as the more widely known content and syntactical standards such as 19115/19139 and 19136 (GML). My next step is to consider the information infrastructure we're likely to have in place (e.g. controlled vocabularies, ontologies) and the basic types we're likely to need in our metamodel, in the context of serialising both to XML and RDF, each of which may play different roles in producing and consuming content.

## Balancing Harmonisation

Of course all this nattering about formalising information model development is precisely the process that both GEOSS and INSPIRE are going through. The difference for us is that we're a small domain, a lot of what we want to model (in the information sense) is not geospatial, and our user community is both global and not EO in the GEOSS sense.

Still, the INSPIRE implementing rules have a lot of relevance (albeit at 151 pages, like the ISO documents, their relevance is diluted by verbiage. It's hard to get excited if the key message has to take that long to deliver).

In particular, given that we have a number of communities to deal with, a key requirement is to get the level of "harmonisation" right:

(figure 15 from the INSPIRE Methodology for the development of data specifications.)

Some interesting numbers:

 (million barrels per day) 2005 2007 World Oil Production 84.63 84.6 Chinese Oil Demand 6.84 7.6

The original article (hat tip John Fleck) makes the point that a natural inference from these numbers is that while Chinese oil demand has been rising, elsewhere in the world, demand has been falling. Now remember these figures predate the recent price eruptions.

I find an interesting subtext in these numbers. As far as I'm aware it would hard to be argue that while the fall in demand has probably come from Europe, Japan and the U.S., it has been driven by consumer or governmental drives to reduce oil consumption because of concerns about global warming. So that implies that if we could actually devise an effective policy to drive down oil consumption in the west superimposed on what might be a natural downward trend1, then it would probably be possible to allow Chinese demand to continue growing and have a net decrease in consumption globally. And if that's so, it removes one of those planks from the "do nothing" brigade who argue that there's no point because Chinese (and Indian and Brazilian etc) consumption will continue to increase.

We can make a difference!

(I wonder what the coal figures look like?)

Update: I shouldn't have started this, but now Jeff has a link to the latest EIA figures, and I haven't a clue what the real state of supply and consumption is ... which echoes the major point of his first post I guess).

## Formalising Information Model Development

Our metafor project is trying to establish a formal methodology for constructing UML, working with that UML in a team, and building multiple implementations using combinations of RDF and XML-Schema.

There are some interesting problems to overcome:

1. Establishing UML conventions that will allow code generation.

2. Dealing with version control of the UML. (And dealing with different versions of XMI).

3. Dealing with code generation itself.

(In this context, code generation is simply the generation of XML-Schema and/or RDF which contain appropriate data models).

I'll try and address these in a series of posts. (I was once told that it's a bad idea when being interviewed live to say something like "there are three points I want to make" because you're bound to forget the third under the pressure of a live interview. I suspect I have form for listing things I will post about, and never getting around to doing so, which must be the blogging equivalent, and thus something to be avoided. We'll see.)

But before we do this, there are some vocabulary issues that need expansion. Metafor is about building a "Common Information Model" (or CIM) for climate (and other numerical) models. Such models describe the earth system using mathematical equations encoded as computer programmes, and are used as part of activities to produce data (often starting with various data inputs). I've described some of this before. Like most folk entering this sort of activity I've typically blundered straight into defining things without thinking about the process clearly. Perhaps I've learnt a few things along the way about why that's not so clever ...

The first step before we construct any conventions is to consider the structure we're actually talking about. Much of this is really well documented in section 7.4 of ISO19101 and in the entirety of ISO19103 (Geographic information - Conceptual schema language) but since those documents are not easily accessible, we'll try and summarise.

(Note that we're using model in two different ways: in the context of "information" and the the context of "simulations", hopefully the context will help make clear which is which).

Firstly, from wikipedia we have :

... a model is an abstraction of phenomena in the real world; a metamodel is yet another abstraction, highlighting properties of the model itself. A model conforms to its metamodel in the way that a computer program conforms to the grammar of the programming language in which it is written.

So, to construct our CIM, we need to construct a metamodel, which provides some rules about how to construct it. Clearly our metamodel includes the exploitation of UML, but it needs to be more than that, otherwise we'll not be able to build something we can serialise into RDF and XML without a lot of human interaction. It's that "more" that we need to consider in a later post ... but meanwhile, we also need to relate the metamodel, and our own information model with their actual application in schema, and we also need to remember we live in a world where others are doing similar things.

I've tried to summarise all this in a customised schematic version of figure 4 in ISO19101:

## Model Intercomparison, Resolution, Ensembles

We've talked about the trade off between these things, but the reality is whichever way we go we get more data, and more data means more problems. This is what's on my mind:

Fortunately there is another wall (unfortunately it's full too), and scope to replace these with higher capacity units, but we'll need more power into the room too ... (and more rooms) ...

## The Rising Storm

I've been on holiday ... more of that anon. When I catch up on email and administrivia enough to return to things of interest to others, blogging will return too ...

Meanwhile, I know that some of my readers fall into the camp of "can't quite believe this climate stuff" but "don't believe the nutters either".

So for you: two videos and something to provoke some thinking, most of which agrees with my thinking too, but I haven't the eloquence or the strength to follow through to write things like that myself ...

• So, ten minutes from Hansen. Listen and Learn.

• Nearly twenty minutes from Seita Emori. The model he's describing was the highest resolution model in the AR4 archive. That doesn't make it right, and he probably ought to caveat more the results beyond temperature, but it's all very plausible. The fact that it is even plausible should cause concern!

• And finally, Michael Tobis: "My view in a nutshell".

## anatomy of a mip - part2

My recent description of the key components of model intercomparison projects was done both as input to metafor deliberations and as preparation for a visit by Simon Cox. We spent a bit of that visit time discussing the UML describing such projects (which appeared in the previous post). In doing so, we managed a few simplifications and fixes to my UML ...

The key points to notice are

• fixing the association to be in keeping with usual use of UML (in particular, noting that a composition association implies that if the parent instance is deleted, the child instances should also disappear).

• making more clear the association between RunTime and Experiment by adding the explicit conformsTo association.

• moving the ModelCode to be an adjunct to the RunTime so that the RunTime directly produces the Output.

• (update, woops missed the important one): using the view stereotype to indicate the classes which we believe will form launch points for discovery.

## exposing mips in moles

In my previous posting, I should have pointed out that the MIP of interest is the RAPID thermohaline circulation model intercomparison project (THCMIP).

In this post I want to enumerate the membership of some of those classes I introduced last time, and think through their relationships with MOLES and some of the practicalities for implementation. Doing so exposes a few problems with the current MOLES draft.

I start by considering how the MIP entities map into the existing MOLES framework (by the way, I will come back to observational data and do the same thing for an observational data example or two).

There are five model codes involved: CHIME, FAMOUS, FORTE, HADCM3 and GENIE. There are four experiments involved HOSING, CONTROL, TRH20 and CRH20. There are at least four time averaging periods involved: daily, five-day, monthly and yearly.

Some groups have done seasonal, some have done different experiments, but we'll ignore those for now.

So in practice, using MOLES vocabulary, we have at least 5x4x4=80 different granules of data to load into our archive, and there are 5 data production tools (models) + 4 experiments (where do we put these?)+ 1 activity which need comprehensive descriptions ie 10 new information entities. We might argue there are 20 primary data enties (being the model x experiment x observation station) combinations (remembering that the model runs might have been carried out on different architectures).

Of course we ought to support multiple views on this data, but we ought not have to load any more information to support those view. (Views like:

• A data entity which correponds to data from each experiment (4 of these),

• A data entity which corresponds to data from each model (5 of these),

• A data entity which consists of the granules from all models with specific time averaging for one experiment (there are 16 of these), etc

The data entities themselves should not need any specific properties; these they ought to inherit from the combinations of other entities (models, runs etc). This situation is tailor made for RDF to support views which arise from facetted browsing, but a legitimate question is what views should be offered up for discovery (that is, positions from which one can start browsing)?.

In any case, we start with ingesting 80 model runs, and generating their A-archive metadata (which gives us the temporal and spatial coverage along with the parameters stored). We'll assume that process was perfect (i.e. all parameters were CF compliant and all data was present and correct. Of course in real life that's never the case - all parameters are never CF compliant, all data is never present, and often it's not correct, that's what the process of ingestion has to deal with).

Each of those granules has the additional properties of temporal resolution and parent run. We probably ought to allow an optional spatial resolution in case the output resolution was different from the model resolution and potentially even different from a required resolution in an experiment description).

(These next three paragraphs updated 17/06/08): The run itself is an entity, which corresponds to a grab bag of attributes inherited from the other entities which we want to propagate to each of the constituent granules. If that were all it were it could be an abstract entity, which might correspond exactly to the MOLES data entity (which isn't abstract). However, somewhere we need to put the runtime information (configuration, actual initial conditions etc).

Currently we have the concept of a deployment which links a data entity to one each of activity, data production tool and observation station. This has a number of problems: some of the views described above produce data entities associated with multiple data production tools etc. I've been arguing for some time that deployments are only associations and not entities in their own right, in which case a deployment really is an aggregation of associations, but it doesn't need to exist .... even as an abstract entity. The proof of this is in the pudding: the current MOLES browse never shows deployments it simply shows the links which are the associations. Sam has argued that deployments could have some time bounds attributes, but when we explored why, what he was defining was in fact a sub-activity. We did think the runtime might go here, but it could also go into the provenance of the data entity, or we could have more subtle structures in the experiment.

So let's think about the attributes of a data entity itself. The data entity will have an aggregated list of granule bounds, parameters, resolution etc. It will also have associations with multiple other moles entities. If I hand you that data entity, what metadata do you want to come with it? Before we get carried away, remember that we've started with 20 data entities which we might naturally document, but we have identified many more (e.g. above) which we would rather like to be auto-generated. These wouldn't necessarily have common runtimes.

I've said elsewhere that I want all the moles entities to be serialised as atom entities, so we have some attributes we must have. The question is from where will they come?. Clearly we need some rules combining other attributes. These rules probably should also encompass which entities are discoverable.

Food for thought. (You might have thought we had done this thinking: well we had done some, so we know what is wrong with our thinking :-)

## The anatomy of a mip

We're in the process of documenting a specific model intercomparison project (MIP) for the purposes of the Rapid programme. It's the same issue we have for CCMVAL1, and for metafor in general.

The issue is what metadata objects need to exist: Never mind the schema, what's our world view look like? It looks something like this:

Beyond the classes and their labels, one of the key features of this diagram is the colour of the classes, which is meant to depict governance domains of the information (sadly, at the moment, the governance of the metadata itself is all down to us).

• The brown classes (project, experiment, scientific description) essentially come from the project aims.

• The blue classes (the model and it's description) are generally the integrated product of development and descriptions over a number of years by a modelling group.

• The yellow classes are the data inputs and outputs. In principle we can auto describe the data itself from the contents of data files, but there is context metadata needed. In the case of the output Time Average Views which are the moles:granule objects which we store in our archive, the context metadata is all the stuff depicted here. In the case of the input Ancillary2 Data, we hope the context metadata already exists.

• The interesting stuff, which is dreadfully hard to gather and which may consume the most time, is the khaki stuff - in comparison to the blue material which we hope is rapidly slowly changing, the khaki material is different for every run.

Now some folk go on and on about repeatability of model runs. Sadly, very few model runs are bitwise repeatable, not only is there not enough metadata kept, but generally the combination of model code, parameter settings, ancillary data, and computer system (with compiler) is generally not even possible to re-establish.

Personally, I think that's ok (it's just not practicable for sufficiently complex model/computer/compiler combinations if the elapsed time is anything much above months), but what's not ok is for a model run to be unrepeatable in principle: that is someone ought to be able to add a new run which conforms to the experiment and for which it is possible to compare all the elements depicted here. We ought to have kept the data and the metadata. We really ought to, and we ought to understand how to construct a difference.

Hence metafor, but right now, we just want to make sure we describe our Rapid data ...

1: that link doesn't exist at the time of writing, but it will :-) (ret).
2: with apologies to the Met Office, this is a slightly different use of the word Ancillary. (ret).

## Having our granule cake and eating it too

Over on import cartography, Sean is convinced he's not sabotaging the fight for a sustainable climate. He's right.

His post was inspired by a sequence of emails (most of which I haven't read), which culminated in this one.

I must say I'm depressed with how often folk manage to get themselves in one of two states: my hammer is suitable for all problems (including screws), or if you dis my hammer I will never be able to strike another nail.

To be fair, I sometimes find myself in these camps too (I told you it was depressing!)

Anyway, the thread Sean highlghted is particularly frustrating, because I don't think the two camps need to be far away. There is a place for clean, easy to use mashable technologies, and, complex, powerful, complete descriptions of geospatial objects. And guess what? Sometimes that place can coincide!

Just today I mentioned that we were moving MOLES towards using ATOM as a container technology. So, not yet fully thought out, but just to make that point, here is something approximating UML for a new MOLES granule:

(and here1 is an example XML instance).

The point to note is that the inoffensive little object in the bottom left corner includes a GML feature collection in CSML, using the entire crufty(but useful)ness of GML, and it's pointed to as atom content.

We do intend to have our cake and eat it too! We should be able to both mash up and consume our complex binary objects. I don't see why we can't see complex GML descriptions as cargo of georss flavoured atom, just the same as a photo might be!

For the record, some things to note in this are

1. We need to flesh out a when object, because the atom time stamps are about the atom record, not the time of the enclosed data objects.

2. We need a logo element attribute of the dgEntity because an Atom entry doesn't have a logo (even though a feed does).

3. We have some specific elements to ensure we could produce an ISO record with mandatory components.

4. We have made the decision that the atom author and contributor should refer to the author and contributors of the enclosed CSML content, not the metadata writer (who produces the summary text, who appears as the metadata maintainer).

Of course now we might be able to build services which are easier to exploit than the OWS stack though ... (but meanwhile we are building one of the them anyway).

1: Updated on the 10/06/2008, the old one is here (ret).

## moles basic concepts

Another interim version of MOLES, the main changes here are to

• make clear that dgEntity and dgBase are abstract (never instantiated),

• for the moment at least, not all moles entities are discoverable

• explicitly include attributes necessary for the mandatory ISO elements.

I expect the next version will move to specialising from Atom entities rather than GML feature types (although we'll explicitly allowing GML feature types to be encapsulated in the atom content).

Anyway:

## Simple Thinking

I'm quite surprised this hasn't appeared on more of the blogs I read! (Simple thinking about the choices and risks about taking action on climate change!)

## Identifiers, Persistence and Citation

Identifiers are pretty important things when you care about curation and citation, and they'll be pretty important for metafor too.

At the beginning of the NDG project we came up with an identifier schema, which said that all NDG identifiers would look like this idowner__schema__localID e.g. badc.nerc.ac.uk__DIF__xyz123 where the idowner governed the uniqueness of the localID. If we unpack that a bit, we are asserting that there is an object in the real world which the badc has labeled with local identifier xyz123 and we're describing it with a DIF format object. In practice, the object is always an aggregation, so in ORE terms, there is a resource map which has the above ID.

Sam would argue that we blundered by doing that, he thinks the "real identifier" of the underlying object could be thought of as badc.nerc.ac.uk__xyz123, and we might be better using a restful convention, so that we wrote it as badc.nerc.ac.uk__xyz123.DIF. One of the reasons for his argument is that over time (curation), the format of our description of xyz123 will disappear: for example, we know that we are going to replace DIF with ISO (note that we have lots of other format descriptions of objects as well). This matters, because we now have to consider how we are persisting things, and how are we citing them. I would argue that the semantics are the same even if the syntax has changed, but I concede the semantics are more obvious in his version, so it's not too late to change, in particular, I suspect it's more obvious that badc.nerc.ac.uk__xyz123.atom and badc.nerc.ac.uk__xyz.DIF are two views (resource maps) of the same object (aggregation).

Either way, if we take the persistence issue firstly: once we bed this identifier scheme down, we're stuck with it, so what do we do when we retire the DIF format view of the resource? Well, I would argue that we should issue an HTTP 301 (Moved Permanently) redirect to the ISO view. This means we can never retire the IRI which includes the DIF, but we can retire the actual document. Note that this is independent of the two versions of the syntax above.

Secondly, what about citation? Well that's an interesting issue (obviously). You might recall (yeah, unlikely I know), that we came up with a citation syntax that looks like this:

and we made the point we could use a DOI at the end of that instead. That citation identifier looks a lot like Sam's form (funny that, Sam was sneaking his ideas in while I wasn't looking). A legitimate question is what is the point of the URN in that citation? If the data hasn't moved from http://badc.nerc.ac.uk/data/mst/v3/ the URN is redundant, and if it has, how do I actually use the URN? Well, I suspect we need to make the URN dereferenceable, and it should be a splash page to all the current views - which is essentially the same functionality that you get from a DOI. The only question then is what handle system one trusts enough: a DOI, an HDL, or do we trust that even if the badc changes our name, we can keep our address and use redirects? In which case we have

with or without the Available from which I reckon is redundant, and noting that http://badc.nerc.ac.uk/somepath is semantically the same as doi:. Sam's form URI wins (it doesn't necessarily have to, but it's more economical, in which case the argument for using a handler for citation economy alone is given some weight).

Incidentally, James Reid pointed out in comments that the Universal Numerical Fingerprint which looks like a pretty interesting concept that might also be relevant. I'll get back to that one day.

## EA and Subversion, Resolved

The good folks at CodeWeavers have resolved my problems with the subversion client under Wine (which I needed to get working for use from within Enterprise Architect). All kudos to Jeremy White!

I'd got to the point of recommending in a support request at codeweavers that a work around might be to try and replace the call to subversion with a windows bat file that invoked linux subversion rather than trying to get windows native subversion working properly.

Jeremy was far smarter than that. Yes, we've ended up invoking linux subversion, but via a different route.

The first step we took was to replace the native subversion.exe call with a simple linux script (I had no idea that one could even do that, having assumed that from a windows cmd.exe one had to call windows stuff ... but note that the trick was to make sure the script had no filename extension, and point to it in the EA cofiguration as if it were an executable). Having done that, we could see what EA was up to, and we found a few wrinkles.

Jeremy then came up with a winelib application (svngate) which handles all the issue with windows paths, and also a bug in the way EA uses subversion config-dir (a bug which doesn't seem to cause problems under windows, even though it ought to). In passing, Jeremy also fixed a wee bugette in the wine cmd.exe which was also necessary to make things work. All the code is on the crossover wiki.

So I'm a happy codeweavers client. I'm less happy with how Sparx dealt (commercially) with their end of this, but that's a story for another day. (Update 06/06/08: I'm probably being unfair, their technical support are now taking this and running with it; their linux product will be svngate aware and is getting linux specific bug fixes.)

Update 02/06/08. There was another wrinkle I discovered after a while ... the old cr/lf unix/windows problem. This can be relatively easily fixed,as Jeremy had seen this coming. I created my own version of subversion (/home/user/bsvn) with

#!/bin/bash
svn "$*" | flip -m -  and set SVNGATE_BIN to /home/usr/bsvn! Update 02/06/09. Actually that previous script doesn't quite work in all cases (i.e. where the svn content has blanks and hyphens in filenames). Better seems to be: #!/bin/bash svn "$@" | flip -m -


## Introducing Metafor

The EU has recently seen fit to fund a new project called METAFOR.

The main objective of metafor is:

to develop a ''Common Information Model (CIM)" to describe climate data an the models that produce it in a standard way, and to ensure the wide adoption of the CIM. Metafor will address the fragmentation and gaps in availability of metadata (data describing data) as well as the duplication of information collection and problems of identifying, access and using climate data that are currently found in existing repositories.

Our main role is in deploying services to connect CIM descriptions of climate and data across European (and hopefully wider) repositories, but right now the team is concentrating on an initial prototype CIM - building on previous work done by many projects (including Curator).

A number of my recent activities have already been aimed at metafor ... in particular the standards review that is currently underway will inform both metafor and MOLES.

For some reason my fellow project participants want to do much of their cogitation on private email lists and fora. While I don't think that's the best way forward, I have to respect the joint position. However, I will blog about METAFOR as much as I can, and I'll obviously be keen to take any feedback to the team.

I keep on harping on about how metadata management is time intensive, and the importance of standards.

Users keep on harping on about wanting access to all our data, and our funders keep wanting to cut our funding because "surely we can automate all that stuff now".

I've written elsewhere about information requirements, this is more a set of generic thoughts.

So here are some rules of thumb, 'even with a great information infrastructure, about information projects:

• If you need to code up something to handle a datastream, you might be able to handle o(10) instances per year.

• If you have to do something o(hundreds of times), it's possible provided each instance can be done quickly (at most a few hours).

• o(thousands) are feasible with a modicum of automation (and each instance of automation falls into the first category), and

• o(tens of thousands and bigger) are highly unlikely without both automation and no requirement for human involvement.

What about processes (as opposed to projects):

In the UK 220 working days a year is about standard. Let's remove about 20 days for courses, staff meetings etc ... so that leaves about 200 days or, for a working day of 7.5 hours, a working year of about 1500 hours.

So a job that takes a few hours per item can only be done a few hundred times a year. A case in point: in the last year 260 standard names were added to the CF conventions. One person (Alison Pamment) read every definition, checked sources, sent emails for revision of name and definition etc. Alison works half time, so she only had 750'ish hours for this job, so I reckon she had a pretty good throughput; averaging roughly three hours per standard name.

Now that job is carried out by NCAS/BADC for the global community, and the reasonable expectation is that names that are proposed have been through some sort of design, definition, and internal/community vetting process before even getting to CF.

So, from a BADC point of view, we have to go through every datastream, identify all the parameters, compare with the CF definitions, and propose new names etc as necessary. If all our data hand standard names, we'd be able to expoit that effort to produce lovely interfaces where people could find all the data we hold without much effort, if they only cared about the parameter that was measured. But they don't. Unfortunately for us (workload), and fortunately for science, people do care about how things are measured (and/or predicted). So the standard names are the the thin end of the wedge. We also have to worry about all the rest of the metadata: the instruments, activities, data production tools, observation stations etc (the MOLES entities).

As a backwards looking project: Last year, we think we might have ingested about 20 milliion files. Give or take. Not all of which are marked up with CF standard names, and nearly none (statistically) were associated with MOLES entities. Truthfully we don't know how much data we have for which our metadata is inadequate (chickens and eggs). As always, we were under-resourced for getting all that information at the time.

My rule of thumb says our only hope of working it out and doing our job properly by getting the information is that we need to identify a few dozen datastreams (at most), and then automate some way of finding out what the appropriate entities and parameters were. If it's manual we're stuffed, Sam has another rule of thumb: if something has to be done (as a backwards project, rather than as a forward process) more than a thousand times, unless it's trivial it wont get done, even with unlimited time, because it's untenable for one human to do such a thing, and we don't have enough staff to share it out.

Fortunately, for a domain expert, some of these mappings are trivial. But some wont be, and even distinguishing between them is an issue ...

Still, I genuinely believe we can get this right going forward, and do it right for some of our data going backwards. Do I believe non-domain experts could do this at all? No I don't. So where does that leave the UKRDS which, at least on the face of it, has grandiose aims for all research data? (As the intellectual inheritor of the UKDA, I'm all in favour of it, as a project for all research data, forget it!)

## Cost Models

Sam and I had a good chat about cost models today - he's in the process of honing the model we use for charging NERC for programme data management support.

Most folk think the major cost in data management is for storage, and yes, for a PB-scale repository that might be true, but even then it might not, it all depends on the amount of diversity in the data holdings. If you're in it for the long time, then the information management costs trump the data storage costs.

Some folk also think that we should spend a lot of time on retention policies, and actively review and discard data when it's no longer relevant. I'm on record that from the point of view of the data storage cost, the data that we hold is a marginal cost on the cost of storing the data we expect. So I asserted to Sam that it's a waste of time to carry out retention assessment, since the cost of doing so (in person time) outweighs the benefits of removing the data from storage. I then rapidly had to caveat that when we do information migration (from, for example, one metadata system to another), there may be a signficant cost in doing so, so it is appropriate to assess datasets for retention at that point. (But again, this is not about storage costs, it's about information migration costs).

Sam called me on that too! He pointed out that not looking at something is the same as throwing it out, it just takes longer. His point was that if the designated community associated with the data is itself changing, then their requirements of the dataset may be changing (perhaps the storage format is obsolete from a consumer point of view even if it is ok from a bit-storage point of view, perhaps the information doesn't include key parameters which define some aspect of the context of the data production etc). In that case, the information value of the data holding is degrading, and at some point the data become worthless.

I nearly argued that the designated communities don't change faster than our information systems, but while it might be true now for us, it's almost certainly not true of colleagues in other data centres with more traditional big-iron-databases as both their persistence and information stores ... and I hope it wont remain true of us (our current obsession with MOLES changes needs to change to an obsession with populating a relatively static information type landscape).

However, the main cost of data management remains in the ingestion phase, gathering and storing the contextural information and (where the data is high volume) putting the first copy into our archive. Sam had one other trenchant point to make about this: the gathering information phase cost is roughly proportional to the size of the programme generating the data: if it includes lots of people, then finding out what they are doing, and what the data management issues are will be a signficant cost, nearly independent of the actual amount of data that needs to be stored: human communication takes time and costs money!

## In a maze of twisty little standards, all alike

I'm in the process of revisiting MOLES and putting it into an atom context, with a touch of GML and ORE on the side. I thought I'd take five minutes to add a proper external specification for people and organisations.

Five minutes! Hmmm !! When will I get back to atmospheric related work? Can I get someone else to go down this hole?

Train of thought follows:

So, there's FOAF.

Do any folk other than semantic web geeks actually use FOAF? (I say geeks advisedly, it's hard to take seriously a specification that explicitly includes a geekcode.) Even from a semantic web perspective wouldn't it be better to use a more fully featured specification and extract RDF from that?

So then there is OASIS CIQ (thanks Simon), which includes

• Extensible Name and Address Language (xNAL)

• Extensible Name Language (xNL) to define a Party?s name (person/company)

• Extensible Party Information Language (xPIL) for defining a Party?s unique information (tel, e-mail, account, url, identification cards, etc. in addition to name and address)

• Extensible Party Relationships Language (xPRL) to define party relationships namely, person(s) to person(s), person(s) to organisation(s) and organisation(s) to organisation(s) relationships.

Well that feels overhyped, and beyond what is needed. Perhaps xNAL would be OK, but xPIL probably wont gain me much and I think the xPRL is heading into RDF territory in an over-constrained way.

What about reverting to the ISO19115 specification. Arguably ISO19115 shouldn't be defining party information (in the same way I was complaining about it not using Dublin Core), but it does. What would that give us? CI_ResponsibleParty. Well, I'm comfortable about this?

What about Atom! Atom settles down to a very simple person specification: name, email address and an IRI. Well that's pretty good, because I have an IRI which doesn't even need to exist (and if it does could point to any of the above). What it does is unambiguously identify a person, unlike name which can differ according to the phase of the moon (Bryan N Lawrence, Bryan Lawrence, B.N. Lawrence, B. Lawrence etc). But I do want some of the other stuff, role and so on, so I could use CI_ResponsibleParty, in the knowledge that I can extract an atomPersonConstruct from that for serialisation into Atom (I think I'd have to stuff the IRI into the id attribute of CI_ResponsibleParty).

OK, I'm going to stop there, it seems that with an IRI to avoid ambiguity, and CI_ResponsibleParty, I can do what I need to do.

But I could have spent a couple of hours in a far more pleasant way.

(Update 09/06/08: I note that KML 2.2 allows an <xal:AddressDetails> structured address ... but it doesn't use it ... yet.)

## Beginning to get a grip on ORE

This Friday afternoon I was trying to get to the bottom of ORE. ORE is pretty much defined in RDF and lots of accompanying text. I've been trying to find a way of boiling down the essence of it. UML (at least as I use it) doesn't quite do the job, so this is the best I could do:

The next step is to look through the atom implementation documentation.

## Ignorance is xhtml bliss

Wow, I was mucking with some validation for this site (in passing), and I thought "while I'm here, I might as well change this site to deliver application/xhtml+xml rather than text/html". What a blunder.

I'll fix it for you poor IE folk ... soon.

## From ORE to DublinCore

Standards really are like buses, there's another along every minute, exactly which one should you choose? I'm deep in a little "standards review" as part of our MOLES upgrade. I plan to muse on the role of standards another day, this post is really about Dublin Core!

You've seen me investigate atom. You know I've been delving in ISO19115. You know I'm deep into the OGC framework of GML and application schema and all that. You know I think that Observations and Measurements is a good thing.

Today's task was to investigate ORE a little more, and the first thing I did was try and chase down the ORE vocabulary, which surprisingly, isn't in the data model per se, it lives in it's own document. Anyway, in doing so, I discovered something that I must have known once, and forgotten: Dublin Core is itself an ISO standard (ISO15836:2003). Of course no one refers to DC via it's ISO roots, because they're toll barred (i.e. the ISO version costs money), wheras the public Dublin Core site stands proud.

What amazes me of course is that Dublin Core and ISO19115 use different vocabularies for the same things, even though Dublin Core preceded ISO19115. What was TC211 thinking? Of course ISO19115 covers a lot more, but why wasn't ISO15836 explicitly in the core of ISO19115? The situation is stupid beyond belief: someone even had to convene a working group to address mapping between them. I've extracted the key mapping here.

Mind you, Dublin Core is evolving, unlke ISO15836 which by definition is static. We might come back to that issue. Anyway, the current DublinCore fifteen which describe a Resource look like this:

 term what it is type contributor a contributors name A coverage spatial and or temporal jurisdiction, range, or topic B creator the primary author's name A date of an event applicable to the resource C description of the Resource D or E format format, physical medium or dimensions (!) F identifier reference to the resource G language a language of the resource B (best is RFC4646) publisher name of an entity making the resource available A relation a related resource B rights rights information D (G) source a related resource from which the described resource is derived G subject describes the resource with keywords B title the name of the resource D type nature or genre of the resource H

We can see the "types" of the Dublin Core elements have some semantics which reduce to

 A free text (names) B free text (best to use arbitrary controlled vocab) C free text (dates) D really free text E grapical representations F free text (best to use MIME types) G free text (best to use a URI) H free text, B, but best to use dcmi-types

The last vocabulary consists of Collection, Dataset, Event, Image, InteractiveResource, MovingImage, PhysicalObject, Service, Software, Sound, StillImage, Text. (Note that StillImage differs from Image in that the former includes digital objects as well as "artifacts").

## RDF software stacks.

So we want an RDF triple store with all the trimmings!

We're running postgres as our preferred RDB. We've got some experience with Tomcat as a java service container. We prefer python in a pylons stack and from scripts as our interface layers (ideally we don't want to programme our applications in Java 1).

There appear to be four2 candidate technologies to consider as part (or all of) our stack: Jena, Sesame, RDFLib, and RDFAlchemy. The former two provide java interfaces to persistance stores, and both support postgres in the backend. RDFLib provides a python interface to a persistance store, but might not support postgres. RDFAlchemy provides a python interface to support RDFLib, Sesame via both the http interface and a SPARQL endpoint, and Jena via the same SPARQL endpoint (and underlying Joseki implementation).

Would using postgres as our backend database perform well enough? Our good friend Katie Portwin (and colleague) think so.

There appear to be three different persistance formats insofar as RDFLib, Jena and Sesame layout their RDF content in a different way. Even within Java there is no consistent API:

Currently, Jena and Sesame are the two most popular implementations for RDF store. Because there is no RDF API specification accepted by the Java community, Programmers use either Jena API or Sesame API to publish, inquire, and reason over RDF triples. Thus the resulted RDF application source code is tightly coupled with either Jena or Sesame API.

Are there any (recent) data which compare the performance of the three persistence formats and their API service stacks? It doesn't look like it, but I think we can conclude that either Jena or Sesame will perform OK, and I suspect RDFLib will too. Which of these provide the most flexibility into the future? Well, there are solutions to the interface problem on the Java side: Weijian Fang's Jena Sesame Model which provides access to a Sesame repository through the Jena API, and the Sesame-Jena Adaptor; and clearly from a python perspective RDFAlchemy is designed to hide all the API and persistence variability from the interface developer. I think if we went down the RDFLib route we'd either be stuck with python all the way down (not normally a problem), or we'd have to use it's SPARQL interface.

I have slight reservations about RDFAlchemy in that the relevant google group only has 14 members (including me), and appears to be in a phase of punctuated equilibrium as development revolves around one bloke.

Conclusions: if we went down postgres > tomcat(sesame) -> RDFAlchemy we'd be able to upgrade our interface layers if RDFAlchemy died by plugging in something based on pysesame and/or some bespoke python sparql implementation (it's been done, so we could use it, or build it. Others have built their own pylons thin layers to sesame too). We'd obviously be able to change our backends rather easily too in this situation. (Meanwhile, I intend to play with RDFlib in the interest of learning about manipulating RDFa.)

Link of the day: State of the Semantic Web, March 2008.

1: this isn't about language wars, this is about who we available to do work (ret).
## atom for moles

As we progress with our MOLES updating, the issue of how best to serialise the MOLES content becomes rather crucial, as it impacts storage, presentation, and yes, semantic content: some buckets are better than other buckets!

Atom (rfc4287) is all the rage right now, which means there will be tooling for creating Atom and parsing etc, and Atom is extensible. It's also simple. Just how simple? Well, the meat of it boils down to one big UML diagram or three smaller diagrams which address:

1. The basic entities (feeds and entries),

2. The basic content (note that the xhtml could include RDFa!)

3. and links (note that while atom has it's own link class for "special links", xhtml content can also contain "regular" html links).

These three diagrams encapsulate what I think I need to know to move on with bringing MOLES, observations and measurements, and Atom together.

## Big Java

You know, those of us out there in the Ruby/Python/Erlang fringes might think we're building the Next Big Thing, and we might be right too, but make no mistake about it: as of today, Java is the Big Leagues, the Show, where you find the most Big Money and Big Iron and Big Projects. You don't have to love it, but you'd be moronic to ignore it.

... and the programmers ask for Big Money, write Big Code, and we can't afford the Big Money to pay for them, or the Big Time to read the code ... let alone maintain it.

(Which is not to say I have any problems with someone else giving/selling me an application in Java which solves one of my problems - provided they maintain it, I'm not that moronic :-)

## chaiten

There's nothing like a big volcano to remind one of our precarious hold on planet earth. Thanks to James for drawing my attention to Chaiten, and the fabulous pictures, via: Alan Sullivan, nuestroclima and the NASA earth observatory.

Along with genetics1, volcanism was the other thing that could have kept me from physics ... neither quite made it :-).

Anyway, I'm not quite sure what sort of volcano it is, nor of the real import of the explosions thus far, but as a caldera type volcano it could be more impressive yet ... if it's even vaguely similar to Taupo. When I was a kid we grew up with stories of the 7m deep pyroclastic flow in Napier (a little over 100 km away).

1: you can read that phrase how you like :-) (ret).

by Bryan Lawrence : 2008/05/07 : Categories environment : 0 trackbacks : 1 comment (permalink)

## Implementing my RDFa wiki code

I claimed it would be straightforward to add the RDFa syntax to my wiki.

Actually, most of it was. The hardest part was putting the attributes into the (many different sorts of) links that my wiki supports. So I took the opportunity to clean up the link handling code.

This is my RDFa wiki syntax1 unit test code. The attentive reader will note that I use both the format and statelessformat methods of my wiki formatter (the statelessformat method is called by the format method in normal use). This exposes the fact that it turned out to be easiest to do the RDFa handling in two passes. The first pass was in the normal statelessformat (which does links, simple bold, italic, greek etc). In doing that it also now marks up inline RDFa and flags to a second pass the block and page level RDFa. These get handled right at the end of all the other wiki handling (which handles list and table state, preformatting etc) - block level RDFa gets tacked onto the previous paragraph entity, and page level RDFa gets put into a DIV that encloses everything else.

The point about what I have done, is to try and develop a syntax that can be tacked onto (most existing) wiki syntaxes without much grief. It seemed to work. So now I have code that can create RDFa. The next step will be to plumb it into Leonardo (shouldn't be hard), and then try and play with some real RDFa creation. That may have to wait a week or two on a) my day job, and b)my life ...

by Bryan Lawrence : 2008/04/25 : Categories metadata ndg python metafor : 0 trackbacks : 1 comment (permalink)

## creating RDFa

Let's just assume for a moment that I want to create RDFa in XHTML. Just how should I go about it? It appears that there are no html authoring tools that are "RDFa aware"!

I want to play wth RDFa but I'm not going to author it in raw HTML. I'ts just not worth it. Can I programme my way out of it? If I wanted to modify my wiki software to support (some) RDFa assertions what might I consider as important?

#### Basic Syntax

There are six XHTML attributes:

1. @rel: provides forward predicates (subject is the enclosing entity, object is pointed to by a scoped URI aka CURIE)

2. @rev: provides inverse predicates (subject is pointed to, object is enclosing entity)

3. @name: provides a predicate as a plain literal.

4. @content: a string for supplying a machine readable version of an object rathe r than the enclosing element content (displayed to humans).

5. @href: a partner resource object (eg remote html link),

6. @src: a partner resource object which is embeddable (eg an image).

(Just what I mean by enclosing entity needs definition too.)

There are five RDFa specific attributes:

1. @about: can provide a URI which gives a subject.

2. @property: can provide URI(s) used for expressing relationships between the subject and some literal text (predicate)

3. @resource: a resource object which is not clickable ...

4. @datatype: a uRI defining a datatype associated with

5. @instanceof

And we need to declare namespaces.

So, it's all about applying attributes to html entities. For the moment we could reduce the problem to using the RDFa attributes verbatim, but introducing a notation for how to do that in (my) wiki syntax.

#### Methodology?

I might want to be able to make namespace definitions with scope for a document and scope within specific blocks.

• An expedient way forward would be to only allow namespace definitions once, at the beginning of a document. This would result in conformant documents and solve the 80-20 for this issue.

I want to add RDFa assertions to entities? How? We can start with the syntax examples ...

Example one was adding page level metadata. Assuming I want to do that at the blog entry level, and assuming for the moment I can't use atom as the container format, then the container needs to be an html div which already takes us into a version of example three. At this point, we could introduce a page level syntax which looked something like:

@@
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:cal="http://www.w3.org/2002/12/cal/ical#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
property="dc:creator" content="Bryan lawrence"
@@
page content


which should result in

<div xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:cal="http://www.w3.org/2002/12/cal/ical#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
property="dc:creator" content="Bryan lawrence"
page content
</div>


To repeat example three, taking in account the limitation to namespaces only at the page level, we could enhance the wiki link with attributes:

This document is licensed under a


to clearly instantiate the predicate relationship of the link to the calling page.

Similarly, we could write the fourth example:

I'm holding @@property=cal:summary@ one last summer Barbeque@@


to achieve the

I'm holding <span property="cal:summary"> one last summer Barbecue</span> ...


example. Here we're using @@ to introduce the enclosing entity (span since it appears inside a block level entity) and @@ to close it, with @ used to terminate a specific attribute, allowing to have multiple attributes as required (@@blah=blah@blah2=blah2@ content @@) etc.

• We could add these as attributes to a block level entity by simply beginning the block level entity with the @@ syntax (i.e. paragraphs starting on new lines, list items, quotes) and terminating the block level attributes (rather than the block) with the @@. See the example a bit further below.

The next example is to have what is displayed different from the machine readable substrate, so we have

... on @@property=cal:dtstart@content=2007-916T1600-0500@datatype=xsd:datetime@
September
16th at 4pm@@.


which would give:

... on <span property="cal:dtstart" content="20070916T1600-0500"
datatype="xsd:datetime"> September 16th at 4pm </span>.


(Note that if we needed an @ within a property, we would escape it with a backslash to have \@ ).

The next example is to introduce the instance of attribute so that all the properties of a block level object are associated with it. That's straightforward:

...previous paragraph.

@@instanceof=cal:Vevent@@ I'm holding @@cal:summary@ one last summer barbeque@@
on @@property=cal:dstart@content=20070916T1600-0500@datatype=xsd:datetime@ September
16th at 4pm @@.

Next paragraph ...


Note the first @@ has no content after the attribute and before the closing @@ so it applies to the block level entity.

The final example in that section has the ability to make a remote document the subject of the association. We could do that using the same contact notation:

I think @@about=urn:ISBN:0091808189@instanceof=biblio:book"@ White's book 'Canteen Cuisine@@
is well worth ...


#### Summary

So, in terms of new wiki syntax it would appear that if we don't try and shorten the attributes themselves (which is worth considering, but not in this discussion), we can get away with using @@ in three different ways:

1. When @@ appears on it's own, it is intended to start a <div> which will carry on to the end of the current document and provide a page level scope for RDFa attributes.

2. When @@ begins a block level entity (paragraph, list item, description) it is used for block level attributes, unless it encloses text after the last attribute, in which case it is being used for inline attributes ...

3. which is the last type, which creates a span around some content to enclose the attributes.

In addition, we allow @ inside the standard square bracket link syntax.

Now I reckon that lot would be straightforward to implement!

Why hasn't anyone else introduced a wiki syntax? (Have they and I'm just not aware?)

## First time grant holder experience

Buried in a Wall Street Journal article (thanks Michael) about the lack of involvement of the US presidential candidates in science issues is an interesting statistic:

At the National Institutes of Health, the average age of a first grant is 42 for a Ph.D. and 44 for an M.D.

It'd be interesting to know about NSF too.

Is it that much better here? I don't know! It would be interesting to see statistics for average ages of grant holders, and first time grant holders for NERC, EPSRC and STFC ...

That said, it may well be illegal to hold that information here, and I'm not convinced NERC know how old I am, let alone anyone else. But it would still be interesting ... although years since doctoral graduation is probably a better metric which might not be illegal to hold. The latter could only be obtained by survey. Now if only I could convince someone to undertake such a survey ...

## other folks data models

Kill the cat!

I had just written a long posting which summarised some of the recent discussion on Cameron Neylon's blog, and in particular the material on data models which appears in his semantic category.

Anyway, I had saved it all to a file ... and just before I posted it, I typed cat filename ... and blew it all away. A good hour of diligent reading and summarising wasted. This was all in the context of drawing conclusions relevant to MOLES. I guess they'll appear in later posts, so you wont miss it all ...

So, no summary: you'll have to read all his posts. But meanwhile, some key links and quotes from them:

1. "No data model survives contact with reality". Pretty sure Roy Lowry said that a long time earlier than he did ...

2. FuGE (Amongst other things, UML model of genomics experiments). Interesting.

3. Minimal Information about a Neuroscience Investigation "standard" at Nature Precedings. Underwelming.

4. Distinguishing "recording" from "reporting". Totally agree.

## More on the windows subversion client under wine.

I reported my problems earlier.

This post is by way of notes wrt to the https problems.

Firstly, the detail of the error (with thanks to Rafael Coninck Teig?o:

Running just the client from the command line gives me this:
/tmp/svn-win32-1.4.6/bin$export WINEDEBUG='-all,+winsock' /tmp/svn-win32-1.4.6/bin$ wine svn.exe update ~/development/e-dj/
trace:winsock:DllMain 0x7eda0000 0x8 (nil)
trace:winsock:DllMain 0x7eda0000 0x1 0x1
trace:winsock:WSAStartup verReq=2
trace:winsock:WSAStartup succeeded
trace:winsock:WSAStartup verReq=202
trace:winsock:WSAStartup succeeded
svn: Network socket initialization failed
trace:winsock:DllMain 0x7eda0000 0x0 0x1

The excerpt o code within SVN that's returning this error is at session.c:
/* Can we initialize network? */
if (ne_sock_init() != 0)
return svn_error_create(SVN_ERR_RA_DAV_SOCK_INIT, NULL,
_("Network socket initialization failed"));


and the following note are from the SVN redbook:

The libsvn_ra_dav library is designed for use by clients that are being run on different machines than the servers with which they communicating, specifically servers reached using URLs that contain the http: or https: protocol portions. To understand how this module works, we should first mention a couple of other key components ... the Neon HTTP/WebDAV client library.

Subversion uses HTTP and WebDAV (with DeltaV) to communicate with an Apache server. ... When communicating with a repository over HTTP, the RA loader library chooses libsvn_ra_dav as the proper access module. The Subversion client makes calls into the generic RA interface, and libsvn_ra_dav maps those calls (which embody rather large-scale Subversion actions) to a set of HTTP/WebDAV requests. Using the Neon library, libsvn_ra_dav transmits those requests to the Apache server.

Hmmm, Neon Library and WebDav under Wine ... Hmmm.

(Update 17-04-2008: NB, this is a problem with both http and https, and it really is that subversion can't initialise the network library and I've not found any useful advice on the net. My guess is that I'd need to build my own binary of subversion with neon linked in different way than the default, or that there's some subtle wine configuration I don't know about. Either way I'm stymied!)

(Update 21-04-2008: I've raised this as a ticket with the Cross Over Office Folks since I need them to run EA anyway.)

Being a collection of links and notes.

You've seen me witter on about online resources in molesOnline Reference. Now we need to do some more thinking. Still mainly in the context of service binding (but where the services might just be view services too!).

Put it all in rdf they said.

#### RDF

What do I know?

Well, I understand about subject -> predicate -> object. I can bandy about the phrase "triplestore" with the best of them.

So if I have text -> typeOfLink -> target, and I'm thinking about this in terms of my metadata environment, then I can readily imagine at least three contexts for such links: embedded within document text, embedded within documents as lists of links, and as associations between documents at some metalevel (e.g. the relatedTo association in MOLES).

If I'm thinking about infrastructure, then I want to store:

• those triples

• the documents

• the associations and crucially, the context in which they all live. It would appear that context matters (this is all related to the context of reification, that is, that each triple is itself a resource describable in RDF).

RDF itself says nothing about what typeOfLink should be. For that we need an ontology for links ... ("oh no we don't" they scream from offstage, "oh yes we do", they scream from onstage. Yes, I fear this is going to descend to pantomime).

#### RDFa

Well, RDFa allows me to embed all sorts of information about a link in the text preceding and following it. In practise I might think about that ability as allowing me have an object which has some attributes, and in the context document for me to expose some of those attributes. That probably needs an example.

Let's imagine I have an objectType called person defined in a schema called, say, moles. For the purpose of this example, Person has attributes of name, email address, and web page. I might say that the URI of any person instance is the URI of their webpage.

In my metadata document, I might want to have some text in a paragraph which has a link to a persons web page, something like:

<p xmlns:moles="http://ndg.nerc.ac.uk/schemas/molesv2"
<!-- about is RDFa syntax and sets a current URI for this paragraph -->
<span property="moles:personName="Bryan Lawrence">
Bryan Lawrence</span> at
his email</a>.
</p>


which could get hoovered up into raw RDF triples.

My point here is that this particular link can be serialised via RDFa into a bunch of triples, and in the calling document the primary link might not even appear as a "simple link" (the primary link would not be visible in an html rendering of this link which looks like an email link). The semantics of the link are controlled by the semantics we have in the parent namespace.

(Issue to consider: how do we create all these documents with RDFa that conforms to MOLES or whatever other schemas are relevant?)

#### Atom

Let's start with the atom:content element, which can contain either content or links to the content of an atom:entry. This is pretty useful because it explicitly does a little bit of what MOLES version one tried to do. Best summarised by directly quoting the specification:

atomInlineTextContent =
element atom:content {
atomCommonAttributes,
attribute type { "text" | "html" }?,
(text)*

atomInlineXHTMLContent =
element atom:content {
atomCommonAttributes,
attribute type { "xhtml" },
xhtmlDiv
}
atomInlineOtherContent =
element atom:content {
atomCommonAttributes,
attribute type { atomMediaType }?,
(text|anyElement)*
}
atomOutOfLineContent =
element atom:content {
atomCommonAttributes,
attribute type { atomMediaType }?,
attribute src { atomUri },
empty
}
atomContent = atomInlineTextContent
| atomInlineXHTMLContent
| atomInlineOtherContent
| atomOutOfLineContent


Even more importantly: the atom:link element defines a reference from an entry or feed to a Web resource. The specification assigns no meaning to the content (if any) of that element.

atomLink =
atomCommonAttributes,
attribute href { atomUri },
attribute rel { atomNCName | atomUri }?,
attribute type { atomMediaType }?,
attribute hreflang { atomLanguageTag }?,
attribute title { text }?,
attribute length { text }?,
undefinedContent
}


Now we have something that extends xlink, and has a controlled vocab called the Registry of Link Relations (cue "oh no") for the link relative types, which currently has five members:

1. The value "alternate" signifies that the IRI in the value of the href attribute identifies an alternate version of the resource described by the containing element.

2. The value "related" signifies that the IRI in the value of the href attribute identifies a resource related to the resource described by the containing element. For example, the feed for a site that discusses the performance of the search engine at "http://search.example.com" might contain, as a child of atom:feed:

    <link rel="related" href="http://search.example.com/"/>


An identical link might appear as a child of any atom:entry whose content contains a discussion of that same search engine.

3. The value "self" signifies that the IRI in the value of the href attribute identifies a resource equivalent to the containing element.

4. The value "enclosure" signifies that the IRI in the value of the href attribute identifies a related resource that is potentially large in size and might require special handling. For atom:link elements with rel="enclosure", the length attribute SHOULD be provided.

5. The value "via" signifies that the IRI in the value of the href attribute identifies a resource that is the source of the information provided in the containing element.

The mechanism for updating these relationships is quite onerous, involving New assignments are subject to IESG Approval, as outlined in RFC2434. Requests should be made by email to IANA, which will then forward the request to the IESG, requesting approval. The request should use the following template:

• Attribute Value: (A value for the "rel" attribute that conforms to the syntax rule given above).

• Description:

• Expected display characteristics:

• Security considerations:

And now let's use some of the examples floating around in various documents where folk are talking about RESTful services. Sean's one that I used recently was

<entry>
<title>Some KML</title>
rel="alternate"
href="..."
/>
...
</entry>


Now, we could build a compound thing, and get a WADL document. There is a conventions syntax that I first saw in a Pat Cappelaere presentation, which is essentially of the form:

If I have a resource /x then I can get the metadata for it at /x.metadata and the atom representation at /x.atom, and maybe, if it's a service type thing, a service descrition at /x.wadl.

In otherwords, a convention (I think) which imposes a vocabulary (is it controlled, and if so by whom?) on a URI. So the link type is now in the URI scheme in a RESTful atom world ...

Andrew Turner gave some examples in a presentation I found via the geoweb rest group which also does the same sort of thing, now explicitly as an atom link:

<link
rel="alternate"
type="application/vnd.ogc.wms_xml; charset=utf-8"
href="http://mapufacture.com/feeds/1621.wms" />


He gave a template example too in that presentation that I have yet to understand ...

Now, let's jump to a discussion in Google Groups, where I learn (from Sean yet again) that "what a service must do is specified using an app:accepts element.

But that works fine provided what I'm uploading is a media type. What if my service needs to know the feature type? There is some discussion that went down this route, where Sean proposed something like:

text/xml;subtype=om/#,


which presumably is a syntax for pointing to a specific schema.

Jeremy Cothran suggested including a reference to an associated content schema, but didn't (I think) suggest exactly how that could be done.

Right. That's enough for now. More later.

by Bryan Lawrence : 2008/04/10 : Categories ndg metadata metafor : 0 trackbacks : 1 comment (permalink)

## Semis here we come

At last some sporting good news. Semi-Finals here we come.

Mind you, I suspect that's as far as we'll go ...

## Notes on brief service descriptions and bindings

I'm coming late1 to a discussion about representing services in feeds and catalogues. (See Sean and Jeroen, and especially Doug's commment on Seans' post.) I found these while trying to go back to an old post of Sean's (which I have yet to find) about REST and WPS, because I wanted to incorporate some of the ideas into a forthcoming presentation. Anyway, a bit of retrospective googling lead me here.

Now as it happens, in my own way, I was asking the same question just a few days ago. The fact that I was asking it in the context of ISO19115 obscured the semantics of the question, as indeed did most of the discussion linked above. (Actually asking it in the context of ISO19115 ought not have done so, because ISO19115 is a content standard not an implmentation standard, but that's a by-the-way).

Anyway returning to the question, this is in fact one of the major things behind my much longer rant about service descriptions. So much so that I fancy writing down my current train of thought. Sorry. You really don't have to read it ...

The problem is that catalogues (don't care what they are done in, Atom, ISOXXX or whatever), have list of datasets (entries or whatever if you have a different application). We want users to be able to select datasets and pass them off to a service client (e.g. Google Earth, a WMS based visualisation client or whatever). What we need to do is be able to flag for each dataset that it can support specific services ... which will vary on a dataset by dataset basis.

We have a bunch of choices of how to do this. For each dataset we can load in a bunch of endpoint addresses that specific services can exploit. This is what we currently do (where we potentially have some KMZ endpoints, some WMC endpoints, and a bunch of others). This scales horribly in many ways from a management point of view (add a new service, edit each and every metadata record ... and we have lots!). We do this in the context of a DIF related_URL. If you looked at the XML you get to see stuff like this:

<Related_URL>
<!-- Web Map Context URL is required by NERC Portals -->
<Related_URL><URL_Content_Type>GET DATA > WEB MAP CONTEXT</URL_Content_Type>
<URL>http://ourhost/ukcip_wms?request=GetContext</URL>
</Related_URL>


To add a new service, we need to create one of these for each dataset for which it can operate on2 , which will depend on lots of things, including the feature types of the data. Ideally of course we know the feature types of all the data, and so we can script the metadata mods ... but if we know that, then why don't we do late binding? (And automate the process by having service descriptions and data descriptions and let the associations themselves be discoverable (and renderable by the catalogue dynamically as users find things of interest.)

Well of course we're only crawling (in a software sense) towards adequate feature type descriptions, but the nirvana we want is real enough. But to do that we need some service descriptions ...

... which don't exist, so we have to solve the problem outlined in the first paragraph for the foreseeable future, in a wide variety of contexts (including ISO19115 based catalogues).

Now, back to the main point. For many reasons ORE is becoming very interesting to us, and so it was great to see Sean making the connection (that I should have made). I like it so much I'm going to repeat his two examples here:

(Sean stuff follows):

<entry>
<title>Some KML</title>
rel="alternate"
href="..."
/>
...
</entry>


Otherwise, you use rdf:type (ORE specifies that non-atom elements refer to the linked resource, not to the entry):

<entry>
<title>Some Journal Article</title>
rel="alternate"
type="text/xml"
href="..."
/>
<rdf:type>http://purl.org/eprint/type/JournalArticle</rdf:type>
...
</entry>


(Sean stuff concludes)

The point being that where standard resource type names (mime-types) exist, one can use them, and where they don't, one uses an RDF relationship. (Incidentally, I don't really see how this differs in semantic content than Stefan Tilkov's recent proposals for decentralised media types, which got less than enthusaistic reception - even to the point of my not understanding why he thinks his proposal is semantically different from using RDF.)

In the context of MOLES and CSML we've been thus far using xlink to try and make these typed links (which answers yet another of Sean's old questions: yes, us). We've got so far with it that Andrew Woolf has an OGC best practice recommendation on how to use xlink in this context. I strongly believe that typed links are crucial to our data and metadata systems, but I'm less convinced by xlink ... (yeah, we can build tools to navigate using it, but will our best practice paper disappear into the ether?)

So let's step away from it all and ask questions about the semantics. And when we do that, I expect to come back to both ORE-like concepts and RESTful ones ...

If I start with a dataset with a uri, I can consider a few things:

• I can have a resource available which provides a description of the dataset in a variety of different syntaxes (e.g. the raw XML for GML:CSML and/or MOLES, and html versions of either or both),

• I can have a bunch of bound-up service+dataset uris which show "summary" visualisations of the data (logos, quicklook aka thumbnail images etc),

• I can have a service+dataset uri which exposes a human navigable tool to subset/download/visualise or whatever the dataset (i.e. a service which does something to the dataset, but in the first instance what you get back is html which is NOT a representation of the dataset itself, and

• I can have a naked service endpoint which when I go there, can expose something about the dataset (but maybe exposes multiple other datasets as well).

In both of the last cases, it might be possible to automatically construct a uri to a resource which did represent some property of the dataset given the starting endpoint URI. Ideally that's what you get when you use standardised services ... In this latter case it ought to be possible to construct a RESTful URI in the end, even if the service itself wasn't particularly RESTful (e.g. WMS and WCS).

Now all of the above are really aggregations which brings me back to ORE.

I had better stop now. Conclusions can wait.

1: I lost contact from the world for most of March. One day I'll explain why here. Some will know. (ret).
2: This example also shows the situation we're in of having to have a different WMS endpoint for each dataset ... the way WMS works with getCapabilities doesn't scale to us having one WMS endpoint for all our data. (ret).

## more MOLES version two thinking

We've made some more progress with our thinking since last week ... but there is more to come.

Meanwhile:

Note the new relatedTo association class. That's in there to allow us to have arbitrary RDF associations between MOLES entities explicitly modelled. No idea if this is the right way to do it ...

Note also that the services are placed more sensibly, and the concept supports both services intended for human consumption (that are "of the web") and which are for data consumption ...

## Subversion under WINE

Today's task. Trying to get Enterprise Architect working with subversion.

Enterprise Architect runs under CrossOver Office on linux (Wine) and getting it to work with subversion is not as straightforward as one might like (but the folks at sparxsystems are very helpful!)

The first step is to verify that we can use subversion within crossover office.

I installed the standard windows 32 bit binary into my EA bottle.

The two repositories I am interested in working with are available via https:// and svn+ssh://, and neither worked straightforwardly (I think the sparxsystems folk have only worked with regular svn:// repositories thus far).

To test access, I created simple batch files of the form:

"C:\Program Files\Subversion\bin\svn.exe" checkout "https://url-stuff" > svnlog.log &2> svnlog-err.log


and

"C:\Program Files\Subversion\bin\svn.exe" checkout "svn+ssh://url-stuff" > svnlog.log &2> svnlog-err.log


which I put in my /home/user/.cxoffice/ea/drive_c/ directory with names something like batfile.bat. (In both cases, the repositories already exist and work fine with linux subversion).

We can then run those windows bat files with a unix script like this one:

/home/user/cxoffice/bin/wine --bottle "ea" --untrusted --workdir "/home/user/.cxoffice/ea""/drive_c" -- "/home/user/.cxoffice/ea/drive_c/batfile.bat"


Neither work. The two error messages I got were:

• (https:) svn: Network socket initialization failed

I haven't made much progress with the former. Googling yielded this query to the wine list. The author got no response and currently has no solution (I checked with him personally).

However, I made real progress with the latter. The best advice I found was this, but there were enough things to think about for the wine environment that it's worth documenting here.

1. I spent some fruitless time trying to install tortoiseSVN into my EA bottle under crossover. (It seems to work, but under a win2000 bottle, and EA needs a win98 bottle). I then tried Putty ... which installed fine. You'll need the lot (make sure you download the installer not the simple .exe).

2. You will need a public key and private key set up so that the batch files run by EA (and you in testing) can work, so we get that done first.

3. On my laptop, I used

ssh-keygen -t rsa


to get a public and private key pair (thanks). For the moment I'm using no passphrase (the laptop disk itself is encrypted).

4. Now I had a pair of keys sitting in my laptop ~/.ssh/ directory.

5. The repository I want to get to lives on server, so I log into server, so the public key needs to be there. I changed directory into my .ssh directory there, and

1. created an empty file authorized_keys (just "touch authorized_keys")

2. opened that file up, and copied in the public key from my laptop directory.

3. Then I tested that key pair from the command line (under linux, ssh you@server). Fine.

6. Then putty has to be encouraged to use that private key.

1. Putty doesn't understand the openssh private key file we've just generated, so we need to convert it.

1. A copy visible to windows is needed (i.e. copied away from the hidden .ssh directory).

2. Run puttygen from the crossover office run windows command (browse to the executable in the Putty Program Files directory).

3. Now click on File> Load Private Key and load the file, then save the private key out as .ppk file. Delete the copy of the private key, and move the .ppk file somewhere special on your drive_c in your windows partition.

7. Now you need to make sure that ssh is configured to use your putty. So, that means putting

ssh = "C:\Program Files\PuTTY\plink.exe" -i "C:\\yourdir\\yourfile.ppk"


in your subversion config file (which lives in

.cxoffice/yourEABottle/drive_c/Windows/Application Data/Subversion


And now at least I can checkout a working copy ...

I'll try and integrate that with EA another day, and the https problem will have to wait too. That lot took more than twice the time I allocated for it ...

## early MOLES version two thinking

Today's task was going through MOLES V1 and trying to capture the key semantics in UML.

MOLES V1 was implemented (and defined) in XML schema, and had some regrettable flaws which we want to sort out in V2 ... and like I said yesterday, we want to take a step towards O&M support (but realistically we won't get there in this one step).

So this is where we are: the key overarching diagram looks like this:

I don't have time to explain much of it today, but suffice to say it has all the good bits from V1 but cleaner lines (yes really) of specialisation and association. It should be much easier to implement and support.

It turns out with a cleaner structure a whole lot baggage disappeared, and rather less semantic content was left to differentiate between those key entities: productionTool, observationStation and activity than I thought ... (you can't see that in this diagram, you'll have to wait for a fuller discussion, but suffice to say it mostly comes down to codeLists, albeit that we might support a movingPlatform specialisation of an observationStation).

You will also see that there are the beginnings of some support for late binding of services. That might be a bit clearer with this:

Yesterday I was also wittering on about online references. My thinking isn't complete on this yet, but this is what I have in dgOnlineReference:

Note the use of applicationProfile to hold the semantics of an information resource, which ought to be a featureType. Note also I'd rather like to get the mimeType in here too ... rather like an xlink actually. Some more thinking to go on here (indeed the whole thing is still in a state of flux ...)

## The Scope of ISO19115

We're taking the first steps towards refactoring our Metadata Objects for Linking Environmental Sciences (MOLES) schema to be more easily understood and implementable and to support (if not conform with) the new Observations and Measurements OGC specification. In doing so it became obvious to me that I need to think about the relationship between MOLES entities and ISO discovery metadata.

ISO19115 specifies that

...a dataset (DS_DataSet) must have one or more related Metadata entity sets (MD_Metadata). Metadata may optionally relate to a Feature, Feature Attribute, Feature Type, Feature Property Type (a Metaclass instantiated by Feature association role, Feature attribute type, and Feature operation), and aggregations of datasets (DS_Aggregate). Dataset aggregations may be specified (subclassed) as a general association (DS_OtherAggregate), a dataset series (DS_Series), or a special activity (DS_Initiative). MD_Metadata also applies to other classes of information and services not shown in this diagram (see MD_ScopeCode, B.5.25).

Let's have a look at the MD_ScopeCode, which is the value of the MD_Metadata attribute hierarchyLevel:

 MD_ScopeCode CodeS Definition: class of information to which the referencing entity applies attribute 001 information applies to the attribute class attributeType 002 information applies to the characteristic of a feature collectionHardware 003 information applies to the collection hardware class collectionSession 004 information applies to the collection session dataset 005 information applies to the dataset series 006 information applies to the series nonGeographicDataset 007 information applies to non-geographic data dimensionGroup 008 information applies to a dimension group feature 009 information applies to a feature featureType 010 information applies to a feature type propertyType 011 information applies to a property type fieldSession 012 information applies to a field session software 013 information applies to a computer program or routine service 014 information applies to a ... service ... model 015 information applies to a copy or imitation of an existing or hypothetical object tile 016 information applies to a tile, a spatial subset of geographic data

I'm not convinced I understand all of those, particularly the model type, but also the collectionHardware, collectionSession and dimensionGroup types. Anyone who can shed some light on those would be welcome to comment below or email me ...

Obviously metadata should also apply to other entities. In particular, within the NDG we consider that the observation station (this caused us difficult in finding an appropriate noun inclusive of simulation hardware, but covering a ship, physical location, or a field trip etc), the data production tool (DPT, aka instrument, but inclusive of simulation software also known as models), and activities are also first class citizens of metadata. Perhaps collectionHardware and collectionSession might be relevant for the DPT and some activities. I don't know.

We also consider the deployment to be an important entity: a deployment links one or more data entities to a (DPT, activity, observation station) triplet, and may itself have properties. It's worth noting that in the observations and measurements framework, the concept of Observation binds a value of a property determined by a procedure to a feature. In the NDG world, data features live within data entities, and some part of what O&M calls a "procedure" is an attribute of a data production tool, but most of a "procedure" is in my mind synonymous with a deployment1. Values and properties live within data entities too (a data entity is described by an application schema of GML which can include from the O&M namespace).

Leaving aside a resolution of a formal data model for MOLES, the first class entities will need to support metadata, and so there needs to be a scope code for the appropriate first class entities.

John Hockaday on the metadata list, initially suggested extending the scope codes to cover:

 profile there are many community profiles being developed document a general "grab bag" type for documents. repository ... suitable for something like a RDBMS codeList there are many codeLists in ISO 19115, ISO 19119 and ISO 19139. These codeLists are extensible. modelRun or modelSession to distinguish from model (but see below) applicationSchema information about GML application schema themselves portrayalCatalogue for finding OGC Symbology Encoding or Styled Layer Descriptors for OGC Web Services.

Eventually he implemented some of those in a new codelist

 modelSession information applies to a model session or model run for a particular model document information applies to a document such as a publication, report, record etc. profile information applies to a profile of an ISO TC 211 standard or specification dataRepository information applies to a data repository such as a Catalogue Service, Relational Database, WebRegistry codeList information applies to a code list according to the CT_CodelistCatalogue format project information applies to a project or programme

Actually, even with this definition of modelSession to augment model (which he thought might be used for things like metadata about UML descriptions), I still have problems. Within NDG and NumSim, we have the concept of model code bases and experiments, and I think these need to be kept separate but linked.

Personally I don't like the dataRepository one ... but I can live with it.

Project is ok, but we would prefer activity, because we decided that, activities should include activities, and the parents may well be projects ... but not always ... (e.g. campaigns within a formal project may themselves have sub-campaigns etc).

At this point I might consider a slightly different extension set (which is of course the point of having extensible codelists). Given I'm not sure about these collection thingies, and given a tilt towards O&M, I might want to have

 document as above profile as above codeList as above dataRepository as above activity information applying to a project, programme or other activity productionTool information about an instrument or algorithm observationStation information about the characteristics, location and/or platform which carried out, or is capable of, an observation or simulation. deployment information linking a data entity, activity, productionTool and platform in a procedure

Now I can have an algorithm (computer model) described in a productionTool metadata document and the particular data entity it produces is a data entity (of course), with the particular switches, initial conditions etc, described in a deployment (although I suspect there should and will be ambiguity as to whether the attributes of a productionTool could inherit most if not all of the characteristics of a deployment).

A deployment most closely corresponds to an O&M observation in that we deploy a tool in a or at a particular station for an activity to make a measurement, and I'd love a better (compound) noun than ObservationStation ...

1: Actually the overlap between an observation and a procedure is significant, something that is pointed out in the O&M spec itself (ret).

## Online Resources in ISO19115 and MOLES

ISO19115 is a brave attempt to classify metadata. I think some aspects of it are fatally flawed in an RDF world, but some are not, and we're likely to support it for some time to come. But, let's face it, ISO19115 isn't all inclusive, so how do we reference external information? This posting is all questions. Hopefully tomorrow I'll provide answers!

The NASA DIF does this via the related_url keyword, which has three attributes: a textural description, the url itself, and a content_type from a controlled vocabulary. The related_url simply exists as a first order child of a DIF document.

In ISO19115 we need to use the CI_OnlineResource (Figure A.16). We find it has attributes of linkage (url), protocol, applicationProfile, name, description (all character strings), and function, which is drawn from an enumeration CI_OnLineFunctionCode. The latter has download, information, offlineAccess, order and search as values.

Clearly this codeList can and should be extended. However questions remain:

1. Is this the right way to do it, and

2. How should we use CI_OnlineResource anyway?

#### 1. Extending the CodeLists

Option one is to extend the CI_OnlineFunctionCode codeList. Ted Haberman on the metadata list1 suggested extending it with the GCMD controlled vocabulary2 which is of course what I initially thought.

John Hockaday replied with a suggestion about using the protocol attribute, and inserting a codeList there (and suggested that GCMD should have their own codeLists available - but given their aversion to public version control, we wouldn't rely on that). Also, I'm not so sure that simply using the GCMD codelist in either place is right, because the GCMD valids for the related_url mix the semantics of services and urls, even if they do have a sensible list, and the protocol is probably more about things like ftp and http than what we layer over either. I'm also concerned that we include the semantics to support an external online bibliographic citation.

#### 2. How do we actually use CI_OnlineResource anyway?

There are multiple places:

1. The online resource which is the best place to download the data itself. That's via MD_Metadata>MD_Distribution>MD_DigitalTransferOptions which has an attribute of onLine which can have 0..* multiplicity. Arguably we can add services there too 3

2. For bibliographic citations from the data to documentation in scientific papers, we could use something in LI_Lineage.

3. For citations from papers to the datasets it's not so obvious.

The latter is what brought me to this. In MOLES we want to have links between the various entities and between MOLES documents and other documents (particularly NumSim and SensorML documents, i.e. documents which describe productionTools). (In truth, the relationships we want to encode would be far more sensibly serialised in RDF than XML Schema.) We also want to have entities we call data granules which are part of data entities and we expect those data granules to each individually have different services available 4.

#### 3. Services

Our problem is that we don't want the data granules independently discoverable for fear of overwhelming search engines (including our own), but we might want to expose in an overarching metadata document for the data entity that services of a particular class exist. The problem is that we might want this in two different syntaxes: a browser should be able to produce something consumable by a human (VIEW EXTENDED METADATA) and a service client might want simply to obtain a list of granules with names, service types, and service adddresses. Of course we could handle this by having the latter accessible via cunning use of microformats or tags, but there are both performance and maintenance reasons why that wont wash.

1: for some reason not all messages, including ones relevant to this posting, appear in the metadata list archive (ret).
2: which is currently lurking at here - but note that GCMD and version control don't get on, we prefer our version controlled lists, eg. this one (ret).
3: Although obviously we believe in late binding and separating the service and data metadata! (ret).
4: Yes, I know, late binding again, but it's not here now! (ret).

## More quotations - Static Maps, Public Trusts, and Bad Processes

I'm so limited to time at the moment that I'm limited to quoting, but the three snippets I've got today resonate as much as the last ones ...

Firstly, Sean is out there leading us to brave new world beyond OGC protocols, and he was the first one to bring google static maps to my attention. Interesting.

Secondly, I used to never get around to reading All Points because it seemed to be high traffic, and I was low bandwidth. Well that's all true, but I have been making an effort lately to at least page through my feeds regularly and the signal seems to be getting pretty interesting. Joe Francica posted two things to remember today. The second, was a quote from Jim Geringer, former governor of Wyoming and ESRI's director of policy and public sector:

I want to remember that one in the same context as finding is not enough.

The other thing Joe posted was a good deal longer, but poses a really interesting question: Is Google Earth a Public Trust? in the sense that public bodies are starting to rely on privately funded base map and imagery database for mission critical applications in emergency management. He makes the point that they already rely on the computing infrastructure, but that external bodies managing public information is quite a step further. It's certainly something to think about.

by Bryan Lawrence : 2008/02/22 : Categories ndg environment computing : 0 trackbacks : 2 comments (permalink)

## authkit and pylons don't quite fit

Background - I'm using genshi as my templating engine in pylons 0.9.6.1 and I want to authkit to do access control and authentication. This is in the context of pyleo.

I'm following the guidance in the draft pylons book.

Problem of the day: wrapping the signin template in the site look-n-feel. This is slightly less than trivial because the signin template is produced by authkit, but it doesn't have easy direct access to the pylons templating system because pylons is yet to be instantiated (in the middleware stack).

The recommended way of doing it is to create a file (in pyleo's lib directory called "template") which loads what is needed to control the signin template (in a functinn called "make_template"), and point to that using

authkit.form.template.obj = pyleo.lib.template:make_template


in the development.ini file so that authkit can render a nice sign-in page.

There are a few problems in the current version of the guidance:

1. The current version of the doc wrongly has "template" instead of "make_template" after the colon in the development.ini config file.

2. For genshi, we don't want to call a template called "/signin.mako" we want to call "signin",

3. if your site banner wants to look at the c or g variable you have to do rather better than what is done with the State variable pretending to be c in the example template file. At the very least you need to add a __getitem__ method so that at least calls to c.something in your site templating code won't break, even if they don't work ... You might also probably need to access to the pylons globals ...

At this stage, my template.py which provides the render function at the authkit level looks like this:

import pylons
from pylons.templating import Buffet
from pylons import config
import pyleo.lib.helpers as h
from pyleo.lib.app_globals import Globals

class MyBuffet(Buffet):
def _update_names(self, ns):
return ns

def_eng = config['buffet.template_engines'][0]

buffet = MyBuffet(
def_eng['engine'],
template_root=def_eng['template_root'],
**def_eng['template_options']
)

for e in config['buffet.template_engines'][1:]:
buffet.prepare(
e['engine'],
template_root=e['template_root'],
alias=e['alias'],
**e['template_options']
)

class State:
def __getitem__(self,v):
return ''
c = State()

g=Globals()

def make_template():
''' In the following call, namespace is a dictionary of stuff for the templating
engine ... which is why c is a (nearly) empty class, and h is the normal helper '''
return buffet.render(
template_name="signin",
namespace=dict(h=h, g=g, c=State())
).replace("%", "%%").replace("FORM_ACTION", "%s")


Now it nearly works properly, but the pyleo site template currently uses the pylons c variable to produce a menu which is data dependent and obviously that doesn't work properly. We need to work out some way to get at that from "outside" pylons (which is where authkit lives). While that's a problem that can wait, it's a problem that needs solving ...

## Finding is not enough

It seems to be my day for wanting to quote folk:

On why finding something is not enough (via Bill de hOra) Samuel Johnson (1753):

I saw that one enquiry only gave occasion to another, that book referred to book, that to search was not always to find, and to find was not always to be informed.

This is useful for me in the context of, yet again, having to describe search behaviour to yet another audience.

## on global warming urgency

Steven Sherwood (in a Science letter), with respect the urgency for action in response to global warming:

Greater urgency comes from the rapid growth rate (especially in the developing world) of the very infrastructure that is so problematic. Mitigating climate change is often compared to turning the Titanic away from an iceberg. But this "Titanic" is getting bigger and less maneuverable as we wait--and that causes prospects to deteriorate nonlinearly, and on a time scale potentially much shorter than the time scale on which the system itself responds.

## Data Publication

As part of the claddier project, we have been working on a number of issues associated with how one publishes data (as opposed to journal articles which may or may not describe data).

We're about to submit this paper to a reputable journal. Some of you will recognise concepts (and maybe even some text) from various blog postings over the years. Any constructive comments would be gratefully received!

This paper presents a discussion of the issues associated with formally publishing data in academia. We begin by motivating the reasons why formal publication of data is necessary, which range from simple fact that it is possible, to the changing disciplinary requirements. We then discuss the meaning of publication and peer review in the context of data, provide a detailed description of the activities one expects to see in the peer review of data, and present a simple taxonomy of data publication methodologies. Finally, we introduce the issues of dataset granularity, transience and semantics in the context of a discussion of how to cite datasets, before presenting recommended citation syntax.

## Moving Modelling Forward ... in small steps

I'm in the midst of a series of "interesting" meetings about technology, modelling, computing, and collaboration ... Confucian times indeed.

Last week, we had a meeting to try and elaborate on the short and medium-term NERC strategy for informatics and data. For some reason, NERC uses the phrase "informatics" to mean "model development" (it ought to be more inclusive of other activities, and perhaps it is, but it's not obvious that all involved think that way). As it happens, we didn't spend much time discussing data, in part because from the point of view of the research programme in technology, the main issue at the moment is to improve the national capability in that area (i.e. through improvements and extensions to the NERC DataGrid and other similar programmes).

Anyway, in terms of "informatics" strategy we came up with three goals:

• In terms of general informatics, to avoid loosing the impetus given to environmental informatics by the e-Science programme,

• To try and increase the number of smart folk in our community who are capable of both leading and carrying out "numerically-rich" research programmes (i.e. more people who can carry our model development forward). We thought an initial approach of more graduate students in this area followed by a targetted programme might make a big difference.

• To try and identify some criteria by which we could evaluate improvement in model codes (in particular, if we want adaptive meshes etc, which ones, and how should we decide?). (Michael you ought to like that one :-)

This was in the context of trying to ensure that NERC improves the flexibility and agility (and performance) of its modelling framework so it can start to answer interesting questions about regional climate change. Doing so will undoubtedly stretch our existing modelling paradigms, particularly as we try and take advantage of new computer hardware.

During the meeting we all had our list of issues contributing to the discussion. This was my list of things to concentrate on:

• Improving our high resolution modelling (learning from and exploiting HIGEM).

• Improving our (the UK research community outside the Met Office) ability to contribute to AR5 simulations.

• Improving our ability to work with international projects like Earth System Grid (data handling) and PRISM (model coupling). (We - the UK - are involved with both, but not enough).

• Data handling for irregular grids.

• Model metadata (a la NumSim, PRISM, METAFOR).

• Future Computing Issues in general, but in particular:

• Massively parallelism on chip ... where we might expect memory issues: "Shared memory systems simply won't survive the exponential rise in core counts." (steve dekorte via Patrick Logan.)

• Better dynamic cores

• Better use of cluster grids and university supercomputing (not just the national services, will require much more portable code than we have now, and not a little validation of the models on each and every new architecture).

• i.e. better coding standards ...

• Better ensemble management and error reporting (Michael's bad experience is not dissimilar to folk here with the Unified Model).

• Learning the lessons of the GENIE project(s).

• Handling massive increases in data volumes.

• With consequential issues for transport and archival

• and the requirement to better exploit server-side data services

• Much better model componentisation and coupler(s).

And like everyone else, I wanted to know where are the smart folk to do all this?

Then today, we had an initial discussion about procuring a new computing resource with the Met Office (which, by the way, doesn't preclude our involvement in other national computing services, far from it). There isn't much I can say about this discussion, as much of it was in confidence, but suffice to say, it was all about how we can exploit a shared system on which we would be running the met office models for joint programmes ... of course it's that very same model which most certainly needs a technology refresh :-)

On Friday, we'll be discussing the new NERC-Met Office joint climate research programme ... (which will be one of the programmes exploiting the new system).

## Using more computer power, revisited.

In the comments to my post on why climate modelling is so hard, Michael Tobis made a few points that need a more elaborate response (in time and text) then was appropriate for the comments section, so this is my attempt to deal with them. But before, I do, let me reiterate that I don't disagree that there are substantial things that could and should be done to improve the way we do climate modelling. Where the contention lies maybe in our expectations of what improvements we might reasonably expect, and hence perhaps in our differing definitions of what might be impressive further progress.

Before I get into the details of my response, I'm going to ask you to read an old post of mine. Way back in January 2005, I tried to summarise the issues associated with where best to put the effort on improving models: into resolution, ensembles or physics?

Ok, now you've read that, three years on, it's worth asking whether I would update that blog entry or not? Well, I don't think so. I don't think changing the modelling paradigm (coding methods etc), would change the fundamentals of the time taken to do the integrations although it might well change our ability to assess changes and improve them, but I've already said I think that's a few percent advantage. So, in practise, we can change the paradigm, but then the questions still remain: ensembles, resolution or physics? Where to put the effort?

Ok, now to Michael's points:

Do you think existing codes are validated? In what sense and by what method?

In the models with which I am familiar I would expect every code module that can be tested physically against inputs and outputs has been done so for a reasonable range of inputs. That is to say, someone has used some test cases (not complete, in some cases, the complete set of inputs may be a large proportion of the entire domain of all possible model states, i.e. it can't be formally validated!), and tested the output for physical consistency and maybe even conservation of some relevant properties. There is no doubt in my mind that this procedure can be improved by better use of unit testing (Why is that if statement there? What do you expect it to do? Can we produce a unit test?), but in the final analysis, most code modules are physically validated, not computationally or mathematically validated. In most physical parameterisations, I suspect that's simply going to remain the case ...

Then, the parameterisation has been tested against real cases. Ideally in the same parameter space in which it should have been used. For an example of how I think this should be done, you can see Dean et al, 2007, where we have nudged a climate model to follow real events so we can test a new parameterisation. This example shows the good and bad: the right thing to do, and the limits of how well the parameterisation performed. It's obviously better, but not yet good enough ... there is much opportunity for Kaizen available in climate models, and this sort of procedure is where hard yards need to be won ... (but it clearly isn't a formal validation, and we will find cases where it's broken and needs fixing, but we'll only find those when the model explores that parameter space for us ... we'll come back to that).

(For the record, I think this sort of nudging is really important, which is why I recently had a doctoral student at Oxford working on this. With more time, I'd return to it).

It might be possible to write terser code (maybe by two orders of magnitude, i.e 10K lines of code instead of 1M lines of code).

While I think this is desirable, I think the parameterisation development and evaluation wouldn't have been much improved (although there is no doubt it would have helped Jonathan, the doctoral student, if the nudging code could have gone into a tidier model).

The value of generalisation and abstraction is unappreciated, and the potential value of systematic explorations of model space is somehow almost invisible, or occasionally pursued in a naive and unsophisticated way.

I don't think that the value is unappreciated. There are two classes of problem: exploring the (input and knob-type) parameters within a parameterisation, and exploring the interaction of the paramterisations (and those knobs). The former we do as well as is practicable and I certainly don't think the latter is invisible (e.g. Stainforth et al, 2004 from ClimatePrediction.net and Murphy et al, 2004 from the Met Office Hadley Centre QUMP project). You might argue that one or both of those are naive and unsophisticated. I would ask for a concrete example of how else we would do this. Leaving aside the issue of code per se, we are stuck with core plus parameterisations - plural - aren't we?

(if) there is no way forward that is substantially better than what we have ... I think the earth system modeling enterprise has reached a point of diminishing returns where further progress is likely to be unimpressive and expensive ...

I'm not convinced that what we have is so bad. We need to cast the question in terms of what goals are we going to miss, that another approach will allow us to hit?

Which brings us to your point

... If regional predictions cannot be improved, global projections will remain messy,

True.

... time to fold up the tent and move on to doing something else... the existing software base can be cleaned up and better documented, and then the climate modeling enterprise should then be shut down in favor of more productive pursuits.

I think we're a long way from having to do this! There is much that can and will be done from where we are now.

I have very serious doubts about the utility of ESMs built on the principles of CGCMs. We are looking at platforms five or six orders of magnitude more powerful than todays in the foreseeable future. If we simply throw a mess of code that wastes those orders of magnitude on unconstrained degrees of freedom, we will have nothing but a waste of electricity to show for our efforts.

I don't think anyone is planning on wasting the extra computational power, and I think my original blog entry shows at least one community was thinking, and I know (since I'm off to yet another procurement meeting next week) continues to think, very seriously about how to exploit improving computer power.

On what grounds do you think improving the models, and their coupling, will not result in utility?

## Whither service descriptions

(Warning, this is long ...)

Last week I submitted an abstract to the EGU meeting in April in the The Service Oriented Architecture approach for Earth and Space Sciences (ESSI10) session. I'd been asked to submit something, but I fear I may be a bit of a cuckoo in the SOA nest ... (if by SOA, we take a traditional definition of SOA=SOAP+WS-*).

The abstract can be summarised even more briefly in two sentences:

• there is a long tail of activities for which the abilities of web services to open up interoperabilty is being hindered by the difficulty in both service and data description, so

• there is a requirement for both more sophisticated and more easily understandable data and service descriptions.

Some might argue that the latter is a heresy. Amusingly, I wrote that before I opened up my feed aggregator this week to find another raft of postings about service description languages. Probably the best, and easily most relevant, from my point of view, was Mark Nottingham, who wrote by far the most sage stuff. I'll quote some of it below. It looks like there was an ongoing discussion that hit the big time when Ryan Tomayko wrote an amusing summary which was picked up by Sam Ruby and Tim Bray.

It's hard to provide a summary that is more succinct than the material in those links and adds value (to me, remember these are my notes, I don't care about you1) so I won't! But I will write this, because I don't particularly want to wade through them all again.

#### On Resources

The first point to note is that most of the proponents of service description languages (particularly those from a RESTful heritage) are finally realising that it's not just about the verbs, the nouns matter too! It's fine to argue that you don't need a service description language because we should all use REST, but the resources themselves can be far more complicated beasts than standard mime-types, and so they need description too.

Mark Nottingham said it best in his summary:

Coming from the other direction, another RESTful constraint is to have a limited set of media types. This worked really well for the browser Web, and perhaps in time we?ll come up with a few document formats that describe the bulk of the data on the planet, as well as what you can do with it.

However, I don?t mean "XML" or even "RDF" by "format in that sentence; those are the easy parts, because they?re just meta-formats. The hard part is agreeing upon the semantics of their contents, and judging by the amount of effort its taken for things like UBL, I?d say we?re in for a long wait.

I've wittered on about the importance of this before, again and again. However, there are fundamental problems with using XML to describe resources. I've alluded to this issue too, but along with the summary by James Clark, I liked the way that it was put here:

One of the big problems with XML is that it is a horrid match with modern data structures. You see, it is not that it isn't trivial to figure a way to serialize your data to XML; it is just that left to their own devices, everyone would end up doing it slightly differently. There is no one-true-serialization. So, eventually, you end up having to write code to build your data structures from the XML directly. The problem there is that virtually all XML APIs are horrible for this kind of code. They are all designed from the perspective of the XML perspective, not from the data serialization perspective.

It gets worse. XML is one of those things that looks really easy, but is actually full of nasty surprises that don't show up until either the week before you ship (or worse.., a few weeks after). Things like character encoding issues, XML Namespaces, XSD Wildcards. It is really hard for your average developer (who makes no pretenses at XML guru-hood) to write good XML serialization/hydration code. Everything is stacked against him: XML APIs, XML -Lang itself, XSD.

At one time, I think I understood what it meant "Share schema not type", but now I don't ...

#### On the Service Description Language itself

Well, I've tried to review this sort of thing before. Since, then WADL has hit the big time. From a semantic point of view, I can't say I understand the big differences between WSDL and WADL, although I can appreciate that the WADL syntax is much simpler (and so it's a good thing).

Some folk, sadly including Joe Gregorio (whose work I mostly admire), have made a big deal out of the fact that there is no point generating code from WSDL (or WADL or any service description language), because if you do, when the service changes the WSDL (should) change, and so your code will need regeneration or otherwise your service will break. I think that's tosh (it's true, but still tosh, best put in Robert Sayre in a comment):

HTML forms are service declarations.

No one is arguing that we don't need HTML forms! The fact is, that clients will break when services change! Sure, some changes wont break the clients, but some will! The issue really comes down to How well coupled are your services and clients? and if they are strongly coupled Will you know when the service changes, and can you fix your client if it does? From experience of using WSDL and SOAP (yuck), I know I'd MUCH rather simply get the new WSDL, and regenerate the interface types .... than muck around at a deep level. (That said, I'm not arguing in favour of SOAP per se! Today's war story about SOAP and WSDL is one set of new discovery client developers complaining about our "inconsistent use of camelcase" in our WSDL ... it seems that they're hand crafting to the WSDL, and they want us to break all the other clients to fit their coding standards).

Of course, me wanting to use a service description language presupposes I've used my human ability to read the documentation (if it exists, or the WSDL if I really have to), to decide whether such a solution is the "right thing to do".

#### What does this mean to me?

At the moment we use WSDL and SOAP in our discovery service. I'd much rather we didn't (see above). It could be RESTful POX, which is how we've implemented our vocabulary service (but inconsistent camel case would still break things). It probably will change one day. More importantly, for the data handling services, we're currently using OGC services, where the "service description language" is the Get Capabilities document. This much I know (and where I violently disagree with Joe G) is that it would be much easier to use a generic service description language than the hodgepodge of get capabilities documents we deal with. I think OGC Get Capabilities is an existence proof that a generic service description language would be a Good Thing (TM)! In the final analysis, that's probably what I'll say in April (as well as "SOAP sucks" and "You need both GML and data modelling).

1: I do really :-) (ret).

by Bryan Lawrence : 2008/01/22 : Categories ndg computing xml : 1 trackback : 6 comments (permalink)

## from every direction it's mitigation and acknowledgement

Most mornings now I get between half and an hour of time to myself: between feeding my baby boy who wakes up around 5.30 to 6 am and getting my daughter up around 7 to 7.30 am ... I mostly spend the time reading, coding (for pleasure ... I have made some significant progress on the new leonardo), and just cogitating.

This morning it was reading; and it seemed like from every direction we have climate change adaptation and mitigation issues:

• Paul Ramsey on climate change, peak oil and the deep ocean (I read Paul for his commentary on GIS and Postgres ...)

• New Scientist noting that we may be near peak coal (full article at author's website) (who needs to explain why they read the new scientist?)

• The Observer reporting that the Severn Tidal power scheme takes another step towards actuality ... (when in the UK you do have to read a Sunday newspaper, they are the best in the world ...).

• Joe Gregorio on batteries (Joe writes sage stuff on python, web services and much else).

Now the thing is, I found all of them in my thirty minutes this morning. It's an eclectic bunch of sources, but that's my point! (... and yes, it is unusual for me to read things from the NS, the Observer and my akregator all within thirty minutes ... and none of the readings from the latter were from my environmental folder).

## Why is climate modelling stuck?

Why is climate modelling stuck? Well, I would argue it's not stuck, so a better question might be: "Why is climate modelling so hard?". Michael Tobis is arguing that a modern programming language and new tools will make a big difference. Me, I'm not so sure. I'm with Gavin. So here is my perspective on why it's hard. It is of necessity a bit of an abstract argument ...

1. We need to start with the modelling process itself. We have a physical system with components within it. Each physical component needs to be developed independently, checked independently ... This is a scientific, then a computational, then a diagnostic problem.

2. Each component needs to talk to other components, so there needs to be a communication infrastructure which couples components. Michael has criticised ESMF (and by implication PRISM and OASIS etc), but regardless of how you do it, you need a coupling framework. This is a computational problem. I think it's harder than Michael thinks it is. Those ESMF and PRISM folks are not stupid ...

3. All those independently checked components may behave in different ways when coupled to other components (their interactions are nonlinear). Understanding those interactions takes time. This is a scientific and diagnostic problem.

4. We need a dynamical core. It needs to be fast, efficient, mass preserving, and stable in a computational sense. Stability is a big problem, given that the various parameterisations will perturb it in ways that are quite instability inducing. This is both a mathematical and a computational problem.

5. We need to worry about memory. We need to worry a lot about memory actually. If in our discussion we're going to get excited about scalability in multi-core environments, then yes, I can have 80 (pick a number) cores on my chip, but can I have enough memory and memory bandwidth to exploit them? How do we distribute our memory around our cores?

6. What about I/O bandwidth? Without great care, the really big memory hungry climate models can often get slowed up and be waiting spinning empty CPU cycles waiting for I/O. This is a computational problem.

Every time we add a new process, we require more memory. The pinch points change and are very architecture dependent. Every time we change the resolution, nearly every component needs to be re-evaluated. This takes time.

At this point, we've not really talked about code per se. All that said, the concepts of software engineering do map onto much of what is (or should be) going on. Yes, scientists should build unit tests for their parameterisations. Yes, there should be system/model wide tests. Yes, task tracking and code control would help. But, every time we change some code there may be ramifications we don't understand, not only in terms of logical (accessible in computer science terms) consequences, but from a scientific point of view, there might be some non-linear (and inherently unpredictable) consequences. Distinguishing the two takes time, and I totally agree that better use of code maintenance tools would improve things, but sadly I think it would be a few percent improvement ... since most of the things I've listed above are not about code per se, they're about the science and the systems.

So, personally I don't think it's the time taken to write lines of code that makes modelling so hard. Good programmers are productive in anything. I suspect changing to python wouldn't make a huge difference to the model development cycle. That said, anyone who writes diagnostic code in Fortran, really ought to go on a time management course: yes learning a high level language (python) takes time, but it'll save you more ... but the reason for that is we write diagnostic code over and over. Core model code isn't written over and over ... even if it's agonised over and over :-)

Someone in one of the threads on this subject mentioned XML. Given that there (might be) a climate modeller or two read this: let me assure you, XML solves nothing in this space. XML provides a syntax for encoding something, the hard part of this problem is deciding what to encode. That is, the hard part of the problem is the semantic description of whatever it is you want to encode (and developing an XML language to encapsulate your model of the model: remeember XML is only a toolkit, it's not a solution). If you want to use XML in the coupler, what do you need to describe to couple two (arbitrary) components? If it's the code itself, and you plan to write a code generator, then what is it you want to describe? Is it really that much easier to write a parameterisation for gravity wave drag in a new code generation language? What would you get from having done so?

So what is the way forward? Kaizen: small continuous improvements. Taking small steps we can go a long way ... Better coupling strategies. Better diagnostic systems. Yes: Better coding standards. Yes: more use of code maintenance tools. Yes: Better understanding of software engineering, but even more importantly: better understanding of the science (more good people)! Yes: Couple code changes to task/bug trackers. Yes: formal unit tests. No: Let's not try the cathedral approach. The bazaar has got us a long way ...

(Disclosure:I was an excellent fortran programmer, and a climate modeller. I guess I'm a more than competent python programmer, and I'm sadly expert with XML too. I hope to be a modeller again one day).

## Walking the Leonardo File System in Pylons

Now that we have access to the filesystem, the next step to porting is to get a pylons controller set up that can walk the filesystem ...

Start by installing pylons (in this case, 0.9.6):

easy_install Pylons


(watch that capitalisation: don't waste time with easy_install pylons ...)!

Now create the application... in my case a special sandbox directory called pylons ... and get a simpler controller up ...

cd ~/sandboxes/pylons
paster create --template=pylons pyleo template_engine=genshi
cd pyleo
paster controller main


At this point it's worth fixing something that will cause you grief later: we've decided on genshi, so we need an __init__.py in the pyleo/templates directory. Create one now. It can be empty. We won't need it til later ... but best get it done now!

At this point if we start the paster server up with

paster serve --reload development.ini


We can see a hello world on http://localhost:5000/main and a pylons default page on http://localhost:5000

The next step is to get rid of the pylons default page, and make our main controller handle pretty much everything (we wouldn't normally do this, but we're porting leonardo not starting something new). We do this by replacing the lines after CUSTOM ROUTES HERE in config/routing.py with:

    map.connect('',controller='main')
map.connect('*url',controller='main')


and removing public/index.html.

Now http://localhost:5000/anything gives us 'Hello World'.

The next step is to get hold of the path and echo it instead of 'Hello World'. We do that by accessing the pylons request object in our main controller, which we have available since in main.py we inherit from the base controller.

        return 'Hello World'pre] we have

	path=request.environ['PATH_INFO']
return path


And the next step is to pass it to a simple genshi template to echo it. We do this by

• making a simple template. Here is one (path.html which lives in the pyleo/templates directory):

<html xmlns="http://www.w3.org/1999/xhtml"    i
xmlns:py="http://genshi.edgewall.org/"
xmlns:xi="http://www.w3.org/2001/XInclude" lang="en">
<body>
<div class="mainPage">
<h4> $c.path </h4> </div> </body> </html>  • and callit from main.py after assigning the path value to something in the c object (itself visible to the template). Replace those two lines in main.py that we just replaced before, with:  c.path=request.environ['PATH_INFO'] return render('path')  And now we're using Genshi to show us our path. The next step is to bring the leonardo file system into play, so we put filesystem.py into the model directory. (As an aside: Pylons is Model-View-Controller. If that doesn't ring any bells, see this). Note that the MVC stuff is inside a directory embedded one more level than one might expect. That's why we have this wierd structure: sandboxes/pylons/pyleo is the source for a new (distributable) egg, and sandboxes/pylons/pyleo/pyleo is where our pylons application (with MVC) lives. Realistically, next time I do this, i won't be repeating every identical step, so I'm bound to stuff up. I might need to get some decent debugging. Set this by editing the development.ini file so that [logger_root] has level set to DEBUG rather than INFO. Be warned; it results in verbiage on the console! Right back to our thread ... In this first stab, we'll use filesystem.py as our model, and we'll simply put in place a view which walks the content and then displays the text for the moment (without a wiki formatter). Nice and straight forward. We simply modify our existing template: path.html: <html xmlns="http://www.w3.org/1999/xhtml" xmlns:py="http://genshi.edgewall.org/" xmlns:xi="http://www.w3.org/2001/XInclude" lang="en"> <!-- Simple genshi template for walking the leonardo file system --> <!-- We'll be using the javascript helpers later, so let's make sure we have them -->${Markup(h.javascript_include_tag(builtins=True))}

<body>
<div class="mainPage">
<h4> $c.path </h4> <ol py:if="c.files!=[]"> <li py:for="f in c.files">Page:<a href="${f['relpath']}">${f['title']}</a></li> </ol> <ol py:if="c.dirs!=[]"> <li py:for="d in c.dirs">Directory:<a href="${d['relpath']}">${d['title']}</a></li> </ol> </div> </body> </html>  and our main controller. It's all in the main controller for now. We'll change that later. Meanwhile, our main controller now looks like this: import logging from pyleo.lib.base import * from pyleo.model.filesystem import * log = logging.getLogger(__name__) class MainController(BaseController): def index(self): ''' Essentially we're bypassing all the Routes goodness and using this main controller to handle most of the Leonardo functionality ''' c.path=request.environ['PATH_INFO'] #Later we'll move this elsewhere so it doesn't get called every time ... self.lfs=LeonardoFileSystem('/home/bnl/sandboxes/pyleo/data/lfs/') #ok, what have we got? dirs,files=self.lfs.get_children(c.path.strip('/')) if (dirs,files)==([],[]): return self.getPage() c.dirs=[] c.files=[] for d in dirs: x={} x['relpath']=os.path.join(c.path,d) x['title']=d c.dirs.append(x) for f in files: x={} x['relpath']=os.path.join(c.path,f) leof=self.lfs.get(f) #print leof.get_properties() x['title']=(leof.get_property('page_title') or f) c.files.append(x) return render('path') def getPage(self): ''' Return an actual leonardo page ''' leof=self.lfs.get(c.path) c.content=leof.get_content() if c.content is None: response.status_code=404 return else: #This is a leo file instance c.content_type=leof.get_content_type() #for now let's just count these ... comments=leof.enclosures('comment')+leof.enclosures('trackback') c.ncomments=len(comments) if c.content_type.startswith('wiki'): # for now, just return the text ... without formatting etc ... return ''.join([c.content,'\n%s comments'%c.ncomments]+[com.get_content() for com in comments]) else: t={'png':'image/png','jpg':'image/jpg','jpeg':'image/jpeg', 'pdf':'application/pdf','xml':'application/xml', 'css':'text/css','txt':'text/plain', 'htm':'text/html','html':'text/html'} if c.content_type in t: response.headers['Content-Type']=t[c.content_type] else: response.headers['Content-Type']='text/plain' return c.content *** highlight file error ***  and believe it or not, that's all we need to walk our filesystem, and return the contents ... it wont be a big step to add the formatter to get html, and to start thinking about how we want the layout to look in template terms. (We'll need to make sure we use a config file to locate our lfs as well). The steps after that will be to add the login, upload, posting, trackback, feed generation etc ... none of which should be a big deal ... but I don't expect to do them quickly :-) Update: I'm pleased to report that while this is a fork from the leonardo trunk, as is the django version, all three code stacks are now jointly hosted on google. The pylons code is in the pyleo_trunk and this is rev 464 (I haven't worked out the subversion revision syntax for the web interface). by Bryan Lawrence : 2008/01/03 : Categories python pyleo : 0 trackbacks : 0 comments (permalink) ## the leonardo file system The most difficult thing about porting leonardo is interfacing with the leonardo file system (lfs). The lfs was designed to allow multiple backends through a relatively simple interface ... of course it's not properly documented anywhere, so remembering how it works was a bit difficult. The following piece of code shows the general principle: from filesystem import LeonardoFileSystem import sys,os.path def WalkAndReport(leodir,inipath='/'): ''' Walks a leonardo filesystem and reports the contents in the same way as doing ls -R would do ''' def walk(lfs,path): directories,files=lfs.get_children(path) for f in files: leof=lfs.get(os.path.join(path,f)) #The following is the actual content at the path ... if it exists. #It's what you would feed to a presentation layer ... content=leof.get_content() print '%s (%s)'%(f,leof.get_content_type()) for p in leof.get_properties(): print '---',p,leof.get_property(p) #check for comments and trackbacks ... is there any other sort? comments=leof.enclosures('comment')+leof.enclosures('trackback') #comments and trackbacks are leo files ... for c in comments: for p in c.get_properties(): print '------',p,c.get_property(p) for d in directories: leod=os.path.join(path,d) print '*** %s *** (%s)'%(d,leod) walk(lfs,leod) lfs=LeonardoFileSystem(leodir) walk(lfs,inipath) if __name__=="__main__": lfsroot=sys.argv[1] if len(sys.argv)==3: inipath=sys.argv[2] else: inipath='/' WalkAndReport(lfsroot,inipath) *** highlight file error ***  While I'm at it, I'd better document a small bug in the leonardo file system itself that manifested itself on this blog (python 2.4.3 on Suse 10) but nowhere else ... the comments came back in the wrong order. The following diff on filesystem.py fixed that:  def enclosures(self, enctype): + #BNL: modified to reorder by creation date, since we can't + #rely on the name or operating system. enc_list = [] for d in os.listdir(self.get_directory_()): match = re.match("__(\w+)__(\d+)", d) if match and enctype == match.group(1): index = match.group(2) - enc_list.append(self.enclosure(enctype, index)) - return enc_list + e=self.enclosure(enctype, index) + sort_key=e.get_property('creation_time') + enc_list.append((sort_key,e)) + enc_list.sort() + return [i[1] for i in enc_list]  by Bryan Lawrence : 2008/01/02 : Categories pyleo python : 0 trackbacks : 0 comments (permalink) ## Playing with pylons and leonardo I've suddenly been granted a couple of hours I didn't expect, so I thought I'd take the first steps towards forking leonardo (sorry James), so that we have a pylons version. I know James has a Django version, but I want pylons for a number of reasons: • I've done a lot of work with pylons. I understand it. • I have a number of extensions already to James' codebase (including full trackback inbound and outbound). (James' subversion got broken: we never resolved it, so they never got committed back). • I want to build an egg, which allows external templating (i.e. you can complete control the look and feel via a genshi template or use the default within the egg). • I want to do all of this so I have a nice small job which exercises and documents all the skills I've built up building in the NDG portal (in my spare time) over the last few months. • I want to cache documents more efficiently (trivial in pylons). • I want to be able to be able to produce archiveable versions for previous years (not so trivial). Tim Bray reminded us all that this is important! I expect it will take months to do what I expect to be a few hours coding :-( I wonder how well this will compare with previous announcements! I guess we'll need a new name. pyleo will do, in this case for pylons leonardo. by Bryan Lawrence : 2008/01/02 : Categories pyleo python : 0 trackbacks : 2 comments (permalink) ## virtualenv One more thing to remember. I'm going to be building pyleo using pylons 0.9.6.1, but the ndg stuff (also on my laptop) is using pylons 0.9.5. Library incompatibility is scary. Fortunately, we have virtualenv to the rescue. Using virtualenv, I can build a python instance that is independent of the stuff build into my main python (which is a virtual-python for historical reasons). It's better than virtual-python because I get the benefits of things in my system site-packages that I've installed since I installed my virtual python. What to remember? (I really will be forgetting things like this when there are weeks between activity ... in particular, this way I'll know which python to use!) Well, I built my new virtualenv instance by typing python virtualenv.py pyleo  and I can change into it any time I like with source ~/pyleo/bin/activate  I expect I'll be able to ensure I use this python in my (test) webserver when I get to it. It looks like I need to adjust the path to the libraries inside the outer script with import site site.addsitedir('/home/bnl/pyleo/lib/python2.5/site-packages')  I used virtualenv 0.9.2. It looks like I can make ipython respect this by copying /usr/bin/ipython into my ~/leo/bin and editing it to use ~/leo/bin/python ... by Bryan Lawrence : 2008/01/02 : Categories pyleo python : 0 trackbacks : 0 comments (permalink) ## discovery and google Knol is a case of not so much of changing the Google mission, but the battlefield. Google now realise it's easier to organise well marked up content and have figured out a way to deal with what is known as metacrap by centralising where content is manufactured. Folk keep asking me why we bother building data discovery engines, and I keep saying that when Google has figured out how to automatically add semantic value to plain text we can stop bothering ... (and yes, I do know about these folk and their ilk). 1: Bill, should you ever see this, sorry, unicode is broken in this version of Leonardo ... I really need to fix it ... but I suspect rewriting Leonardo might be my method and that wont be any day soon. (ret). by Bryan Lawrence : 2007/12/19 : Categories ndg : 5 trackbacks : 2 comments (permalink) ## raw and refereed data I'm still chasing up peer review of data. My it's hard to find anything useful in the peer reviewed literature on the subject, or indeed on the wider Internet ... Meanwhile, I came across this: ... Of course if the papers that mislead us had links to the raw data of their experiments we would have been able to spot these errors much more quickly. This is a huge advantage of doing open source science. With the proliferation of new forms of scholarship, it is no longer effective to teach our students that there is such a thing as reliable literature. There is uncertainty in every information source ... It would be helpful if even (or maybe ESPECIALLY) the raw data came with information that would help assess it's reliability and uncertainty ... by Bryan Lawrence : 2007/12/17 : Categories claddier curation (permalink) ## A definition of intelligence This quote is apparently from Piaget, but I got it from Calvin via a comment on Jeff Fleck's blog: intelligence is what you use when you don't know what to do, when neither innateness nor learning has prepared you for the particular situation. Memorable! The context was pretty interesting too: William H. Calvin, Pumping Up Intelligence: Abrupt Climate Jumps and the Evolution of Higher Intellectual Functions during the Ice Ages, in The Evolution of Intelligence, edited by R. J. Sternberg (Erlbaum, 2001), pp. 97-115 by Bryan Lawrence : 2007/12/04 : Categories environment (permalink) ## google good I've been less than positive about Google in the past, but this via Joe Gregorio, is enough to impress me back onto their side of the fence ... Google (NASDAQ: GOOG) today announced a new strategic initiative to develop electricity from renewable energy sources that will be cheaper than electricity produced from coal. The newly created initiative, known as RE<C, will focus initially on advanced solar thermal power, wind power technologies, enhanced geothermal systems and other potential breakthrough technologies. RE<C is hiring engineers and energy experts to lead its research and development work, which will begin with a significant effort on solar thermal technology, and will also investigate enhanced geothermal systems and other areas. In 2008, Google expects to spend tens of millions on research and development and related investments in renewable energy. As part of its capital planning process, the company also anticipates investing hundreds of millions of dollars in breakthrough renewable energy projects which generate positive returns. ... "With talented technologists, great partners and significant investments, we hope to rapidly push forward. Our goal is to produce one gigawatt of renewable energy capacity that is cheaper than coal. We are optimistic this can be done in years, not decades." (One gigawatt can power a city the size of San Francisco.) by Bryan Lawrence : 2007/12/03 : Categories environment (permalink) ## Have ubuntu gone mad? Apparently Gutsy Gibbon is great. Great that is as long as you don't have an ATI graphics card on a laptop and want to use suspend and/or hibernate. What is the point have having beta releases if you ignore what the punters say? Is an obscure kernel change that makes a few percent performance difference more important than basic functionality for a large class of users. Apparently. I think this sort of wild disregard for users from techno geeks is what puts folk off linux in general and distros specifically. Not that it will make one iota of difference to Mark Shuttleworth, but I'm really pissed off. This was an avoidable mistake! Maybe ATI will fix this in their next driver drop, but that didn't stop me wasting a few of my precious evening hours wondering why I couldn't get gutsy to behave. It's crap! (Update: 27/11/07: ATI have released a new driver, but it probably doesn't support my card: a Mobility FireGL V5200 in a Lenovo T60p. I say "probably" because the ATI website doesn't seem to believe this card exists - it doesn't appear on any drop down lists, and the V5000 which does, seems only to be supported by a much older driver, which wont work on the new kernel!) (Update: 21/12/07: Credit where credit is due. It looks like ATI have responded to my bleating on their support page. The new driver explicitly lists the FireGL V5200 - if not the mobility version, but I'll take that risk. Hopefully I'll find time over Christmas to upgrade.). by Bryan Lawrence : 2007/11/21 : Categories ubuntu kubuntu (permalink) ## On peer review (of grants) Peer review comes in for its fair share of brickbats, but not too many academics are seriously against it! Indeed we rely on it. Peer review comes in many guises, and is applied in many ways, in many places, for many tasks. However, for all the experience we all have of peer review, there are few instances of "best practice", examples of which we might all agree: "that's how it should be done"! Over on Michael Tobis's blog, a few of us had a bit of a diatribe about the way peer review is generally applied by research funding bodies: we argue that such bodies end up making decisions without all the information they could (relatively) easily obtain (albeit, probably, because that way they are protecting their reviewers from fatigue). As it happens, right now we're considering how peer review should be applied to the assessment of datasets for publication. In the process of working up our procedures, I have been delving a little into the published literature about peer review, and some of what I found has relevance to the previous discussions ... The UK Boden Report from 1990 to the then Advisory Board for the Research Councils (2.8 MB pdf) was a pretty interesting read (yes, seriously :-). Although it's nearly twenty years old, and predates the Internet, the key findings are pretty robust even now, especially the one that essentially states "there is no other game in town!" It looks like most of the recommendations have been dealt with, but a propo the previous discussions, I was struck with this: 4.41 There is a view in the academic community that one needs to be "visible" rather than merely "good" to get support from the Councils. THe general question here is: does one review the applicant or the application? Roughly half the responses to the Group concluded that undue weight on track record disadvantages the young 1. However, the other half put the case that more attention should be put to track record since, research evidence suggests, evaluation of past performance is a good indicator of future results. They also made the point that backing "known" winners could lead to considerable savings in peer review cost. and 4.42 ...the onus should be on Councils, and their committees to gain adequate knowledge of such track record as there is and to weight this2 with due reference to the experience of the applicant. I can't say I've direct experience of such conversations (re weighting track records) happening in the (few) moderating panels I've been involved with ... but that may be statistics of small numbers. Perhaps it happens! There are certainly new investigator schemes ... Which brings us to this: 4.47: ... We are not unsympathetic to the idea of using interviews as part of peer review practices, since they do allow a full examination of the project and an opportunity for the applicants to respond to peer's queries. Our doubts rest on the high cost of this strategy ... And that's the nub of the problem. Funding bodies are always trying to drive down the frictional costs of running research, and keep as much as possible of the research budget for doing research (good). That said, I think there is plenty of scope for using new technology (video conferencing, instant messaging etc) to improve the information available to peer review panels. What we also need to do is find ways of improving the external peer review itself. That there are problems is not doubted. Amongst the vast literature on this subject, the following is as succinct as it gets: There is little data defining the accuracy or reproducibility of peer review (Ernst & Resch, 1994 3), who went on to find in an experiment on reviewing consistency in clinical medicine that there was a significant impact of reviewer bias on referee judgement and room for improvements in fairness and consistency in peer review. But even where the right people are doing it, and being as even-handed as possible, they simply don't have time. We need ways of making it more acceptable to spend time doing it. I hadn't heard of peer miles before, but ideas like that can't hurt. NERC pay folk to be on the peer review college, but the pay is derisory for the extra workload involved: quite clearly NERC is exploiting co-funding (and lots of it) from the referees employer. It would be nice to see mechanisms invoked to reward good quality work in that time. Some journals give out awards for "good quality reviewing", perhaps NERC could do that too 4, but universities would have to find ways of rating such awards in their departmental brownie points systems (or however they do internal load balancing). Again, the Boden report has been here first: 5.4 The time spent by applicants and peers is not a cost to the Councils: it is a cost to the employing institutions 5 5.14 ... Spending around 1-1.5% of total expenditure on peer review seems a reasonable use of resources. My initial reaction to this was to consider that this statement could be bolstered by some metric of the success of the process in terms of quality of research delivered. Then I realised that it's mighty hard to measure the success of the research that might have been delivered had it made it through! Maybe there is a role for Bayesien statistics there ... (some have tried it for journals, e.g. Neff and Olden, 20066), a challenge for James? I wonder whether the extra costs of making the process a bit more interactive could be quantified? If they could be, and they were around 0.5%, I'd reckon that money well spent ... 1: but this point is deal with elsewhere, we they conclude the evidence for this is inclusive (ret). 2: my italics (ret). 3: J. Laboratory and Clinical Medicine, 124, 178-182 (ret). 4: but they'd have to be careful that they didn't measure good quality by "number of grants reviewed", given my experience of the NERC peer review college, that's a high risk! (ret). 5: As I said, even when they pay for college membership, this is the case! (ret). 6: Bioscience, 56, 333-340 (ret). by Bryan Lawrence : 2007/11/20 : Categories curation (permalink) ## Advice from Boden for FAB Boden (see my last post) also had some things to say for some issues bedeviling my time: 4.55 Problems for interdisciplinary proposals are exacerbated if they lie across Research Council boundaries ... Times haven't changed. I was at a meeting yesterday where we were considering how NERC could improve its support for the technology underpinning environmental science. I was assured that not only has NERC improved its procedures so that the response mode peer review system will look more favourably on technology activities, but that NERC would be working with EPSRC on technology bids. I look forward to the reality. History says both of these are easy to say ... In the same context (technology) Boden had even more advice for us: Recommendation vi): ... Streamlined peer review procedures for small investments are appropriate; Councils might look for opportunities to devolve work to institutions ... Well, that's exactly what our little group will recommend: For some classes of prototyping and feasibility studies, using (specific) existing NERC institutions to manage these would be better than letting Swindon do it ... it remains to be seen whether NERC will buy it! by Bryan Lawrence : 2007/11/20 (permalink) ## the code is just the code I've just spent a bit of time working through how we document the models which produce the climate predictions we hold at BADC. Then this lunch time I read this: I came to understand that the way the word "model" is used in the climate sciences is confusing. An executable software package (a "program") is often called a "model" but this overvalues the code and undervalues the model. The code is an attempted embodiment of the model. The model is the science. The realization of the model ("running the code") is the prediction. The code itself is just an instrument. It's hopeless to demand that we stop calling it a "model". It's just too ingrained. We should be aware, though, that this is sloppy thinking. The code is just code. How true! I guess I had some of this in mind as we've developed our NumSim schema for describing simulations (which I may well describe here anon), but other similar activities are predicated on "self-documenting" codes. We're about to start a major (EU funded) project which will, amongst other things, improve how we document simulations for future model intercomparisons, so it's quite apposite to read this now! I think it's important not to get hung up about documenting the code per se, although we need to understand the distinction between codes. Many of my colleagues think these model description activities are about "reproducibility", me, I think it's about "understanding" (I think reproducibility is a hard ask). by Bryan Lawrence : 2007/10/31 : Categories climate badc : 0 trackbacks (permalink) ## All for one and one for all On Wednesday and Thursday I attended the first ever NERC-wide data management workshop1, where we tried to bring together everyone in the NERC designated data centre2 extended family. The event was a great success, and I hope it will be repeated! There is great scope for improving the delivery of our individual programmes from learning about how others do things, and there is great scope for enabling new interactions! Years ago (2001), NERC had a review of its data centres, and the review concluded that the designated data centres should be combined into one distributed data centre. NERC didn't follow that advice, and for good reasons: each of the data centres is embedded in its community, and their strength depends on their disciplinarity. Nonetheless, that strength is also a point of weakness, the existing funding fault lines aren't good at enabling information transfer across discipline boundaries, yet it's just such information transfer that is crucial in dealing with the big interdisciplinary problems we are facing today3! The upcoming NERC strategy (due in a few weeks) addresses the interdisciplinary problem head on by recomposing NERC's attention in a number of specific cross-cutting themes, rather than down discipline specific lines, but, and this is a key but, the delivery plan still expects this to be delivered by discipline specific institutes (whether NERC owned or not). I think this is the right approach, by definition, inter-disciplinary research depends on there being foundations of discipline specfic expertise and funding methodologies that encourage both that discipline specific expertise and activities on the margins! It's nice that the NERC data centre community is ahead of the game, we're trying hard to improve our interdisciplinarity with meetings like this one, and projects like the NDG. 1: The workshop webpage is unfortunately not public (ret). 2: NERC has seven designated data centres plus some other major activities (ret). 3: Yes, I know, we've always been facing them, but now these problems are being actively confronted ... (ret). by Bryan Lawrence : 2007/10/12 : 0 trackbacks : 0 comments (permalink) ## Why not dinosaurs? One of the talks at the data conference was given by Lee-Anne Coleman the head of Science, Technology and Medicine at the British Library. In her talk she was positing a role for the British Library as a host to digital data archives. Given that I'm on record as thinking that institutional repositories and data are a bad idea, what did I think about this? Well, my first question was: Why not dinosaurs? You (the reader) are not alone in finding that cryptic. So did everyone else in the hall! But what I meant was, we don't put our dinosaur bones in libraries, we put them in museums, and we do that for a purpose! So, why not digital data? Well, if we take digital away from that sentence, and we (legitimately) consider bones as data, and we know we don't put them in libraries, then the conclusion a priori ought to be there isn't a role for (implicitly all) digital data and libraries. So the question is better posed as "Is there a role for some digital data and libraries?" Of course there is: books are now digital data, and as she said, so are recordings and videos etc1. So, there is a definite role for some digital data, but all types of data, definitely not! She went on to admit no intention to curate primary and discipline specific digital data, but a desire to "link to curated data" and to "identify and hold" reference datasets. Well, I was still in the dark, because (as I explained in a question from the floor), the problem is that while the library uses woolly phrases like "digital data", they will get resistance from those of us who understand the complexities of the problem. I pleaded for the library to start being explicit about the types of data (and by type I mean explicitly rather more and less than the format of the data). I mean type in the same sense as I did when I discussed interoperability, something I can name which has real word meaning. Types such as "recording" or "mp3"2 carry enough information that I can assess the ability of the BL to hold data conforming to those types. What I want is a countable list of types of data that the library are interested in, and why, and how they are going to maintain them. When I get such a list, I'd be happy to weigh in with my opinion, but until then: why not dinosaurs? There are many institutions who want to become bit buckets, and while that's an important role, the contents of those bit buckets are useless without a custodian community. If libraries can incorporate such custodian expertise (or rely on it being so pervasive in society that when format and semantic migration are necessary that the expertise will be available and affordable), then absolutely, get involved! But if not, by actively claiming to hold the data, they're providing a false impression of information persistence! All that said, I do believe in the idea of the BL holding reference datasets, but they have to make sure they understand what they've got and how they're going to persist them. Books and documents are easy3! I believe in the proposed role for improving search and navigation (better resource discovery). I've invested enough of my life in this area already to know that there is much to do, and that libraries already have much to give! I also think there's a potential role for libraries in ontology and standards governance (not, you will notice, in constructing or devising these things). Too many standards bodies have business models that get in the way of their function (ISO: are you listening?), whereas libraries, and in particular the BL, understand their duty to society! I was particularly heartened that Coleman introduced this idea herself in the context of a role for naming the unique associations between authors and their products in the "dataspace" (a role that they already play for books). Doing this of course would have them taking a role in URI maintenance! Finally, I'm in two minds about using libraries as "bit buckets of last resort", that is, "the place to give your data when you're about to be closed down". This idea was floated from the floor, twice! If in this situation, one can't find an organisation with the knowledge and capacity to take thedata, and it's in any way "specialised", then the reality is that it's probably dead already, and giving it to the library may only be buying time - fine if there is a white knight on the horizon, but otherwise a cost with little prospect of reward (this is not the same as preserving a document or anything else with commonly understood semantics and syntax)! As a tax payer I'm not convinced that my taxes should pay for the storage of impenetrable bits! 1: The list could be much longer, I in no way mean to imply a limited list here. (ret). 2: Although it's a format, we know that they are recordings, so in this case the format name carries the feature type semantics. (ret). 3: Or are they? Amusingly, the powerpoint of the BL presentation online on the conference website has fonts that I don't have, and I can't easily read it - ironic really, given the topic and venue, but I've been there too. Roll-on pervasive use of odf for "powerpoint". (ret). by Bryan Lawrence : 2007/10/12 : Categories curation (permalink) ## Bugger (That's not a profanity down under!) Gutted! I'm too depressed to write anything in my own words, but while I don't want to be called a bad sport, it's hard to argue with this: ...Barnes was a hugely influential figure in this match. His sinbinning of Luke McAlister, his miss of the forward pass that led to a French try and the questionable French hands in a ruck at the very end of the match have been well-documented ... So, the refereeing mistakes were involved in 17 of the French points, and cost us points... but: ... Now before you go off about this really starting to sound like sour grapes, I'll be the first to admit that Barnes wasn't the lone cause of the All Blacks' shocking exit. There was plenty of blame to be placed on the New Zealand and playing staff. Things like silly passes, spilled ball, poor option taking, the lack of a dropgoal attempt at the end are a few worth mentioning. Then you could go on about questionable selections, rotational policies that saw most players spending more time on the sidelines than the playing field ... However, it was probably best said here: There was an aimlessness about their play in the second half that was excruciating in the extreme. Individually you couldn't say anyone had a particular shocker. But collectively they were off the mark — and that is the very point. Wayne Barnes and the French must also cop it. The referee made some dubious decisions that cost the New Zealanders dearly, particularly the sinbinning of Luke McAlister. Atrocious call. But one they had to be good enough to take in stride. And those French? Tres magnifique in their passion and commitment. The scoundrels. Bugger! Four more years ... by Bryan Lawrence : 2007/10/08 (permalink) ## playing with grape One of my major scientific interests is the grape project. We hold the data here. Today, I wanted a quick look at the data, so I thought I'd use our archive as a naive user (it's sadly easy to be naive nowadays :-( ). Anyway, what I found wasn't a great advertisement for the BADC ... The data is documented as being in hdf format. Which version? (It matters!) I happened to know it was hdf5, so that's ok for me but not for Joe Average. Tools to read it? Well, straight away I went to pytables. Installing it and getting it to work wasn't as straight forward as I hoped - mainly because my virtual python seemed to stuff up the paths to the hdf5 library. In the end, I resorted to virtualenv, and it worked. Documentation of the datafile? No! Self documenting? Well, it's not CF netcdf, but htis is what we got: ~hdf/bin/python import tables h2=tables.openFile('grape-atsr2_15835_199805010718_l2v02.hdf','r') /usr/lib/python2.5/site-packages/tables/File.py:227: UserWarning: file grape-atsr2_15835_199805010718_l2v02.hdf exists and it is an HDF5 file, but it does not have a PyTables format; I will try to do my best to guess what's there using HDF5 metadata METADATA_CACHE_SIZE, nodeCacheSize) >>> print h2 grape-atsr2_15835_199805010718_l2v02.hdf (File) '' Last modif.: 'Tue Oct 2 14:56:21 2007' Object Tree: / (RootGroup) '' /GEO (Group) '' /GEO/LatLon (Array(2L, 4608L, 191L)) '' /ORACinput (Group) '' /ORACinput/angles (CArray(4L, 4608L, 191L), zlib(6)) '' /ORACinput/jultime (CArray(4608L, 191L), zlib(6)) '' /ORACinput/land (CArray(4608L, 191L), zlib(6)) '' /ORACinput/rad error (CArray(7L, 4608L, 191L), zlib(6)) '' /ORACinput/radiances (CArray(7L, 4608L, 191L), zlib(6)) '' /ORACoutput (Group) '' /ORACoutput/aerosol data (CArray(3L, 4608L, 191L), zlib(6)) '' /ORACoutput/aerosol error (CArray(3L, 4608L, 191L), zlib(6)) '' /ORACoutput/albedo (CArray(3L, 4608L, 191L), zlib(6)) '' /ORACoutput/cloud data (CArray(8L, 4608L, 191L), zlib(6)) '' /ORACoutput/cloud error (CArray(8L, 4608L, 191L), zlib(6)) '' /ORACoutput/cost (CArray(2L, 4608L, 191L), zlib(6)) '' /ORACoutput/flag (CArray(4608L, 191L), zlib(6)) '' /ORACoutput/iteration (CArray(4608L, 191L), zlib(6)) '' /ORACoutput/phase (CArray(4608L, 191L), zlib(6)) '' /ORACoutput/residuals (CArray(7L, 4608L, 191L), zlib(6)) '' /ORACoutput/retrv quality (CArray(4608L, 191L), zlib(6)) '' /ORACoutput/snow flag (CArray(4608L, 191L), zlib(6)) ''  So, there are some arrays with some variables in them? Well, I know we documented these, but I don't know where ... (more black marks for BADC). I've documented this here, both for my own benefit, and to admit publicly we've got some datasets in a poor state of health. This one at least I can and will fix, and I think I'll also institute some better BADC dataset reviewing. by Bryan Lawrence : 2007/10/02 : Categories climate (permalink) ## On SQL and XML I'm going to write about CouchDB sometime, there seems to be enough hype for me to record something for my own benefit (no, not you dear readers, some of this blog writing is mainly for me :-) But before I do, in some of the current furore is interesting in it's own right, not just because of CouchDB per se. For example, Assaf Arkin: Relational databases have failed the software industry in much the same way XML, Java and client-server failed the software industry. In other words, no failure to see here, move along. Those are all excellent technologies for solving a wide range of problems. Just that there are some problems they?re particularly poor at solving. Take home message? There is no one solution to all problems! Of more relevance to the subject of this post: SQL is great when you have highly structured data. The problem is much of the data we generate day to day isn't easily extractable into carefully planned schemas and are challenging to represent and query in a SQL databases. That means lots of useful data that could be stored and queried ends up unused or lost because we don't have the time and resources to build schemas to store them. and I'll tell you the one thing XML is good for (and I could be wrong because I really don't know many alternatives), it's good for marking-up textual documents. For anything else, ESPECIALLY PROGRAMMATIC INTERFACES, it's a goddamn nightmare. I finally saw the light. JSON also has warts, but it has been an absolute dream in comparison. Both of which points I had understood vaguely, but now absolutely understand after slaving away on the NDG over the last couple of months. I wish I had known these things before, it may have saved us significant grief. Now, do I believe that XML is the right tool for data models? by Bryan Lawrence : 2007/09/18 : Categories ndg computing metadata : 1 trackback (permalink) ## On Measuring Research Outputs Our department (Space Science and Technology) within the STFC is undergoing an exercise aimed at measuring our research quality. The mechanism involves bringing together a small panel of folk who go over some metrics, listen to some presentations, and make some recommendations. It's been hard for those of us in Earth Observation and Atmospheric Science (and in particular, within the Centre for Envrionmental Data Archival): very little of our income is "for research", and most of what we do get is very applied research money, but that's a story for another day. One of the things the panel did was ask for our "paper" outputs, since 2002, and they did some citation metrics based on that. They did this for all the divisions within the department. The panel does contain representatives from a range of the disciplines within the department, but doesn't run the entire gamut, so the panel will of necessity have to draw some conclusions based on metrics, rather than an assessment of what they heard ... The difficulty with this of course is that they then have to compare apples with oranges, and by limiting themselves to the last five years they've made that task very hard. I suspect the limit was by analogy with the Research Assessment Exercise, but it's important there to remember that in RAE exercises academics have declared their most important works (explicitly NOT restricted to papers, and expert panels rank the works impact, and they don't rely on metrics alone!). As an exercise in the misuse that one can make of a single metric in comparing apples and oranges, I went down to the library and took the first ten papers from the last issue of 2006 in each of one journal from atmospheric science and one from astronomy. For each papers, I simply went through the cited papers and built up a histogram of the year it was published. The raw results are in the two histograms on the left hand column of the following figure: (Note that the papers for years prior to 1990 are lumped into 1990.), The right hand column is the data after removing the two papers with the lowest and highest number of publications since 1991 in a crude attempt to remove "single-paper" bias in my very small sample. There is one obvious conclusion one can be tempted to draw from these data: Atmospheric science tends to cite older papers! • In part this is driven by the fact that the time from submission to appearance has traditionally been quite long (longer than other disciplines, and while this is getting better, it's still appears slower ...). (Other factors include the topics involved: if you're writing a paper about an even that happened in 2002 you might not have as much older material to cite etc etc). • This introduces a "citation" latency in comparisons: the median year of citations since 1991 is 2001 in astronomy and 1999 in atmospheric science. • As a consequence, papers since 2002 have a lower impact (in terms of papers since 1991, 38% of papers cited in atmospheric compared with 53% in astronomy). • If this is a proxy for the impact of any individual's impact as well ... this means the "citation" efficiency of an atmospheric scientist is "lower" than an astronomer. • Thus, if one restricts the entire analysis to papers since 2002, then the impact of any individual much be lower (in this metric) in atmospheric science than astronomy. • This is obviously compounded by the number of papers actually published per person, which is much higher in astronomy (I'll not comment here on why that might be, beyond the obvious statement that different disciplines have different implementation workloads from experiment/observation/simulation conception to results being obtained, even without any analysis time workload issues). #### Disclaimer This analysis isn't supposed to reflect on the actual worth of either community, simply to say that if one started with the hypothesis that this citation metric measured an individual's worth, then the numbers will almost force one to conclude that the average astronomer is a better researcher than the average atmospheric scientist. I think you will know that I don't believe that (in fact I don't believe that the statement could be proved, one way or the other!). Further, I know the panel didn't have this explicit comparison in mind, this isn't meant to be an apologia - this little analysis is simply to show how metrics can take you places you didn't want to go ... by Bryan Lawrence : 2007/09/07 (permalink) ## Crystal Ball Gazing There are lots of things happening at the moment that I would like to have time to dabble with enough to get a feeling for the best strategic path for my group. They fall into a number of categories (in no particular order): 1. Geospatial tools including KML etc 2. Databases and Filesystems 3. Access Control Technologies 4. Metadata thinking and tooling 5. Scientific Packages 6. Data Delivery Services Over the next few weeks I want to produce a wee blog posting for each of these areas, summarising what I'm (trying) to keep an eye on (and perhaps who/where provides the best info I can find on these topics). I need to do this to focus my mind and time ... But first, I plan to go on holiday next week. I'm inspired by Sean to want to take my (two year old) daughter camping, but have to face the fact that this year doing my bit with the wee boy is probably the higher priority. Next year, though, we're going camping, by hook or by crook. Call me on it if I haven't reported on a camping trip by the end of August 2008. by Bryan Lawrence : 2007/09/07 (permalink) ## the missing two months. Two months to the day since I got " that phone call" and rushed off home from work, I've had my first "lunch time feed read". Of course I've been watching stuff go by, but not paying as much attention as I'd like. So what have I been doing? Well, regulars will know what has been chewing up my non-work time. During my work time over the last couple of weeks I've been coding with the rest of the ndg team on the pylons stack which is the outward looking piece of the ndg (I say the outward looking because most of the NDG deliverables are in underlying technology). Pylons has been an interesting experience! I have to say that I've been amazed at what can be achieved using Pylons, at the same time as being frustrated by the raw edges (in particular, the software and the documentation are out of step, and the pylons folk don't keep the older documentation versions in obvious places). As I've watched the feeds and news go by, a number of issues nearly got me to write something, but I resisted at the time. Those moments have gone, so I don't plan to write anything significant now, but I can't resist somethings in passing ... ### On REST The RESTful crusade goes on! Sean and Charlie are banging the drum for REST. I don't have a problem with most of it, but I always seem to find myself playing a role that isn't quite a luddite, but I fear might come across with a whiff of sour old man to it ... in any case, I totally get the advantages of a simple API (get,put,delete etc), but what seems to be forgotten is that the hard yards are not only in the API but in the definition of the resource itself ... in particular, the examples Charlie gives completely ignore the difficulty of defining the resource. What the distributed object approach forces one to do is define your resources very carefully, and yes of course, that means if you change your resources your applications break ... but I think your restful service will break if I put an image where you expect an atom document! Where REST is winning big in mashups and the like is in the juxtaposition of a relatively small number of well known media types (it's no coincidence that REST was designed for hypermedia). Once the resources become more complex, the API is less important than the resource definition, and the argument is really still about distributed objects, although the effort should be expended more on defining attributes than methods in many cases. (But there are exceptions, I've wittered on about affordances before, and the concept is surprisingly useful). All that said, in general I'm sold that, provided we define our resources properly, REST provides an excellent idiom which we can use to interact with them. In practise for example, I expect that the OGC web services will evolve to become more and more restful (but they'll still have getcapabilities documents that conform to some sort of spec .... which I sincerely hope migrates from what it is now). So for the moment, on this subject, all I want to do is ask Sean "OK, so you don't want WPS and you can do it all with POST and 202 ... but what's the protocol to define a service and the things on which it operates? I suspect you'll have to write something down ... and in my mind that's what defines the necessity for a WPS (but maybe not the current WPS). ... which brings me to ### On the future of XML Two articles stuck their head up above the parapet. The first was Eliot Kimber preaching to the converted (i.e. me) about the difference between data models and how we serialise them (XML in the first instance). But in a more complex use case, where the data structures serialized are more complicate ... with non-trivial data types and complex composite object structures and whatnot, I can definitely see a purpose built language having real value, primarily in the ease with which programmers doing the serialization/deserialization can both design and understand the mapping from the objects to the serialized form. I spent some time working with the STEP standard (ISO 10303), a standard for generic representation of complex data structures, ... I was involved in the subgroup that was trying to define the XML interchange representation of STEP models. This turned out to be a really hard problem precisely because of the mismatch between XML data structures and data types (String at the time) and the sophisticated STEP models. It confirmed what I already knew, which was that mapping abstract data structures to efficient and complete XML representations is hard and naive approaches based on simple samples will not work. Which is the heart of the problem with GML and ISO19139 and MOLES and all the other places where my group and I are struggling with implementation(s). Eliot again: One thing that XML has done is embedded a number of key concepts and practices into the general programming world, such as making a clearer distinction between syntax and abstraction, which sets the base for realizing that once you have the abstraction, the original syntax doesn't matter, which means you can have multiple useful syntaxes for the same abstraction. It has made the general notion of serialization to and from abstract data structures via a transparent, human-readable syntax a fundamental aspect of data processing and communication infrastructures. And this speaks to my point about REST, which I addressed somewhat in a defence of GML... for me, and the problems I deal with in the geospatial word, the API is not the issue - ie whether it's a SOA or RESTFul isn't the first order problem - it's the data model and the serialisation that cause us grief. Is JSON the answer? I suspect not ... at least not on it's own. I suspect that there are (at least) two parts of the abstraction that need to be well handled: the schema description (programatically, UML is fine for the humans), and the instance serialisation. We've traditionally done these with XML-schema and XML, but perhaps the answer is to split the machine readable data description from the data instances. This isn't my idea of course, it came up in the other post I've singled out on the future of XML: Tim Bray on the tenth birthday of XML: Tim says: XML is the first successful instance of a data packaging system that is simultaneously (human) language-independent and (computer) system-independent. It?s the existence proof that such a thing can be built and be useful. Is it the best choice for every application? Is it the most efficient possible way to package up data? Is it the last packaging system we?ll ever need? Silly questions: no, no, and no. JSON is already a better choice for packaging up arrays and hashes and tuples. RNC is a better choice for writing schema languages. What price OGC web services that deliver JSON binary packages (where appropriate) and where the capabilities documents are in XML and described by RNC ... ? There was more, but this is already too long ... by Bryan Lawrence : 2007/09/05 : Categories ndg : 3 comments (permalink) ## one plus one equals more than two Well, it's been more than a few weeks of near silence on my blog ... however at this point our newborn is no longer newborn. I've had an interesting eight weeks: the first two on paternity leave, then another week of annual leave, then I used up a couple more weeks of annual leave over a month so I could work half time. Last week was nearly full time, and this week I'm back to it full time. I must say I really enjoyed leaving for work at 9.30 am and being home soon after lunch to play with the toddler and cuddle the baby (and cook and clean and all that good stuff)! It'd be really nice to have the sort of job that was both lucrative enough and enjoyable enough and doable in twenty odd hours a week. However, I don't have that job :-), so in the longer run I'll probably find I'm more sane working full time than half time. I suspect I'd find the stress of not having time to complete things is just as bad as outright tiredness. But my eight weeks didn't have that stress because I spent the last few months making sure that I had minimal meetings scheduled for July and August, and I had set myself some "programming" tasks rather than "management" tasks (and I find it's the latter that generate stress). While normal service at work has resumed this week, I suspect that normal service on the blog will take a little longer to reestablish. It seems that children are nonlinear: the second kid doesn't make you twice as busy at home - it's somewhat more intense than that ... by Bryan Lawrence : 2007/08/28 : 1 comment (permalink) ## bt broadband technical support sucks No surprises there ... I've had problems with my broadband over the last few days; the line simply doesn't stay up, the line count is 38 since rebooting yesterday, and I'm continually finding that I have no connectivity and having to manually connect to the router and reconnect to broadband. Often I find the router sitting trying to handshake, or training, and have to comeback a few minutes later. So, I call broadband "support". Yesterday, their advice: reboot the router with a two minute down time. Yep, well that worked for an hour or so (did I say this was a random intermittent problem? Well now you know, so it didn't work at all did it ?) Today, the line happens to have been up and down like a yo-yo, but while I was talking to them, it remained up, so they wanted remote access to my router, and want to do that through some software they wanted to download onto my computer (which has to be windows-XP apparently ... so regular readers will not be surprised to know that I was unable to oblige). At which point they asked me to get hold of a windows computer, tomorrow, and they'd call me back ... well that's sort of helpful I suppose. But not really: I explained that my voyager router can be configured to allow direct remote access, and I was happy to do that right now. Did they want that? No. Apparently their technical expertise doesn't extend to diverting from their script (which seems to consist of download their software, then they can connect to my computer and desktop share, then they can open up IE and talk to the router. Just opening up IE on their own computer and talking direct to the router seems to be beyond them). Now what? I have to hope it stays working I suppose. It's a bit sick given I get weekly calls from them trying to sell me their 8 Mb/s product, this despite we know 768 as as fast as it could go here (7 km from the exchange). (make that 39 times since yesterday, it dropped out while I was typing this) Update 23/07/07: Ostensibly nothing has happened since theblog entry, although BT again offered a call back. However, the line is now stable (uptime well over a day)! I'm left with two hypotheses: • Given the thunder about over those few days, perhaps our 7km line was behaving like an aerial and garnering enough noise to confuse the router? • However, I'm not convinced, given that the SNR reported by the router didn't drop below 10 dB (but it was fluctuating a bit). • More likely, BT have done something to make the ADSL line itself more stable at the exchange ... by Bryan Lawrence : 2007/07/17 : Categories broadband : 8 comments (permalink) ## Book Buying Frustration My two year old daughter can't get enough of the adventures of Kapai the kiwi, and I've promised to get her "orange Kapai" (she names them by the colours on the front). ... the Kapai books, by Uncle Anzac, are published by Random House NZ, and are available at Whitcoulls in NZ (as well as elsewhere no doubt) .. but the ISBN number of "Orange Kapai" is not matched by any internet bookstore that I can get to, except whitcoulls, and guess what: they only allow people with NZ and Australian addresses to register! This saga amazes me for a number of reasons: • Why don't amazon and their ilk have every publisher's booklist in their catalogue? • ... or maybe they do, but Random House is so crap they haven't connected up their international catalogue properly ... the UK website for Random House doesn't match the ISBN number, whereas Random House NZ does ... • Why are Whitcoulls so blinkered as to not try and exploit their kiwi booklist on the global market? Bah. Humbug. I'm going to bed. Tomorrow I'll ring up some relatives in NZ and get them to buy a few Kapai books for me (I don't want to go through this again) ... but I shouldn't have to do that ... Update: 16/08/07: ... and I didn't. In the end I wrote to Random House NZ, who were very helpful and pointed me in the direction of realgroovy who were able to send me the books (although they did rip me off for more than twice the P&P their website said it would cost). Still, regardless of cost (well, nearly regardless), it was really good to see her face when she realised the "aeroplane had delivered orange kapai"! by Bryan Lawrence : 2007/07/16 (permalink) ## Yet More Digital Silence I'm afraid this blog will be quiet for a few weeks now, unlike my house: by Bryan Lawrence : 2007/07/10 : 1 trackback : 4 comments (permalink) ## Top 100 Supercomputer meets the Canterbury Crusaders (i.e. a top 100 supercomputer in the same town as the best rugby team in the world). This shows how poor I've been at keeping up with what my former colleagues have been doing. Andy Sturman, who seems to have been behind this, is to be congratulated! In some ways I'm really happy for the UC team, and in some ways, it's bittersweet. One of the (many) reasons I left NZ to come back to the UK was because I didn't want to continue to fight to get bits of money to do climate modelling ... I guess that fight has now been won at Canterbury! (Although, knowing UC politics, and the politics of supercomputing in general, I suspect the next fight - how to stay competitive - has just begun). I suspect that they'll find that they don't have enough fast disk pretty quickly ... 36TB is nothing for a machine that fast .... by Bryan Lawrence : 2007/07/03 (permalink) ## A standard Vocabulary is more than just a name. One of the reasons why CF is so important to me is that it provides the methodology for ensuring that our data buckets are consumable by software and humans alike. A key part of CF is the standard name, which identifies the physical quantity associated with a variable. One of the problems with using CF is that one really needs to sometimes consider a combination of the CF cell method (e.g. maximum over time) and the cell bounds (monthly data) and the standard name (temperature) to fully digest what the meaning is of a CF variable (monthly mean maximum temperature). This can be a problem both for humans, and for software which needs to map between variables (e.g. ontologies or time aggregation software). It had been proposed that the standard names be extended to include combinations like those above, however, last week at the Paris GO-ESSP meeting we had a CF day, where one of the resolutions was that a better way to do that would be to develop a new list (to be managed by CF), which would match a standard vocabulary with explicit combinations of CF attributes. Thus, we might find an entry in this table which explicitly made a match like the example above (other examples result from asking questions like: How do I encode "High Cloud Amount" or "Ice Days", both of which are standard vocabulary concepts, but if encoded in a CF standard name alone would lead to logical inconsistencies. In practice this would mean that a variable could have an attribute from the CF standard vocabulary, and the CF checker would need to ensure that the cell_methods, cell_bounds and standard name all matched the definition in the standard vocabulary Although this was a meeting resolution, CF has a process for change, and so now this proposal will need to be formally proposed to the mailing list via the trac, and then voted on by the conventions committee before it becomes part of a future version of CF. I'll put a link here to the proposal ticket when the process has started. These concepts propagate into our GML world too: where a phenomenon had better have all of these attributes preserved ... by Bryan Lawrence : 2007/06/19 : Categories cf ndg curation (permalink) ## US policy on climate change. What the left hand gives, the right takes. Just when it seems like the US (federal government, don't blame the rest of them) is making progress (progress in the sense that the first steps are to admit you have a problem and need to do something about it) ... ... we find that the juggernaut of federal crankiness is grinding exceedingly small: the US is continuing to cutback on climate observations from space (via Eli). And what sort of observations will be most at risk? ... data that can be collected only from satellites about ice caps and sheets, surface levels of seas and lakes, sizes of glaciers, surface radiation, water vapor, snow cover and atmospheric carbon dioxide. So when we have a few years of cooler weather ... as we most definitely will (the climatological trend is just that, a trend, natural variability still has a part to play)the US political process will have just turned off observations that make the identification unambigous ! by Bryan Lawrence : 2007/06/08 : Categories environment : 0 trackbacks : 1 comment (permalink) ## Another feisty gotcha - java I use oxygen, and so I need java. OK, I think, sudo apt-get install sun-java6-bin  ... and then we get a license prompt (in konsole), but nothing seems to allow me to accept it. Kill the terminal. Now apt is broken ... do you want to continue [Y/n]? Y debconf: DbDriver "config": /var/cache/debconf/config.dat is locked by another process:...  Aarrgh. Google gives this, only the last hint of which helps: sudo fuser /var/cache/debconf/config.dat /var/cache/debconf/config.dat: 18888$ sudo kill -9 18888


Try again. Realise I have to <tab> to the <ok> prompt.

How obvious was that?

## building python on feisty

So now I have to build myself a new python on feisty kubuntu since /usr/local isn't safe.

Things to note:

• It can't be done without installing libc6-dev

• apt-get install libc6-dev

• If you want the python command line to be functional you need readline

• Update 1st June At this point despite the fact it appeared to find the system zlib, an attempt to install ez_setup.py gives:

from setuptools.command.easy_install import main
zipimport.ZipImportError: can't decompress data; zlib not available


I'm now considering a virtual python ...

## I still believe in Fortran

I've never believed in religious wars over programming languages, but the latest O'Reilly survey on the state of programming languages makes interesting reading, if only for the assumption that book purchase measures the health of a programming langauge.

Mostly I don't care, but I couldn't really believe fortran is so irrelevant (0 book sales in the first quarter of 1997!) I know it's commercially irrelevant ... so maybe all the relevant material is available online now? Well, a quick look at Google hits, compared with the O'Reilly classification, and the TIOBE index which is a more sophisticated ranking based on hits gives: :

 Language Hits OReilly Classification TIOBE rank java 306M Major 1 perl 107M Mid-Major 6 python 87M Mid-Major 7 delphi 59 M irrelevant 11 latex 50 M Minor n/a tcl 29M Minor 26 fortran 17 M irrelevant 19 haskell 12 M Minor 39

which is more consistent with what I'd have guessed. Maybe the truth is that most Fortran programmers are (relatively) old folk like me (although I haven't written a line of fortran for five years), who don't need new books. Further, looking at the position of delphi (which I've never even looked at) it seems Fortran isn't the only exception: the case for irrelevance is far from proved.

Perhaps O'Reilly need a new category name. Irrelevant these languages are not!

For some time now, we've been narrowing down how best to do scientific data citations. Last week we had a workshop where we concentrated on a number of issues associated with data publication.

I introduced the workshop with some philosophical bumf (0.75 MB ppt), and I'll probably say a lot more about it later, but meanwhile, here I want to concentrate on one aspect: We got some real feedback for my proposals for data citation (explanation and ISO19139 version).

The key criticisms were:

• Too many things that looked like URLs.

• Too much stuff

• The order of material could be reconsidered.

• (and the particular example I've been using could be easier to comprehend if we used examples that weren't quite so pathological).

Recall that we had something that looked like this (no longer a "real" dataset, but simpler as an example):

where this was essentially

Author, title, [Internet], Publisher, Date, URN, feature ID, [Feature Type (from a controlled vocabulary)], [downloaded date, available from Distributor website].

(It's important to remember that we believe the feature ID (anotherID) is important because we accept that with data we do expect folk to cite into them on a regular basis e.g. a record in a database etc.)

After some mucking around, the breakout group working on this came to something like this:

The URN could be a DOI, and in some cases it could be simplified to:

Lawrence, B.N. (1990): My Radar Data, [http://featuretype.registry/verticalProfile anotherID]. British Atmospheric Data Centre DOI:doiaddress.

We have made the following assumption, and simplifications from the previous version:

• We lost [Internet] because we thought it was redundant once the citation has a URL or DOI.

• In this case we are dealing with "formally published" data, and so there is an expectation that the data wont change, so the download date is redundant.

• We thought that a formally published data set should not allowed to grow, later "editions" could provide snapshots. We appreciate that this has implications for numbers of publications etc, but the importance of citing something as it was is preeminent.

• This is not to preclude folk referencing material on the internet which is changing, but if it is going to be "published" data, then we think we can and should handle it differently.

• We expect the target of the URL or DOI to be a metadata document, it should not be a binary target. There is human readable content there which provides more context and the URLs of the actual data. We have left the feature type in there though (as well as the feature ID), because it provides the human parser of a reference list a key hint about the target type1.

• There could be a mismatch between a dataset which could have a DOI, and the URI of the feature. (We don't expect all features to have DOIs), and so having two forms of the citation does make sense: the first form above allows a URL which points directly to the feature to be shown (not that does in this example); even if the URL isn't persistent, the URN will be, and the data object should always be accessible via the publisher,URN,featureID combination. I suppose the DOI version is cleaner2 provided the feature can be easily obtained from the target of the DOI.

1: I'll probably come back to the "manifest" concept that is coming out of the work of Raj Bose and Guy McGarva in Edinburgh another time, but suffice to say "manifest" could itself be a member of the controlled vocabulary: a featurecollection is itself a feature! (ret).
2: despite the fact I hate the way most folk create the things: anyone done the stats for how many DOIs are opaque unmemorable strings that are used in the literature and result in mistranscribed versions which point nowhere or to the wrong place? (ret).

## debian python and easy_install aren't a perfect match

It turns out that on a debian system, if you

1. create your own python in /usr/local, and

2. use easy_install (python eggs)

you'll get into trouble. Phillip Eby has a solution, but it's not very tidy.

I can't think of any good reason why Debian has done this, and of course it affects ubuntu badly as well.

by Bryan Lawrence : 2007/05/18 : Categories ubuntu python : 0 trackbacks : 1 comment (permalink)

## overheard on the email

Overheard on the email lately (I think both Andrew and Stefano will forgive me for publicising excerpts from recent emails especially since they're relevant to the whole what place does OGC and GML have in the world chat):

Andrew Woolf:

Too many people forget the case of a feature with (possibly multiple) coverage-valued properties. When scientific people complain that this ISO/OGC stuff "is just GIS" I robustly respond that actually the concepts are as revolutionary to traditional GIS as to us scientific users. Let's please leave behind these old notions of "raster" vs "vector", and realise that actually we can model the world in whatever complex way is necessary.

Stefano Nativi:

We should avoid "mental barriers" like Raster Vs Vector, as well as Coverage Vs Feature (the new version of the same contraposition).

In an interoperable Geospatial Information framework, Observation&Measurement, Feature and Coverage are different ways to see the same stuff (i.e. they are views). Different use cases may need to present users with a Feature view and access and process data using a Coverage view, or vice versa.

It helps to have the same toolkit to describe these views :-)

## It's not how big the tool is, it's what you do with it

I really ought not get involved in long discussions when I don't have time to finish what I start ... but anyway. Charlie didn't start the conversation by writing this, I did that by responding :-), but now it is a conversation :-). So this is an open letter to Charlie.

Taking things one point at time:

1. In his (my) view, GML is not meant for data exchange but instead provides a common language that various communities can use to develop interoperable solutions .

• Umm, not quite. I think GML is meant for data exchange, but using it requires some understanding by the client, GML alone is not a solution (Below we'll define data in this context).

2. If I want to share data with my business partners, I can invent and implement a proprietary exchange format in less time then using GML. Or take an even simpler route - just exchange shape files and be done with it.

• Each time you invent your proprietary exchange format, you probably can do it faster ... but remember both ends of that piece of wire have to be involved in the conversation. With GML (and a UML description as documentation), I've got a fighting chance of interpretting your data without you, and then writing my client or service.

• As for shape files: care to explain how I can use a shape file to exchange a trajectory of dropsondes from an aircraft?

3. Second, with this approach you end up with thousands of separate communities that cannot exchange data between them. Whether this is good or bad depends on your goals - if I want to exchange data with a few other like minded organizations then this is ok. But if your goals are loftier, to create a world-wide geoweb, then this is bad..

• Umm. Nope, they can exchange data, but they have to do some work to do so, and absolutely I want a world-wide geoweb, but I don't want that to be proprietary or limited by the commercial imperatives of the GIS vendors. Actually, I think we both agree on this point, we're just trying to get there via different routes.

4. Third, if the goal really was to create a common language to describe geographic information, as opposed to exchanging it, why not reuse UML (Universal Modeling Language) by creating a UML profile?

• Agreed, so we start with our UML profile, which describes something, now what do we do? We have to find a way of implementing our description ... more of this below.

5. (wrt GML) ... Is it meant to make it easier to implement thousands of one-off data integrations, or is it meant to enable geographic data exchange on a world-wide scale? ... If its the former, then I think there are simpler, faster approaches than GML. And if its the latter, then GML fails to provide a simple format that every system can use in the same way that Atom does.

• Seriously: enabling thousands of one-off implementations a different way each time would be easier than GML?

• Atom allows you and me to exchange a document that we can each interpret in terms of some simple concepts. Add GeoRSS to it, and you can tell me the location that this document applies to. If I want to do anything more complicated then we're into a GML extension anyway ...

John Caron, in the comments to my missive, also picked on OGC specs in general:

It could be that the process of creating OGC specs itself is flawed, perhaps because implementations come after the fact, perhaps because industry consortia are simply the wrong "governance" structure to produce clean technical specs with just the right level of abstraction.

I don't think all OGC specs are flawed, but do agree that GML is not at the right level of abstraction. It makes too much use of arcane XML technology (for example, in a slightly different context, who really cares about substitution groups?), and certainly could be much leaner. However, that's actually a problem with all standards that try to be all things to all people. There is no doubt that lean mean standards like Atom are easier to construct well (maybe the Atom team wont call what they did "easier", but all things are relative :-). The question then is could a profile of GML do an 80-20 job better than the whole thing? Could Atom (alone) do it all?

The clear answers are yes and no. In the first case which is what Application Schema of GML are all about. Step 2 in my diagram is about using GML to build an application schema, not GML in and of itself. That's the bit that communities should build to be lean and mean. Then only extend (or unify) a little as you add a new community.

To be fair, even building a profile is a pain, and that's because we all agree that XML schema is difficult to handle, and (again, in a different context) difficult to constrain.

Actually, perhaps part of the issue is what we mean by data. I think GML allows me to describe lots of attributes of my data in a way that's quite easy for someone else to consume: you can read my axes, understand the parameters (what dictionary did i use to describe them) etc. The final mile, "the real data", is going to be hard for anyone else to consume without knowing exactly what that data object is. But GML also allows me to define the coverages (albeit in a restrictive way, roll on a full implementation of ISO19123), and that's the take home point for my trajectory of dropsondes example above.

If I give you a coverage, conforming to a GML application schema of my trajectory of dropsondes, you've got a fighting chance, with a GML parser1, of writing some code to grab the trajectory and put it on a map, or create a contour map of the height versus trajectory ... and all that without knowing about NetCDF itself. Yes, you might need software that can read NetCDF, but you don't need to interpret the NetCDF yourself, you only need to interpret the GML description (so I've just saved you the netcdf manual, and maybe, in the CSML case, the HDF manual, the NASA Ames manual, and the PP manual, and yes, they add up to a lot more pages than GML alone). You can take my libraries though for reading the data and "just use them", but you can "just use them" in your application only if you parse the GML ... and do the thinking about what my data objects mean to you in your application (if anything :-).

Could Atom (alone) do that? No! Atom would allow my software to consume your document easily, but to make sense of the content in my application, the "standard" has to describe the semantics of use to me. Of course Atom is relatively easy to use, and that's because it has limited semantics. Great, let's use it for what it is, but not pretend I can give you an arbitrary geospatial object in Atom and you can consume it in any meaningful way without having a conversation with me about what it is, and then writing some code :-)

Both ways we need new code. Your point is that it would be easier to use Atom to give me the object, and have a conversation with me every time to write the code. My point is that you can give me the object anyway we like, but it helps if we have a toolkit that helps us describe the object without having to write (every time), yet another calendar handling tool, and dictionary handling tool, and ...

... and by way of conclusion: This isn't really an open letter to Charlie, it's really to my subconcious (so Charlie: thanks for the excuse: who knows, I too may change my mind; there's a lot more thinking and detail sorting out before this conversation is finished - whether or not anyone else bothers to join in :-)2

1: Oh yes, I know exactly what that means :-), no one has a complete GML parser, more's the pity. (ret).
2: Sorry this last paragraph didn't get submitted with the original, I somehow managed to submit an earlier version. (ret).

## feisty kubuntu

On Friday I upgraded from dapper ubuntu to feisty kubuntu on my laptop. I needed to do it because:

1. I got sick of Evolution hanging with some image email, and requiring a restart a couple of times a day.

2. I wanted beagle to index my .doc files properly. (Actually, I needed this, the amount of time I spend trying to find files is unbelievable).

3. I needed to deal with OpenOffice misbehaving with some spreadsheet inserts on a document (for once I was working with someone who wanted odt rather than doc ... role on the revolution). An upgrade was required, this was the clincher as to why I did it then and there.

4. Ever since I got my laptop it would never reliably find a physical ethernet at boot time, I often had to do an ifup eth0 afterwards (I think this was a bit of misconfigured networking by Emperor Linux, but I never got to the bottom of it).

5. konqueror was crashing on Eli's website. Always.

6. I was hoping that akregator might behave better with atom feeds.

All but the last of these got fixed. I'm pretty happy, but there were some wrinkles:

1. Evolution email didn't get indexed by beagle. I've fixed this by importing my mail back to kmail. By the way, that isn't as trivial as it ought to be. Evolution mail directories include .cmeta and .ev-summary.data etc files which the kmail importer hangs on. I had to follow advice on how to fix that. Basically:

• You need to run this code. Then do the import. But beware, it imports them all as unread, so if you did have some genuinely unread stuff, then you wont be able to identify it afterwards from kmail (although it's still there in Evolution).

2. I have my own python in /usr/local, and it interacts badly with the system python when /usr/local is mounted. See this bug report.

3. I'm not really convinced beagle is getting everything, especially in the mail.

4. The hotkeys on my Leonovo Thinkpad T60P used to work (thanks to Emperor Linux plus a wee piece of my own bespoke python). I'll have to get around to them.

I'm very impressed with

• The new network manager

• But, it really ought to come with the PPTP code by default, and it ought to work without getting both network-manager-pptp and network-manager-gnome, and when you do have the pptp stuff it ought to be integrated with the kde wallet not the gnome keyring.

• The hibernate and suspend work much better with my laptop, and it looks like the power management does too, with around half an hour longer battery life I think ...

The upgrade did take an hour or two after the install to get thing nearly back the way I want them, so that was lost time, but I suspect I will make it up this week on file finding alone.

As usual I did it on a separate partition, so I can always go back ... I only wish I could work out how to backup from one partition to another, and upgrade that other partition, rather than install on it.

## Interoperability is just over the horizon ... always

I'm not a GIS person, yet I've invested quite a lot of my own time, and quite a lot of public money, into building tools based around the Geographic Markup Language (GML). GML is essentially a toolkit designed to improve interoperability, but it's getting a bit of bad press right now, both in blogs ( e.g.) and mailing lists (e.g this thread).

I probably wouldn't care, but Sean Gillies who seems to have quite a few clues (and who provided me with the pointer to the blog link above), seems to agree. So I want to engage in this discussion, but before I do so, I want to digress.

I spend a lot of time arguing about how difficult it is to data citation and publication. One of the key points that one needs to keep on restating is that we have a shared understanding of what books, articles, chapters and pages are, and what those terms mean. We have no such understanding for data. Data is about the real world.

Ok: back to the main point. Interoperability, in the GIS sense, is actually the same problem. Actually, it's the same point in every sense, not just the GIS sense, but we'll stay focussed here :-). One of the best diagrams I have seen to make this point is in ISO19109 (Geographic information - Rules for Application Schema), and it's encapsulated in one figure, which looks something like this:

The key point is that everyone, doing any coding, does something like that. We start off by modelling (on paper, in head, in UML ... whatever) some of the things about the real world that we care about.

We then move from that abstract model to the world to building descriptions of the key features of the real world in some "descriptive language". We give those descriptions names, likes "schema" or "standards" (or even "RFCs") and we use a variety of technologies to do that.

Then we take data and we populate instances of those "schema".

Really good architects/programmers find really simple ways of doing the process, so the entire effort is streamlined, and the resulting objects and instances are easily understood by the "Community of Interest".

Now let time pass.

Two communities want to talk to each other, and exchange data. All that simplification is lost. They have to work their way back up the tree (probably to the real world level), and come back down until they share the same "descriptive language", and then data objects described using the descriptive language can be shared.

On the way we can write a new descriptive language every time (for every pair of new communities), or we can try and design an abstract descriptive language that allows one to avoid that step every time someone wants to interoperate. Doing the latter introduces considerable complexity, and that makes the job of solving problems for ones own little community harder ... every time. But, and here is the big BUT, unless you know you will never want to share your data, then you've just moved that complexity til later, you haven't undone it. If you are in a business, that's just fine, this years profit is all that matters, but if you have longer time horizons, then solving the tiny problem isn't necessarily optimal.

Which brings me back to my data citation example. We can't share anything, until we have a shared understanding of what it is (1), and a shared way of describing it (2). Right now, flaws and all, UML and GML are the best thing going for that. They're absolutely not the best thing going for solving any specific problem, and not even close to the best thing for most of the use cases the GeoRSS community want to address, but, if you want to do interoperability, it's not only today's problem you need to think about: what's over the horizon matters!

So, with that in mind, let's go back to Charlie Savage's argument. There's a lot of good stuff there, but I think he draws the wrong conclusions.

Thus the real problem GML tries to solve is how can your computer system and my computer system exchange data about the world in a meaningful way? In my opinion that's an unsolvable problem, because the way your database models the world is different than mine.

So I agree except for the unsolvable bit ... which we'll come back to.

He ends up with

In my view, the fundamental premise of GML is wrong. The ability to create custom data models is an anti-feature that makes integration between different computer systems impossible because it assumes that those systems can actually understand the data. Computer systems have no such intelligence - they only understand what someone has programmed them to understand.

Which is so nearly right, except for the impossible and anti bits.

I think the heart of the problem is in the expectation of what GML gives you. What it absolutely doesn't give you automatically is code to manipulate someone else's data objects. What it does give you is a descriptive language you can both use to describe them, and you absolutely have to spend real programming time exploiting the fact you have a common language (so now it's solvable and possible!) No, my WFS client may not understand your Feature Types ... yet ... but I could make it do so, and I can do that without inventing or learning a new paradigm. That's interoperability, but it's a "strong-typing/loose-coupling" sort of interoperability.

Of course it's not the only sort of interoperability that matters. Web 2.0 and REST and GeoRSS and all that stuff is good, it's really good, but it's not the whole story. I wish folk wouldn't keep on arguing that just because some technology doesn't solve their use case it's flawed!

Of course I can give you chapter and verse on why GML sucks, but that's another story, it sucks less than some other options :-)

## Spring

No photos can do justice to this time of the year in the Chilterns, but we try:

## NDG Access Control

I've wittered on about access control here for a while. Despite being frantic with various funding proposals (hence the silence), I've found time (with help) to knock out a description of NDG security. It wont make fun reading for those who like simplicity, but that's life, it's as simple as we can make it! James Snell only got one thing wrong in this statement:

Auth is and will continue be the most significant issue with APP interoperability.

The thing he got wrong? He didn't need APP in the sentence! Actually, he got one other thing very right in that post: the importance of profiles, but that's another story in another context.

Why now? The 2007 e-Science AHM.

#### Practical Access Control using NDG-security

Access control in the NERC DataGrid (NDG) is accomplished using a combination of WS-Security to ensure message level integrity, X509 proxy certificates to assert identity, and bespoke XML tokens to handle authorization. Access control decisions are handled by Gatekeepers and mediated by Attribute Authorities. The design of the NDG-security reflects the reality of building a deployable access control system which respects pre-existing user databases of thousands of individuals who could not be asked to reregister using a new system, and pre-existing services that need to be modified to take advantage of the new security tooling. NDG-security has been built in such a way that it should be able to evolve towards the use of community standards (such as SAML and Shibboleth) as they become more prevalent and best practice becomes clearer. This paper describes NDG-security in some detail, and provides details of experiences deploying NDG- security both in the e-Science funded NDG and the DTI funded Delivering Environmental Web Services (DEWS) projects. Issues to do with securing large data transfers are discussed. Plans for the future of NDG-security are outlined; both in terms of application modification and the evolution of NDG-security itself.

(Full paper: pdf)

## go-essp 2007

The call for abstracts for the GO-ESSP 2007 meeting is out (and has been for a while). This year we're limiting numbers to 60, so there is an abstract winnowing phase. I've just submitted mine:

#### Technical and social requirements for a putative AR5 distributed database of simulation and other data

It is highly unlikely that future large multi-model intercomparison projects involving multiple initial-condition and/or parameter ensembles from multiple institutions will be solved by centralised database solutions. Such centralised databases would need to have very high bandwidth to all possible data consumers, and require significant resources which would not be easy to obtain within existing national budgets. Fortunately solutions to the problem of intercomparison which involve distributed data holdings with common metadata structures and interfaces are possible. There are already a number of possible components of such a solution deployed in a variety of institutions. However, there are a number of issues that would need to be addressed before such solutions could be joined together to provide seamless access to data on an international scale. The issues range from agreeing on which technical solutions (the plural is important) should be used, to establishing trust relationships which could be supported not only by the scientists in the institutes involved, but by their network and computer security administrators. Given that technical developments will most likely continue to be driven by individual funding programs, not by an overall project with internationally generated and agreed requirements, it will be important to understand that success will most likely depend on agreeing common interfaces and information models, not on deploying the same technology throughout.

## US National Weather Service is ahead of the game

Over two years ago, I was pleased to note that the US National Weather Service provided forecast data in XML via a SOAP interface. In the intervening period they've moved on considerably: they now have a WSDL interface to their SOAP service, and now a new WFS interface (hat tip: John Caron).

If I ever find time I might have to build some toys to interface with their data!

by Bryan Lawrence : 2007/04/06 : Categories ndg badc xml climate : 0 trackbacks : 1 comment (permalink)

## Channel Four Shame

I didn't see the C4 programme, and hadn't planned on commenting on it here, but last night while I was watching the kiwis take another step towards the World Cup, my wife was on the phone to a teaching colleague: Apparently this colleague had seen the programme and had found it "pretty convincing", and worse from my point of view it was a common position! She had discussed it with other colleagues who had also found it convincing - science teachers all! So I spent further into the sleep bank and wrote most of this last night.

I don't blame them, one doesn't expect the mainstream "believable" media to be that poor: while TV is not refereed literature, in the UK at least one expects a level of integrity in "factual" or "discussion" pieces1. But I digress, this post is not supposed to be a diatribe, I wanted to write something that could be accessible to a teacher, with some links to folk who had seen the programme and had made some cogent responses.

So, remembering that I didn't see this programme, let me start with a comment about the contenders: on the one side we have the Intergovernmental Panel on Climate Change (representing thousands of active scientists, most of whom are not directly funded by any government, but are simply assembled by governments), and a handful of disaffected, often out of touch, or simply misled and misquoted, individuals orchestrated by an individual with form on misleading the public:

Martin Durkin, for his part, achieved notoriety when his previous series on the environment for the channel, called Against Nature , was roundly condemned by the Independent Television Commission for misleading contributors on the purpose of the programmes, and for editing four interviewees in a way that "distorted or mispresented their known views". Channel 4 was forced to issue a humiliating apology. But it seems to have forgiven Mr Durkin and sees no need to make special checks on the accuracy of the programme.

Now this isn't about weight of numbers alone, the point is this that there really is no significant argument amongst active scientists about this, climate change is real, happening now, and rather more rapidly than we hitherto expected (although that's not to say the potential impacts aren't overstated here and there), but C4 felt the need to make a "there's my Johnny, the only one in the entire army marching in step" kind of programme. This is a classic situation where a few amateur voices who can't get anything peer reviewed have a take on things that is purported to be as valid as that which results from peer review! What nonsense!

OK, so the two key links you want addressing the programme itself are

And this is a blog entry I link to above (for the benefit of anyone who prints this out), on amateurism and peer review:

OK, and here's my summary and interpretation of their key points, along with some other bits and bobs.

• Met Office: The bottom line is that temperature and CO2 are linked.

• Bryan adds, blimey, even Arrenhius a hundred years ago understood that. (You may be interested in what the American Institute of Physics thinks, it's only their government who is out of step!.

• Real Climate implies they made a big deal out of CO2 not matching the temperature record over the 20th C. Apparently the graph they showed had been doctored (see below), and the very good explanation (suphate aerosol) for the discrepancy is well known, so the programme makers were lying by omission.

• C4 was also upset that Temperature leads CO2 by 800 years in the ice cores. Which as Real Climate put it is basically correct, but irrelevant since the reason for that understood.

• Met Office:The bottom line is that observations are now consistent with increased warming through the troposphere.

• The troposphere should warm faster than the surface, say the models and basic theory. And Real Climate implies they claimed the data didn't agree with that. But it does! Unless you want to use data with known errors that have since been fixed (and the folk on this programme knew that perfectly well).

• Apparentlly they tried blaming cosmic rays as well, in passing, so I'll address that in passing too: See my blog entry on that house of cards. The Met Office again: The bottom line is, even if cosmic rays have a detectable effect on climate (and this remains unproven), measured solar activity over the last few decades has not significantly changed and cannot explain the continued warming trend. In contrast, increases in CO2 are well measured and its warming effect is well quantified. It offers the most plausible explanation of most of the recent warming and future increases ... changes in solar activity do affect global temperatures. However, what research also shows is that increased greenhouse gas concentrations have a much greater effect than changes in the Sun?s energy over the last 50 years.

So let's wrap up with two more quotes:

• The bottom line is that current models enable us to attribute the causes of past climate change and predict the main features of the future climate with a high degree of confidence. We now need to provide more regional detail and more complete analysis of extreme events.

• ... it means they have "touched up" pretty well all the graphs they've used (the solar one omitted the recent data; the 400y solar "filled in" some missing data that was missing for a good reason). Swindle indeed!

(Update, 2 May 2007: See also this list of issues and their open letter and make sure you check out the signatories.)

1: I for one will never bother watching channel four for "news" again, entertainment yes, but facts no! (ret).

## Data Journals or whither the Earth System Atlas

I attended day three of the QUEST open science meeting yesterday, and listened to the presentation from the Earth System Atlas (ESA) folk.

Aside from the assumptions that they were the first to think that scientific data should be preserved (wrong), or first to think of a proper citation based data publishing effort (wrong), or even that they would be the first to deliver such a thing (already done, and not even by us), many of their arguments are good. But I would say that, since I've been making many of the same arguments.

However, I had some substantive issues with their presentation that I didn't get a chance to discuss then, so I'm recording them here (I'm aiming on a meeting with their technical director next week, but it may not happen).

1. I wonder how hard they've thought about what it means to be a data journal, and all that entails: journals need to be persistent, and while that's relatively easy1 for something that has a paper copy, it gets more difficult for something that is digital only, and much more difficult for something which uses formats and conventions which are not in the commercial main stream. Essentially, a data journal has to be a fully operational data centre first, and a peer reviewed entity second. What I mean by that is: If you can't be sure that the data is properly managed, then will the citation be persistent? It's not for nothing that the AGU only allows data references to "proper" data centres.

2. Further, they've asserted that their ESA will provide "a centralised model" and that's a good thing, and I would contend that's downright wrong. If they're halfway successful, then supporting data access will require a distributed model; and the more distributed the better, ideally distributed across continents!

3. I didn't hear a clarion-call of standards compliance. The reason why electronic journals work is that everyone can read them, and that depends on the use of standard output formats (and frankly, the persistence thing depends on standard input formats, and both depend on lossless translation and the work of copy-editors :-). In the case of a data journal, the output formats will include figures (easy) and downloadable data (harder). They declared that "they'd start" with NetCDF, but without a convention for how to use NetCDF that's not enough. Of course they're aware they need metadata, but I fear they've only scratched the surface. I didn't hear ISO, I didn't hear OGC, I didn't hear CF-compliance (well I did hear all those things, but that was in my comments from the floor). Mind you, the QUEST science meeting may have been the wrong place for them to deliver acronym soup ... we avoided it in our NDG presentation to the same meeting and ended up with gross (inaccurate) generalisations as a consequence. So it may be that this is all in hand, and anyway they are still very much in the spin up phase.

4. I certainly didn't hear any discussion of what they will actually cite, beyond "we'll use DOI's". Regular readers will know that citing data is not going to be trivial.

5. All that aside, they're going to get some of that metadata structure for free: the QUEST part of the ESA will have to be data that is compliant with the NERC data policy and hence conform to BADC metadata requirements and we will be holding duplicate copies of the data.)

6. I note the existing ESA site has a UAH copyright, and that's not consistent with what they said about the data access being open. Further, they'll need to step into the data licensing cesspit (it's not enough to say it's free and open, they will have to license it, if for no other reason to avoid liability).

Reasons 1 and 2 are above are the reasons why in our efforts to develop a data journal (of which more another time, we've only just received funding for the second step following on from CLADDIER, but it will involve a pilot project with the Royal Meteorological Society, RMS), we're going for an overlay journal. The data journal part of it (which boils down to the specific metadata and documents needed to elevate the dataset to "peer-reviewed" above the metadata and documents needed to archive the data), will exploit existing reliable archives. That way, we can rely on the persistence via professional data archives, and point to multiple duplicate copies if we have confidence they are the same thing to get performance, we could even encourage different archives to have copies as mirrors if necessary, following the Lots of Copies Keeps Stuff Safe (LOCKSS) mentality.

As I said at the meeting, there is room for a spectrum of data journals in academia, so I don't see this as a "them or us" situation, far from it, I suspect our different approaches may appeal to different segments of the community.

I do notice that they've spent a lot of effort building up an editorial board and contacting the great and good, and we've spent no effort on that (although for the editorial board we'll exploit the RMS). It remains to see whether that has been a blunder for us.

1: no archivest flame wars please, the key word is relatively (ret).

## planes, trains, and automobiles.

I'm sitting in a hotel in Paris, communicating by virtue of the hotel next door who has an open wireless network. Well done France, none of the British effort to make money from absolutely everything ...

Anyway, I felt like waxing lyrical about the Eurostar experience. Why does anyone fly from London to Paris? Oh yes, I know it's more expensive coming by train, but it's much much more pleasant. I've done this trip nearly entirely by train ... the first 20 minutes from our country estate (not everything on this blog is entirely accurate) was by car, but then

• train into London (on time)

• underground across London (quick and easy)

• train to Paris

• train to the last mile (ok, 100m, also quick and easy)

• a la pied.

Total time, equivalent to flying, total cost not that different (given what it costs to get to Heathrow!), queueing time: Minimal. Wasted time hanging around: Minimal.

Now if only I could get a fast train to other places in Europe direct from home ... (and if only my mates who live in Paris weren't on the other side of the world this week).

## Management Technique - Or Lack of It

Over the last six months I've been pretty poor at keeping track of my blogroll - my akgregator tells me I have 4020 unread articles. What I tend to do is simply ignore entire blogs for long periods of time, and then have a purge. On the train back from Manchester yesterday I finally got to having a look at Esther Derby's blog: Insights You Can Use. As I read through nearly all the articles I realised two things:

1. her blog is a must read blog for anyone who manages anyone (let alone manages people who build software, in which case it really ought to be a must MUST read blog). (Obviously I knew that, sort of, as I have her feed, but I've just elevated it back up to my "read nearly every day" category).

2. I've been a pretty poor manager lately.

I have to hang my head in shame about how I've been treating people, again for two reasons:

1. I haven't been as patient as I should have been, and

2. the reason I have been getting frustrated with folk is in many cases down to me!

There as so much in her blog, that it's hard to pick specific things that I want to share with you, but anyway, here's a couple (not necessarily the best things, but just fairly typical: insights I can use)!

#### The Prime Directive

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

That being so, why didn't they do what you wanted them to do? Part One, Part Two

#### On change

1. people who aren't following my ideas are "resisting". Why's that then?

1. they don't know how

2. they don't feel they have time

3. they think their way is better

4. they don't think the new way will work

5. they don't like/respect the person requesting the change

6. the new suggestion is counter intuitive given people existing mental models (or what they've been taught)

7. the new suggestion runs counter to existing reward structures or other organizational systems

8. the new suggestion doesn't make sense to them

9. they have no experience that tells them the new way will work, or how it will work.

2. Maybe we need to be careful about how we introduce new ideas? Our wonderful ideas of how to make things better for other people may not be greeted with enthusiasm, because:

1. Other people may value things about the old way that we don't see or don't appreciate.

2. Things that we don't like about the old way may be valued by other people.

3. It takes time for most sentient beings to adjust to a new way.

4. People will accept the new way to retain something they value (ok, so help them with that!)

5. Even when the new way is accepted, people may look back fondly on the old way (deal with that).

#### Quick Code Reviews

Well, ok, she didn't write this one but she did link to it, so I wouldn't otherwise know about it. The bottom line is that rather than have formal code reviews a good way to create good software and promulgate good knowledge about thta software through a team is to

1. unintrusively request a small piece of time (max 5-10 minutes) from a colleague to discuss code whenever a new and small (less than 10 minutes remember) piece has been written to fix a bug or satisfy a new unit test (or both I suppose, since any bug in the best of all possible worlds should probably create a new unit test).

2. go through it line by line, and

3. get the benefit that over time no code is only understood by only one person!

#### Context Switching

1. Up-to-date technical skills : 14% (9 votes)

2. The ability to juggle multiple, constantly reprioritized tasks: 38% (25 votes)

3. Skin thick enough to take constant end-user abuse: 8% (5 votes)

4. The ability to say "no" without making people angry: 15% (10 votes)

5. A sense of humor: 26% (17 votes)

As she says:

The respondants to this survey apparently want employees who have out-of-date technical skills and are thin-skinned yes-men and yes-women. And they'll take those employees and subject them to multitasking on constanly shifting priorities.

I'm thinking they won't be producing much working software.

Oh s**t, well I want my team to be up-to-date, but I also want them to have those other characteristics, so that implies I may be one of those that leads teams that doesn't deliver as much as they might (we do deliver working software, but I'm always moaning about how slowly, perhaps it's me ...)

#### And so

After all that, can I change?

## Review of the ESA HMA project

Recently, the British National Space Centre asked me to undertake a review of the European Space Agency's Heterogenous Mission Accessability project. The full review is here, the key result is that:

From all perspectives, the technical opportunities for involvement in HMA in the future are good: the underlying technology is being developed in a public manner, a testbed and service validation system is planned, there is considerable scope for expansion (both by adding services and data products to the DAIL layer), and an investment in HMA technology is likely to have payoffs in the wider deployment of geospatial services (including commercial deployment).

Technical summary points are:

1. The HMA project is being developed with methodologies based on the ISO and OGC specifications. To understand the HMA, data providers and data consumers will need to be familiar with those specifications.

2. Not only are the baseline specifications in the public domain, but many of the HMA architectural specifications are in the public domain in form of OGC documents ? so the only barrier to uptake on HMA technology is appropriate funding.

3. The project is on target to deliver new functionality based on the existing SSE toolkit and a Data Access and Interoperability Layer (DAIL).

4. The functionality that will be delivered by the DAIL will be limited by design decisions that have been made for pragmatic reasons in a changing landscape of what should be reliable interoperable web-service technologies.

5. As the underlying technologies change, and as the requirements of the HMA are driven by th wider GMES project, it is inevitable that changes in the DAIL (and associated toolkits) will be required. This is recognized by the establishment of a HMA project Architectural Board (HAB).

6. The membership of HAB may need to be reviewed to ensure it is forward looking and not limited to just the existing HMA partners (it may not be enough to have mechanisms for adding new members as new missions are added). HAB deliberations should be public (although obviously individual mission implementation timescales should remain confidential if desired).

7. There is considerable prospect for expansion of the HMA into other ESA activities.

8. There are some issues associated with identity management technologies which may slow progress with moving from prototypes to implementation. These are compounded by a potential lack of trust by data providers in (1) the ability of the DAIL to protect information about data/service use by individual users, and (2) the protections that their IPR has within the SOA. (The latter being unfounded in our opinion).

9. While the project has made good use of OGC specs for metadata management, service description and control, there has not been any significant data modelling, and that will limi the use that can be made of OGC web services for data consumption, either within DAILservices, or by DAIL consumers.

10. The current development is based around layers; instruments that provide atmospheric profiles will not be well supported in the initial phases. (This is a consequence of the lack of data modelling and consequential lack of feature-type definition beyond the implicit assumption that the data consists of layers).

11. The permanent testbed to be created as part of the HMA-T project should ease development of HMA compatible services (both those which consume services via the DAIL and those which expose services via the DAIL).

12. The proposed OGC pilot project should expose the HMA technologies for wider constructive critique, and this will be of significant benefit both to GMES and the wider community.

## wsgi, unicode and paste

I woke up this morning with a sore head ... no not from the daemon drink, but because I had a unicode problem. Everytime I have a unicode problem I get a sore head. Nearly every time the solution is obvious, but usually I can't frame the question well enough ...

Anyway, this time I was getting this error in a paste application:

ERROR:root:Traceback (most recent call last):
File "build/bdist.linux-i686/egg/wsgiutils/wsgiServer.py", line 131,
in runWSGIApp
self.wsgiWriteData (data)
File "build/bdist.linux-i686/egg/wsgiutils/wsgiServer.py", line 177,
in wsgiWriteData
self.wfile.write (data)
File "/usr/local/lib/python2.5/socket.py", line 254, in write
data = str(data) # XXX Should really reject non-string non-buffers
UnicodeEncodeError: 'ascii' codec can't encode character u'\xad' in
position 5710: ordinal not in range(128)


It was fairly obvious that my wsgi application was returning a unicode string where a vanilla string was required. But why was this a problem and what should one do about it?

It turns out that the wsgi spec requires vanilla strings, which means the programmer (me) needs to handle this explicity. Thanks to Ian Bicking on the paste mailing list the solution is obvious (well it is now):

def __call__(self,environ,start_response):
''' This is an example wsgi application '''
#go do some real work and return some (possibly) unicode string
r=somefunction(environ)
start_response('200 OK', [('Content-Type', 'text/html'),('charset','utf-8')])
return [r.encode('utf-8')]

(Update: a couple of useful links on unicode etc: pylons page on internationalisation and the python unicode tutorial)

## Roundtripping openoffice and msword - bullets

I'm gradually moving to using openoffice more and more (yes, I think I'd admitted to using msoffice before, but I'm finding it more and more unreliable - especially since my combination of crossover office and ubuntu has stuffed up the font support so pdf output from msoffice is broken). However, to share with some colleagues, I do need to use .doc format ...

And I find that second level bullets get munged in tranlating from odt to and from doc in such a way that I can't seem to undo it. This problem is apparently well known, but there doesn't seem to be an ubuntu solution (or indeed much comment from the ubuntu space).

If anyone knows one for dapper, I'd be keen to know ...

## python soap library proliferation

Implementations of python soap libraries appear to be like buses. There are none along for a while, and then suddenly there are two in a row. In the beginning there was soapy and zsi, and they became one. Soapy is old and no longer being supported, and the soapy and zsi communities are coalescing on ZSI, under the name Python Web Services.

Recently I found out about tgwebservices, and as I said then, it's enough to make me want to revisit turbogears.

But then today, I discovered soaplib which seems to provide a nice easy interface to soap and can even be deployed in a wsgi stack (I like that, even cleaner than tgwebservices!).

Both of the "new" python-soap stacks make use of decorators, which make the code a good deal cleaner than ZSI (starting later has its advantages), but I'm sad that yet again the python community is fragmenting around a key piece of infrastructure.

I can understand why the proponents of the new python libraries didn't like ZSI, and I should be the last person to criticise someone else for thinking green fields are easier to build on than brown fields (I have a history of doing it myself), but yet ...

While competition (and consequential survival of the fittest) is a good thing, I think our community isn't really big enough to support three fully functional stacks. It sounds like the two new ones might themeselves coalesce, which would leave two python soap implementations, and maybe that's supportable, and even good for our community.

Meanwhile, neither of the new ones have a working wsdl2py, which is a key part of service consumption (starting later has its disadvantages), and it's not yet clear how the two companies involved will handle open-source communities trying to build around their babies: it's been hard enough for the ZSI community to avoid forking. So, for the while, we'll stick with ZSI, and we'll probably contribute our ws-security implementation back to ZSI, but I suppose we'll have to evaluate that decision properly now ... I hate shifting sands.

## Another Nail in the Cosmic Ray Conspiracy Coffin

A number of folk (Nexus via Rabett Run) have picked up on the recent paper by Evan et.al. (2007,GRL, 34, L04701) on the inappropriate use of ISCCP data for long term trend analysis.

I liked this paper because of the conclusion, and because they cited us as the location from which the data is available. I wish more folk would do that. We need to be able to demonstrate this sort of thing to those who fund us.

Anyway, the conclusion is pretty unambiguous, and like those who pointed me at the paper, I can't resist quoting the abstract:

The International Satellite Cloud Climatology Project (ISCCP) multi-decadal record of cloudiness exhibits a well known global decrease in cloud amounts. This downward trend has recently been used to suggest widespread increases in surface solar heating, decreases in planetary albedo, and deficiencies in global climate models. Here we show that trends observed in the ISCCP data are satellite viewing geometry artifacts and are not related to physical changes in the atmosphere. Our results suggest that in its current form, the ISCCP data may not be appropriate for certain long-term global studies, especially those focused on trends.

## service orchestration needs data models

Service description languages need to address three classes of identifiers:

• the identifiers (or handlers) of the objects

• the identifiers (or handlers) of the services

• the identifiers of the object descriptions (we have to know what type of objects they are).

I'm sitting here in Frascati in the ESA Heterogeneous Mission Accessibility (HMA)1 workshop, and the last of those identifiers is the elephant in the room. Lots of talk about web service consumption, and lots of talk about application schema of GML, but little or none about data models.

(Never mind service choreography)

1: the link is hma.eoportal.org but it's currently broken (ret).

## I don't care about the flops, I care about the PB

It's good to see the UK has commissioned a new supercomputer: HECToR. The press release is all excited about how fast it will go (theoretically, initially 60 Tflop/s, going to 250 Tflop/s in 2009, with another upgrade in 2011). You have to go to the new hector site itself, to discover the important detail from my point of view:

The systems will be connected by a common infrastructure (Rainer) which will also support 576 Tb of directly attached storage, rising to 1 Pb at the end of the phase.

(it's not clear what "the phase" is, given the multitude of timescales involved, but I'll find out eventually).

This is good news for us, as we're currently providing support for atmospheric modellers using HPCx since the disk configuration there was simply inadequate!

## My personal event horizon is receding too quickly

I feel obliged to know about the various technical things that could impact both on our services and our service developments, which means I live within a little black hole into which I want to aggregate information. (It's a black hole because I don't have enough time to communicate much back out again).

The thing about black holes is that as they grow, the space enveloped by the event horizon grows quickly too. At the risk of pushing this analogy too far, the problem is my personal information entropy increases, the surface area of that event horizon becomes larger faster than I can keep track of. There are just too many things I want to know about. I've been here before

Anyway, today's feeling of knowledge impotence is associated with it being unlikely for me to have time in the near future to follow up on two things that today's blog trails1 have led to:

#### tgwebservices

This via Stephen Pascoe!

Tgwebservices looks like a fascinating toolkit for developing interfaces to python code which support (simultaneously) REST, SOAP and JSON clients. The readme gives this example:

class InnerService(WebServicesRoot):
@wsexpose(int)
def times4(self, num):
return num * 4

class ServiceRoot(WebServicesRoot):
inner = InnerService()

@wsexpose(int)
def times2(self, num):
return num * 2

Those decorators give you for free the following: Assume that ServiceRoot is instantiated with a baseURL of "http://foo.bar.baz/". Here are URLs that are available:

 URL exposes http://foo.bar.baz/ nothing there... http://foo.bar.baz/times2 HTTP access to the times2 method http://foo.bar.baz/inner/times4 HTTP access to the times4 method on InnerService http://foo.bar.baz/soap/ URL to POST SOAP requests to http://foo.bar.baz/soap/api.wsdl URL to get the WSDL file from

This alone is enough for me to revisit turbogears (and maybe cherrypy)!

#### HTTPsec

This via Sean Gillies who led me to David Smith to Stefan Tilkov at InfoQ2 who took me to a rather underwhelming position paper by Mark Baker.

He started with an argument I can agree with; that web-services are actually more tightly coupled than the web paradigm requires, but draws different conclusions than I might. Anyway, in a very thin argument that "the web" (as opposed to "web services") supported security well (I say thin, but some of his points about message level versus transport level are very fair), he mentioned HTTPsec. And that's what got me interested.

HTTPSec:

HTTPsec is a strong authentication scheme for HTTP transactions. It defines an HTTP extension for mutual authentication and message origin authentication, via the integrity protection of a defined set of HTTP message headers. It offers message sequence integrity, forward secrecy, and optionally content integrity and content ciphering.

HTTPsec can authenticate any web traffic between any identities or peers that can provide certificates or RSA public keys. HTTPsec is designed for scenarios where credential-based schemes are inappropriate for architectural reasons or are simply considered too weak. It is also appropriate where message-layer security requirements are not otherwise satisfied by transport-layer or network-layer security protocols ...

That looks rather like a "normal web" WS-security to me. If HTTPsec is a drop in replacement for WS-security, then I start to be pretty interested ... as WS-security is one very strong reason for being interested in SOAP (because I can pass encrypted or otherwise messages through various recipients and still know they've not been tampered with).

(Incidentally, I fail to believe that HTTPsec could be considered a RESTful paradigm, as the whole concept of signed messages blows away one of the pillars on which REST rests: that you want to be able to use transparent caching for multiple clients. That doesn't mean it's not desirable, it just means this is not ammunition for another one of those REST versus SOAP arguments).

#### ... and together?

Of course it doesn't look like there is a python implementation of httpsec, indeed, there is only just (November 2006) a java one. And either way (httpsec or ws-security) if tgwebservices is to be really useful, it'd have to be able to roll with a message level security paradigm.

## Patent Stupidity

Regular readers will know that I haven't much time for software patents. The stupidity of them is all to clear to see, and we've another example this last week. IBM have apparently provided a blanket IPR disclosure on their involvement in the development of the Atom specs. Well done them! Atom is important and that's a good thing!

But the sad part of this is that they felt they had to do this, even though they have:

NO KNOWN patents or applications for patents that read on the Atom specs.

James Snell goes on to say:

However, as is well known, IBM has a massive IPR portfolio and it would take a very long time and cost a lot of money for us to dig through 'em all to know for certain. Rather than spend the time and energy doing that, IBM has agreed to a blanket commitment to Royalty Free terms for any IPR that reads on the Standards Track specifications produced by the atompub Working Group.

So, IBM have so many patents that they don't know if any of them are relevant! How the hell is anyone supposed to develop software that doesn't violate patents if even the patent-holders can't (afford to) search them in any given context?

Don't get me wrong. I'm not whinging at IBM, I'm whinging at the stupidity of the situation wrt software patents!

## Contemplating a move from Leonardo at home

You might think this blog has been quiet lately, but it's nothing compared to my home blog, which has been in stasis for a year or so: mainly lack of time, plus problems running Leonardo in a simple hosting environment (too hard to configure to be the way I want it, cruftless URLs etc). So:

• I could do some major surgery on Leonardo, but given this is for my home blog, it'd have to be done in my time, and there isn't much of that. I've done all the surgery I need for Leonardo at work, and don't even have time to do the surgery I want for Leonardo at work ... besides, although I liked (and like) Leonardo, I think there are better frameworks nowadays ...

• I've considered moving to one of the "major" blog providers or software packages, but I want my own material to be on my own site (call me a Luddite), and I want to be able to tinker.

• I've considered writing my own blogging software (Joe Gregorio has pointed out how easy it is to get started), but I'm often reminded how easy it is to produce a prototype and how hard it is to write a production system, and for home use it needs to be "production-mode", and so although I want to tinker, I don't want to be fiddling incessantly with it as I have little time). So I figure starting from scratch isn't going to be the way to go.

• The same Joe Gregorio ditched his old software to a significantly improved 1 moved version of his "throw-away" Robaccia to 1812, and a few folk have even started playing with it. What I like most about it, is the nice simple backend store, and the use of an Atom Publishing Protocol client - I started work on something similar for Leonardo, but of course it was bespoke, and I never got around to finishing it, but always missed it (I would like to be able to both write blog entries and have an archive of my posts on my laptop for the many times when I'm not connected, e.g. trains). If I wanted 1812 for work, to be comfortable with it, I'd want to add support for trackback, openid, and implicit versioning (all previous versions available and editable). I'd probably have to put the latter in both the client and the server, and obviously it'd need my wiki format, and I'd want to make sure the server could support direct editing (for those times when I have access to someone's browser but can't get my laptop online). So lots of work then. But fortunately, I don't want it for my work blog, Leonardo is safe here for the foreseeable future.

So, I plan to potter along with 1812 for home. I've made this intention public today ... but I wouldn't enter a sweepstake on when/if I get my new home blogsite out if I was you ...

1: Amusing that he did this, given his insistance that folk should work on existing frameworks, but I totally buy the knowledge acquisition argument, I absolutely have to "keep my hand-in", and lightweight mucking around with blog software is one way of doing so. Some of my other attempts at keeping current have gotten into the critical path of activities, and that's not good for someone whose main commitment has to be management. Does this argument contradict my thoughts about "production-mode" at home? Maybe! Do I care? No. (ret).

## ippc-data.org

The BADC has recently won a contract from Defra to deliver some services in support of distributing climate data to the UK and global community. One of those services is the IPCC data distribution centre. The contract was formally signed for a start date of February the 1st, and the eagle eyed in the community will have noticed that a new website appeared on the 2nd of February (the day the IPCC working group one released their summary statement) to replace the previous site at the Climate Research Unit in East Anglia.

Quite clearly we started ahead of the contract signing date, and we kept most of the previous site, but a lot of work was done on cleaning it up, removing broken links and improving performance. We've got a lot more planned too ... but anyway, I want to publicly acknowledge the team who did the hard work, particularly Charlotte Pascoe, Stephen Pascoe and Martin Juckes!

The IPCC data distribution centre is actually three websites, one in Germany, one in the U.S, and ours, so we're looking forward to further integration, and learning from our new colleagues.

## Desktop Search

If only Beagle, or Google Desktop search or any of them, could find a document on my desktop ... no, I mean the wooden thing on which my keyboard resides ...

## Software Design, or, why Chandler is going nowhere

Joel has an interesting review of a book about the Chandler design and philosophy (I came across this via Joe Gregorio. The whole review is well worth a read, but I can't help wanting to quote these bits (sorry Joel, I know it's more than one paragraph, but I want to shout about these three over and over again, and I promise to attribute every time).

... you fell for that old overconfidence trick of your mind. "Oh, yeah, we totally know how to do this! It's all totally clear to us. No need to spec it out. Just write the code."

... you hired programmers before you designed the thing. Because the only thing harder than trying to design software is trying to design software as a team.

I can't tell you how many times I've been in a meeting with even one or two other programmers, trying to figure out how something should work, and we're just not getting anywhere. So I go off in my office and take out a piece of paper and figure it out. The very act of interacting with a second person was keeping me from concentrating enough to design the dang feature. What kills me is the teams who get into the bad habit of holding meetings every time they need to figure out how something is going to work. Did you ever try to write poetry in a committee meeting?

Mind you, I'm not convinced that diving off into an office to solve every problem is possible: despite the Mythical Man Month, sometimes you have to share tasks, and you just have to talk about them. What you do need at every technical meeting is some sort of straw man though!

(And I'm really sorry that Chandler appears to be going nowhere. I want a good mail/calendar/tasks application, and Outlook just doesn't deliver ... and nor does anything else I've tried, albeit all of the ones I've tried may have been hamstrung by having to interoperate with Exchange.)

## Quiet Times buying an LCD TV

This is heading to be my quietest blogging month since I started ... mainly because of a hectic workload, and the fact that we need a new TV, and I can't bring myself to pay the going rate for an LCD TV ...

Some of the time that I might have spent blogging I've spent catching up on email from being out of the office, the rest has been spent combing the web to decide which TV/price combination we like and can afford (and fit in our living room). I can't believe how much LCD TVs cost, nor how many variants just one manufacturer can put on the market ...

Reading reviews is little or no help: "which online" put you off anything except their best buys (and they're not that enthusiastic about them), and half of the reviews for the models I'm interested in are not in English ...

Anyway, I've had enough. I've settled on one, and I've settled on a supplier, and now the thing is out of stock. I'll wait a few days, and hopefully they'll get a supply, and I'll do it, else expect more quiet times: I've got to buy a TV shortly (the NZ rugby season starts soon, and our old one has various fundamental problems) - and I've got even more time out of the office next week, so will be catching up on even more email at night ...

## The Big Storm

Last weeks big wind took its toll too: not only did I nearly get stuck in Liverpool (all trains stopped, all motorways bar the one to Manchester closed), my back fence blew over, as did a bunch of trees. So I've got fence building to do ... more wasted time ...

I didn't really see any big winds myself, but as I say, I felt the effects. I missed the fence and trees falling over (asleep at the time), and I missed most of the drama nationally as getting back from Liverpool was a bit of a saga: the most expensive taxi ride of my life (from Liverpool to Birmingham: fortunately a shared taxi pricewise; unfortunately a shared taxi bumwise - ever sat in a black cab for five hours? Don't!) ... followed by a two hour wait in Birmingham for the train crew to materialise (can't really blame the train company for that), and then a slow train home (speed restrictions because of the wind). After all that, I didn't fancy watching the news (on my soon to be replaced TV).

by Bryan Lawrence : 2007/01/23 (permalink)

## ubuntu sound

Yes, I'm still here, but it's just been hectic since the holidays ...

Over the holidays, I did manage to upgrade our home computer to ubuntu edgy (from OpenSuse 10.0). This was is every sense an upgrade, but it had one catch: I lost my sound. It's taken me from New Year til now to find time to do anything about it.

IEC is an input to certain high end sound cards ... Ubuntu turned on those options by default. To turn off those options, rum "alsamixer" (without quote) in gnome terminal. You can move around each option with your arrow key (right and left key.) Move to every IEC related options, turn off all of those options (use keyboard "m" to turn off.) After turning off hit "Esc" key to save, and type "sudo alsactl store" so that it is saved permanently.

## Access Control

I've said it before, and I'll say it again. If you have high volume or high value real resources on the web, you need access control!

Back in August I introduced the simple "gatekeeper" methodology we have planned within the DEWS project. This simple idea could be used to "protect" any resource, and in particular, along with the WCS, we could use it for OPenDAP.

We could do it the following way:

• Deploy pydap.org.

• Introduce a layer of wsgi middleware that called ndg security and provide gatekeeping functionality directly within that application.

We'd never have to modify pydap. But we would need to think about how the clients interacted with the pydap server. Realistically no one uses pydap with a browser: people either use bespoke OPenDAP clients, or they use an application linked to a client library (the most popular of which would appear to be versions of the netcdf bindings).

In either case, it's unlikely that folk would be rebuilding their applications to take advantage of ndg security, so we'd need to deal with an out of band establishment of the security context.

ndg security requires the gatekeeper to have access to a proxy certificate (to identify the data requestor) and an ndg attribute certificate (to assert what roles the data requestor has). If a browser were to contact a gatekeeper, the normal sequence of events would be to redirect to a login service, which would instantiate a session manager instance, which would get the credentials and load them into an ndg wallet instance before redirecting back to the gatekeeper with url arguments which would then be used to populate a client-side cookie with the session id and the session manager address (for future requests).

In the case of OpenDAP (or any other non-ndg specific applications which we might wish to secure with ndg security) we need to 1) obtain the credentials independently of the data request, and 2) we need to communicate those credentials to the gatekeeper with every request without the benefit of a cookie.

We could do the first of these with an equivalent command to the grid_proxy_init concept with the globus toolkit (although our version would probably need the URI of the data object in case specific attribute certificates were required). That command, say, ndg_securitycontext_init could populate a local file based ndg wallet, but there'd still initially be no way of getting the certificates through a client application to our gatekeeper server except via the URL. So we'd not be able to use the contents of the local wallet directly, although it could be queried to provide an token in the argument in the URL which could be intercepted by the gatekeeper.

Ideally the token would be the proxy and attribute certificate themselves, but that would make already cumbersome OPenDAP URL's horrendous. It'd probably be as easy to utilise a remote session manager instantiated by the ndg_securitycontext_init command , and simply provide the address of that wallet in the URL, although we'd have to do this with each request (no cookie remember :-).

Obviously that token would be worth intercepting by the bad guys, because if you can get it, you can pretend to be the data requestor until the token contents time out. We could minimise the risk of this by insisting on using https to communicate with our gatekeeper (and pydap) server, but this would be an overhead on the data transport, which might be a problem for high volume transmissions.

Alternatively, we want to somehow make the token useless if intercepted. A lightweight way of doing this which would stop some lightweight threats would be to ensure the token includes the IP address of the data requestor.

We could also make sure that the data request itself (the URL) was signed and encrypted by the proxy certificate in the wallet, making it only possible to do replay attacks (the token wouldn't be independently available), we'd end up with a URL something like:

http://ourgateeper.address/encrypted_data_request


but it'd be nightmare for the user, imagine doing a sequence like the following for every transaction in a matlab or idl session (assuming we've already instantiated our ndg_security_context):

1. identify the data uri

2. run ndg_encrypt_uri programme (which we'd have to provide)

3. copy the new cumbersome url to the application

4. get an error

5. wonder whether we got our copy and paste right

6. realise we had the original data uri wrong (for example)

7. ...

So, that's going to be too hard for real people. We might have to live with a one time token produced by the ndg_securitycontext_init command, which can be appended to "normal" opendap urls ...

I'm sure the real security folks wouldn't like this as much as full ndg security, as it would be possible to spoof the originating ip, and bad guys can exist on the same machine as good guys. However, in practice I think this would be a sufficient level of security for all our datasets, but it'd still be ugly!

Ideally of course the opendap folks would modify their libraries to have an access control call out which would allow any security infrastructure to be bolted in. There is a rumour that the Australian Burea of Meteorology has contracted them to do something like that, but I've yet to see any details. (And of course ideally we wouldn't have to have our own bespoke ndg security too, but until the mainstream meets our access control needs, we'll continue to have to roll our own!)

Our role in the web ecosystem is to make it easy for scientists to do science, which means making it easy for them to find data, to manipulate it, and to do new things. Making it easy to find things, means giving them as many tools as possible!

The Google SOAP API was my initial motivation for our data discovery deployment. However, along with our SOAP API (which will lie at the backend), we'll have a RESTful vanilla API and probably a suitable version of OpenSearch too. So it's crying shame (via Tim Bray) that Google is backing away from their leadership position in this area. With every such step they loose a little bit of their magic, and a little bit of the good will that the community had for them - maybe I will try that other search engine after all!

## Service Descriptions

A little over six months ago, I introduced the thorny problem of service binding to my blog. Of course it hasn't gone away. Last week I gave a talk (see my the SEEGRID talk on my talks page) about "Grid-OGC collision" in which I made some specific statements, amongst which were:

At the moment WSDL2 is where we are investing our thinking time (ISO19119 is a meta-model for services rather than a SDL)

GRID has more sophisticated service binding, access control and authentication, workflow! The OGC community should not reinvent tooling!

I also made a throw away comment about Service Description Languages along the lines of which "there's another one along every week", by which I meant that folk appear to keep getting pissed off with what's available and building their own cut down simpler SDLs.

This was all in the context of knowing that many in the OGC community sees the ebXML registry information model (ebRIM, pdf) as a key part of the way forward for solving this sort of problem. (The reason why ebRIM is so attractive is that it allows independent description of services and metadata, with the associations in the ebRIM catalogue being used to produce late binding of what can be done to specific datasets. Of course using ebRIM only postpones the question of what is used to describe the services, and what associations are put in the system.)

Josh Lieberman called me on on those assertions, and asked me specifically to say exactly what the Grid community had that the OGC community should use! Of course I dissembled: one of the problems with being in my position is I know just enough to make assertions (and decisions) based on what I've read and heard (via the UK e-science community), but I make little attempt to remember the detail.

Before I get going, I have to confess to some lazy use of the word grid in the context of the talk and this note. Frankly, what I really wanted to do in the talk I gave was to encourage the OGC community to avoid reinventing wheels, and when I said Grid, I really meant: "everyone who is not in the OGC community who are dealing with tightly coupled strongly typed systems and the semantics of their interoperation", that is I meant it in the cyberinfrastructure/e-science/research sense. I should also confess that not being part of the OWS4 initiative also means that there are probably sections of the OGC community who would both agree with my position, and even better, have done something about it!

It turns out that Josh already knows about everything below, but I've persisted in writing it here, because a) it's a useful collection of links for me, and b) it might provoke some useful discussion, either now or at the inception of a future NDG project (if it happens) ... Flawed or not, this is where I am:

1. Ideally we have to put some sort of online resource service endpoint descriptions in our dataset metadata. In this context we can't rely on the RIM alone because the ISO19139 documents may be harvested and used elsewhere where the RIM association content may not be available.

2. From various corridor gossip I've heard that populating and interacting with ebRIM systems may be fiendishly hard.

3. Ideally we want service descriptions themselves to be independent of the dataset endpoint descriptions.

4. If we believe in Service Orientated Architectures at all, then one of the reasons for doing it is to allow orchestration.

5. Orchestration wont spring up ab initio, we need well-described services and things that humans can work with first!

6. The hardest part of service description is in producing a machine readable (and understandable) description of what the service does and what it's input and outputs are.

7. The OGC community probably lead the world in strong-typing of complex data features (based on the concepts of ISO19101 and ISO19109). However, while the domain modelling in UML is quite mature (e.g. via the HollowWorld), and the serialisation of data objects into GML is mature, the serialisation of feature interfaces has been forgotten.

8. There have been many attempts at building web service description languages (too many to list), and even the odd (incomplete) attempt to evaluate some of them.

9. There are two types of coupling that we care about (tight/loose), and two types of typing (strong/weak) in the services we care about. I've introduced these before. Ideally we should be using something which can support services built around all four combinations.

10. In the case of tight coupling and strong typing, we can build sophisticated orchestration frameworks with real workflows. In this situation a human maybe able to build complex workflows without inspecting all the service details themselves simply by relying on tools. However, we need to remember that because the resulting service interfaces are likely to be quite brittle changing one service interface will result in the necessity to consider rebuilding all the client service interfaces!

11. In the case of loose coupling and weak typing, orchestration is more likely to be done by well informed humans since there is unlikely to be enough machine understandable information in the web service descriptions (however presented) to allow complex automatic detection of what is possible and allowed. (i.e mash-ups are great, but depend on humans to work out what is possible).

12. Many of us will be building services that belong in both camps, so being able to use the same tooling for both jobs would be helpful!

13. I can't find a description of the OGC GetCapabilities document which would allow one to construct any generic tooling which is not tightly bound to the service it describes (i.e. as far as I can tell the GetCapabilities document can only be consumed by something which already understands the service-type it is looking at). Nonetheless, GetCapabilities does at least expose what data exists under a service (if it is a service which exposes data at all).

14. Even the OGC invented GetCapabilities expecting it would be replaced (page 10 of 04-016r3) :

A specific OWS Implementation Specification or implementation can provide additional operation(s) returning service metadata for a server. Such operations can return service metadata using different data structures and/or formats, such as WSDL or ebRIM. When such operation(s) have been sufficiently specified and shown more useful, the OGC may decide to require those operation(s) instead of the current GetCapabilities operation.

15. WSDL2 is a vast improvement on WSDL1.1 in terms of support for the semantic description of data types. In particular, of relevance to the OGC community, the data types can be described by reference to external xml-schema ... i.e. we can use our GML application schema to constrain service types. (Incidentally, the WSDL2 primer is one of the clearest documents I've found in terms of exposing what some of these things do!).

16. The grid community have been building clear concepts of how to build semantic descriptions of web services (e.g. OWL-S - see especially the description of the relationship between OWL-S and other web service description languages.)

17. While there are a number of different ways of annotating wsdl2 to produce semantic descriptions of the services exposed (e.g. SAWSDL and WSDL-S), there are already tools being developed to create and exploit them.

18. People have said really good things to me about SCUFL (Simple Conceptual Unfied Flow Language) and Taverna. I understand from the folks at Newcastle that they might not be applicable to the OGC services processes, but the key point is that folk find these sorts of tools useful in real workflows. Even if SCUFL/Taverna aren't appropriate, there is also some work being done with BPEL at OMII which may be more relevant. It sounds like the SAW-GEO project will be looking at this stuff in great detail, and I'm looking forward to their outputs.

19. We've built the concept of affordance into our next version of CSML1 However, we'll not be able to use those in service descriptions within the NDG2 project.

So, what's next?

Of course not all web services are OGC web services, and that's especially true of us in the NDG. The best we will be able to do in the next few months is to provide human readable service descriptions within metadata documents. However, if we ever get to build an NDG3, choosing the right service description methodology will be crucial. We would expect to use WSDL2 and ebRIM, but are hopeful that simpler interfaces to ebRIM will be available by then.

1: which isn't visible at that link yet, but it will be in a month or so, you can see pre-release stuff at http://proj.badc.rl.ac.uk/ndg(ret).

## Swivel Data

I've just found out about swivel.com (via Savas).

The idea is that apparently one can "upload data, any data" and "display it to others visually". It's a nice idea, but it's also grossly false advertising. One cannot upload "any" data and "visualise" it. Maps? Binary data? That said, I can still see utility for some.

It's also not a particularly new idea. Looking through the comments on a techcrunch article about swivel shows a number of other places trying to do a similar thing (http://www.data360.org and http://www.1010data.com). There's also a pointer to a now defunct company with a similar model.

While I wish them well, it's frustrating for those of us with complicated datasets and/or complex long term data management problems. It leads to unreasistic expectations of what can be done with generic data archives.

Issues that they will eventually need to consider:

• Exactly what is the underlying data model? In what format should data be submitted to conform with it? (No, I'm not going to register to find out whether answers to these questions exist behind the login screens. That sort of stuff needs to be upfront!)

• One of the reasons we demand our data providers don't give us spreadsheet data is because inevitably they come with inadequate metadata, and it's damned hard to do any automatic cataloguing1 of data contents etc. So I suspect they may struggle with their cataloguing once they have a lot of data, and the users may struggle with actually using the data for anything real - particularly if they last a year or two and their metadata links start breaking!

• Just like YouTube, they're going to have to worry about licensing! Folk are quite precious about their data IPR!

1: The IR folk are always claiming this isn't a problem, any old text search mechanism will work fine: and that's true provided the text exists and the data model is bloody simple, which it might well be in the case of swivel. (ret).

## Service Orientated Architecture - Two Years On!

I've been blogging for slightly over two years. After I wrote my SOA article earlier today, I realised that my second ever blog post was on SOA. Then, as now, I had been reading Savas's blog.

Unfortunately, in the intervening time two things have happened:

• I have spent more and more time down the SOA hole, to the detriment of my Atmospheric Science, and

• I forgot this:

Share schema and contract, not type (avoid the baggage of explit type, allow the things on either end to implement them how they like).

This was in a discussion of distributed objects versus service orientated architectures. I should have reread that entry and the links within (especially this one) before wading into deep waters! It would have helped focus my argument.

Going back then we have from the original article that inspired my second blog:

The motivation behind service-orientation is to reduce the coupling between applications.

Well, I'm not so sure about that, I think it's really about what Lesley Wyborn in her SEEGRID III presentation called "decomposition" - and the reason why we want decomposition is so that we can have "orchestration". So it's actually about reducing the coupling within applications so the components of application(s) can be reassembled to do new and interesting things - potentially across system and/or organisational boundaries. Having done so, we might still want strong typing (although this paints an interesting perspective on typing and the use of xml schema).

Anyway, I need to get back to thinking about other activities here for a while and forget about SOA details. Fortunately Simon Cox sent me a photo which puts SOA in a proper perspective, and he's kindly allowed me to put it up here:

## who pays the watcher?

The observer/guardian reports that the Met Office budget for climate change "has been slashed".

Or more accurately, that's what the headlines says. The main story is a bit more confusing: the budget cuts appear to be over the whole Met Office, not just for climate change, and the slash is a proposed cut of 3% on the main forecasting activity. The story points out that Defra, who pay for most of the climate change work have not yet indicated whether they are planning a cut. The story quotes an unnamed Met office spokesperson.

Before I say any more let me firstly state that 1) I haven't spoken to anyone at the Met Office about this, 2) I work with the Met Office, and 3) I am trying to get funding from Defra as well (for supporting the communication of climate change results and storing the data ... I'll blog more of that another day) so I'm not a completely disinterested bystander.

This sort of story is incredibly annoying on a number of levels. The reporters have conflated climate change research with forecasting - The climate change work depends on the supercomputing, but the Met Office has a complex internal market, so it's not obvious that there is much, if any risk, to the climate change work in this cut. Maybe there is a genuine issue there, but I can't tell from the article1. I'll have to ask someone (but may not be in a position to blog the answer).

It also provides plenty of ammunition for the septics ... the reason why we should worry about this cut is because this is the hottest year on record? Should we not worry about cuts in a cold year?

There's obviously a problem here. Most of us in the business of climate change research (and support I suppose ... I can hardly claim to be doing climate science per se any more), would argue that more funding is needed to minimise uncertainty in prediction - especially at the regional scale. However, the septics always claim that we preach gloom and doom in order to increase the funding. Catch-22. But articles like this one that barely make grammatical sense, conflate lots of issues (there are others mixed in there too), don't help anyone!

1: Although obviously climate science is a bit of an ecosystem activity: cut the weather research in the wrong place and you'd definitely influence the climate science, but there are bound to be some parts of weather research at the met office that don't affect climate science! (Not that I'm arguing that cutting weather research is a good idea either, just asking for clarity in what's actually at risk here) (ret).

## No Silver Bullet Exists

Another bout of web services "religious war" has broken out again. We've been here before! This time it's based on one funny and accurate diatribe about SOAP. The resulting frenzy in the blogosphere has yielded some quality comments, and even some declarations of victory by those who think REST is the one true way.

Amongst the furore, there were four key comments:

1. Nelson Minar whose opinion on soap web services has to be respected states:

The deeper problem with SOAP is strong typing. WSDL accomplishes its magic via XML Schema and strongly typed messages. But strong typing is a bad choice for loosely coupled distributed systems. The moment you need to change anything, the type signature changes and all the clients that were built to your earlier protocol spec break. And I don't just mean major semantic changes break things, but cosmetic things like accepting a 64 bit int where you use used to only accept 32 bit ints, or making a parameter optional. SOAP, in practice, is incredibly brittle. If you're building a web service for the world to use, you need to make it flexible and loose and a bit sloppy. Strong typing is the wrong choice.

2. This is backed up by Joe Gregorio:

... if you don't have control of both ends of the wire then loosely typed documents beat strongly typed data-structure serializations.

3. And finally, Sam Ruby pointed out that it's just much easier to handle problems with developing restful applications. He also makes the following throw away:

In addition to all the architectural benefits of REST, as well as all the pragmatic experience the web has built up over time with caching and intermediaries? benefits and experience that WS-* forsakes ...

4. Gunnar Peterson makes the point that it's not just about unsecured applications (via Sam Ruby):

... But if you are going to say that REST is so much simpler than SOAP then you should compare REST with HMAC, et. al. to the sorts of encryption and signature services WS-Security gives you and then see how much simpler is.

Which brings me to me why I wanted to say something. It's just not as simple as some might say, and even Roy Fielding didn't claim that REST solved all the problems in the world! I wonder how many of the vocieferous RESTful advocates have actually read his thesis?

Some choice quotes:

Some architectural styles are often portrayed as ?silver bullet? solutions for all forms of software. However, a good designer should select a style that matches the needs of the particular problem being solved. Choosing the right architectural style for a network-based application requires an understanding of the problem domain and thereby the communication needs of the application, an awareness of the variety of architectural styles and the particular concerns they address, and the ability to anticipate the sensitivity of each interaction style to the characteristics of network-based communication.

The REST interface is designed to be efficient for large- grain hypermedia data transfer, optimizing for the common case of the Web, but resulting in an interface that is not optimal for other forms of architectural interaction.

REST is not intended to capture all possible uses of the Web protocol standards. There are applications of HTTP and URI that do not match the application model of a distributed hypermedia system.

So where do I stand on all this? In practice, some of the applications of this stuff are simply not "distributed hypermedia applications" (which is what REST was designed for). Some of them really are distributed object activities (the OGC things), and some of the assumptions of REST are violated in the grid world - for example, I'm not terribly interested in automatic caching when my data objects are huge -10 GB+-, and write performance is as important is read performance, and latency doesn't matter because machines are doing the work. (I'm happy to provide quotes from Fielding's thesis where he lists these things as reasons for REST!)

But we don't just do big science data moving at the BADC. In trying to get my thoughts together on this, I came up with the following classification of communities based on what they are trying to do:

 Strong Typing Weak Typing Tight Coupling Grid (Implicit1 Typing) "Secured"2 Web Applications Loose Coupling OGC (Explicit3 Typing) "Web 2" & Mash-ups

where

1. Implicit Typing: Nearly all grid applications are file based at the "usage" point or use OGSA/DAI. Either way the applications are tightly coupled to an implicit knowledge of exactly what the contents of the data resources are - the semantic content of the data resources is known to the (human) builder of workflows. I'm not aware of any real attempt to do late binding of services based on the semantic content1 of the data resources.

2. Secured: Here I'm implying more than just the use of https as a transport mechanism. There is usage of sophisticated AAA mechanisms which include role-based access control - but in the final analysis the actual processes or transactions are relatively loosely coupled.

3. Explicit: The OpenGeospatial Community has built very sophisticated mechanisms of building detailed descriptions of computational objects (features) which match onto the features of the real world. These objects have detailed structures, multiple attributes, maybe decomposable, and have implicit interfaces which afford behaviour.

(As an aside, note that in general the tightly coupled systems have strong security. Also that recent work within the OGC community has been building SOAP bindings and strong security into the OGC web service paradigm. It's unfortunate that that work is being done in the context of a project known as GeoDRM ... which has all sorts of bad connotations ... mostly it's about GeoAAA2, not DRM!).

All the really good arguments for REST lie in the bottom right corner of the table ... and to be fair, its also where the majority of web usage should lie too!

The reality is that the availability of tooling plus the type of task makes decisions for me! We're building things that are a mash-up of web-services (which we secure using WS-security) and plain old XML type services (which we secure using gatekeepers that exploit WS-*). We do do some REST things. We're not using much "pure" grid tooling, because the python tooling isn't mature enough yet. But we will.

So there is no silver bullet right now! For me, any sort of fundamentalism sucks. The final word belongs to Nelson Minar:

Truly, none of this protocol fiddling matters. Just do something that works.

(Update, 4th Dec. Aaargh: the material in italics above somehow got lost presumably because I missed the last seconds of my wireless time in the great southern blackspot.)

1: Where I'm meaning semantic content at the level of detail exposed by, for example, GML Application Schema (ret).
2: Authentication, Authorisation, Access (ret).

## A proposal for profiling ISO19139

I've been flagging issues with profiling ISO19139 for some time - see Oct 19 and Aug 15 and especially the comments in the latter.

Over this week a group of us (Rob Atkinson, Simon Cox, Clemens Portele and myself) have been reviewing the reality of problems with profiling ISO19139 and conforming with appendix F of ISO19115.

The following is my summary of what I think we agreed.

#### extensions

We believe the situation is relatively straightforward for extensions to ISO19139:

1. A given community decides on a profile

2. The extensions are defined and documented in UML, so that an extended element/class is a specialised class which

1. lives in a specific profile package

2. has a different name, and

3. has an attribute which documents which iso type it is intended to replace.

3. This new package is then serialised into a new schema,

• where the new class retains the different name from the UML definition and maintains the gco:isoType attribute to identify the parent (extended iso) element.

4. and instances are validated against that schema using standard mechanisms.

5. Interoperability requires that when these instances are made available outside of the community for which the profile is understood, either the producer should transform the document back to vanilla ISO19139 (easily done since all extended elements can be renamed using the iso-type attribute along with removal of material from the new namespace).

#### restrictions

We believe that most communities will be profiling ISO19139 by restriction.

Typical restrictions will include (but not be limited to):

• limiting the cardinality of attributes (e.g. making an optional attribute mandatory),

• restricting the type of some attributes (e.g. forcing something to be a date where currently a date or a string is allowed).

• replacing string options with codelists

In all these cases, the resulting instance documents ought conform directly to the parent ISO19139 schema since these restrictions have not introduced new semantics to any element/class - they are simply constraints.

Accordingly, where a community profile wishes to restrict a given element class we recommend that:

1. The restrictions are defined and documented in UML, so that an restricted element/class is a specialised class which

1. lives in a specific profile package

2. has a different name, and

3. the constraints are modelled using the Object Constraint Language (OCL), and

4. the specialisation is stereotyped as a <<restriction>> to guide serialisation 12.

2. These constraints are then serialized as schematron3 commands, which are maintained separately from the profile schema (if it exists - a purely restrictive profile would not require a new schema).

3. Instances are then validated using the standard XML techniques (which should ensure that they are valid ISO19139) and by a schematron processor.

4. Interoperability of these instances is trivial given they directly conform to ISO19139 (there is no requirement by the consumer of such a profile to see the constraint serialisation).

For example:

Right now, the schematron serialisation requires manual (human) interpretation of the OCL to construct the serialisation. It would appear that a direct serialisation of the full power of OCL would be non-trivial, so we are recommending that

1. communities attempt to use the most simple OCL commensurate with their restricting requirements, and

2. publicise their best practice, so that eventually

3. it would be possible to codify the best practice into a "recommended OCL for profile restrictions" document, which would then be amenable to

4. the development of an automatic parser (that, for example could be built into a future version of [shape change]).

Thus far the only significant criticism of this schematron serialisation approach is that it might not be possible to trivially build a metadata editor which conforms to any arbitrary profile since such tooling would need to be able to both parse the schema and the schematron.

The only other possible approach would be the "clone-and-modify" approach at serialisation. In this case, at serialisation time the schema name changes and the element definitions are directly restricted in the new schema. This new schema then looks like, and behaves like ISO19139, but isn't ISO19139: we believe that inevitable governance issues would arise in the maintenance of the serialisation. Further, like the extension case, instances would need transformation when shared outside the community.

However, it would appear that deploying schematron constraints may not be that difficult in some tools. For example tools exist that can use schematron with xforms.

Further, it is our belief that it will be more easy to deal with hierarchical (and even multiple-inheritance) profiling using the OCL approach as well. For example, an organisation generating metadata may belong to multiple governance domains (e.g. the BADC is a British institution, producing atmospheric data: one might expect our ISO19139 profiles to conform to both the British and WMO standard profiles). It would be easy to test this for restrictions, we simply validate using both schematrons independently!

1: The use of a stereotype for serialisation guidance is the methodology followed in ISO19139 itself. We've chosen restriction here. Although we're not totally comfortable with the particular choice of word, none of us could come up with a better one. (ret).
2: It was suggested that we could do without the stereotype since the serialisation code could simply identify targets for serialisation into schematron alone simply by the fact that the specialisation consisted of OCL alone. However, it is possible that profile maintainers might choose to create real specialisations in an extension (for example, by categorising a particular class into a number of different restricted specialisations, each of which would need to be named). Although that's an extension case, the parser needs to be agnostic as to whether it's an extension or restriction as it processes each element from the UML. (ret).
3: It may be there are other ways of implementing OCL, but schematron seems to be the tool of choice at the moment (ret).

## Wireless Internet Blackspot - Australia

I've spent most of the week in Canberra, Australia, attending three different events - a standards workshop, AUKEGGS and the SEEGRID III conference (programme pdf).

It's been quite a frustrating experience for communication:

• The hotel I'm staying in has no high-speed internet (despite claiming to do so on it's website!). I would have moved but there appear to be no other hotel rooms available in this town for love nor money (one of the conference attendees has been sleeping on local sofas!).

• None of the three venues had wireless that could be configured for public access , nor people available to deal with registration onto available networks.

• The public hotspots (like the one I'm sitting in now) are filthy expensive (\$26 Australian for two hours connectivity!), and provide poor connectivity (I can't get my VPN to get up and stay up).

I can get email (at this hotspot), but I have to do it through the braindead outlook web interface ... which in practice means I'm cherry picking a handful of things that a) I spot amongst the deluge, and b) I can deal with. The rest is just building up ... ready to be a millstone around my neck next week.

In a wired and wireless world, connectivity matters. I can't afford to be this out of touch. Do I want to come back to Australia under these circumstances? Nope!

## Common search terms at badc

Here are the top twenty search terms recently requested on the badc site.

 1 rainfall 2 wind 3 temperature 4 sst 5 wind speed 6 hadisst 7 ecmwf 8 rain 9 hadcm3 10 cet 11 radiosonde 12 precipitation 13 ozone 14 midas 15 soil temperature 16 solar radiation 17 humidity 18 faam 19 sunshine 20 solar

Proof positive I think that we need to call the ontology server in our search backend, so that search terms actually match the right datasets (currently rainfall matches 5 datasts, rain 13, and precipitation 3).

## status of OGC specs

At a recent internal meeting we found the following little table helpful:

 name commonly implemented current spec draft spec WCS 1.0 1.1 WFS 1.0 1.1 1.2? WMS 1.1.1 1.3 WPS 0.4

What's important to note are

• that implementers lag behind specs, and that's ok

• specs sometimes are not backward compatible, and that's ok (if there is a good reason),

• the newer specs are much better for our community than the old ones, but there is much to do.

Update, December 7th: I'm told that even where implementations claim to adhere to the older specs, many are not even complete implementations of those specs.

## browser crap

As is fairly obvious from the range of dribble that appears on this website, I dabble in many of the technologies my team work with. One of the things I've been doing is developing the interface to the new NDG discovery service (this is the sort of thing I find I can do on the train home from meetings when my brain is too addled to think properly as it's kind of mechanical: make mistake, fix it, move slowly on - it beats sudoku and patience and the other stuff I see tired folk do at the end of the day).

Anyway, this is about browser crap. The following simple piece of css is designed to support identical buttons (whether "real" buttons, or pseudo buttons which are really hyperlinks). I developed it on firefox, and then looked at how it looked under IE and konqueror.

Oh how glad I am that I don't do interfaces for a living.

Here's the CSS:

a.button {border: 1px solid black; background-color: #F4F4F4;
text-decoration:none; color:black;}

button.button {border: 2px solid black; background-color: #F4FFFF;
text-decoration:none; color:black;}


And this is what we get:

• Firefox:

• Konqueror:

• Internet Explorer:

Note the vertical alignment of the "real" button, and the font differences under IE. Yuck. As it happens I've decided I don't really want them to look alike, but I felt like bleating about how crap browsers are at consistent CSS support (and this for a really simple piece of CSS usage).

Once upon a time I mentioned that we now have a megane estate car. I raved about it's fuel economy.

Today I was confronted with two light bulb failures to fix: the reversing light, and the front left indicator. No problem I thought. Well, thus far I've failed to fix the reversing light (but it looks doable), and I failed to replace the indicator light too. Fortunately I didn't have to, because at the point where I'd just about given up trying to fit my hand up an impossibly small hole, while turning the air blue about the quality of French engineering, I resorted to the Internet. Luckily (thanks Google too) I found this, which I'm repeating in it's entirety because I trust me more than yahoo to keep this for the lifetime of my car:

... lock your steering out left or right depending on which side is blown you will see a removable cover which you might need a screwdriver to pop off..now this is tricky and make sure you take off your watch or your arm will get stuck put your hand up and feel around for a flat plastic leg about an inch long this is the holder that the bulb sits into then screws into the lamp most of the time the bulb isn't blown its just this holder comes loose so turn on your indicators and twist the holder to tighten it and they more than likely will start working again as this closes the contacts between the bulb and the lamp...if the bulb is blown twist the other way and keep pulling back while twisting so it will pop out when the legs get into position you will have to fiddle around the bulb/holder to get it down and out the gap in the panel as there isn't much room usually u have to take your hand out to give room for the bulb to come down then change the bulb and you should grease the rubber seal on the holder just makes twisting it back easier..fitment is the exact opposite of removal but it is tricky and will have to bend your hand over and back and twist around and in and out but if you have the patients you should have no problem...but say a prayer you just have to tighten it not change the bulb!

## taking OGSA DAI seriously again

Early on in the evolution of the NERC DataGrid we investigated OGSA/DAI, which is a "data access and integration" component of the Globus stable. We rejected it for a number of reasons, chief of which were that the software was immature, and it didn't seem to offer much more than what has recently been recently termed WS-JDBC (albeit perhaps with a dash of WS-XML:DB API).

Of course the years roll on, and maybe we should have revisited it, but it has always seemed like the rest of the globus toolkit - a good idea, a bit immature, maturity just over the horizon - maybe useable next year ... with the same fatal flaw: for a group with relatively little spare engineering time, it was always over the horizon, every year!

It still looks like for us it'll be next year (technically "next year" now means "in some successor project to NDG2"), despite the fact that even within the Met community it's gaining some traction. However if it's going to be next year, we have to know some more about it sooner than that, so it seemed good to see that Ian Foster had chased up on the criticism by getting Malcolm Atkinson to say a few words.

The trouble is, even after reading the article, I'm not convinced we should investigate further. Malcolm's points were essentially:

1. it supports multiple backend formats (OK, but this actually just means you still have to know how to query and understand the schema which defines the backend content) ...

2. it is extensible ("OGSA-DAI has three popular extensibility points, the data resource adapters, the activities and the client libraries") ... hmmm in what way is JDBC not extensible in the same sense? (OK, I know the answer to that, it's obvious that OGSA-DAI would allow a consistent framework for access to multiple backends, but the other two are surely the same? Still, score a point for OGSA/DAI).

3. "... OGSA-DAI contains a variety of multiple-data source functions, such as DQP (still in prototype), and multi-site query facilities which deal with partial availability." Hmmm, well I understand the words "Distributed Query Processing" but don't really understand how that is much use unless the backend resources are pretty homogeneous (never the case for me). The rest is a mystery to me.

So, I'm still not enlightened. Which is a source of frustration (it always seems like Globus nearly offers us what we want), however, I can see one minor place where OGSA/DAI might have helped us in NDG2, but I can't help wondering whether the investment in effort would have been commensurate with the reward ... given that my problem still remains that for all the problems which I find interesting, we worry about the nature of the things we store (their "feature-type"). For us, describing those features in a way that is queryable and can be interpretted by client software is the domain of the OGC webservices. It's not clear to me what role OGSA/DAI could play for us in that context (although there is a "delivery" component of OGSA/DAI which is nagging at me in the context of asynchronous delivery of data, which could be important for WPS or WCS with big data objects).

In the UK we have a couple of projects running under the banner of "Grid-OGC collision", at least one of which I think seems to be aiming to confront OGSA/DAI with OGC WFS/WCS. As I say, I can't see the mileage in it myself, but I've been wrong before, will be wrong again, and am glad someone else is doing the investigating. If there is an NDG3, we'll be looking to those projects to guide us as to whether there really is a role for OGSA/DAI in our activities which is beyond a putative "WS-JDBC".

## climate change speed

Climate change is one of those (many) phrases that every reader/listener interprets in their own way. James Annan has interpretted it in one way in a discussion which I'll summarise as "the a priori1 position that climate change would be detrimental has no scientific/logical underpinning". (Well, actually I think he's really expressing the converse: "that no change is the best possible outcome, is not necessarily wrong"; but it amounts to the same thing.)

Put in yet another way: Given that it's unlikely that some cosmic conspiracy has put us a maximum of some sort of climate-ecosystem-human-society efficiency, it's hard to argue a priori that change would be a bad thing, as it might take us somewhere better. (The metric of exactly what is better is irrelevant to the argument, but I was disappointed that James' arguments were all pretty western hemisphere-acious.)

However, I think there's a fatal flaw in this argument. I think one can argue a priori that the speed of climate change could be detrimental, even if the place one ended up was some how "better" (always assuming we reached an equilibrium before economic/ecosystem meltdown). It seems quite clear that neither existing human societies, nor natural ecosystems (if such still exist) can respond to a rapidly changing environment without detrimental impact.

So, my point is that climate change is a problem both in terms of magnitude and rate of change. Some of us are as much worried about the speed of change as we are of the eventual magnitude. (James does allude to the issue of rapid climate change in passing, but I think misses it's importance in discussing whether the status quo may or may not be better than the result of climate change).

Given that the speed of change is at least as important (if not more so at some point) than the magnitude of the change, then arguments about our ability to adapt being relevant need to be taken with a grain of salt. The first question should not be whether we (or our ecosystems) can adapt enough, but whether they can adapt as quickly as things are going to be changing.

I think we now know that the speed of change is going to be a problem, but I would argue that actually the most likely a priori position would have been to expect just that: Anthropogenic climate change is most likely to be detrimental - if only because of the speed! And unlike James, I do think this is a logical position!

1: Here my definition (and I think that of James) is that a priori means "before we do/did any calculations that actually show the expected climate change would be detrimental" (ret).

## Goodbye Korea

I'm sitting typing this in Incheon airport (which is unbelievably quiet, nearly everything is closed, and while it's now 10 pm, stuff was closed when I got here around 8.30 pm).

Sure enough I didn't get out of the hotel ... not enough time at lunch time and complete exhaustion last night. But I had a fabulous view out of the hotel window, and all my collagues were raving about how nice a city Seoul is. I must come back (and I must remember not to even consider driving ... the traffic is a nightmare).

The conference itself was a bizarre mixture of overview talks with (from my perspective) little content, and some detailed descriptions of testbeds and exemplar systems (like our NERC DataGrid). I think nearly everyone in the audience must have spent half their time frustrated: the technical folks bored with the overview talks, and the WMO representatives trying to find out about the planned WMO Information System, buried with acronmyms and unfamiliar concepts in the technical talks. Still, most folk seemed to be getting something out of it.

From an NDG perspective, what was interesting is how much everyone is converging on the same technologies. OAI is everywhere. Everyone is struggling with ISO19139 (some more aware of the ramifications than others). OGSA/DAI is making a come-back. The WIS will interact with academia far more successfully than earlier WMO systems (while WMO Resolution 40 rears it's ugly head as far as inter-country interoperability is concerned, at least it protects the doing of science).

Now for the ludicrously long flight back to the UK (via Dubai - don't ask!)

## Sanity from Hulme

Todays posts may be giving some the wrong impression about what I think about climate change. I've been decrying attacks on foundations, not supporting what Mike Hulme terms "The Discourse of Catastrophe", that is, the whipping up of a "State Of Fear" (ouch, could it be that Crichton at least got that bit right?).

Of course what we really need is reasoned discussion about adaptation and mitigation, and the development of a proper understanding of what climate change might actually be.

Just to set the record straight (thanks for pointing this out James), I thoroughly agree with Mike's article, which concludes with:

I believe climate change is real, must be faced and action taken. But the discourse of catastrophe is in danger of tipping society onto a negative, depressive and reactionary trajectory.

(And still no sleep ... I go back to bed to try between posts ... honest).

by Bryan Lawrence : 2006/11/05 : Categories climate environment crichton : 0 trackbacks : 0 comments (permalink)

## Jetlagged thoughts on amateurism

I knew that I might come into some criticism for using words like "Pretentious" and "dribble" and "drivel" about Monckton's "article" in the Sunday Telegraph. It has started already (see the comments). One of the problems about my using this sort of language is that it opens the door to others using the same sort of language about me ...

Fair enough. I suppose I should have tried to be more temperate. I will try (for example, I've severely censored my original verision of what follows). But it makes me very VERY annoyed that newspapers give this sort of stuff oxygen.

Why?

Last week I was driving from somewhere to somewhere, and on route I listened to an interesting piece on Radio Four about Poincare. One of the things they said resonated: Poincare was working at a time (the last time!) when it was possible for an outstanding mathematician to understand in detail the entire breadth of mathematics.

I don't know when the last time was that it was possible for someone to understand in detail the entire breadth of atmospheric science, but it sure isn't now! An observational atmospheric scientist will not understand numerical modelling in detail (heck, even a modeller is unlikely to understand in detail all of the GCM she/he is using, one has to rely on experts in other fields, for example, a GCM has an atmosphere and an ocean, Q.E.D.)

So how do we cope. Peer Review is how we cope. We rely on a mechanism which ensures (as far as possible, it's not perfect) that the pieces we put together are validated by peers - that is, by people who do understand in detail those pieces - and the joining together of the pieces is validated by people who understand the joining together, and then the interpretation is validated by people who understand the methodology of interpretation. And all the while we try and include quantified uncertainty, and probability estimates, and caveats, caveats and more caveats. And sometimes we find fault, we find errors in what has been done and published, and so we redo, we improve, and we move forward.

Then someone comes along, drives a truck and trailor through this, simply cannot have done due diligence, and hacks away at some poorly understood detail. The whole process is damned in a few words. The work of hundreds of scientists is attributed to "the UN" as if some civil servant somewhere had produced a briefing paper, rather than the IPCC process being about the best synthesis of knowledge about climate it is possible to create (I'm not saying it's perfect, but I am saying there is currently nothing better). I'm sorry, but the day of the educated amateur has pretty much gone (and Richard Lindzen, if you ever read this, the day of the MIT professor knowing all there is to know about everything has gone too!)

With no peer review, just a dose of editorial whimsey, and Monckton's thoughts get read by more people than any of the thoughts of the hundreds of individuals who have spent years contributing to understanding the climate.

Don't anyone mention "balance" either. If the ST published reviews of the hundreds of papers that have been peer reviewed and gone into this work, then fair enough, they could publish the other stuff too - but this isn't balance, it's the equivalent of putting Monckton's gnat on a giant seesaw with a few battleships on the other end, and claiming his end would sit on the ground.

So, should I waste my time reading his article through properly, and damning the arguments? No I shouldn't. I blog for many reasons, sometimes for fun, sometimes to contribute my pieces to public understanding of science, sometimes for catharsis - like now - and sometimes professionally to provide notes to myself and my colleagues . Responding might fit in the Public Understanding category, but I think my time would be better spent elsewhere. And when he gets his thoughts published in a serious journal, I'll take them seriously. Frankly, it's this sort of nonsense that stops some of my colleagues blogging. They simply don't want to open themselves up to the tedium of responding. The nice thing about peer-review is that rubbish only has to be rejected by two or three reviewers. We don't all have to waste our time.

I'm sorry if that sounds elitist. It's not a cult of the elite. I'm more than happy for anyone to try and distil the state of knowledge, and in my own little way, I sometimes try and contribute to the public understanding of science, but damn it, enough's enough ...

Yes, it's five a.m., and I'm jetlagged, and can't sleep, and this hasn't helped.

by Bryan Lawrence : 2006/11/05 : Categories environment : 1 trackback : 1 comment (permalink)

## Another thing I would do if I had time ...

... would be to take this load of pretentious dribble apart. I can't actually bring myself to read it through. It's introduced by a equally drivellacious article in the Sunday Telegraph. That article included this gem:

Dick Lindzen emailed me last week to say that constant repetition of wrong numbers doesn't make them right.

What Monckton needs to consider is that the constant repetition of discredited arguments doesn't make them credible. It's a delicious irony that he can't see that ...

What's not amusing is that some hardworking folk are going to have to take more time to discredit it (and it'll be easy). I just wish I had the time to add my voice, but I suspect it's better for me to be working on making sure that the raw material (credible numbers) are discoverable, documented and easily manipulated.

I came across this stuff following up from William (especially the comments) which led me to Tim Worstall. I was interested in trying to understand the arguments about the economic scenarios. I have to confess that this issue is my achilles heel, I've never taken the time to really understand the economic plausibility or otherwise of the various scenarios. I plan to change that over the next six months or so, in particular as the fourth assessment report is released (I've promised myself to read every word).

## Not getting to see Seoul

I'm in Seoul for two days, and will have spent nearly as long getting here and going home as I've got here. When I was younger I used to want a jetset lifestyle, it sounded so glamorous. The truth is far more mundane: long distance air travel is uncomfortable, of dubious morality1, and oftentimes one doesn't see anything at all. This trip I've arrived in the dark, will leave in the dark, and have two full days work to do. I might, just might, leave the hotel and see a local restaurant tomorrow night (but I may collapse from jetlag and do room service, as I've done tonight) ... Sorry Korea, you deserve more time (I have the interest, but this is not the time!)

I'm here to give a talk on the data discovery work we are doing in the NERC DataGrid, as part of the World Meteorological Organisation's Technical Conference on WMO Information Systems (TECO-WIS). The talk I'm giving is this:

The Natural Environment Research Council's NERC DataGrid (NDG) brings together a range of data archives in institutions responsible for research in atmospheric and oceanographic science. This activity, part of the UK national e-science programme, has both delivered an operational data discovery service, and built and deployed a number of new metadata components based on the ISO TC211 standards aimed at allowing the construction of standards compliant data services. In addition, because each of the partners has existing large user databases which cannot be shared because of privacy law, the NDG has developed a completely decentralized access control structure based on simple web services and standard security tools .

One of the applications has been the redeployment of an earlier data discovery portal, the NERC Metadata Gateway, using the Open Archives Initiative Protocol for Metadata Harvesting (OAI/PMH). In this presentation we concentrate on the practicalities of building that discovery portal, report on early experiences involved with interacting with, and harvesting, ISO19139 documents, and discuss issues associated with deploying services from within that portal.

1: I only do it in those situations when I think "being there" will make a difference (ret).

## The Future of Physics and Science

• A few days ago the BBC carried a news item: science students have to work (much) harder (in terms of time) than arts students at university.

• Student fees are starting to bite.

• Not many jobs in the paper have "Physicist" in the title, even though many employers may well hire physicists in preference to other graduates to fill posts.

• Yesterday I discovered that the University of Reading is closing its physics department - not enough students enrolling.

These things are not unconnected! Deja vu. As a former physics academic, I've seen it all before ... (in another country :-).

But it's not just physics that suffers: who's going to do all the hard environmental science then?

## The Stern Way Forward

Yesterday I waded through the first two thirds of the Executive Summary ... I thought it best to finish it today, otherwise I would be at risk of not only not reading the whole thing, but not making it through the executive summary :-)

The way forward he proposes depends on three pillars:

• Establishing a price for Carbon,

• A Technology Price

• The removal of barriers to behavioral change.

In terms of pricing Stern recommends use of one or more of tax, trading or regulation, with the mix depending on choices within specific jurisdictions. I have to say, without reading the main text, I don't understand how he can argue that different jurisdictions can achieve carbon prices in different ways (surely it would lead to some form of carousel fraud) ... but I'm no economist, so ok ...

Policy incentives include

... technology policy, covering the full spectrum from research and development, to demonstration and early stage deployment ... but closer collaboration between government and industry will further stimulate the development of a broad portfolio of low carbon technologies and reduce costs.

I find it (amusing, sad, worrying) that the report suggests that existing policy incentives to support the market should only increase by two to five times. This at a time when the UK incentives run out half way through the year. That suggests to me that an order of magnitude increase is necessary (after all, the take up is relatively low, and even with incentives one has to be pretty wealthy to get into home generation).

In terms of behavioural change, he makes the point that:

Even where measures to reduce emissions are cost-effective, there may be barriers preventing action. These include a lack of reliable information, transaction costs, and behavioural and organisational inertia. ... Regulatory measures can play a powerful role in cutting through these complexities, and providing clarity and certainty. Minimum standards for buildings and appliances have proved a cost-effective way to improve performance, where price signals alone may be too muted to have a significant impact.

The clear message throughout is that the market can't do this alone! From the obvious point that carbon costs are an "externality" (the producer of carbon dioxide does not themselves pay the costs), through to the reality that regulation and taxation are going to be necessary to begin to change minds - I reckon hearts will follow (if they're not already there!)

Another obvious (to me) point is that we need to start thinking and planning about adaptation now! We have some decades of climate change ahead of us, regardless of what we can achieve in changing emissions!

We hear a lot about how there is no point in the UK doing anything because it contributes only 2% of the global emmissions. However, it was good to read

... China's goals to reduce energy used for each unit of GDP by 20% from 2006-2010 and to promote the use of renewable energy. India has created an Integrated Energy Policy for the same period that includes measures to expand access to cleaner energy for poor people and to increase energy efficiency.

It would be nice to hear concrete proposals from the U.S. and Australia!

I'll leave the last word to Stern:

Above all, reducing the risks of climate change requires collective action. It requires co-operation between countries, through international frameworks that support the achievement of shared goals. It requires a partnership between the public and private sector, working with civil society and with individuals. It is still possible to avoid the worst impacts of climate change; but it requires strong and urgent collective action. Delay would be costly and dangerous.

by Bryan Lawrence : 2006/10/31 : Categories environment (permalink)

## Stern Facts

Like William Connolley I doubt I'll ever read the whole thing, but it's intriguing to wade through the 27 page executive summary at least.

Under a BAU scenario, the stock of greenhouse gases could more than treble by the end of the century, giving at least a 50% risk of exceeding 5?C global average temperature change during the following decades. This would take humans into unknown territory. An illustration of the scale of such an increase is that we are now only around 5?C warmer than in the last ice age.

Well, that 5 degree figures is a bit hard to fathom (although I think he uses that with respect to the period 2100-2200), but the comparison with the scale to the last ice age is rather a good one. Even if the real number might be 2 to 3 degrees C, it puts things in perspective somewhat - even those of us who may claim to be professionals still have to get a grip on the emotional reaction that a few degrees C isn't that much really. Put like that, it obviously is!

The disaster list is pretty awesome:

• Water supply problems (Melting glaciers: initially a flood risk, lead to a fall in water availability, not to mention changes in water availability associated with changing weather patterns) ...

• Declining crop yields (particularly in the higher range of predictions)

• Death rates from malnutrition, heat stress and vector borne diseases (malaria dengue fever etc) increase ...

• Rising sea levels ... threatening the homes of 1 in 20 people!

• Ecosystem melt down (15-40% of species for only a 2C increase!) (Plus ocean acidification with unquantifiable impact on fish stocks)

Then:

Impacts on this scale could spill over national borders, exacerbating the damage further. Rising sea levels and other climate-driven changes could drive millions of people to migrate ...rise in sea levels, which is a possibility by the end of the century... Climate-related shocks have sparked violent conflict in the past, and conflict is a serious risk in areas such as West Africa, the Nile Basin and Central Asia.

But maybe this will get more attention in the City:

At higher temperatures, developed economies face a growing risk of large-scale shocks - for example, the rising costs of extreme weather events could affect global financial markets through higher and more volatile costs of insurance.

I find all the arguments about changes in GDP difficult to follow, possibly because one never really knows where the baseline is (unless one is an economist), but this seems a pretty straight forward statement:

In summary, analyses that take into account the full ranges of both impacts and possible outcomes - that is, that employ the basic economics of risk - suggest that BAU climate change will reduce welfare by an amount equivalent to a reduction in consumption per head of between 5 and 20%. Taking account of the increasing scientific evidence of greater risks, of aversion to the possibilities of catastrophe, and of a broader approach to the consequences than implied by narrow output measures, the appropriate estimate is likely to be in the upper part of this range.

I've blogged before (Jan 2005a, Jan 20005b, Jul 2005, and Aug 2005) about the future of the oil economy, but maybe I've been on the wrong tack:

The shift to a low-carbon global economy will take place against the background of an abundant supply of fossil fuels. That is to say, the stocks of hydrocarbons that are profitable to extract (under current policies) are more than enough to take the world to levels of greenhouse-gas concentrations well beyond 750ppm CO2e, with very dangerous consequences. Indeed, under BAU, energy users are likely to switch towards more carbon-intensive coal and oil shales, increasing rates of emissions growth.

The economic analysis makes it clear that there is a high price to delay. As he says:

Delay in taking action on climate change would make it necessary to accept both more climate change and, eventually, higher mitigation costs. Weak action in the next 10-20 years would put stabilisation even at 550ppm CO2e beyond reach ? and this level is already associated with significant risks.

#### The Good News

He thinks there is a way out:

Yet despite the historical pattern and the BAU projections, the world does not need to choose between averting climate change and promoting growth and development. Changes in energy technologies and the structure of economies have reduced the responsiveness of emissions to income growth, particularly in some of the richest countries. With strong, deliberate policy choices, it is possible to ?decarbonise? both developed and developing economies on the scale required for climate stabilisation, while maintaining economic growth in both.

I think the Aussie and the American governments understand the last part of this (opportunities), but they want to somehow avoid the fist part ... (costs):

Reversing the historical trend in emissions growth, and achieving cuts of 25% or more against today?s levels is a major challenge. Costs will be incurred as the world shifts from a high-carbon to a low-carbon trajectory. But there will also be business opportunities as the markets for low-carbon, high-efficiency goods and services expand.

For those of us in paranoid Europe, worried about Russian control of our gas supplies:

National objectives for energy security can also be pursued alongside climate change objectives. Energy efficiency and diversification of energy sources and supplies support energy security, as do clear long-term policy frameworks for investors in power generation.

the social cost of carbon will also rise steadily over time ... This does not mean that consumers will always face rising prices for the goods and services that they currently enjoy, as innovation driven by strong policy will ultimately reduce the carbon intensity of our economies, and consumers will then see reductions in the prices that they pay as low-carbon technologies mature.

I've got that in the good news section on the grounds that the clear message is that the cost of doing something about this isn't going to rise and rise, but it isn't good news for our next thirty years. I fear for the tourism and export agriculture of countries a long way from anywhere else (e.g. New Zealand!)

At this point I'm on page seventeen, and I'm tired ... more soon!

Update, 31 Oct: See James Annan for a critique of the science part ..

## Subtle Discipline Drift

Recently Oxford University advertised for a Lectureship in Atmospheric Physics, and Imperial College is currently advertising for a raft of positions, from lectureships to professorships. In some ways I was and am tempted (despite having supped at the font of "proper" academia before, and having been burned ... a lectureship, in NZ at least, being far far more stressful than what I do now :-). I do love atmospheric science.

However, while I'm tempted, I'm also realistic. In the past five years, I've drifted (and sometimes been pushed) towards what has recently in the UK been called e-science (that name is now deprecated, who knows what we'll call it next year!). I no longer read the atmospheric science journals, not even the title pages, so I haven't a clue what's in the literature. One has only to look at what I blog about nowadays to realise that whatever I'm doing now, it's not atmospheric science per se - although everything I do is predicated towards making the doing of atmospheric science easier.

I keep telling my staff that one should always be appraising one's career options, assessing what doors are closing and opening as time goes by, and carpa diem etc. So, here's my assessment of one of my options right now. Given I'm so out of touch, I'm not sure I'll even feel comfortable supervising atmospheric science students, which is really scary. I think that's the sound of my atmospheric science door closing, if not permanently, pretty tightly anyway - it'd take some effort and time to prise it open again! But I like working with students, so if you're an academic in a computer science department in my part of the world, and fancy getting them working on some atmospheric science related problems, get in touch, we could talk about co-supervision. Hopefully that's another door opening ...

## More Stupid Patent Litigation

I'm with Tim Bray on this. Why isn't the internet in an uproar? IBM is litigating Amazon on patent violations, it's all pretty incredible, but the two most silly are:

If Amazon is found guilty of this, then the entire ediface of data distribution in science will be violating these patents too. In fact, pretty much all e-commerce is covered in these patents (and the others they're claiming are violated).

IBM should be ashamed, even if it's only to over-turn that ludicrous one-click patent ...

## Exploring Web Server Backends - installing fastcgi and lighttpd

A few months ago, I was investigating web server options (one, two, three). I finished that series saying I needed to investigate wsgi. Well that time has come, there are a number of reasons why wsgi and fastcgi (or scgi) may be important to us. However I'm a little bit wary about Apache and fastcgi after getting the impression that lighttpd may be the way to go for fastcgi. So, this note is a list of my experiences getting a wsgi hello world going on my dapper laptop. (As usual, my interest in doing this for myself is to understand the major issues, not because I'm personally going to be working on this).

(I'm doing this using my own /usr/local/bin/python2.5 rather than the system default python.)

Got Lighttpd:

sudo apt-get install lighttpd


(This started a process running under www-data, and put scripts in /etc/init.d, so I may well have this automatically starting when I boot, which isn't really what I want on a laptop ... I'll investigate that later).

Got flup, noting that no 2.5 egg existed, I had to get a tar ball, and setup install it (nb: remember using my local python):

wget http://www.saddi.com/software/flup/dist/flup-r2030.tar.gz
tar xzvf flup-r2030.tar.gz
cd flup-r2030
sudo python setup.py install


Following cleverdevil (Jonathan Lacour) I grabbed scgi while I was at it.

sudo easy_install scgi


but for my first steps, I'm planning on getting vanilla fastcgi working. I may play with scgi later. Meanwhile for fastcgi, I'm basically following cleverdevil again, adjusted for my ubuntu apt-installed lightty.

I modified the file ''10-fastcgi.conf in /etc/lighttpd/conf-available to be

## FastCGI programs have the same functionality as CGI programs,
## but are considerably faster through lower interpreter startup
## time and socketed communication
##
## Documentation: /usr/share/doc/lighttpd-doc/fastcgi.txt.gz
##                http://www.lighttpd.net/documentation/fastcgi.html

server.modules   += ( "mod_fastcgi" )

## Start a FastCGI server for python test example
fastcgi.debug = 1
fastcgi.server    = ( ".fcgi" =>
( "localhost" =>
(
"socket" => "/tmp/fcgi.sock",
"min-procs" => 2
)
)
)


and put a sym link to this file into my /etc/lighttpd/conf-enabled directory. (Update 27 Oct: Oops, I had a non-working version of 10-fastcgi.conf here until today. The one above is the one I have working ... today).

I put the test file in my /var/www directory as test.fcgi:

#!/usr/local/bin/python
from flup.server.fcgi import WSGIServer

def myapp(environ, start_response):
start_response('200 OK', [('Content-Type', 'text/plain')])
return ['Hello World!\n']

***
highlight file error
***


And I ran it:

python test.fcgi


and it sits there running.

Now, trying to access it on http://localhost.localdomain/test.fcgi results in a 500 Internal Server Error. A check in the access log showed many instances of this (associated with much head scratching and time wasting):

2006-10-25 21:54:06: (mod_fastcgi.c.2669) fcgi-server re-enabled: unix:/tmp/fcgi.sock
2006-10-26 08:09:30: (mod_fastcgi.c.1739) connect failed: Permission denied on unix:/tmp/fcgi.sock
2006-10-26 08:09:30: (mod_fastcgi.c.2851) backend died, ...


Eventually the penny dropped. The server is running as www-data which has no access permissions to the unix domain socket (/tmp/fcgi.sock) created by the user (whether me or root) running the python fast.cgi server code ...

So, I changed the permissions on /var/www to allow www-data access, and reran the python command:

sudo su www-data
python test.fcgi


And lo and behold, I get a "Hello World" on http://localhost.localdomain/fast.cgi.

## Wierd unicodeness

For some reason my blog has suddenly developed some sort of unicode problem, which is making a large number of pages core dump. I don't know why. I'm investigating ... meanwhile, I'm trapping the error, but you may assume strange things will happen today!

Update (10-15am): At the moment, I'm trapping the page content in the wiki formatter, forcing it to utf-8, doing my wiki formatting (in wikiBNL), and then forcing it back to ascii before returning the content to the leonardo page provider. I need to do the last step, because there is some error higher up which is breaking with a strict asci code conversion. What is utterly wierd is that this was fine yesterday! Since then I have changed my embedhandler (to support namespaces in xml pretty printing, something I'll blog about sometime), but I fail to understand how that would lead to this problem ...

Update (10-35am): Well, it's not the new embedhandler. I commented out the new xml stuff, and the problem still exists ... (I never thought it was, but I think it's the only thing I've touched ...

Update (11am): I don't understand this, and haven't time to fix it right now. This means that some pages with non-asci may have some spurious ? until I can fix this ...

## Citing data with ISO19139

I thought I might try and work out exactly what tags I might use for my previous citation example, if I was using ISO19139 (i.e. in the metadata of another dataset).

The appropriate piece of ISO19139/19115 is the CI_Citation element, which defines the metadata describing authoratative reference information ... which in my mind should also include other datasets!

Some if it is "straight forward" (I don't plan to admit how long it took to work this out :-) :

<CI_Citation xmlns="http://www.isotc211.org/2005/gmd" xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.isotc211.org/2005/gmd http://www.isotc211.org/2005/gmd/citation.xsd">
<title>
</title>
<alternateTitle>
<gco:CharacterString>MST </gco:CharacterString>
</alternateTitle>
<date>
<CI_Date>
<date>
<gco:Date>2006 </gco:Date>
</date>
<dateType>
<CI_DateTypeCode codeList="http://www.isotc211.org/2005/resources/CodeList/gmxCodelists.xml#CI_DateTypeCode" codeListValue="publication"> </CI_DateTypeCode>
</dateType>
</CI_Date>
</date>
<identifier>
<MD_Identifier>
<code>
</code>
</MD_Identifier>
</identifier>
<citedResponsibleParty>
<CI_ResponsibleParty>
<organisationName>
<gco:CharacterString>Natural Environment Research Council </gco:CharacterString>
</organisationName>
<role>
<CI_RoleCode codeList="http://www.isotc211.org/2005/resources/CodeList/gmxCodelists.xml#CI_RoleCode" codeListValue="Author"> </CI_RoleCode>
</role>
</CI_ResponsibleParty>
</citedResponsibleParty>
<citedResponsibleParty>
<CI_ResponsibleParty>
<organisationName>
<gco:CharacterString>British Atmospheric Data Centre </gco:CharacterString>
</organisationName>
<role>
<CI_RoleCode codeList="http://www.isotc211.org/2005/resources/CodeList/gmxCodelists.xml#CI_RoleCode" codeListValue="Publisher"> </CI_RoleCode>
</role>
</CI_ResponsibleParty>
</citedResponsibleParty>
<citedResponsibleParty>
<CI_ResponsibleParty>
<organisationName>
<gco:CharacterString>British Atmospheric Data Centre </gco:CharacterString>
</organisationName>
<contactInfo>
<CI_Contact>
<onlineResource>
<CI_OnlineResource>
<function>
</function>
</CI_OnlineResource>
</onlineResource>
</CI_Contact>
</contactInfo>
<role>
<CI_RoleCode codeList="http://www.isotc211.org/2005/resources/CodeList/gmxCodelists.xml#CI_RoleCode" codeListValue="custodian"></CI_RoleCode>
</role>
</CI_ResponsibleParty>
</citedResponsibleParty>
<presentationForm>
<CI_PresentationFormCode codeList="http://www.isotc211.org/2005/resources/CodeList/gmxCodelists.xml#CI_PresentationFormCode" codeListValue="profileDigital"> </CI_PresentationFormCode>
</presentationForm>
</CI_Citation>

OK, it's pretty nasty in terms of verbiage, but as (some) folk keep saying, this is for computers not humans - never mind that a human has to write some code to handle it - but it's not as bad as I feared!

In getting that far, we see that I've nearly managed to get the same information content, but there are some pretty important omissions (I think, caveat emptor, I'd glad to be wrong about this):

1. I don't see any way to indicate that the dataset is being updated (the "ongoing" tag in my previous example", ideally this would require a spot for an MD_MaintenanceFrequencyCode in the citation).

2. I don't see any way of indicating a particular part of a dataset (that is, having separate identifiers both for the dataset and for particular features within it).

3. Despite support for feature-type descriptions within an ISO19139 document proper (in the MD_FeatureTypeDescription tag), one can't identify which features are in a cited dataset. We're reduced to using CI_PresentationFormCode, which strikes me to be a completely ugly compromise between feature descriptions and a text element. The one I've chosen here (profile) is partly right, but doesn't get across that this dataset consists of timeseries of vertical profiles!

4. One can't, as far as I can see, identify when the dataset was accessed (or the date of the citations validity) and I think this is rather crucial for citation of online material.

I guess those are the minimum extensions we'd need to support citeable datasets! (By the way, I've ignored the option of using otherCitationDetails as one is only allow one of those in the citation!)

Update: Note that BADC appears as both a publisher and a custodian, actually, following my discussion of the distinction, I think at the moment, one would want to remove the publisher role ... and leave only the custodian role (in the ISO19139, the text citation form can't distinguish between these roles).

## Persistence

Just after I wrote my last post on data citation, I found Joseph Reagle's blog entry on bibliography and citation. He's making a number of points, one of which was about transience. In the comments to his post, and in Joseph's comment on my post, two solutions to deal with internet transience are mentioned: the wayback machine and webcite.

I've looked at the wayback machine in the past, but there is no way that it represents any realistic full sample of the internet (for example, as of today, it has exactly one impression of home.badc.rl.ac.uk/lawrence - from 2004!) ... but how could it? It's an unrealistic task. What I do see it as is a (potentially very useful) set of time capsules ... that is samples!

By contrast, webcite allows the creater of content to submit URLs for archival, thus ensuring when one writes an academic document, the material will be archived, and the citation will be persistent. This is a downright excellent idea, provided you believe in the persistence of the webcitation consortium (and I have no reason not to). The subtext however, is that the citation is a document, it wont help us with data - and not just because data may be large, the other issue is that the webcitation folk would have to take on support for data access tools, and I think the same argument applies to them as applies to libraries in this regard!

This brings me back to my point about data citation: we had better only allow it when we believe in the persistence of the organisation making the data available, and that will consist of rather more than just having the bits and bytes available for an http GET!

## Citation, Hosting and Publication

Returning to my series on citation (parts one, two, and three).

My last example was an MST data set held at the BADC, and I was suggesting something like this (for a citation):

which I could also write like this to give some hint of the semantics:

<citation>
<Author> Natural Environment Research Council </Author>
<Title> Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth </Title>
<Medium> Internet </Medium>
<Publisher> British Atmospheric Data Centre (BADC) </Publisher>
<PublicationDate status="ongoing"> 1990 </PublicationDate>
<Feature>
<FeatureType>http://featuretype.registry/verticalProfile </FeatureType>
<LocalID>200409031205 </LocalID>
</Feature>
<AccessDate> Sep 21 2006 </AccessDate>
<AvailableAt>
</AvailableAt>
</citation>

The tags are made up, but hopefully identify the important semantic content of the citation. As I said last time, there is some redundant information there, but maybe not (there is no guarantee that the Identifier and the AvailableAt carry the same semantic content).

Inherent in that example, and my meaning, was a concept of publication, and I introduced that distinction by comparing the MST and our ASHOE dataset (which is really "published" elsewhere). In the library world, there is a concept of "Version of Record", which isn't exactly analogous, but I would argue BADC holds the dataset equivalent of the version of record for the MST, and NASA AMES the equivalent for the ASHOE dataset.

Generally, in scholarly publication, in the past one distinguished between the refereed literature, the published literature and the grey literature1, where the latter might not have been allowed as a valid citation. The situation has become more complicated with the urge to cite digital material, but one of the reasons for the old rules was about attempting to ensure permanence and access - something that is obviously becoming a problem again. Thus, we should explore the concepts of publication and version of record a bit further, before we create new problems. Cathy Jones, working on the CLADDIER project, has made the point in email that a publisher does something to the original that adds value, and I think in the case of digital data, that something should include at least:

• some commitment to maintenance of the resource at the AvailableAt url

• some commitment to the resource being conformant to the description of the Feature

• some commitment to the maintenance of the mapping between the identifier and the resource.

And so, in a reputable article (whatever that means), or in the metadata of a published dataset, I wouldn't allow the citation of a dataset that didn't meet at least those criteria, but once we have met those criteria, then that first version should be the version of record, and copies held elsewhere should most definitely distinguish between the publisher and the availability URI.

Arguably the 2nd and 4th of these criteria could be collapsed down to the use of a DOI. While that's true, I think the use of both helps the citation user (just as I think it best to do a journal citation with all of the volume, page number and DOI). However, if the publisher does choose to use a DOI, it would help if the holders of other copies did not! Whether or not it's true, the use of a DOI does imply some higher level of ownership than simply making a copy available.

Implicit in my discussion of the metadata of a published dataset, is the idea that just as in the document world, we could introduce the concept of some sort of kite-mark or refereeing of datasets. A refereed dataset would be

• available at a persistent location

• accompanied by more comprehensive metadata (which might include calibration information, algorithm descriptions, the algorithm codes themselves etc)

• quality controlled, with adequate error and/or uncertainty information

and it would have been

• assessed as to it's adherence to such standards.

There might or might not be a special graphical interface to the data and other well known interfaces (e.g. WCS etc) ought probably be provided.

Datasets published after going through such a procedure would essentially have come from a "Data Journal", and so in my example above, such the <Publisher> would become the name of the organisation responsible for the procedure, and the <Title> might well become the title of the "Data Journal".

1: Grey Literature: i.e. documents, bound or otherwise, produced by individuals and/or institutions, but which were not commercially available, and therefore, by implication, not very accessible. (ret).

## The Economist Goes Green

I'm a bit slow to find out about this (which is down to not enough time reading my feeds): Anyway, The Economist is arguing for action on greenhouse gas emissions: Editorial (7 Sep).

Is this a tipping point of sorts?

(Thanks to Andrew Dessler).

by Bryan Lawrence : 2006/10/20 : Categories climate environment (permalink)

## On substitution groups and ISO19139

I have bleated already about the difficulties of using ISO19139 with restrictions which introduce new tag names.

Now the official way to do this is probably to exploit substitution groups in the new xml schema associated with your restriction. So, if one wanted to restrict, for example gmd:MD_Metadata one might start in your new schema with something like

(See w3schools, the xml schema primer, or Walmsey, 2001, to explain the syntax). Then in the instance document, one would have

At this point one could use the xml schema validation machinery to ensure one had a nice valid instance of the new restricted schema.

My beef is how we use this. The gurus will tell you there is no problem, and maybe there isn't if one wants to invest an enormous amount of time in complex handlers (even so, maybe it's not that straight forward for data binding, and perhaps the tools aren't really that mature - or weren't in 2004).

So, if I'm writing code to handle ISO19139 documents, I'm going to be writing xslt or using xpath or xquery to get at particular content or I'm going to have to invest in brute force if I want to handle things in a high level language like python (as far as I know there are no pythonic tools that get close to this sort of requirement internally).

Let's just explore the brute force method, and a simple use case: I have harvested ISO19139 profiles (I'm starting to think "variants" - complete with quotes - would be a better term :-) from a number of places, and want to deliver the titles to a web page ... so I need to find the titles. I can't assume I can use a simple xpath expression (which is supported in python) to find all the titles. I have to parse all the relevant schemas, and do something complex to find the new title elements. In practice, I have to support each profile as a completely different schema, they might as well not share the ISO19139 heritage - even though there are advantages in the ISO19115 content heritage. Yuck.

OK, now suppose I hand this off to an xquery engine. How easy is that? Let's assume it's not buggy ... This is essentially the use case described as 1.9.4.7 Q7 in the June 2006 use case document. I'm not that familiar with xquery, but it appears that

schema-element(gmd:MD_Metadata)


should then match any element which is linked to it via a substitution group declaration like that above. If it really is that simple, then this is much easier than the brute force method, and a good reason for passing my problems to a real xquery engine.

However, this may well work fine for handling document reshaping type tasks, but returning to the use case, I could well have tens of thousands of harvested documents, if not millions, and so I may well be considering indexing. I don't know, but it would appear that one has to rewrite all substitution group elements when producing an index - does our eXist native xml database technology do this automatically for me? I don't know, and that's the point.

All this marvellous xml technology is bloody complicated, and all to handle the case that a community wants to restrict the usage of some tags or lists! Why make all this grief? Wouldn't it be far easier to give community guidance, but accept perfectly valid ISO19139 documents which fall outside that guidance, because we could all simply follow the simple rules in David Orchard's article article, and especially the one I've highlighted before:

Document consumers must ingore any XML attributes or elements in a valid XML document that they do not recognise.

We could rewrite that as:

Communities should give guidance on those ISO19139 attributes or elements that need populating for usage within the community (and which might need to be handled by community tools).

Job done. No complicated machinery. More tools available, easier indexing, and much easier human parsing of everything ... (from the schemas, to the instances, and all the code that handles them).

## Two completely unconnected things

I've just had a burst of rather intensive work over the last couple of weeks (hence the silence), and this lunch time I've rather ground to a halt. So, by way of light entertainment, I clicked on my akregator and started reading from the enormous number of unread things from the various feeds I think/thought I need/want to follow ...

Herewith are two links which apart from how I found them, are completely unconnected. I'm drawing to your attention because for very different reasons I valued reading them.

1) The first, is from John Fleck's blog, and is actually a higher profile restatement of a comment on an earlier entry by Nick Brooks discussing cultural responses to climate change. Go read the entire thing, but this is a taster:

... even if we accept that environmental crises led to the emergence of civilisation, we are still looking at collapse - the collapse of the societies that preceded these new cultures. In the Sahara we know that lifestyles based on mobile cattle herding collapsed, as did hunting and gathering. It?s the survivors who adapt, after the event.

It's that last sentence that got me thinking ...

2) And now from the April the first collection at the bmj, where quite clearly there is some sort of calendar problem, we have a series of fabulous articles and responses, but John Fleck (again), pointed out this one on the half-life of teaspoons ...

I wonder if they would get the same results with plastic teaspoons!

One serious. One frivolous. Viva Blogging! How how else would one find this sort of stuff?

## Programmers should be miserable

Steve Yegge (via Tim Bray):

... I don't like Java, but I do use it. Liking and using are mostly orthogonal dimensions, and if you like the language you're using even a little bit, you're lucky. That, or you just haven't gotten broad enough exposure to know how miserable you ought to be.

## More on citation - part three, delving

Following on ...

In my last example, I considered a (made up) uri which was supposed to be of meaning to the BADC that would allow cleaner citing of our MST radar dataset. The URI looked like this: badc.nerc.ac.uk/data/mst/v3/upd15032006

There are two obvious components to this uri, the first is a domain address, badc.nerc.ac.uk, the second is the string /data/mst/v3/upd15032006. This latter uri was constructed with a number of local implicit subcomponents. I've included

• "data" as a local identifier to identify the schema for my uri construction.

• "mst" identifies a dataset

• "v3" is a (made-up) version

• "upd15032006" is a (made-up) last modification date. (for some datasets updating through the date, we probably ought to have a modification time in the URI)

Now the format of this uri is rather unimportant. What I'm asserting here is that all these components are really useful for citing data. Rather like the concepts of volume and page number and issue date for a journal. Now I don't really care too much how different sites do their uri schemes, but I do care a lot that they contain these concepts.

This uri is an example of what I think of as a "base source". That is, I expect I might well consider citing something within this "database". We might also worry about what I mean by "downloaded" in that citation ...

I've suggested elsewhere that a priori one needs a concept of a feature-type (ISO19109) before there is much point citing within a database. In this case what I mean is there is no point citing a "thingie" within a database, unless there is a (generally or otherwise) understood concept of what I mean when I say "thingie". It could be a gene-sequence, it could be a radar-profile, it could be a chemical equation ... but if it's any of those things, somewhere there is a definition of what that means, and a description of the format of the result I'll get when I go get one of those things. That somewhere is (or should be) a feature-type registry (ISO19110) in our world, but it doesn't matter if it's not, the point is that the concepts and some realisation of them must exist for citation within a database to have meaning.

This whole discussion is necessary because we have no a priori concept like page or chapter ... with documents we know what we are going to get back .. .a document ... but if we cite something that is data we don't. This is a problem. For example, take opendap, which is rather popular. If I give you an opendap url, e.g.

http://www.cdc.noaa.gov/cgi-bin/nph-nc/Datasets/reynolds_sst/sst.mnmean.nc

As the opendap pages say:

The simplest thing you can do with this URL is to download the data it points to. You could feed it to a DODS-enabled data analysis package like Ferret, or you could append .asc, and feed the URL to a regular web browser like Netscape. This will work, but you don't really want to do it because in binary form, there are about 28 megabytes of data at that URL.

So clearly one doesn't want to go pointing at arbitrary urls without the right software and a bit of knowledge about what might exist at the url ... so in practice it's much better to cite a document which tells you about the data that was cited ... but now that document itself had better be as permanent as the data.

That being so, even better might be to have some way in the citation of doing this in a meaningful manner. If we push our MST example a bit further, what do we get?

where 0200409031205 is yet another made up identifier, which points to an object of type "verticalProfile" in registry http://featuretype.registry ... while I know the identifier afficionados don't like semantics in the identifier, me, I think there should be to help with the meaning of citations, and so in this case it might be timing of the radar profile.

It looks like we have some redundant information, because I could give the exact url of the data object (which could incorporate the uri), but remember there is no guarantee that the BADC uri and the download url are the same! Also, we may need the feature type uri to be different from either the base-source uri and the download url. (Haven't thought this bit through yet, but it seems conceptually possible).

## Hiding the Heat - Solar Dimming.

First we had global dimming. Now we have Solar Dimming. I've blogged before about the relationship between sunspot numbers and global mean temperature.

New Scientist has an article entitled Saved by the Sun 1, where they are reporting that

Some astronomers are predicting that the sun is about to enter another quiet period.

which would lead to a bit of cooling. There are a large number of quotes, but no actual references (of course), so I'm curious as to the state of these predictions. Some of the quotes came from people I know and respect (especially Jo Haigh, but she isn't in the solar prediction game). A quick google yielded more "news stories" (e.g. this russian one, click only if you like flashing adverts). (It also yielded this crap ... don't waste your time clicking on this one unless you want to throw up).

Eventually, I came to this in Space Weather News, which has a real reference. I haven't read that, but the story suggests the next sunspot cycle will be comparable with 1906. Well, that doesn't sound like it will buy us much time to think (which is the basic thread of the New Scientist article). What about predictions beyond the next cycle? Well, I couldn't find any of those ...

So, so much for being able to put off hard choices!

1: New Scientist, 16 Sep 2006, page 32 (ret).

I know I sort of predicted this, but New Scientist1 claims that:

Formal scientific papers are now even beginning to cite blogs as references

I would like to see some examples of this!

(As an aside, it is good to see that Peter Murray Rust is blogging, and even more so on the importance of blogging in scientific communication and slagging off PDF too!)

1: Amanda Gefter in New Scientist, 16 Sep 2006, page 48 (ret).

## More on citation - part two, MST

Yesterday I started talking about how I think one should cite data we hold on behalf of someone else. That discussion isn't yet finished, issues of citing within datasets, and other "standards" for citation still need to be discussed (including how we parse and store them electronically), and I still haven't gotten to addressing any of my points from the original blog entry. Before we go there, it's helpful to discuss our mst radar data set.

The mst radar dataset consists of data from 1990 to the present. Over time the data format and methodology of data collection (i.e. how the raw radar returns are converted to, for example, wind) have changed. Nonetheless, one can imagine someone wanting to cite a timeseries extending back to the beginning or a particular days data, or perhaps the whole thing.

Do we publish the data? We will hold the data in perpetuity, and make it available for scientific use, so in some senses yes 1 (In the same sense as we publish the ashoe data). I'm going to get back to this issue of what I think publication should be for data.

Meanwhile, how should one cite this? Still using the U.S. National Library for Medicine recommendations (pdf), we should probably consider this as an online database.

• Our first issue is: "Who is the author?". I think this is a case where this is a NERC facility, so it is NERC.

• What is the title? Well at the BADC we call it "The NERC Mesosphere-Stratosphere-Troposphere Radar Facility at Aberystwyth", but actually I think we should consider renaming the dataset to something which is n't about a facility, or a funder. Better would be the "The Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth"

• What is the urn, what is the update date, what is the actual url of the data? Actually, despite being a data centre, the answer to none of these three questions is especially obvious. It should be ... but that's part of what claddier is about, to expose these sorts of issues.

The whole thing would then be, for example:

Well, that's obviously horrible. Let's just for a moment imagine our retrieval system was a bit more citation friendly.

• We might, for example, have a clean urn which helps identify the version and length of the data. Ideally it might look something like: badc.nerc.ac.uk/data/mst/v3/upd15032006. In which case we don't need the updated phrase.

• We might have a cleaner url to the location of the data, which hides the method of download

In which case the following would be legitimate:

(In all this, note that in all this, today and yesterday, I have completely removed any reference to the physical location of the database (eg, London, or in our case, The physical location is so irrelevant. It made sense to include when one could physically go to an archive and get a copy of a document, but you can't do that with our data. Coming here will get you nothing. Further, we could move the badc from Chilton to Oxford tomorrow, and it would make NO difference to the accuracy of the citation. Why include it then? I think the location convention has to die when applied to electronic retrievals).

Well, that's enough for now. Next we'll consider citing into that archive, and some of the other issues ...

1: Wikipaedia thinks so: "Publishing is the activity of putting information into the public arena. ..." (ret).

## More on citation - part one, ashoe

We've just been having an interesting conversation about the citation of datasets in the context of claddier. I think I've got the use of we and I correct in what follows to indicate what I think as opposed to what we discussed and agreed ...

I've wittered on about this before, where I considered six issues. Today we discussed two things: how we might cite some specific BADC datasets and we came up with some examples, which led to; how we distinguish between datasets we hold on someone else's behalf, rather like a library, and those which we publish ourselves.

So, we were considering two of our datasets: one, the MST radar data, and the ASHOE mission data (which I was a participant in, hence my choosing it as an example).

The latter dataset is essentially an online copy of a CD which we obtained from NASA, and so while we host it online for the UK academic community, we are certainly not the publishers. The former is a dataset which we hold as the primary dataset for NERC, who require us to make it available. In neither case have we done anything ourselves to allow me to feel comfortable with the grandiose phrase that "we publish it" (indeed, in the former case I feel quite uncomfortable with that concept).

So how do we expect folk to cite these datasets? Here, I'm going to discuss the ashoe dataset alone. The ashoe CD is essentially a compilation of data from a bunch of principal investigators. The compilation was produced by Gaines and Hipskind, so I think this should be dealt with by treating it like an anthology. So that's the authors (or creators) dealt with.

Following the U.S. National Library for Medicine Recommended Formats for Bibliographic Citation1(pdf), we have:

Gaines and Hipskind, The Airborne Southern Hemisphere Ozone Experiment; and Measurements for Assessing the Effects of Stratospheric Aircraft (ASHOE/MAESA) CDROM [Internet], Nasa Ames Research Centre, Earth Science Project Archives, c1994 [cited September 21, 2006] Available from http://badc.nerc.ac.uk/data/ashoe/

Now the real data is actually at the Earth Science Project Office Archive but note that this is an updated version of the data (I think, the file dates seem to indicate this) from that on the cdrom, which is described online. (I think this means we should add the new data to our archive as well ... but that's an issue for another day).

Anyway this shows several issues

• The date of the data is actually unknown (at least in the public information I found, without rooting around too hard). One can't rely on the filesystem time ...

• Why should folk indicate that they found it our site, when strictly the ESPO themselves"published" it online? I would argue there are three possible reasons why this is a good idea, which might or might not apply:

• One knows the original version is no longer online (as appears to be the case here).

• One knows that the original online version is somewhat more ephemeral and the place where you are using the citation cares about this issue (e.g. the American Geophysical Union at one time only allowed citation of datasets in registered repositories2

• With data, I think there is always some risk that the version of data you downloaded from one place will be different from that downloaded from somewhere else (it has certainly happened that we have hosted corrupted versions of data that were non-trivial to identify 3). Now this might argue to always going back to the original source before conducting the analysis which results in the citation, and if possible that's the best idea, but again, with data, that may not be possible for reasons of volume, bandwidth, or server capacity of whatever. I think it best then to indicate which version you used by indicating where it came from, and the download date.

Having said all that, there is a way forward. Going back to the actual citation, I don't much like it, nor do I like much else I've seen. So here's what I think we should do: I think the "cited" should be "obtained on" for data, and it should follow the available at, so better would be:

Gaines and Hipskind, The Airborne Southern Hemisphere Ozone Experiment; and Measurements for Assessing the Effects of Stratospheric Aircraft (ASHOE/MAESA) CDROM [Internet], Nasa Ames Research Centre, Earth Science Project Archives, c1994 [Available from http://badc.nerc.ac.uk/data/ashoe/, obtained on September 21, 2006]

In our discussions today we pushed on to thinking about the ESPO version as the data equivalent of a NASA Tech Note document, and in our discussion of versioning decided it would be best if an identifier which had meaning to the publisher (ESPO) was also included (to deal with, for example, versioning). Now arguably that might be the original http uri but it could be anything with meaning to the publisher. In this case we only have the http url, so we could have:

Gaines and Hipskind, The Airborne Southern Hemisphere Ozone Experiment; and Measurements for Assessing the Effects of Stratospheric Aircraft (ASHOE/MAESA) CDROM [Internet], Nasa Ames Research Centre, Earth Science Project Archives, c1994, urn http://cloud1.arc.nasa.gov/ashoe_maesa/project/cdrom.html [Available from http://badc.nerc.ac.uk/data/ashoe/, obtained on September 21, 2006]

Note that I'm implying that I want to use a urn which has meaning to the original publisher if I can, which means ideally data publishers should publish the base source citation urn for their datasets to help get this right. If the base urn is in fact a uri or url, then the original site attribution has magically appeared if that's what the publisher wants.

What I mean by base source, and all the other hanging threads will have to wait for another time.

Update (26/10/07): The NLM's Recommended Formats for Bibliographic Citation has been updated, and the new revision is available here. I have yet to check whether the revision has any implications for our citation format!

1: Thanks to Chris in comments to my original blog entry. (ret).
2: An early version of their data policy appears here, Update, 22nd Sep, the current version is here ... NB: BADC has been approved by AGU. (ret).
3: And regrettably you can't always tell this from our metadata, we really have to do better! (ret).

## Diesel v Petrol

Over the summer we upgraded our petrol engine renault megane hatchback and replaced it with a diesel megane estate (the dCi-100 version). We've gone from 40 miles per gallon (sorry about the imperial units) to between 55 and 60 miles per gallon. (Our first full tank did 59.5 mpg.)

One of the things I like about the new1 car is the computer mpg readout ... we both tend to drive with that showing, and both of us have started driving slower and more carefully so that the mpg figure will go up. What a simple innovation ... knowing when you're being a gas hog makes one do it less!

I've noted that driving at 55-60 mph is the optimum, and 70-ish mph results in mpg figures in the high forties ...

All of this has me wondering why, as an easy and effective way of dropping the greenhouse gas emissions we in Britain don't

• Make the motorway speed limit 60 mph (yes, I'll hate it too from a driving point of view, but from a conscience point of view I'll feel better), and

• Make diesel cheaper than petrol, to encourage more folk to move over ...

As far as I understand it, here in the UK the price difference is down to taxation. The only explanation I've heard is that the greater price is a hangover from the days when diesel engines delivered bad particulate pollution. It would seem to me given modern engine technologies, there is a prima facie case for putting the taxation boot on the other foot, making petrol more expensive per litre (yes, we really do measure our petrol prices per litre and our performance in miles per gallon).

Are there any reasons why diesel production is more expensive economically or in greenhouse gas terms to make my argument wrong?

1: Strictly, it's a newer car, since we didn't buy it new (ret).

## Final Version of the CF Paper

After much delay1, most of it occasioned by my workload, we've produced a final version of the CF paper: Maintaining and Advancing the CF Standard for Earth System Science Community Data, Lawrence, B.N., R. Drach, B.E. Eaton, J. M. Gregory, S. C. Hankin, R.K. Lowry, R.K. Rew, and K. E. Taylor.

The Climate and Forecast (CF) conventions governing metadata appearing in netCDF files are becoming ever more important to earth system science communities. This paper outlines proposals for the future of CF, based on discussions at an international meeting held at the British Atmospheric Data Centre in 2005. The proposal presented here is aimed at maintaining the scientific integrity of the CF conventions, while transitioning to a community governance structure (from the current situation where CF is maintained informally by the original authors).

1: The first version was in November last year (ret).

by Bryan Lawrence : 2006/09/12 : Categories cf (permalink)

## Normal Service Will Be Resumed

Yes, I'm here, had a wonderful two weeks in Wales (well, wonderful except for the weather), and arrived back to work to a major panic trying to get a proposal ready. Expect that to be over next week, and me to restart blogging. I have much to talk about (including my new toucan T60P from emperor linux).

by Bryan Lawrence : 2006/09/09 (permalink)

## to extend or not to extend ...

A few months ago I wrote a few words about practicalities with ISO1939. The key reason for using ISO19139 is that it is a standard for metadata interoperability. A key a priori assumption is that one has a community who have agreed to interoperate ... because in practice to interoperate will require a profile of ISO19139 which defines for a community what interoperability means to them.

Well, that's clear then. Communities will build profiles. This is a good thing, because it means ISO19139 will have direct value to those communities, providing an infrastructure where their concepts can either be specialised by limiting broad concepts in the parent schema or by extending concepts in the parent schema to add new attributes etc or both. Two communities doing this for real that I care about are the IOC and the WMO.

The problem with profiles though, is that profiles probably wont be easily consumed by other communities (i.e. will the WMO be able to consume the IOC instances?), but there may be things that one can do when building a profile that will help consumability (I'm trying desperately hard to avoid food puns ...) ... and I think it's going to be really important that the designers of these profiles think about this, because nearly all the interesting problems we have to solve in the world are on the boundaries between communities.

These boundaries are particularly important for catalogue builders, because catalogues are for finding things that you don't know a lot about already, for most of us, by definition those are things that are the furthest from our core experiences (and so are most likely to be the least likely for us to be able to build into our core profiles).

So, building for interoperability outside our communities should be just as important as building for interoperability within our communities (modulo the reality that the core communities always pay for these things).

Well, philosophy aside, what's on the table?

Last time I made the point that if you extend a schema, for interoperability it will help if you can export your records in vanilla ISO19139 as well as your custom-ISO19139.

John Hockaday was a bit more explicit in an email to the metadata mailing list. If the header to an ISO19139 profile instance includes the link to the parent profile schema (as it must), then:

If the profiles are extensions, i.e. add extra elements to the ISO19115 metadata standard, then they will have to be translated using XSL to the ISO19139 format for other people around the world to use. The translation will remove the extra elements because ISO19139 will not recognise them as valid elements.

If profiles are restrictive, i.e., don't allow some ISO19115 metadata records, then they should also validate against the ISO19139 XSDs without translation using an XSL except that the header will have to be changed to point to the ISO19139 namespaces or the parser is told to use the ISO19139 XSDs.

He goes on to say (as I did) that

I expect the common format for exchange of metadata is the ISO19139 format. All profiles should provide and XSL for translation of their profile XML to the ISO19139 XML format. This will allow exchange of any metadata to be easily achieved.

BUT

I wonder whether we were right. I think we all understand that metadata translation is (generally a lossy) activity.

What do we want our interoperable records to achieve? Generally, I think we want discovery between communities. So the most important thing to do is find information.

In practice, we can exchange any derivative of the ISO19139 profiles using OAI and store them in xml databases - we don't need necessarily to parse them into a relational schema or into "our" xml schema. The question should really be How do I consume instances from someone elses schema?. As I've said before, we should ignore what we don't understand, and look for what we do! That means my discovery tool may want to look for familiar tags, use those, but when the discovery client wants to consume the record (perhaps having navigated using the tools I've provided), they want the original record (without any losses introduced by translation).

So, what we need to do in practice is index using common tags, so ideally the developers of profiles and standards should extend/contract all they like, but if possible to do this by declaring new (specialised) elements in new namespaces, but with the same tag names!"

But the advice to the IOC includes this:

Some of the metadata packages have also restrictions. ?EG. dataQuality. ?We would have expected that there should have been a specialisation of the DQ_DataQuality class called MP_DataQuality that shows these constraints.

And every time one does this, it results in an invisible tag that wont work in a portable environment (so we will ignore it in code). What this says to me is that specialisation by restriction and name change leads to a lack of interoperability ... exactly where it should be easiest ... after all, the instance document should conform exactly to the parent schema too... and I had thought we would only have to change the header to make our indexes and validations work ... (as John suggested). But instead this advice results in an instance that does need to be transformed by XSLT. It would be nice to avoid that where possible. Wouldn't it be better in this situation to NOT change the tag name in the new schema? Could it be that the IOC has got bad advice, or is there some subtlety in the rules for restriction that I (and presumably John) haven't spotted?

I need to get back into this for the NumSim project and for some collaborations for our WMO partners ...

## On Access Control

As part of the dews project, we need to deliver access control for OGC Web Services. In particular, we're planning on limiting access to resources delivered by geoserver. The current concept for dealing with this is displayed in some simple UML:

The bottom line is that a normal request will first involve a redirection to establish a security context, followed by a re-request using it, and then calling the application itself. More details are on the ndg trac site.

## On printing this blog

I had a read of this and implemented a simple print.css so it should be nicer to print things out from this blog now ...

by Bryan Lawrence : 2006/08/15 : Categories python (permalink)

## Granule Concepts

We've been struggling with a few concepts in mapping how we want variables and datasets to be related. The struggle, as in most technical discussions is that one needs to be very exact about what one is saying. To try and simplify our discussions (and maybe yours), I've tried to produce some UML which relates concepts like datasets and files to thing we actually deal with.

The key relationships:

• datasets can be composed of other datasets

• datasets have discovey records

• datasets can be composed of granules

• granules are not independently discoverable

• granules are composed of phenomena

• granules are associated with (potentially multiple) files

• files can be associated with (potentially multiple) granules

• granules are associated with variables

• a phenomenon is associated with (potenially multiple) variables

• a variable has exactly one CF standard name and is associated with exactly one member of the BODC vocabulary

• a CF variable may be associated with CF coordinate variables, and if so,

• the relationship between the variable and the coorddinate may have a CF cell method associated with it.

• a CF coordinate variable is simply a special case of a CF variable

• a MOLES dgDataEntity is an implementation of the concept of a dataset

• a MOLES dgGranule is an implementation of the concept of a granule.

## New Plans for Leonardo

One of the things that came out of today's meeting about campaign support within BADC, was a requirement to provide a "campaign diary", which would

• be persistent beyond the duration of a campaign,

• allow the upload of files for sharing (providing a directory interface),

• provide a timeline of entries, activities and links,

• support multiple authors,

• allow annotation

• support mathematics

Most of this would be supported by any blogging package, but Leonardo has a couple of significant advantages, not least of which is my own personal involvement in the development. Hopefully we'll test drive leonardo in this context for a real campaign later this year.

To do that, we'll need to:

• vastly improve the documentation

• provide a print.css

• support version control

• provide atom feeds of the categories

• allow a directory view of the url space, including icons for non xhtml mime types.

Is it worth doing the engineering? Should we find something that already does all of this? My gut feeling is that is still worth the effort. Apart from my dilettante-like efforts:

• Leonardo is standalone (that's important for getting folk to deploy their own instances),

• Leonardo does support maths,

• Leonardo is easy to modify ...

Something to think about when I come back from hols I think .. not for now!

## dilettante

A long time ago I was given some good advice which I have consistently ignored: "to get a reputation, you need to be an expert in something, and the last thing you want to be is a dilettante".

I think it's good advice, but sometimes you can get away with it, and mostly I have ... but the last eight working days have stretched my belief that I have got away with it!

I've been taking advantage of having five day weeks (no child care days during school holidays when Anne is at home to look after Elizabeth), to have a succession of day long "special issue meetings" as well as a bunch of other shorter meetings. The biggish meetings have been:

• processing affordance (in the context of GML and NDG),

• security pardigms (in the context of NDG),

• content managment systems (for NCAS and BADC),

• python package management (for NDG).

The smaller meetings have covered

• Campaign Management by the BADC for the UK atmospheric science community,

• Parallel processing plans for climate diagnostics,

• Data Assimilation Theory,

• Ocean/Atmosphere Coupling

• Metadata objects for environmental science,

• The CCLRC environmental strategy.

... and I've had the usual raft of scheduled meetings ...

If that's not evidence of a dilettante's workload, I don't know what is. I just about feel like I'm keeping my head above water, but I know what I haven't done ... meanwhile I sure know that

1. I need a holiday, and

2. I need some sustained time on some single issues.

Fortunately the holidays are nearly upon me ... I'll be spending the last two weeks of August in Wales, a week in the mountains, and a week on the beach. No laptop. No internet. Will it be bliss?

I don't know what i'm going to do about the single issue thing ...

by Bryan Lawrence : 2006/08/09 (permalink)

## what is bt up to this time?

For the last three nights (including tonight), konqueror has been misbehaving on both my home linux systems (one running Suse, one kubuntu). Neither O/S has been modified for a long time ...

... yet all of a sudden, nearly all web pages return "Unexpected end of data, some information may be lost."

Mozilla seems fine to the same sites. My laptop which is the ubuntu system is fine when I vpn into work ... to the same sites ...

Conclusion: BT has been up to something with caching or some sort of nasty middleware, but I can't think of any way of getting this fault description through their microsoftacious technical support route ...

Has anyone else seen this? Or anything like it?

(Update: it looks like the css files are not being downloaded. I can't even view a css file from konqueror via bt, even if it is visible in firefox ... now I'm really confused .. nor can I view page source, even though the dom is inspectable and visible).

(Update, 11th, and now it's all fine again ...)

## Affording Interfaces

... and I don't mean how much it costs :-)

Yesterday I tried to get across the concept of processing affordance, which is a construct which it appears that we have to invent because of limitations in GML.

The big deal we have left to thrash around with is how to describe these affordances and link them to interfaces ... Well, I think they're different ideas, as I've tried to express in this diagram:

It has been suggested that we use a registry formalism to describe the association between operations and features (affordance), but that was when we (I) hadn't made a distinction between operations and interfaces. If we use a registry how do we deal with the choices that need to be made about inheritance of affordance? (Update: Andrew tells me that I should look at WSDL2, so I will ... and note that the diagram has also been altered to avoid an incorrect UML use of interface stereotype that Andrew pointed out to me).

Obviously the computational science world has wrestled with this sort of problem for ages, so there has to be a clean solution, we just need to work out how to deploy it in our "model-driven" architecture environment.

by Bryan Lawrence : 2006/07/28 : Categories ndg : 1 trackback (permalink)

## What does zero emission mean?

physorg is reporting that Texas and Illinois will compete for the worlds first near-zero-emissions coal plant.

Fabulous I thought, until I asked myself what zero-emission meant, and chased it up. Actually it means "near zero emission into the atmosphere right now". It would seem the project is about building plants that do carbon sequestration. That's not necessarily a bad thing. Frankly, I have no idea whether it will work long term or not. But what is a bad thing is to be duplicitous while you're doing it, these things are not actually zero emissions, they're still polluters. What this actually means is the pollution will be buried! Again, nothing necessarily wrong with that, humanity has been doing it since civilization was invented, but burying pollution is not the same as not polluting and it must eventually be a problem. If we're going to talk zero-emission, then we ought to be explicit about what it means.

(It must be the humidity, I'm feeling pedantic today).

by Bryan Lawrence : 2006/07/27 : Categories environment (permalink)

## On Processing Affordance

When we produced the Exeter Communique, we spent a lot of time talking about something that Simon Cox has termed "processing affordance". A processing affordance is a property of a feature1 type which expresses "what can be done to or with it".

Some of us (ok, maybe just me), find it useful to distinguish between intrinsic affordances and extrinsic affordances, by which we (me) mean things that depend on the properties of the feature that were anticipated by the feature type creator (intrinsic), and things that may be independent of the properties of the feature, and are certainly things that are not described and maintained by the person or organisation that governs the description of a specific feature type (extrinsic).

This blog entry is effectively the record of a conversation that Andrew Woolf, Jeremy Tandy and I had, where we were trying to tie down (mainly for my benefit I fear), exactly why we need this affordance concept. (Of course, it's my record, so it may be nonsense ...)

If we concentrate only on intrinsic affordances, then they are certainly something we describe in our UML domain modelling. A simple example would be that if we had a gridseries feature, we clearly anticipate an operation of subsetting which allows one to extract a series ... (we anticipate a lot of other things too, but that's fairly obvious). Suppressing extraneous detail (and arguments about whether a geometry like "grid" should be a feature), in UML, we might have our simple domain model looking like:

So far so good, but when we want to serialise it, and if we want to use GML, then we hit a serious snag. The design and history of GML as a data modelling activity means that the GML coverage type has already ditched the operations in its version of the parent CV_Coverage class, and the GML schema has no inherent mechanism for describing operations. (I spent a lot of time being confused by this, as folk had insisted on saying xml-schema has no mechanism for describing the operations, and that's obviously daft ... what they had meant was that the GML-xml-schema has no mechanism).

Because we don't believe in multiple inheritance (the ISO mechanism explicitly says we should only inherit from one class, which amongst other reasons, makes the software easier to automate), we're stuck. In practice we have to serialise via GML schema so we have to come up with an independent method of serialising the operations (which when they are inherently tied to a feature become "affordances"), and then creating the link. Something like the following

The question we are left with then is how exactly to implement these relationships between operations and their parent features in a way that is likely to be consistent with the OGC methodology and one that others will buy into. That will be the topic of another note ...

1: where we use the word feature in the ISO19123 sense (ret).

by Bryan Lawrence : 2006/07/27 : Categories ndg : 1 trackback (permalink)

## NASA earth exploration

A couple of days ago, I reported that NASA was killing off earth observation. Chris, in comments to that post, pointed out that despite the overall change in emphasis the current strategic plan (pdf) in section 3a listed eight new missions. Given I'm supposed to know a bit about this stuff, I thought I'd make sure I was up with the play with NASA's plans. So for the record the NASA planned missions are:

• the "National Polar Orbiting Operational Environmental Satellite System Preparatory Project" (phew) ... NPP ... which is a tri-agency thing, aiming to ''eliminate the financial redundancy of acquiring and operating polar-orbiting environmental satellite systems, while continuing to satisfy U.S. operational requirement for data ...". A new polar orbiter will be launched in 2009 with a five year design life.

• Cloudsat and CALIPSO - a pretty important mission from a climate perspective (and being a cloud guy nowadays of significant interest to me), and it launched earlier this year.

• The Glory Mission - which I have to admit to not having heard of before a bit of research today, is an on and off-again type mission. It looks like it's still scheduled for a 2008 launch, and mission aims are (from one of the glory web sites):

1. The determination of the global distribution, microphysical properties, and chemical composition of natural and anthropogenic aerosols and clouds with accuracy and coverage sufficient for a reliable quantification of the aerosol direct and indirect effects on climate;

2. The continued measurement of the total solar irradiance to determine the Sun's direct and indirect effect on the Earth's climate.

• The Global Precipitation Measurement Mission (GPM). As their web site states, precipitation is a key part of the climate system, and this is a pretty important joint activity with the Japanese Aerospace Exploration Agency (JAXA) and other international agencies to develop a constellation of satellites to measure precipitation a global scale. (It appears that Bush's influence on this activity has already been hindering, and it's been further delayed this year! - 2013+?).

• The Ocean Surface Topography Mission (OSTM)- this is about operational altimetry (i.e. measuring sea surface height with enough regularity to enable the use of the data directly in models via assimilation). It's another massive joint effort involving EUMETSAT, NOAA, as well as NASA and France's CNES. Due for launch in 2008!

• Aquarius, to be launched in 2009], will be a joint mission with Argentina to measure global sea surface salinity. Again, this is pretty important, because this will help quantify the physical processes and exchanges between the atmosphere, land surface and ocean (i.e. runoff , sea ice freezing and melting, evaporation and precipitation over the oceans).

• The orbiting Carbon Observatory (OCO, scheduled for launch in late 2008) will be producing vertical atmospheric profiles of aerosol content, temperature, CO2 and water vapor. Retrieved data will also include scalar measures of albedo, surface pressure and the column averaged dry air CO2. With a planned resolution of gridded data products of one degree monthly averages, and sources and sinks mapped at a slightly lower resolution, it will still provide enough information to start evaluating the real greenhouse gas polluters. The data may well be politically embarassing!

The document also reports a number of other key NASA activities which are directly related to earth observation (and climate change):

• Significant work on advancing radar, laser etc technologies to enable new space instruments.

• They are planning to work with the US Geological Survey to secure long-term data continuity of Landsat type observations,

• They are making significant progress assimilating EO data into models

• They are working on policy and management decision support tools, and

• Working on interagency cooperation to mature their research instruments into operational systems (and to utilise operational systems for research)

So there is no doubt that NASA is doing and will do fabulous earth science that will make huge contributions to understanding anthropogenic (and other) effects on the earth system.

However, what is noticeable is that all these missions (with the exception of GPM) are due for launch in the next three years (and GPM has been heavily delayed, so it's an "older" mission already). One might argue that the baleful influence of politicians is already visible in the lack of new earth observation missions for the next period (or maybe I just don't know about them and they're not yet visible in the strategic plan because they're at an earlier state of development ... I promise to ask some of my colleagues who ought to know more, and if there are things in the pipeline I'll post about them).

Clearly now the EO words have gone from the mission statement, one can anciticipate that it will be even harder to get new missions ... and even harder to justify (with NASA funding) working up the data. However, one can hope that the planned decadal survey by the US National Research Council, referred to at the end of section 3a in the NASA strategic plan, might have strong enough words in it that NASA just has to do the work!

(One might argue that as a kiwi working in Britain I have no right to be bleating about what NASA is and isn't doing, but the fact of the matter is that the climate and environment are global problems, requiring global solutions, and we all need to pull together and use whatever instruments and data we can to make progress, so we all need to know what the key players are up to!)

## Oreskes responds

Very early in my blogging career, I reported the clear consensus in the climate community over the realism and causes of global warming. Recently a smart famous man refuted the Science paper I quoted in a crappy article in a crappy paper (well, any paper that let such a poor piece get published is a crappy paper - sorry Wall Street Journal).

Anyway, the original author (Naomi Oreskes) has written a response piece in the Los Angeles Times (not one of my normal reads, so thanks to John Fleck). Anyway, some choice bits:

My study demonstrated that there is no significant disagreement within the scientific community that the Earth is warming and that human activities are the principal cause.

To be sure, there are a handful of scientists ... who disagree with the rest of the scientific community ... this is not surprising. In any scientific community, there are always some individuals who simply refuse to accept new ideas and evidence. This is especially true when the new evidence strikes at their core beliefs and values ... Scientific communities include tortoises and hares, mavericks and mules.

I've conflated a couple of her paragraphs to produce the following, but I think it's a fair summary of the situation:

... the panel (the IPCC) has issued three assessments (1990, 1995, 2001), representing the combined expertise of 2,000 scientists from more than 100 countries, and a fourth report is due out shortly. Its conclusions ? global warming is occurring, humans have a major role in it ? have been ratified by scientists around the world in published scientific papers, in statements issued by professional scientific societies and ... Yet some climate-change deniers insist that the observed changes might be natural, perhaps caused by variations in solar irradiance or other forces we don't yet understand. Perhaps there are other explanations for the receding glaciers. But "perhaps" is not evidence.

by Bryan Lawrence : 2006/07/24 : Categories environment climate (permalink)

## America sliding towards Australia

More evidence of convergence can be found in the fact that the U.S. is trying to kill off earth observation ... yet again.

The press release quotes Jim Hansen:

They're making it clear that they ... prefer that NASA work on something that's not causing them a problem."

by Bryan Lawrence : 2006/07/24 : Categories climate environment : 1 comment (permalink)

## lurking in Exeter ... in the cool

It's my month for lengthy train journeys: last week to Plymouth for a couple of days, and this week to Exeter for a couple of days. So far (three out of the four I will have completed by tomorrow) the train journeys have been pretty much on time, and good value (in terms of time spent working and minimising my CO2 production). I mention the timeliness to to see if Murphy reads blogs! (Update:Murphy not a blogreader: four out of four timely trains!)

I feel guilty. I just spoke to Anne back in Oxfordshire where the digital thermometer inside the house is reading 32C ... and this at a time when Benson (EGUB) is reporting 33C ... so obviously it has been hotter (generally inside our house doesn't get nearly as hot as it is outside) ... according to the BBC today is the hottest July day - 36.3C somewhere - since records began (beating a record that has stood since 1911). I'm guilty because it's only 24C here in Exeter, and I can imagine how uncomfortable it is at home ...

by Bryan Lawrence : 2006/07/19 (permalink)

## Laptop Purchase

My laptop is nearing the end of it's life. Switches are breaking. Applicatinos run slow and the fan seems to be perpetually on what with beagle and all the other stuff I seem to depend on (expensive memory hogs like the java versions of freemind and the oxygen xml editor along with a cross over office instantiation of Enterprise Architect1).

So, I've been considering how to get what I want, which is a pre-installed linux on hardware which meets my criteria. I want

• Core Duo 2.0 GHz up ...

• 80 GB Hard disk 7200 rpm

• kubuntu

• 1 GB memory

• 1.8 to 2.3 kg in weight (4 to 5 pounds in old money, sorry American readers, your units are behind the times)

• A docking station

• Decent battery performance.

• Support for external monitor at 1900x1200 or a display projector (beamer) without hassle.

• Software suspend and hibernate.

• Higher resolution display than 1024x768 ... preferably in both dimensions!

I've had a look at the Dell 620, but the docking station options on the dell site seem to imply I can't plug my monitor cable in ... what's the point of that? Also, can't find anyone who will preinstall linux for me on one of those (yes, I could do it myself, but this thread suggests that I might have more work than I want ... or have time for).

A Lenovo Thinkpad T60 looks like a starter, and emperorlinux appear to offer what I need (albeit with an American keyboard). But EL are very slow at replying to email ... not sure about giving them money! (Update 20th of July: slow they may be, but they can do what I want and they have got back to me with a compelling quote - now all I have to do is get it though our local purchasing) (EL have a Sony that fit the specs too, and while I currently have a Sony the don't have anything approaching a sensible global warranty).

The only company I could find in the UK which preinstalled linux had too small a set of offerings to deliver what I need. I've heard rumours of HP and lenovo preinstalling linux but HP doesn't have any suitable hardware (whatever the O/S) and Lenovo don't actually offer it to individuals (yet?) as far as I can tell.

Anyone able to give me any advice?2 ... Certainly seems like there is a commercial opportunity here somewhere ...

1: Given that only one of those four key components is free, I think no one can argue that I only ever espouse free and/or open software (ret).
2: JP, don't suggest a mac again, we're not that wealthy at the BADC :-) (ret).

by Bryan Lawrence : 2006/07/18 : Categories computing (permalink)

## openlayers

I think the advent of openlayers might potentially be more important than Google Earth ... it's certainly something we'll have to implement for ndg.

by Bryan Lawrence : 2006/07/18 : Categories ndg computing (permalink)

## Whither Our Web Servers - Part III

In which we return to performance ... having been discussing web server technology from first principles to proxy servers.

So, we have two things for which we need to improve performance:

• our web servers per se, and

• the data delivery services they expose,

and we need do this in a way that preserves efficient software development and maintenance (we need to avoid the curse of premature optimisation1, at the same time as improving services).

So the first thing we should do is ask what we mean by "improving performance". Well, in this context I do mean improving the user experience but I don't mean ensuring accessibility and standards compliance. One quick win will be to use more web servers (i.e. buy more hardware), and simply spread the load across more systems. That's underway. On the slightly longer timescale (weeks to months), we're investigating ultra monkey, which will allow us to address serving data more efficiently. We're also looking at replacing NFS in our storage systems as much as possible with more efficient protocols. On the even longer timescale, we need to work out how to migrate some of our python cgi stuff to faster server environments, which is really what this series of blog notes is about. While the server environment is not really a pinch point for many of our data delivery services (which are generally constrained by NFS I/O, processing code, and the FTP protocol in sequence), they are the point at which user frustration can begin ... while a slow website annoys folks more than is warranted, it is still the perception that counts!

Ok, so back to the server, starting with python server frameworks. A framework is something that helps build a web application, minimally including something to resolve the URLs, deal with HTTP headers etc, give access to context variables and provide session management. Most frameworks provide some sort of database access system ... we'll come back to that.

Given that we need to be able to go from demonstration to production quickly, it makes sense that we deploy frameworks that allow fast development and good efficiency. Obviously given our environment (and budget), we're looking at open source solutions (with the caveat that I'm nervous about open source frameworks that don't have big communities).

But there are a zillion python frameworks, and that's the biggest problem python has ... too much choice (and fragmented development and support). However, python also has the Web Service Gateway Interface (WSGI), which provides a protocol to allow frameworks to communicate with applications. In principle one can develop an application that doesn't know too much about the framework in which it is embedded ... but it probably still needs to talk to a database. The database access could be built into middleware (which could lie between a parent framework and your application, all communicating with WSGI) but more usually the database interaction is integral to the code, is framework specific, and needs to be in the application. That breaks some of the framework integration possibilities of WSGI. (Incidentally database integration is the big problem with most of the frameworks we've investigated, they work fine if you are developing a new application, but may be less flexible in deploying an application which needs to talk to an existing database).

Nonetheless, it appears that WSGI still allows some significant flexibility if you're trying to work up from CGI to something which performs better. Apparently frameworks have generally had their own way of talking to the server: minimally using CGI, and then either generic (Apache) modules like mod_python, mod_fastcgi and mod_scgi (of which more in a later post). However, most of them now offer WSGI, which means one can develop code in any given framework that can be embedded in most others, and talk to servers either via such middleware frameworks, or directly via a WSGI server.

Once again, it's time to stop, and we still haven't gotten to performance. I might as well be writing a book. That wasn't my intention :-(. Anyway, next time we'll talk about deploying WSGI etc.

1: I used Cook Computing as a link for this, because I like the fact he tends to go back and get the quotes right, e.g. Postel's Principle (ret).

## the scope of blogging

Allan Doyle pointed me to a provoking piece asking why the GIS community don't have blog conversations? Well, I'm not really a GIS person but as is obvious from the direction that ndg is taking, I'm heading in that direction. Dave Bouwman's piece essentially posits two explanations:

• It's a numbers thing ... the community is too small.

• It's a technical savvy thing. The GIS community isn't up with the play (clearly clever enough, but maybe it's just one tool - albeit an easy one - to far?)

The links in Allan's post point to some rebuttal of these points, and add some more explanations, including teasing out "the newness of it all" as a (the?) fundemental explanation.

I think these are all fair explanations, but I have another (which is a variation of the size thing). From what I can see there are two classes of (technical) blog conversation going on:

• what I might term open source discussions where the work and conversation is not the core activity of the folk involved, and

• corporate communication blogging - both managed and unmanaged (by which I mean some of which is planned and controlled by management, and some of which is actively encouraged, but not guided).

These two classes then represent

• enthusiastic folk talking about what they like doing beyond their jobs, and

• enthusiastic folk talking about what they do because either the company makes it happen or it accepts that it's a good thing (TM) for the company.

These enthusiastic folk are generally good communicators who get a lot of out of conversations, virtual or otherwise. There are another two classes of folks:

• folks who plain don't communicate well. Never mind the medium. So let's forget them, they'll never blog effectively, if at all.

• folks who could and would be enthusiastic bloggers, but are not in the position to be publicly so ... in order to protect IPR for later commercial (or academic exploitation). Generally, but not always, there will be an employer proscription (either real or imagined), which is behind these folk not blogging (even if they were savvy enough and the community big enough).

I think this last reason is directly coupled to the newness thing, and the maturity of any given company's understanding of the open revolution; that is, that progress, in some markets, will be faster and more financially successful, by working in public than in private. This sort of thing is espoused, by amongst others Simon Phipps of Sun1.

Note that this is not just about open source, it's about open working; my guess is that effective bloggers are more effective workers and deliver the goods faster ... by exploiting the new learning/feedback medium. However, that exploitation depends on critical mass, and both GIS and climate science are still missing that critical mass, and that's one reason why GIS workers aren't blogging: there aren't yet enough open source GIS blogs creating a critical mass so that the more commercially orientated folks become even better at meeting their commercial goals by blogging ...

So, my variation of the size things ends up needing a bigger pool of bloggers, and so Dave's redux post is exactly right (for all communities). Those of us who blog should:

• make an effort to bring more people into blog / rss / aggregator awareness,

• suggest that people start their own blogs so we have wider diversity,and

• compose posts such that they invite conversation.

I only wish I had the time to write something shorter, and thus more inviting of a reply ...

1: via Tim Bray who also adds a bit of sanity on the reality of open source development, although he neglects the academic development of open source things that seeds some really successful projects and businesses. (ret).

by Bryan Lawrence : 2006/07/18 (permalink)

## Trac Macro Hacking

I make a lot of use of freemind, and because ndg uses trac, I wanted to be able to put my notes straight up on the wiki.

I had a look around for relevant macros, and found this one, but while it seems to make nice flash diagrams, I just wanted to be able to simply list my freemind... so I delved into trac and wrote a simple wiki macro.

The code is here. To use it, simply add the mindmap as an attachment to a trac wiki page and then put

 [[SimpleFreeMind(attachmentFileName)]]


on your wiki page. It wont preview, but it will be there for real ... It's a bit ropey and untested, so I'm not putting it up on the trac-hacks site until it has had some significant use.

by Bryan Lawrence : 2006/07/17 : Categories python ndg : 2 comments (permalink)

## Curious as to how this forecast goes.

The met office via the BBC is forecasting another hot week:

it'll be interesting to see how this one goes ...

Update (Tuesday night): A pretty good forecast thus far, at least as far as the maxima go. It hit at least 31 yesterday, and at least 32 today. I'm not so sure that the minimum temperature forecast is that good though!Anyway, this is the next five days forecast:

One day I might dig out some real data and look into whether or not this really is an uncommonly hot summer ...

by Bryan Lawrence : 2006/07/16 : Categories environment (permalink)

## hapless bt strike again

Some of you may remember my bt broadband saga ... Last week my voyager 2091 modem died ... all the lights flash properly but there was no one home inside ... and it no longer reacted to the reset button (or anything actually, dhcp from ethernet and wireless got a great big nothing back) ...

So, I rang BT broadband technical support, and they (after an hour on the phone) agreed with me that it was buggered ... agreed that it was still under warranty, and supposedly ordered me one (to arrive within three to four working days).

You'll not be surprised to know that it didn't turn up, and so I phoned up last night, got fobbed off til today,and then eventually I discover that they weren't sending me one because there was no record of my having been sent one on this telephone line ... (so why did they think it was under warranty?)

How can they

• have no record of having sent it to me? (Maybe some crapola between the home-highway team and the broadband team?)

• possibly not ring me up and tell me the product that they promised isn't going to come? (Shades of engineer visits that don't happen).

So today, after two phone calls to two different numbers talking to two people who had no clues ... I finally got transferred to someone who sounded sensible, and she's ordered me one (for free!), to arrive Monday ... we'll see.

How can such a big company survive with such crap customer-management systems?

by Bryan Lawrence : 2006/07/13 : Categories broadband : 1 comment (permalink)

## Whither Our Web Servers - Part II

Last week I introduced our problem (how to move from python-cgi to python-"something faster", but in a framework with other stuff), and some things I've learned. This weeks episode covers mod_proxy and a bit more ...

OK, so we left it with mod_xxx (insert your language) not recommended as a solution, and perhaps a taste of an implicit recommendation for lightpd and/or scgi if you're python based). But before we go there, we have some more things to consider.

Apparently what Zope and Java servers actually do is create their own http servers which do their own:

URL parsing/generation for you to make sticking them behind a HTTP proxy (like Apache?s mod_proxy, or Squid, or whatever) at an arbitrary point in the URI a piece of cake. Typically you redirect some URL?s traffic (a virtual host, subdirectory, etc.) off to the dedicated app server the same way a proxy server sits between your web browser and the web server. It works just like directing requests off to a Handler in Apache, except the request is actually sent off to another HTTP server instead of handed off to a module or CGI script. And of course the reply comes back as a HTTP object that?s sent back to the originator.

Mark seems to argue that expanding on this approach, which I would summarise as Using a proxy server to farm requests off to dedicated servers deploying http interfaces, ought to be the road down which most should travel. However he also predicts that what he calls the technically less impressive options of FastCGI and/or SCGI will get improved implementations in Apache instead and become more important. Pedro Melo summarises this choice as:

should we continue with (FS)CGI and friends or use basic HTTP between the front-end web server and the back-end application servers?

and goes on to conclude that if we use HTTP between the front and back end servers, we end up having to write, test and debug the http stack for each back end server, instead of using tried and tested implementations (e.g. Apache, lightpd etc) to do the http server, and something specific for the application, joined by a relatively simple protocol.

Ian Bicking in a typically well argued piece makes a strong argument for the (X)CGI school:

What FastCGI and SCGI provide that HTTP doesn't is a clear separation of the original request and the delegated request. REMOTE_ADDR is the IP address of original request, not the frontend server, and HTTP_HOST is also the original host. SCRIPT_NAME and PATH_INFO are separated out, giving you some idea of context.

Ok, at this point I'm sold on the argument that we use a proxy server to farm things out to a backend service (or server) which might be an (X])CGI module running on the initial server, or it might be running on a different server (still with Apache or lightpd or somesuch interfacing it) allowing more sophisticated access control and/or load distribution.

That's probably a good place to stop today's missive. Next time we'll see if we can get any evidence on the efficiency of these backend servers and compare them with the jsp backend servers. I have a friend who has offered to eat his hat if java on the backend server isn't the most efficient option ... I don't particularly want him to eat his hat, but I'd rather folk didn't offer up such risky eating options!

## Microsoft goes open document

Finally, microsoft have come out and admitted they're being forced to play in the open document sandpit. Tim Bray has a roundup of relevant links. Bob Sutor has a of history links that define how we got here. I've been quiet on this issue for a long time, from pressure of work. But you can see that I care (and why) if you check out my msxml category page.

The bottom line is that Microsoft will initially deliver an import routine so Office can read ODF documents ... it isn't fully featured yet, and it's not really clear that it will be, but it is a good step. From my point of view however, I will only be happy when ms-documents are saved natively in ODF with no loss of information! Having seen an example of the two formats, there is no way I want an archive full of the MS openXML product ...

by Bryan Lawrence : 2006/07/10 : Categories msxml (permalink)

## What does a forecast icon mean?

Well, the forecast is beginning to go pear-shaped, although not through lack of heat ... or perhaps not. Here are four MetOffice via BBC weather forecast snapshots, but I'm not really sure what they mean ...

Firstly, here is a 24 hour forecast issued late Monday, followed by a five day forecast (today is Wednesday):

What's interesting about this is that the 24 hour forecast makes it clear that the weather may be foul at some point during Tuesday ... but the 5 day forecast average symbol (actually, I think the word on the website is "predominant") ... suggests sunshine. Me I think I would want to know in a five day forecast that rain and thunderstorms might occur. Mind you, the website states:

This is calculated based on a weighting of different types of weather, so if a day is forecast to be sunny with the possibility of a brief shower, then we will see a sunny or partly cloudy symbol rather than a rain cloud.

Well ok, at least they did explain the methodology, but compare this with today's forecast:

Now, the average symbol tells me what I want to know - risk of rain, but the 24 hour forecast makes it clear that it's not expected to be a big part of the day. Now that does contradict what the website says, so there is an element of inconsistency here (unless we believe that there are thunderstorms between the sunny spells, since the hourly values appear to be instantaneous).

Leaving aside the fact that a look out the window suggests the average is more correct, one is left wondering what exactly what the algorithm is for constructing these things. (Clearly it's autogenerated, because you can get these on a postcode basis).

We've spotted this sort of thing in the past, one day I must chase this up ... because I'm getting interested in what folk expect out of climate forecasts.

by Bryan Lawrence : 2006/07/05 (permalink)

## Whither Our Web Servers - Part I

BADC currently deploys a bunch of web services (in the loosest definition), most of which are cgi's, and nearly all of which are served up by apache (except some specific ZSI web services which for the moment run inside the firewall on their own ports). We also have some tomcat (for sure) and mod_perl (I think).

Nearly all our development is being done in python in the cgi environment, but now some of our development is being associated with real services, so it's time for us to look into server-side options to make them run fast.

I'm pig ignorant about this sort of thing, CGI has been enough for me thus far, and while I expect to get educated by my team, I wanted to get the vocab in my head, so I did a bit of googling. Not surprisingly there's a lot of stuff out there, so I needed to (and still need to) make some notes which I'll put here over the next couple of weeks so I can get corrected if I get the wrong end of the stick.

Firstly, I found a really good article by Mark Mayo1 and two useful responses by Pedro Melo and Ian Bicking. What did I learn from these guys? Lots, but worth noting for now include:

Mark Mayo on why not CGI?:

... you have to load the whole shebang from scratch ON EVERY REQUEST. Performance will be shit, trust me, and so for anything but development with these ?megaframeworks? CGI is completely unfeasable. You need a way of getting all that code loaded into memory and have it stay persistent across requests. That?s what you get with an Apache module, and that?s what you get with FastCGI... now you have these compelling competing frameworks, written in competing languages, needing persistance on the web server and they?re not going to get it with an Apache module.

Huh? You get with an apache module and then you don't. Hmm. We might have to come back to that. Meanwhile. Mark Mayo on alternatives:

FastCGI:What?s wrong with FastCGI in Apache? ... The UNIX Domain Sockets are unreliable, for unknown reasons. Switch to TCP runners and they sometimes hang. Unexplicably. ... Matters aren?t helped by the fact that the FastCGI C code itself is crufty ... Finally, FastCGI in Apache just isn?t as flexible as we?ve come to expect things in Apache to be. ...SCGI came about as a simpler FastCGI replacement in the Python world ... But the FastCGI technology itself is clearly quite a bit better than most people?s experience with it under Apache would lead them to believe. It?s been rock solid in Zeus for many years. Recently lighttpd has also proven that FastCGI can be quite robust and quick in an open source web server, to the point that it?s superior FastCGI implementation propelled lighttpd into the limelight from nowhere as the way to run the new wave of web frameworks.

OK, so maybe we need to look into lightpd. But Mark also pointed out that Apache may well improve their fastcgi support.

Back to the why not Apache modules? Mark again:

Apache modules are not a viable path forward. I think most experienced sysadmins already know this. Building and maintaining Apache gets exponentially more complex as you add modules, and that?s reason enough to avoid it in my books without even considering the memory consumption issue.

What memory consumption issue? Actually, that seems to be best described in a post Mark Mayo made in comments (number 24) to the one about apache and fastcgi in response to someone else recommending mod_whatever(python, perl etc):

... the problem of getting two different versions of the Ruby interpreter into memory as Apache modules at the same time. Smart coding isn?t going to magically eliminate the fact that each Apache process (4 per connection on average) carries the interpreter (via a shared object, no less!) and since requests are round-robin?ed across available processes each process will end up tickling all the memory pages required to run your code. Which means each process, even if it?s just sending out a little .css file, is going to be taking up dozens of megabytes of memory.

(but at least it will be persistent, yeah?)

Marks' post goes on some, but this is where I have to stop for now. Look out for my next thrilling installment ... covering the rest of Marks' post, Ian and Pedro's replies, and moving forward ...

1: Yes I did read the whole thing, though I can't claim to have taken it all in (ret).

## The Great British Summer

The great British summer gets a pretty bad press from folks who come from my part of the world, but in my experience it's pretty unjustified. I think British weather gets its crap reputation (deservedly) from the other seasons, when it's not to unusual to go weeks without seeing even a glimpse of the sun. Having said that, occasionally, summer does go awol here in some years, but those years are getting fewer and further between ...

Meanwhile, what's to complain about in this met office (via bbc) forecast?:

by Bryan Lawrence : 2006/07/03 : 1 comment (permalink)

## Useful TCDL papers

I've just discovered the bulletin of the IEEE Technical Committee on Digital Libraries. The current issue has a number of interesting papers from my perspective:

• Egger discusses shortcomings of the OAIS (Open Archive Information System) model in terms of what it means to be "conformant". His analysis essentially identifies the mixing of technical and management concepts as being a key problem, along with a strange admixture of abstraction. One of his references is to a java API for a content repository for java technology (jsr170, 2004).

• Golub discusses using controlled vocabularies to automate classification of textural documents (web pages) in the context of browsing. I'm interested in that because we have a wee problem in rationalising the (hopefully eventually automated) metadata that we will get from systems like the Numerical Model Metadata with the more textural material we would get if we asked a scientist to write a document covering some of the same ground, but with a very loose set of constraints (NumSim). This of course is just an example of a more generic problem. At some point or another we will want to rationally browse from human generated material to similar automatically generated material and vice versa ... that means we'll need to classify the former automatically. This is of direct scientific importance because interdisciplinary science depends on being able to make connections (find data and information) from areas where one is less familiar with the ideas, concepts and nomenclature - exactly what browsing by classification is all about. Golub has some references on this point, and I've been on about this for ages.

• There are also a couple of papers on searching that looked interesting too. I haven't yet read them, but I'm bookmarking them here: Ramirez and Kriewel. I probably ought to read Frommholz's paper on annotation too.

(I liked the concept of this issue too: getting a group of doctoral students together for a consortium meeting and creating a special issue out of the discussion and papers presented ... these sorts of things are good for both the readers and the creators. Of course every time I read this sort of thing, I wish that atmospheric science papers were as accessible ... roll on open access!)

by Bryan Lawrence : 2006/06/28 (permalink)

## Good Standards Development - Whither web services

Well, I'm back from the go-essp meeting at Livermore (where the temperatures hovered around 40C) ... and hopefully back to some blogging. At the meeting, Russ Rew drew my attention to Michi Henning's article on the rise and fall of CORBA. I didn't get time to read it then, but Tim Bray also linked to it, and jogged my memory (and provided the other links for this post). While I'm not really interested in CORBA, since I've never had to know anything about it (and it seems dead), the advice on standards seems right on, and so I'm repeating it here:

• Standards consortia need iron-clad rules to ensure that they standardize existing best practice. There is no room for innovation in standards. Throwing in ?just that extra little feature? inevitably causes unforeseen technical problems, despite the best intentions.

• No standard should be approved without a reference implementation. This provides a first-line sanity check of what is being standardized.

• No standard should be approved without having been used to implement a few projects of realistic complexity.

• To create quality software, the ability to say ?no? is usually far more important than the ability to say ?yes.?

Tim also links to Bruce Eckel, who compares corba with WS-*:

Of course, the fundamental idea of web services is a good one. We really do want to be able to extract useful information from a remote site. That's happening, and it will continue to happen. But it must be simple, because even if programmers are smart and can figure things out, the CORBA experienced proves that we don't want everything to be as complex as it can possibly be.

Then Tim also points out Steve Loughran who summarises where we are now with:

Out there in the intra-organisation land, you either have REST, relatively simple SOAP (no WSRF, WS-Eventing, etc), or something custom ...

and

One issue is whether distributed objects are the right metaphor for distributed computing. CORBA .. et al (including WSRF) ... all implement the idea that you are talking to instances of stateful things at the end, with operations to read/write state, and verbs/actions/method calls to tell the remote thing to do something. These protocols invariably come up against the object lifetime problem, the challenge of cleaning up instances when callers go away... They also have the dealing with change problem ... either end of the system may be suddenly replaced by an upgraded version, possibly with a changed interface, possibly with changed semantics ...

He goes on to say that REST handles this by freezing things with a low number of verbs .... and claims that this is the only thing that has been show to work.

The thing is that this is an incomplete analysis, because he's essentially saying that the only thing that has been shown to work is a custom solution - because the RESTful stuff doesn't actually give you enough to build anything that relies on any thing sophisticated and discoverable at the other end ... without external semantics bound up in some sort of registry. So does this mean building autonomous discoverable grid services is well-nigh impossible?

Well, I'd argue not. Within NDG we're building a custom solution built on a mixture of SOAP, XML-RPC and REST ... and a bunch of specs for "useful" services (from the OGC). The key to the interoperable-ness of our solution will be our reliance on the semantics built into those service descriptions and the quality of our registries. Roll on ebrim!

## Atmospheric Science will return

I'm conscious that most of my blog entries lately have been about computing/metadata type issues ... which I think reflects the huge effort I and my ndg team have been putting into trying to reach our alpha milestone in time for next week's Global Organisation for Earth System Science Portals meeting.

I hope that once that's out of the way I'll be able to get back to spending half my time on "real" science ...

by Bryan Lawrence : 2006/06/15 (permalink)

## ructions in the ogc periphery

There have been some interesting issues around georss lately (via Allan Doyle). The wrap ups are here and here.

What was interesting about this was the perception that OGC doesn't want to hear from the little guy in standards development. I've blogged about this issue before, but then I was rather negative about the problem (i.e. I think it's not a big deal). Although it appears that the georss issue was a sin of carelessness by OGC, there are obviously some ongoing isuses, most significant of which in my mind are that the OGC needs to be careful about

1. licensing (given what OGC wants to contribute how much better it would be if they did use creative commons, and

2. how individuals can contribute and be involved in OGC activities.

With regard to the latter, I don't necessarily think the issue is about voting, it's about whether or not sensible technical contributions can come from anyone (which requires rather more openness than OGC produces by default - although at least it is better than ISO). At the end of the day, interoperability is about more than just the big companies. I do agree with this from Howard Butler:

Interoperating down in the trenches ... is still a rough, unforgiving, and frequently frustrating experience.

(where I have slightly broadened the applicability of what he actually said).

## Australia drifts even closer to America

Being a kiwi, who has lived a long time in the UK, with lots of close ties with (continental) Europeans, Americans, and Australians, I believe I can say something about the "closeness" of the relative cultures.

Most folk make the assumption that Kiwi's and Australians are close to each other, and closer to poms than they are Americans. Not so. Sportingly Australia and NZ are close (and that's a big part of culture, but it's not all). In my mind, if we plotted points on a plane for each "culture", we'd probably have a line from the UK to the US, with Australia closer to the US end, and NZ closer to the UK end (and with Germany and the rest of Europe off the line, but a similar sort of distance away from NZ to the UK ... much closer to NZ than the US)!

Anyway, more evidence for my hypothesis about the Australians being closer to the US is their behaviour of climate issues. I discovered over the weekend that the Austalian www.csiro.au had binned most if not all of it's world famous climate group. Unbelievable. Howard appoints a bunch of coal and oil blokes to the csiro board, and the climate group is canned. Not hard to make that connection. What amazes me is that either I've been even more myopic than usual, or the Australian environmental community has failed to make nearly enough noise about this (or both).

So it's all evidence that Australian federal political culture is moving down that line closer to the American federal political culture (and thus the cultures themselves are converging ...). However, one has to make a strong distinction ... at least the American climate community is strong and fighting back!

by Bryan Lawrence : 2006/06/12 : Categories environment (permalink)

## Back from Greece

I'm getting withdrawl symptoms from blogging - both reading feeds, and writing something of my own - but I guess that's the result of trying to keep too many balls in the air. It's also the result of having a wonderful week on holiday in Greece - Antiparos to be precise. We (wife and thirteen-month daughter) had only five full days on the island, but they were great. I can't imagine a much better place to introduce a child to the delights of swallowing salt water and eating sand ...

Anyway, I'm back, and depressed with the mountain of work ahead. The expectation of the locals (and most of the tourists) on Antiparos was that one should spend at least a fortnight on the islands, and they were right ... I don't think the work mountain could have been discernably bigger from my perspective looking up at it today, and I would have had more time getting slower and slower (on day four I said to Anne that I was just getting the hang of the pace of life ...).

I'll post some pictures when we've got the film processed (unfortunately the shutter on my digital camera jammed shut on the ferry to the island, so we're reliant on old technology).

by Bryan Lawrence : 2006/06/06 (permalink)

## Suffering with unicode

Our discovery portal needs to handle discovery documents why may not include xml with the correct encoding declarations.

To see the sort or problem this introduces consider the following python code

import ElementTree as ET
a = unicode('Andr?','latin-1')
b= '<test>%s</test>'%a
c=ET.fromstring(b)


which produces

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "ElementTree.py", line 960, in XML
parser.feed(text)
File "ElementTree.py", line 1242, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-11: ordinal not in range(128)


(amusingly I couldn't use my embed handler to format the python code for this wiki entry because of the same problem)

Now, I can fix this, sort of, using:

c=ET.fromstring(b.encode('utf-8','replace'))


I say sort of, because for arbitrary content (in this example content a), it still breaks ... because it would seem that the resulting string b can still fail to import into ElementTree ...

For example, in one document we have 1.3 <= tau < 3.6 in some encoding ... and even after using an encode('ascii','replace') option we get this error:

 self = <ElementTree.XMLTreeBuilder instance>, self._parser = <pyexpat.xmlparser object>,
self._parser.Parse = <built-in method Parse of pyexpat.xmlparser object>,
ExpatError: not well-formed (invalid token): line 1, column 11389
args = ('not well-formed (invalid token): line 1, column 11389',)
code = 4
lineno = 1
offset = 11389


(and the string above includes column 11389) which at least is not longer a unicode error.

Well, I'm not going to solve this now, because I'm off on holiday for a week, but if anyone has solved it by the time I get back, I'll be very happy!

Update: well I haven't even gone on holiday yet, but Dieter Maurer on the xml-sig mailing list has pointed out that the < sign needs to be escaped in XML ... and I'm pretty certain that it isn't ... so now I can go on holiday knowing that it's a relatively simple thing to fix!

So the bottom line was that I had two sets of problems with my input docs: genuine unicode problems and unescaped content ... and I thought it was just one class of problem which is why I struggled with it ...

## ndg security and ubuntu

The current version of NDG security has a number of python dependencies, including m2crypto and pyxmlsec ... the good news is that it's relatively easy to get the things in place under ubuntu, the bad news is that it's a pain working out what packages you need, hence this list (which is mainly for my benefit). You've probably already got libxml2 installed, but you also need:

• libxml2-dev

• xmlsec1

• libxmlsec1-dev

• m2crypto

all of which can be installed using apt-get. You will also need to get a tar file of pyxmlsec which you can install with

sudo python setup.py build install


We know this will be a pain in some circumstances, and we plan next year to try and significantly streamline our package dependencies ...

by Bryan Lawrence : 2006/05/22 : Categories ndg (permalink)

## Service Binding

One of the things we have grappled with rather unsatisfactorily in the NDG is how to declare in discovery and browse metadata

• that specific services are available to manipulate the described data entities

• and, for a given service, what the binding between the service and data id is to invoke a service instance.

This is prety important, as has been pointed out numerous times before, but probably most eloquently in an FGDC Geospatial Interoperability Reference Model discussion:

For distributed computing, the service and information viewpoints are crucial and intertwined. For instance, information content isn't useful without services to transmit and use it. Conversely, invoking a service effectively requires that its underlying information be available and its meaning clear. However, the two viewpoints are also separable: one may define how to represent information regardless of what services carry it; or how to invoke a service regardless of how it packages its information.

Thus far in NDG, where in discovery we have been using the NASA GCMD DIF, we have been pretty limited in what we can do, so we extended the DIF schema to support a hack in the related URL ...

Basically what we did is add into the related URL the following:

<Related_URL>
<URL_Content_Type>NDG_A_SERVICE </URL_Content_Type>
<URL>http://dmgdev1.esc.rl.ac.uk/cgi-bin/ndgDataAccess%3FdatasetSource=dmgdev1.esc.rl.ac.uk%26datasetID= </URL>
<Description>The NDG service delivering data via NDG A metadata. </Description>
</Related_URL>

Leaving aside the fact that we've embedded a naughty character (&) in what should be XML, we then create a binding for a user in the GUI between that service and the dataset id ... it's clumsy, ugly, and of no use to anyone else who might obtain our record via OAI.

Ideally of course the metadata needs to be useful to both a human and automatic service discovery and binding tools. In the example above, we (NDG) know how to construct the binding between the service and the dataset id to make a human usable (clickable from a gui) URL, but no one else would. Likewise, there is no possibility of interoperability based on automatic tools. Such tools would be likely to use something like WSDL, or ISO19119 or both, or more ... (neither provide too much information about what the semantics are of the operations provided, one needs a data access query model -DAQM -which we've termed "Q" in our metadata taxonomy).

However, if we step back from the standards and ask ourselves what we need, I think it's something like the following:

<OnlineResource>
<Service>
<ServiceName>NDG_DataExtractor </ServiceName>
<ServiceBinding>
<ServiceLocation>http://DeployedServiceURL </ServiceLocation>
<HumanInterfaceURL>http://DeployedServiceURL/dataid </HumanInterfaceURL>
</ServiceBinding>
</Service>
<Service>
<ServiceName> NDG_DataExtractorWS </ServiceName>
<ServiceDescription>
<Description> Provides a web service interface to the
<a href="http://ndg.nerc.ac.uk">csml features provided in the dataset, allowing an application (for example the NDG DataExtractor GUI, but others as well), to subset and extract specific features from the dataset. </a>
</Description>
</ServiceDescription>
<ServiceBinding>
<ServiceLocation>http://DeployedServiceURL </ServiceLocation>
</ServiceBinding>
</Service>
</OnlineResource>

where I've made up the tags to get across some semantic meaning (yes, I know, I should have done it in UML).

OK, I think I know what we need, now how does this work in the standards world, what have I forgotten? What do we need to do to make it interoperable, and what are the steps along the way?

Well those are rhetoric questions, I know some of the first things I need to do: starting with a chat to my mate Jeremy Tandy at the Met Office who is wrestling with the same questions for the SIMDAT project, and then I think I'll be off reading the standards documents again. I suspect I'll have to find out more about OWL-S as I'm pretty sure there will be more tools in that area (given that ISO19139 is only just arriving for ISO19115 and there is no matching equivalent that I'm aware of for ISO19119).

## Evaluating Climate Cloud

Jonathan Flowerdew at the University of Oxford has been working with me for the last two and a half years on methods of evaluating clouds in climate models. We've recently submitted a paper on this work to Climate Dynamics. If you want a copy of a preprint, contact him (or contact me to get his contact details).

The use of nudging and feature tracking to evaluate climate model cloud ? J.P. Flowerdew , B.N. Lawrence , D.G. Andrews

A feature tracking technique has been used to study the large-scale structure of North Atlantic low-pressure systems. Composite anomaly patterns from ERA-40 reanalyses and the International Satellite Cloud Climatology Project (ISCCP) match theoretical expectations, although the ISCCP low cloud ?eld requires careful interpretation. The same technique has been applied to the HadAM3 version of the UK Met Of?ce Uni?ed Model. The major observed features are qualitatively reproduced, but statistical analysis of mean feature strengths reveals some discrepancies. To study model behaviour under more controlled conditions, a simple nudging scheme has been developed where wind, temperature and humidity ?elds are relaxed towards ERA-40 reanalyses. This reduces the aliasing of interannual variability in comparisons between model and observations, or between different versions of the model. Nudging also permits a separation of errors in model circulation from those in diagnostic calculations based on that circulation.

by Bryan Lawrence : 2006/05/16 : Categories climate (permalink)

## ISO 21127 aka CIDOC CRM - more metadata dejavu

Most of my colleagues in the environmental sciences wont have come across ISO 21227 (to be fair, it may not yet exist, but heck, most of my colleagues in environmental science haven't come across any ISO standard ...). I was introduced to the concepts behind it by my colleague Matthew Stiff, from the NERC Centre for Ecology and Hydrology (CEH), and I've just been nosying through a powerpoint tutorial, which introduce the CIDOC Conceptual Reference Modle (CRM) (ppt) which would appear to be the heart of it. Maybe the topic: Information and documentation -- A reference ontology for the interchange of cultural heritage information isn't going to engage too many of my colleagues, but maybe it should because the key concept is that:

Semantic interoperability in culture can be achieved by an ?extensible ontology of relationships? and explicit event modeling, that provides shared explanation rather than prescription of a common data structure.

That sounds familiar, if we change "events" to "observations", and replace "in culture" with "in environmental science" we'd all be on the same page ... although maybe some of my hard science mates wouldn't like the word relationships ...

Reading on we find that the CIDOC CRM aims to approximate a conceptualisation of real world phenomena ... sounds like the feature type model approximating the universe of discourse to me ... One key difference from the GML world though is the early acceptance of objects (features in my language) having multiple inheritance (which is hard to do in XML schema, hence a problem for GML).

I'm not the first to make the link between the ISO/OGC world and the CIDOC world of course, Martin Doerr who is one of the CIDOC authors made the connection explicit in a comparison of the CIDOC CRM with an early version of what became GML (pdf). Regrettably his conclusions are a bit dated now (five years is a long time on our business). It'd be interesting if someone did a comparison of the OGC Observations and Measures spec (or the new draft) with the CIDOC CRM ... meanwhile, when I can get my hands on the standard itself, it may make interesting reading to help inform our semantic web developments.

## kubuntu dapper beta broken

I really wanted beagle, and the breezy beagle was broken (or to be precise it seemed to be progressively corrupting my kde directories), so I upgraded to the new dapper beta ...

It all seemed fine when working from home, and the new beagle (with kerry-beagle) is great, but I came into work this morning and stuck the laptop into the docking station, and two significant problems occurred:

• cups only sees my local printer, and I can't make it browse for the network printers (despite browsing being on in my cups config file).

• and nothing I can do seems to change anything. I can't change anything via kcontrol or the web inteface to cups.

• These problems don't seem new! It's pretty poor if a major release like kubuntu dapper (even in beta) can't get printing right!

• Update: OK, some of my problems were to do with some incompatibility between my (pre-existing) cupsd.conf file and the new release. Using the new release file means i can now edit the file, but browsing is still broken ...

• Xorg has broken ... I used to be able to slap the laptop down, and get my 1920x1200 big screen working by restarting the x-server ... but now nothing I can do is making this screen anything but a clone of my laptop 1024x768. I've just spent rather more time than I should have trying to fix it ...

• Note to self: the radeon driver has a man page ...

by Bryan Lawrence : 2006/05/02 (permalink)

## More on Data Citation

In talking to folk, it became clear that it helps to draw a distinction between those who interpret data and those who produce data. Often these are the same people, and there is a clear one-to-one link between some data, the data creator(s) and a resulting paper which describes the data in some detail. However, in this situation, the key reason why the paper is published will normally be the interpretation that is presented, not the data description per se (although there are some disciplines in which the paper may simply describe the data, this is not the normal case).

So what CLADDIER is about is the situation where the data creator(s) has (have) only been involved in the creation of the data, not the resulting interpretation. Again, in some disciplines the data creators may end up as authors on relevant papers (and may never even see or comment on the text). However, in most disciplines this either doesn't happen, or is severely frowned upon, and in these situations the data creators are somehow second class academic citizens (despite being an essential part of the academic food chain). What we want to be able to do is have recognition for these folk via citation of the dataset itself, not a journal paper ... that way important datasets will be measurably important.

There are a lot of complications, some of which I've addressed before. Some of the additional things we discussed yesterday included

• How to handle small contributions (such as annotations). My take on this is that small contributions are visible via authorship within the cited entity, but probably ought not be visible outside of it (although in the semantic web world one ought to be able to count the number of these things any individual does). At some point folk have to decide on the definition of small though (probably a discipline dependent decision) ...

• The situation is rather more complicated with meta-analyses. Arguably we're in the same situation as an anthology or book of papers ... in either case we would expect the contributions to be citable as is the aggregated work.

One new concept for me was the taxonomy world might want to count (in some way) the number of times a specific taxa is mentioned, and use that as a metric of the work done categorising the species. It struck me that this might not be too clever in their world - surely the description of some rare species ought to be pretty important, but it might not get cited as frequently as some pretty routine work on, for example some agriculturally important species. (In the journal paper world this is the same reason why citation impact alone never used to be the whole story in the UK Research Assessment Exercise)

## NDG Status Talk

As yesterday's last blog intimated, I'm in the middle of a two-day meeting of the NERC e-science community. Yesterday I gave a talk on the status of the NERC DataGrid project:

#### NERC DataGrid Status

In this presentation, I present some of the motivation for the NERC DataGrid development (the key points being that we want semantic access to distributed data with no centralised user management), link it to the ISO TC211 standards work, and take the listener through a tour of some of the NDG products as they are now. There is a slightly more detailed look at the Climate Sciences Modelling Language, and I conclude with an overview of the NDG roadmap.

by Bryan Lawrence : 2006/04/27 : Categories ndg (permalink)

## Another Missed Anniversary

As of last Tuesday I have been running the BADC for six years! I can't believe it's been that long! My first tour of duty in the UK (1990-1996) was just under six years, and my return to Godzone (1996-2000) was only four years. Yet the last six years have gone by incredibly quickly, and seem to have been much faster (and in some ways less full) than those other two chunks of time. (Mind you, my wife and I were talking this over, and when we listed what we had done in the last six years it became obvious that the key word in that last sentence was seems.)

When I was interviewed for the post they asked me what I thought I would be doing in five years. I said I didn't know, but I was sure I would have moved on to new challenges. I guess I got that mostly right, I have moved on to new challenges, but I've moved on by staying physically in the same place and growing the job (and taking on a daughter) ...

... and I have no plans right now of jumping back down under, despite having done another long tour of duty :-)

by Bryan Lawrence : 2006/04/27 (permalink)

## Baker Report

Treasury via the Office of Science and Innovation1 is putting a good deal of pressure on the research councils2 to contribute to UK wealth by more effective knowledge transfer. The key document behind all of this is the Baker Report. It's been hanging around since 1999, having more and more influence in the way the research councils behave, but from my perspective it's finally really beginning to bite, so I decided I'd read the damn thing, so people would stop blindsiding me with quotes.

The first, and most obvious thing to note, is that it really is about commercialisation, and it's driven by government policy objective of improving the contribution of publicly funded science to wealth creation. But right up front (section 1.9) Baker makes the point that the free dissemination of research outputs can be an effective means of knowledge transfer, with the economic benefits accruing to industry as a whole, rather than to individual players. Thus the Baker report is about knowledge transfer in all its forms.

The second obvious point is that with all the will in the world, the research councils can't push knowledge into a vacuum: along with push via knowledge transfer initiatives, there needs to be an industry and/or a market with a will to pull knowledge out! Where such industry is weak or nonexistent there is the strongest case to make research outputs freely available as a methodology for knowledge transfer.

Some of my colleagues will also be glad to know that the presumption is that the first priority of the research councils should be to deliver their science objectives:

Nothing I advocate in this report is intended to undermine the capacity of (the Research Councils) to deliver their primary outputs.

Baker actually defines what he calls knowledge transfer:

• collaboration with industry to solve problems (often in the context of contract research for industry)

• the free dissemination of information, normally by way of publication

• licencing of technology to industry users

• provision of paid consultancy advice

• the sale of data

• the creation and sale of software

• the formation of spin out companies

• joint ventures with industry

• the interchange of staff between the public and private sector.

Given that Baker recognises the importance of free dissemination of information it's disappointing that he implies that data and software are not candidates for free dissemination. Of course, he was writing in 1999, when the world of open source software was on the horizon, but not really visible to the likes of Baker, so I would argue that the creation of open source software by the public sector research establishment would not only fit squarely within these definitions had he been writing today, but he might have explicitly included it (indeed he probably would have been required to). In terms of free dissemination of data, most folk will know I'm working towards the formal publication of data, so that fits in this definition too.

I was also pleased to see (contrary to what others have said to me), that Baker explicitly (3.17 and 3.18) makes the point that knowledge transfer is a global activity, and the benefit to the UK economy will flow whether or not knowledge is transferred directly into UK entities or via global demand. The key point seems to be that the knowledge is transferred, not where it is transferred to (although he sensibly make the point that where possible direct UK benefit should be engendered).

Where it starts to go wrong, or at least, the reader can get carried away, is the emphasis in the report on protecting and exploiting Intellectual Property. At one point he puts it like this:

the management of intellectual property is a complex task that can be broken down into three steps; identification of ideas with commercial potential; the protection and defence of these ideas and their exploitation..

There is a clear frame of thinking that protecting and defending leads to exploitation, and this way of thinking is very easy to lead one astray. It certainly doesn't fit naturally with all the methods of knowledge transfer that he lists! It can also cause no end of problem for those of us with legislative requirements to provide data at no more than cost to those who request it (i.e. particularly the environmental information regulations, e.g. see the DEFRA guidance - although note that EIR don't allow you to take data from a public body and sell it or distribute it without an appropriate license so the conflict isn't unresolvable).

Baker does realise some of this of course, he makes the point that:

There is little benefit in protecting research outputs where there is no possibility of deriving revenues from the work streams either now or in the future.

I was amused to get to the point where he recognises that modest additional funding would reap considerable reward, but of course that money hasn't transpired (as far as I can see, but I may not be aware of it). As usual with this government, base funding has had to stump up for new policy activities. (This may be no bad thing, but it's more honest to admit it - the government is spending core science money on trying to boost wealth creation. Fine, and indeed we have been doing knowledge transfer, and will continue to do so, from our core budget, but the policy is demanding more).

The final thing to remember is that the Baker report is about the public sector research establishment itself, my reading of it definitely didn't support the top-slicing of funds from the grant budgets that go to universities to support knowledge transfer, but that's what is happening. Again, perhaps no bad thing, but I don't see Baker asking for it (although there is obvious ambiguity, since it covers the research councils, but when a research council issues a grant, the grant-holding body gets to exploit the intellectual property).

So the Baker report was written in 1999, but government policy is being driven by rather more recent things too. Over the next couple of months, I'll be blogging about those as well (if I have time). One key point to make in advance is that knowledge transfer can and now does include the concept of science leading to policy (which is of course a key justification for the NERC activities)

1: that's the new name for the Office of Science and Technology - such a new name that as of today their website hasn't caught up (ret).
2: Actually it applies to the entire public sector research establishment, so it includes all the directly funded science activities of the government departments as well as the research councils (ret).

by Bryan Lawrence : 2006/04/26 : Categories strategy (permalink)

## I'm still naive

I've just read the real climate post on how not to write a press release. I was staggered to read the actual