Bryan Lawrence : Bryan's Blog 2008

# Bryan Lawrence

## Kia Kaha My Boy

Evan Lawrence, Born 5 July 2007, Died 18 December 2008. Forever Young:

by Bryan Lawrence : 2008/12/24 : 0 trackbacks : 1 comment (permalink)

## exist memory and stability

Within ceda we are producing more and more xml documents, and the obvious tool for most of them is an xml database. At the moment there appears to be only one candidate in the opensource world that has something approaching the necessary reliability and scalablity: eXist. Colleagues who have used or are using Berkeley xml database and xindice have been pretty scathing about their experiences, and I'm not aware of other options (although something interesting may be happening in the mysqland postgres worlds). We're not overly impressed with the reliability of eXist either, which is what I want to document here, but that said, we still believe it's the way forward (e.g.in 2007 it was compared with MySQL circa 1997, which implies reasonable prospects).

We initially installed eXist in a tomcat container some years ago, and we've upgraded exist and tomcat a few times along the way. It's only been this last year that we've started to get upwards of thousands of documents in exist, and it's only in the last year that we've started to have major stability problems, with many tomcat and eXist restarts necessary, and several restores from backup. We have observed problems with consistency when large document insertions via xquery over xmlrpc have been interupted, resulting in the necessity to rebuild collections. We have also seen the problems with collection indexes reported in email to the exist list.

A couple of weeks ago, we decied to reinstall exist in the jetty container, and see if we could identify what the problems were. Doing so has been a bit hairy. eXist is pretty well documented, but even so, the multitude of ways that eXist can be deployed (standalone database server, embedded database, or in the servlet engine of a Web application), leads to a considerable amount of ambiguity in documentation, and most importantly, signfiicant difficulties in working out which files are associated with the various memory configuration options (because which files control the memory depend on the deployment option). We're pretty confident that at least some of our problems are memory related, and maybe where not the only folk in the same boat.

Anyway, herewith is what we think we need to do to control eXist when deployed as a service via the tools/wrapper/bin/exist.sh wrapper. Firstly, there are at least two places we think that the memory available to the processes can be configured:

• tools/wapper/conf/wrapper.conf where we (now) find the process memory configuration

# Initial Java Heap Size (in MB)
wrapper.java.initmemory=64
# Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=512


and

• conf.xml, where we find the db-connection configuration

<db-connection cacheSize="256M" collectionCache="124M" database="native"
files="/disks/databases/exist" pageSize="4096">


where we are warned that

• the cacheSize should not be more than half the size of the JVM heap size, which is not in this case, set (directly) by the JVM -Xmx parameter but by the wrapper. Just that one misconception alone tooks us ages to track down, and;

• wrt collectionCache, ...if our collections are very different in size, it might be possible that the actual amount of memory used exceeds the specified limit. You should thus be careful with this setting. Huh? So how do I be careful? What should I look for?

We think the setting we have should be ok because we have seen the eXist developers saying this should be ok in other email. But ... we're not happy with not knowing ... is this a failure mode or a performance problem setting?

Then, and worryingly, because some of our problems seem to occur in processing xqueries, we find in (old) email from an eXist developer (in an interesting thread) that

cacheSize just limits the size of the page cache, i.e. the number of data file pages cached in memory. This does not include the base memory needed by eXist (and the libraries it uses) for XML processing and querying.

which is probably why they recommend not having it larger than half. But what happens when those libraries blow out their memory requirements?

We haven't yet resorted to turning off full text indexing, because we want that, but we could try the upgrade mentioned in the eXist developer email. We're currently on 1.2.4-rev:8072-20080802.

At this point we haven't yet set a java expert on our problems, but I guess we might have to (following the sort of thing done here). Meanwhile, this post is by way of notes for later (and maybe a brief for such an expert).

## Nearest Book

A viral book meme is apparently propagating (thanks Sean):

1. Grab the nearest book.

2. Open it to page 56.

3. Find the fifth sentence.

4. Post the text of the sentence in your journal along with these instructions.

5. Don't dig for your favorite book, the cool book, or the intellectual one: pick the CLOSEST.

Well, I'm going to break the rules slightly, only because I kept thinking I had the closest book, and then finding a closer one (under a pile of paper), and the sequence of three pretty much parallels my career, so it's both scary and interesting in that way :-)

Anyway, from the real closest, to the one I initially thought was the closest.

1. "Since interfaces and abstract classes carry with them contracts for behaviour, the interface or abstract class can be used to represent an arbitrary implementation" (no I'm not sure I know what that means either). From Dan Pilone (2003), UML Pocket Reference.

2. "To jump to the next iteration of a loop (skipping the remainder of the loop body), use the continue statement" From David Beazley (2001), Python Essential Reference (2nd Ed).

3. "For a finite change of length from Li to Lf,

 W=\int_{L_i}^{L_f} F dL, *** dvi2bitmap error *** 

where F indicates the instantaneous value of the force at any moment during the process." From Mark Zemansky and Richard Dittman (1981), Heat and Thermodynamics.

Of course, being a physicist, I'm going to argue that any of the three could be the closest, because the distance metric is not described accurately enough; closest in what coordinate system, physical distance or accessibility? (With number of pages of paper on top being part of the functional description of the latter.)

(For once I'm taking time to eat lunch and play, had I read Sean's blog entry any other time I probably wouldn't have caught the virus).

by Bryan Lawrence : 2008/11/14 : 0 trackbacks : 1 comment (permalink)

## CMIP5 - The Federation

Next week there is a meeting which I hope will finalise the data requirements to participate in CMIP5. I can't go, but there are a number of issues on the table which I care about, I'll try and write about each in it's own blog post, which you'll be able to find by looking at my (new) CMIP5 category page.

PCMDI is leading the development of a global data archive federation to support CMIP5. It needs to be global: conservative estimates of the volumes of data to be produced for CMIP5 are that there will be PB produced in the many modelling centres involved in producing conforming simulations. Within those PB of data, certain variables, periods, and temporal resolutions of output are going to be defined to create a core archive. We hope that it will be about 500 TB in size.

The global federation being put together will federate

• three copies of that core (one at PCMDI, one here at BADC, and one at the World Data Centre for climate at DKRZ), and

• as much of the data held in the individual data providers as is possible.

Schematically we see something like this

where

• each colour is meant to represent the data from one modelling centre,

• the outer ring represents the federation

• the inner ring represents the core

• there are some modelling groups who have data in the federation and in the core, and

• some groups who are not part of the federation, yet have data in the core.

(the schematic has nine modelling groups, six in the federation, but in reality there will be many more and the distribution between the camps is not yet known, in particular, while the diagram shows an equal distribution between modelling centres and cores, we expect PCMDI to ingest most of the data directly as was done for CMIP3). (Clarification added 11/12/08.)

BADC expects to have three roles in this activity, we will be

1. representing the Met Office Hadley Centre (holding as much data as possible, whether in the core or not),

2. holding simulation data on behalf of the NERC community as a federation partner, again, whether in the core or not, and

3. as a core federation partner, holding a core copy.

I hasten to add, all the core partners will be deploying services to help prospective users by producing simple diagnostics and subset the archive in may different ways - so as to avoid crippling download volumes!

## tales from the sleep deprived

Milk chocolate is like low-alcohol beer. Pointless!

Dark chocolate is like fuel. Essential!

by Bryan Lawrence : 2008/11/13 (permalink)

## lack of service

We've had a hell of a day today.

Somehow, for reasons unknown (but following a planned power cycle, and I don't really believe in coincidence), the router which supports the CEDA (BADC+NEODC etc) network decided yesterday to hide CEDA from the world (and the world from CEDA). At the same time it did the same thing for a few dozen offices as well (including mine).

This despite all sorts of redundancy that is supposed to stop this sort of thing happening (it might be that it was faulty redundancy that caused it). Frankly we still don't know what the problem was, and we've not got it back to normal. However, we've got most things back up and working (but I'm sitting typing this in my secretary's office, she's had a working connection all day, but me in the office next door, nothing). What is working serverwise is by dint of an extra cable or two connecting one part of our machine room to another (one entire link to a backbone router refuses to function, despite being physically ok).

The post mortems from this will run and run, but meanwhile I feel almost physically sick. One of our most important attributes is a reliable fast network, and for (hopefully) one day, it's been neither! It'd probabably be ok, but it seems every time we have power trouble - whether scheduled or not - we have problems, either immediately on in the weeks that follow, and every time it's another "class" of problem ... (because of course we do fix each failure mode as we discover it).

If anyone from FAAM is reading this. Sorry. We're not picking on you ...

Should we be better at anticipating problems? Yes. Can we do better? Yes. Will we never have problems like this again? No, we're simply not resourced for truly high availability.

## peak everything

My blogging has almost dried up. It will be back, but maybe not yet.

Meanwhile, I was reading Michael Tobis's excellent presentation on ethics, when I found buried in the comments a link to another fascinating and scary presentation (3 MB pdf) at brave new climate.

Do look at it, and especially the very last slide, but if you're lazy, the bottom line is:

• there isn't enough coal and oil for CO2 to exceed 460 ppm,

• there may not be enough phosphorus for western agriculture to survive either ...

Clearly I don't know anything about the reliability of the material, but when i get my head above water (one day ...), I'll go looking ...

(p.s: well done America, maybe this time they really have elected someone who could become a leader of the free world)

## delicious on konqueror

I've just spent three weeks without my laptop. That's a story in itself. (IBM worldwide warranty for Lenovo might well work, but glacially, and their ability to communicate sucks beyond belief ... but I'll tell that story another day, if I can be bothered).

So, while I was hamstrung, I was working on borrowed computers and laptops, and began to make a bit of use of delicious.com via the firefox plugin. Anyway, I'm back on my laptop, so I wanted konqueror to work with delicious. It's much easier than you think, but no prizes to delicious for telling you that you can pretty nearly copy the Safari instructions, i.e., all you need to do is drag the relevant links into the right places:

1. First, open up the "tools/Minitools/edit Minitools" window, and create a new bookmarklet called, say "post to delicious". Then drag the "Bookmarks on Delicious" javascript from the Safari section into the Location box.

• You'll need to enable the extra toolbar if you haven't already, and click on the yellow star and unclick on your new menu item.

2. Then just drag the "My Delicious" link onto your bookmark toolbar.

Job done.

## Virtual Conferencing

For various reasons I'm unable to travel much at the moment, yet the last few weeks has seen two of the most important events of the year for me: one in Seattle and one in Toulouse. I was there, but not there, for both.

The first, GO-ESSP, the conference venue was Seattle Public Library, the local hosts had selected yugma professional as sharing technology, and I was mostly at home (1 Mbs broadband with a following wind and then the various bits of string in the public Internet). For the second, the conference was in a room at cerfacs, the technology was Acrobat Connect Pro (ACP), and much of the time I was at work (1 Gbs LAN connected via SuperJanet to GEANT ... you get the picture).

For the first, there was a bigger audience, and I sense, a bigger room. For the second, it seemed smaller and more intimate. Some were at both, maybe someone can tell me the relative sizes of the rooms! I did get video in to the second, but not the first ... the second got video of me, for what it was worth :-).

In truth, the biggest part of the difference came down to reliability of the internet, and reliability of the audio (we chose to use normal teleconferencing as well as the adobe connect for the latter, mainly because of echo cancelling issues which we didn't quite nail down in the pre-event trial).

As far as audio goes, my home long distance carrier was onetel. It isn't any more ... I spent the first day of the Seattle meeting frustrated because I couldn't hear what was going on, on the second, I called in briefly by mobile, and realised that onetel was much of the problem. With BT, I could hear the speaker fine. In general however, I couldn't hear the audience questions, or the replies (I guess the speakers were perambulating around the front away from a fixed mike). In Toulouse (I think) they only had one mike, but it seemed to be much better at getting everyone (albeit with a few requests for repeats from me) ... hence my comment about venue size. The take home message I think is: to support external audio, in a larger room, use a mike strapped onto the speaker, and position folk with microphones to pass to the audience for questions (or get the speaker to repeat them - a bit of that happened in Seattle, but often when the speaker had been let too far out on her/his tether :-). In a smaller room you can get by with one or two mikes.

Both technologies allow virtual desktop sharing, and in the case of ACP, I was even able to control the remote desktop in Toulouse and move the mouse around to highlight stuff. Of course, in practise though, the network was the big discriminator: I don't know much about Seattle public library, but I'm guessing their network wasn't provisioned for a couple of dozen simultaneous VPN downloads of large software packages and datasets. (What would you do if the speaker just advertised some cool thing? Yes, that's right, download it right then and there, and play with it for the rest of the presentation.) So yugma cut out quite a lot, and one or two presentations were basically lost at sea.

What do I think about the two technologies? Well, there's no doubt that at least in terms of the two clients I used, ACP is far more full featured (even though it is hamstrung on linux, unlike yugma which seemed O/S agnostic). I think for smaller meetings, ACP would be significantly better in that it can take over your camera and microphone and do the multiple windows thing naturally, and for bigger meetings it might not make much difference - with one reservation. With yugma it appears you are reliant on their servers ... with ACP, we can (and did) use our own server (in Germany), which implies that if local networks are not the pinch point, we can avoid pinch points at the server by deploying our own. I don't know if one can do that with yugma. (I should say that I don't know what the cost was to deploy either solution!) However, regardless of which is the winner, I was damned glad both were available. I felt like I got a good deal of benefit out of my late nights "in" Seattle, even without the questions, and I felt like I was really there in Toulouse (and in the latter case everyone else felt like I was there too)!

One last point: for those of you in Seattle, who had the delight of hearing my three-year old daughter interject during a talk: that's what happens when your laptop fan dies, and you have to resort to the desktop in the dining room, and when you fail to correctly implement the mute ... :-) Had i known the audio from my end was so good, well maybe I'd have been asking some questions too!

When I was a lad, I thought nothing of reading a book a day. Now a book takes literally weeks to read, that's a combination of a lack of time to read (when I was a lad, hours of reading per day was fine, now it's minutes), and I suspect I read slower as well now too. Maybe I take more in now (though I'm not convinced, I can still read a book many times over and enjoy it nearly as much each time).

This lack of time for reading is not unrelated to a lack of time for blogging! So the relative quiet on this blog is a combination of both a lack of free time out of work, and too many balls in the air at work. For all of those waiting for metamodels part two, I have a talk which should be grist for one or more posts, but I haven't had time to get there ...

## why global warming is too slow

Another video to add to my collection: This one (by Dan Gilbert) is on why global warming hasn't resulted in (much) action: it turns out it's not PAINful (Personal, Abrupt, Immoral, Now).

(Thanks to Real Climate).

## More Comment Spam

Some sad git has coded around my comment captcha ... I wouldn't have thought it worth his/her time ... but perhaps it's a generic problem. In any case, as a consequence, I've coded around them ... and removed some (but definitely not all) of the recent comment spam. I may have inadvertently removed something which wasn't spam, so if you spot that I've removed something that might actually be of interest to readers, let me know.

I've always thought "comments policies" were a bit redundant. If I don't like what folk post, I'll remove it ... (in practise, it has to be off topic, and wildly so, or using offensive language). I'll continue to do that ...

by Bryan Lawrence : 2008/08/29 (permalink)

## Defining a Metamodel - Part One

I've just introduced the concept of a metamodel as being a key component of the conceptual formalism required to come up with a conceptual model of the world expressed in a conceptual schema.

But that's not enough, or at least, it wasn't enough for me. To move forwards we need to understand the relationship between metamodels, vocabularies and ontologies, and when we've done that we can get to grips with the basic entities that would have to exist in our metamodel.

There is a pretty good stab at definining these things in an article by Woody Pidcock at metamodel.com.

My summary of the key aspects of the key concepts is:

 Controlled Vocabulary A list of terms that have been enumerated explicitly, which have unambiguous definitions, and which is governed by a process Taxonomy A collection of controlled vocabulary terms organised into a hierarchical structure, with each term in one-or-more parent-child relationships Ontology A formal representation of a set of concepts within a UoD and the relationships between those concepts. In principle the list of concepts and relationships itself forms a controlled vocabulary. Meta-Model An explicit model of the constructs and rules needed to build models within a specific UoD. A valid meta-model is an ontology, but not all ontologies are modelled explicity as meta-models. (Uod: Universe of Discourse)

In particular, Woodcock points out that a meta-model can be viewed from at least two perspectives: as a set of building blocks and rules to build models (which is the sense I have used it thus far), and as a model of a domain of interest (as might happen if one produced a heirarchy of conceptual models). Clearly, an ontology constructed as a conceptual model describing a particular universe of discourse is much more likely to be the latter than the former, although one might well construct one of the former on the way ...

Which brings me to the next step.

For metafor, we're starting with quite a bit of prior art, including, but not limited to our own NumSim, the Numerical Model Metadata, and most importantly, the Earth System Curator ontology. But we're endeavouring to avoid the mistakes of the past, and going back to some fundamentals I listed a long time ago, and in particular "reusing blocks from widely adopted standards". One obvious set of standards are the ISO19000 series of geographic information standards.

These last include a set of metamodels (19103, 19109 etc ), as well as the more widely known content and syntactical standards such as 19115/19139 and 19136 (GML). My next step is to consider the information infrastructure we're likely to have in place (e.g. controlled vocabularies, ontologies) and the basic types we're likely to need in our metamodel, in the context of serialising both to XML and RDF, each of which may play different roles in producing and consuming content.

## Balancing Harmonisation

Of course all this nattering about formalising information model development is precisely the process that both GEOSS and INSPIRE are going through. The difference for us is that we're a small domain, a lot of what we want to model (in the information sense) is not geospatial, and our user community is both global and not EO in the GEOSS sense.

Still, the INSPIRE implementing rules have a lot of relevance (albeit at 151 pages, like the ISO documents, their relevance is diluted by verbiage. It's hard to get excited if the key message has to take that long to deliver).

In particular, given that we have a number of communities to deal with, a key requirement is to get the level of "harmonisation" right:

(figure 15 from the INSPIRE Methodology for the development of data specifications.)

## Oil Demand

Some interesting numbers:

 (million barrels per day) 2005 2007 World Oil Production 84.63 84.6 Chinese Oil Demand 6.84 7.6

The original article (hat tip John Fleck) makes the point that a natural inference from these numbers is that while Chinese oil demand has been rising, elsewhere in the world, demand has been falling. Now remember these figures predate the recent price eruptions.

I find an interesting subtext in these numbers. As far as I'm aware it would hard to be argue that while the fall in demand has probably come from Europe, Japan and the U.S., it has been driven by consumer or governmental drives to reduce oil consumption because of concerns about global warming. So that implies that if we could actually devise an effective policy to drive down oil consumption in the west superimposed on what might be a natural downward trend1, then it would probably be possible to allow Chinese demand to continue growing and have a net decrease in consumption globally. And if that's so, it removes one of those planks from the "do nothing" brigade who argue that there's no point because Chinese (and Indian and Brazilian etc) consumption will continue to increase.

We can make a difference!

(I wonder what the coal figures look like?)

Update: I shouldn't have started this, but now Jeff has a link to the latest EIA figures, and I haven't a clue what the real state of supply and consumption is ... which echoes the major point of his first post I guess).

1: yes, I know, two points don't make a trend ... and I suppose there is an argument about exporting carbon consumption to consider ... and further, the original article does point out that if this all remains linear then there would be no oil consumption in the US in 17 years, but this argument is not about decades it's about the here and now! (ret).

## Formalising Information Model Development

Our metafor project is trying to establish a formal methodology for constructing UML, working with that UML in a team, and building multiple implementations using combinations of RDF and XML-Schema.

There are some interesting problems to overcome:

1. Establishing UML conventions that will allow code generation.

2. Dealing with version control of the UML. (And dealing with different versions of XMI).

3. Dealing with code generation itself.

(In this context, code generation is simply the generation of XML-Schema and/or RDF which contain appropriate data models).

I'll try and address these in a series of posts. (I was once told that it's a bad idea when being interviewed live to say something like "there are three points I want to make" because you're bound to forget the third under the pressure of a live interview. I suspect I have form for listing things I will post about, and never getting around to doing so, which must be the blogging equivalent, and thus something to be avoided. We'll see.)

But before we do this, there are some vocabulary issues that need expansion. Metafor is about building a "Common Information Model" (or CIM) for climate (and other numerical) models. Such models describe the earth system using mathematical equations encoded as computer programmes, and are used as part of activities to produce data (often starting with various data inputs). I've described some of this before. Like most folk entering this sort of activity I've typically blundered straight into defining things without thinking about the process clearly. Perhaps I've learnt a few things along the way about why that's not so clever ...

The first step before we construct any conventions is to consider the structure we're actually talking about. Much of this is really well documented in section 7.4 of ISO19101 and in the entirety of ISO19103 (Geographic information - Conceptual schema language) but since those documents are not easily accessible, we'll try and summarise.

(Note that we're using model in two different ways: in the context of "information" and the the context of "simulations", hopefully the context will help make clear which is which).

Firstly, from wikipedia we have :

... a model is an abstraction of phenomena in the real world; a metamodel is yet another abstraction, highlighting properties of the model itself. A model conforms to its metamodel in the way that a computer program conforms to the grammar of the programming language in which it is written.

So, to construct our CIM, we need to construct a metamodel, which provides some rules about how to construct it. Clearly our metamodel includes the exploitation of UML, but it needs to be more than that, otherwise we'll not be able to build something we can serialise into RDF and XML without a lot of human interaction. It's that "more" that we need to consider in a later post ... but meanwhile, we also need to relate the metamodel, and our own information model with their actual application in schema, and we also need to remember we live in a world where others are doing similar things.

I've tried to summarise all this in a customised schematic version of figure 4 in ISO19101:

## Model Intercomparison, Resolution, Ensembles

We've talked about the trade off between these things, but the reality is whichever way we go we get more data, and more data means more problems. This is what's on my mind:

Fortunately there is another wall (unfortunately it's full too), and scope to replace these with higher capacity units, but we'll need more power into the room too ... (and more rooms) ...

## The Rising Storm

I've been on holiday ... more of that anon. When I catch up on email and administrivia enough to return to things of interest to others, blogging will return too ...

Meanwhile, I know that some of my readers fall into the camp of "can't quite believe this climate stuff" but "don't believe the nutters either".

So for you: two videos and something to provoke some thinking, most of which agrees with my thinking too, but I haven't the eloquence or the strength to follow through to write things like that myself ...

• So, ten minutes from Hansen. Listen and Learn.

• Nearly twenty minutes from Seita Emori. The model he's describing was the highest resolution model in the AR4 archive. That doesn't make it right, and he probably ought to caveat more the results beyond temperature, but it's all very plausible. The fact that it is even plausible should cause concern!

• And finally, Michael Tobis: "My view in a nutshell".

by Bryan Lawrence : 2008/07/18 : Categories environment climate : 0 trackbacks : 1 comment (permalink)

## anatomy of a mip - part2

My recent description of the key components of model intercomparison projects was done both as input to metafor deliberations and as preparation for a visit by Simon Cox. We spent a bit of that visit time discussing the UML describing such projects (which appeared in the previous post). In doing so, we managed a few simplifications and fixes to my UML ...

The key points to notice are

• fixing the association to be in keeping with usual use of UML (in particular, noting that a composition association implies that if the parent instance is deleted, the child instances should also disappear).

• making more clear the association between RunTime and Experiment by adding the explicit conformsTo association.

• moving the ModelCode to be an adjunct to the RunTime so that the RunTime directly produces the Output.

• (update, woops missed the important one): using the view stereotype to indicate the classes which we believe will form launch points for discovery.

## exposing mips in moles

In my previous posting, I should have pointed out that the MIP of interest is the RAPID thermohaline circulation model intercomparison project (THCMIP).

In this post I want to enumerate the membership of some of those classes I introduced last time, and think through their relationships with MOLES and some of the practicalities for implementation. Doing so exposes a few problems with the current MOLES draft.

I start by considering how the MIP entities map into the existing MOLES framework (by the way, I will come back to observational data and do the same thing for an observational data example or two).

There are five model codes involved: CHIME, FAMOUS, FORTE, HADCM3 and GENIE. There are four experiments involved HOSING, CONTROL, TRH20 and CRH20. There are at least four time averaging periods involved: daily, five-day, monthly and yearly.

Some groups have done seasonal, some have done different experiments, but we'll ignore those for now.

So in practice, using MOLES vocabulary, we have at least 5x4x4=80 different granules of data to load into our archive, and there are 5 data production tools (models) + 4 experiments (where do we put these?)+ 1 activity which need comprehensive descriptions ie 10 new information entities. We might argue there are 20 primary data enties (being the model x experiment x observation station) combinations (remembering that the model runs might have been carried out on different architectures).

Of course we ought to support multiple views on this data, but we ought not have to load any more information to support those view. (Views like:

• A data entity which correponds to data from each experiment (4 of these),

• A data entity which corresponds to data from each model (5 of these),

• A data entity which consists of the granules from all models with specific time averaging for one experiment (there are 16 of these), etc

The data entities themselves should not need any specific properties; these they ought to inherit from the combinations of other entities (models, runs etc). This situation is tailor made for RDF to support views which arise from facetted browsing, but a legitimate question is what views should be offered up for discovery (that is, positions from which one can start browsing)?.

In any case, we start with ingesting 80 model runs, and generating their A-archive metadata (which gives us the temporal and spatial coverage along with the parameters stored). We'll assume that process was perfect (i.e. all parameters were CF compliant and all data was present and correct. Of course in real life that's never the case - all parameters are never CF compliant, all data is never present, and often it's not correct, that's what the process of ingestion has to deal with).

Each of those granules has the additional properties of temporal resolution and parent run. We probably ought to allow an optional spatial resolution in case the output resolution was different from the model resolution and potentially even different from a required resolution in an experiment description).

(These next three paragraphs updated 17/06/08): The run itself is an entity, which corresponds to a grab bag of attributes inherited from the other entities which we want to propagate to each of the constituent granules. If that were all it were it could be an abstract entity, which might correspond exactly to the MOLES data entity (which isn't abstract). However, somewhere we need to put the runtime information (configuration, actual initial conditions etc).

Currently we have the concept of a deployment which links a data entity to one each of activity, data production tool and observation station. This has a number of problems: some of the views described above produce data entities associated with multiple data production tools etc. I've been arguing for some time that deployments are only associations and not entities in their own right, in which case a deployment really is an aggregation of associations, but it doesn't need to exist .... even as an abstract entity. The proof of this is in the pudding: the current MOLES browse never shows deployments it simply shows the links which are the associations. Sam has argued that deployments could have some time bounds attributes, but when we explored why, what he was defining was in fact a sub-activity. We did think the runtime might go here, but it could also go into the provenance of the data entity, or we could have more subtle structures in the experiment.

So let's think about the attributes of a data entity itself. The data entity will have an aggregated list of granule bounds, parameters, resolution etc. It will also have associations with multiple other moles entities. If I hand you that data entity, what metadata do you want to come with it? Before we get carried away, remember that we've started with 20 data entities which we might naturally document, but we have identified many more (e.g. above) which we would rather like to be auto-generated. These wouldn't necessarily have common runtimes.

I've said elsewhere that I want all the moles entities to be serialised as atom entities, so we have some attributes we must have. The question is from where will they come?. Clearly we need some rules combining other attributes. These rules probably should also encompass which entities are discoverable.

Food for thought. (You might have thought we had done this thinking: well we had done some, so we know what is wrong with our thinking :-)

## The anatomy of a mip

We're in the process of documenting a specific model intercomparison project (MIP) for the purposes of the Rapid programme. It's the same issue we have for CCMVAL1, and for metafor in general.

The issue is what metadata objects need to exist: Never mind the schema, what's our world view look like? It looks something like this:

Beyond the classes and their labels, one of the key features of this diagram is the colour of the classes, which is meant to depict governance domains of the information (sadly, at the moment, the governance of the metadata itself is all down to us).

• The brown classes (project, experiment, scientific description) essentially come from the project aims.

• The blue classes (the model and it's description) are generally the integrated product of development and descriptions over a number of years by a modelling group.

• The yellow classes are the data inputs and outputs. In principle we can auto describe the data itself from the contents of data files, but there is context metadata needed. In the case of the output Time Average Views which are the moles:granule objects which we store in our archive, the context metadata is all the stuff depicted here. In the case of the input Ancillary2 Data, we hope the context metadata already exists.

• The interesting stuff, which is dreadfully hard to gather and which may consume the most time, is the khaki stuff - in comparison to the blue material which we hope is rapidly slowly changing, the khaki material is different for every run.

Now some folk go on and on about repeatability of model runs. Sadly, very few model runs are bitwise repeatable, not only is there not enough metadata kept, but generally the combination of model code, parameter settings, ancillary data, and computer system (with compiler) is generally not even possible to re-establish.

Personally, I think that's ok (it's just not practicable for sufficiently complex model/computer/compiler combinations if the elapsed time is anything much above months), but what's not ok is for a model run to be unrepeatable in principle: that is someone ought to be able to add a new run which conforms to the experiment and for which it is possible to compare all the elements depicted here. We ought to have kept the data and the metadata. We really ought to, and we ought to understand how to construct a difference.

Hence metafor, but right now, we just want to make sure we describe our Rapid data ...

1: that link doesn't exist at the time of writing, but it will :-) (ret).
2: with apologies to the Met Office, this is a slightly different use of the word Ancillary. (ret).

## Having our granule cake and eating it too

Over on import cartography, Sean is convinced he's not sabotaging the fight for a sustainable climate. He's right.

His post was inspired by a sequence of emails (most of which I haven't read), which culminated in this one.

I must say I'm depressed with how often folk manage to get themselves in one of two states: my hammer is suitable for all problems (including screws), or if you dis my hammer I will never be able to strike another nail.

To be fair, I sometimes find myself in these camps too (I told you it was depressing!)

Anyway, the thread Sean highlghted is particularly frustrating, because I don't think the two camps need to be far away. There is a place for clean, easy to use mashable technologies, and, complex, powerful, complete descriptions of geospatial objects. And guess what? Sometimes that place can coincide!

Just today I mentioned that we were moving MOLES towards using ATOM as a container technology. So, not yet fully thought out, but just to make that point, here is something approximating UML for a new MOLES granule:

(and here1 is an example XML instance).

The point to note is that the inoffensive little object in the bottom left corner includes a GML feature collection in CSML, using the entire crufty(but useful)ness of GML, and it's pointed to as atom content.

We do intend to have our cake and eat it too! We should be able to both mash up and consume our complex binary objects. I don't see why we can't see complex GML descriptions as cargo of georss flavoured atom, just the same as a photo might be!

For the record, some things to note in this are

1. We need to flesh out a when object, because the atom time stamps are about the atom record, not the time of the enclosed data objects.

2. We need a logo element attribute of the dgEntity because an Atom entry doesn't have a logo (even though a feed does).

3. We have some specific elements to ensure we could produce an ISO record with mandatory components.

4. We have made the decision that the atom author and contributor should refer to the author and contributors of the enclosed CSML content, not the metadata writer (who produces the summary text, who appears as the metadata maintainer).

Of course now we might be able to build services which are easier to exploit than the OWS stack though ... (but meanwhile we are building one of the them anyway).

1: Updated on the 10/06/2008, the old one is here (ret).

## moles basic concepts

Another interim version of MOLES, the main changes here are to

• make clear that dgEntity and dgBase are abstract (never instantiated),

• for the moment at least, not all moles entities are discoverable

• explicitly include attributes necessary for the mandatory ISO elements.

I expect the next version will move to specialising from Atom entities rather than GML feature types (although we'll explicitly allowing GML feature types to be encapsulated in the atom content).

Anyway:

## Simple Thinking

I'm quite surprised this hasn't appeared on more of the blogs I read! (Simple thinking about the choices and risks about taking action on climate change!)

by Bryan Lawrence : 2008/06/03 : Categories environment climate (permalink)

## Identifiers, Persistence and Citation

Identifiers are pretty important things when you care about curation and citation, and they'll be pretty important for metafor too.

At the beginning of the NDG project we came up with an identifier schema, which said that all NDG identifiers would look like this idowner__schema__localID e.g. badc.nerc.ac.uk__DIF__xyz123 where the idowner governed the uniqueness of the localID. If we unpack that a bit, we are asserting that there is an object in the real world which the badc has labeled with local identifier xyz123 and we're describing it with a DIF format object. In practice, the object is always an aggregation, so in ORE terms, there is a resource map which has the above ID.

Sam would argue that we blundered by doing that, he thinks the "real identifier" of the underlying object could be thought of as badc.nerc.ac.uk__xyz123, and we might be better using a restful convention, so that we wrote it as badc.nerc.ac.uk__xyz123.DIF. One of the reasons for his argument is that over time (curation), the format of our description of xyz123 will disappear: for example, we know that we are going to replace DIF with ISO (note that we have lots of other format descriptions of objects as well). This matters, because we now have to consider how we are persisting things, and how are we citing them. I would argue that the semantics are the same even if the syntax has changed, but I concede the semantics are more obvious in his version, so it's not too late to change, in particular, I suspect it's more obvious that badc.nerc.ac.uk__xyz123.atom and badc.nerc.ac.uk__xyz.DIF are two views (resource maps) of the same object (aggregation).

Either way, if we take the persistence issue firstly: once we bed this identifier scheme down, we're stuck with it, so what do we do when we retire the DIF format view of the resource? Well, I would argue that we should issue an HTTP 301 (Moved Permanently) redirect to the ISO view. This means we can never retire the IRI which includes the DIF, but we can retire the actual document. Note that this is independent of the two versions of the syntax above.

Secondly, what about citation? Well that's an interesting issue (obviously). You might recall (yeah, unlikely I know), that we came up with a citation syntax that looks like this:

and we made the point we could use a DOI at the end of that instead. That citation identifier looks a lot like Sam's form (funny that, Sam was sneaking his ideas in while I wasn't looking). A legitimate question is what is the point of the URN in that citation? If the data hasn't moved from http://badc.nerc.ac.uk/data/mst/v3/ the URN is redundant, and if it has, how do I actually use the URN? Well, I suspect we need to make the URN dereferenceable, and it should be a splash page to all the current views - which is essentially the same functionality that you get from a DOI. The only question then is what handle system one trusts enough: a DOI, an HDL, or do we trust that even if the badc changes our name, we can keep our address and use redirects? In which case we have

with or without the Available from which I reckon is redundant, and noting that http://badc.nerc.ac.uk/somepath is semantically the same as doi:. Sam's form URI wins (it doesn't necessarily have to, but it's more economical, in which case the argument for using a handler for citation economy alone is given some weight).

Incidentally, James Reid pointed out in comments that the Universal Numerical Fingerprint which looks like a pretty interesting concept that might also be relevant. I'll get back to that one day.

## EA and Subversion, Resolved

The good folks at CodeWeavers have resolved my problems with the subversion client under Wine (which I needed to get working for use from within Enterprise Architect). All kudos to Jeremy White!

I'd got to the point of recommending in a support request at codeweavers that a work around might be to try and replace the call to subversion with a windows bat file that invoked linux subversion rather than trying to get windows native subversion working properly.

Jeremy was far smarter than that. Yes, we've ended up invoking linux subversion, but via a different route.

The first step we took was to replace the native subversion.exe call with a simple linux script (I had no idea that one could even do that, having assumed that from a windows cmd.exe one had to call windows stuff ... but note that the trick was to make sure the script had no filename extension, and point to it in the EA cofiguration as if it were an executable). Having done that, we could see what EA was up to, and we found a few wrinkles.

Jeremy then came up with a winelib application (svngate) which handles all the issue with windows paths, and also a bug in the way EA uses subversion config-dir (a bug which doesn't seem to cause problems under windows, even though it ought to). In passing, Jeremy also fixed a wee bugette in the wine cmd.exe which was also necessary to make things work. All the code is on the crossover wiki.

So I'm a happy codeweavers client. I'm less happy with how Sparx dealt (commercially) with their end of this, but that's a story for another day. (Update 06/06/08: I'm probably being unfair, their technical support are now taking this and running with it; their linux product will be svngate aware and is getting linux specific bug fixes.)

Update 02/06/08. There was another wrinkle I discovered after a while ... the old cr/lf unix/windows problem. This can be relatively easily fixed,as Jeremy had seen this coming. I created my own version of subversion (/home/user/bsvn) with

#!/bin/bash
svn "$*" | flip -m -  and set SVNGATE_BIN to /home/usr/bsvn! Update 02/06/09. Actually that previous script doesn't quite work in all cases (i.e. where the svn content has blanks and hyphens in filenames). Better seems to be: #!/bin/bash svn "$@" | flip -m -


## Introducing Metafor

The EU has recently seen fit to fund a new project called METAFOR.

The main objective of metafor is:

to develop a ''Common Information Model (CIM)" to describe climate data an the models that produce it in a standard way, and to ensure the wide adoption of the CIM. Metafor will address the fragmentation and gaps in availability of metadata (data describing data) as well as the duplication of information collection and problems of identifying, access and using climate data that are currently found in existing repositories.

Our main role is in deploying services to connect CIM descriptions of climate and data across European (and hopefully wider) repositories, but right now the team is concentrating on an initial prototype CIM - building on previous work done by many projects (including Curator).

A number of my recent activities have already been aimed at metafor ... in particular the standards review that is currently underway will inform both metafor and MOLES.

For some reason my fellow project participants want to do much of their cogitation on private email lists and fora. While I don't think that's the best way forward, I have to respect the joint position. However, I will blog about METAFOR as much as I can, and I'll obviously be keen to take any feedback to the team.

I keep on harping on about how metadata management is time intensive, and the importance of standards.

Users keep on harping on about wanting access to all our data, and our funders keep wanting to cut our funding because "surely we can automate all that stuff now".

I've written elsewhere about information requirements, this is more a set of generic thoughts.

So here are some rules of thumb, 'even with a great information infrastructure, about information projects:

• If you need to code up something to handle a datastream, you might be able to handle o(10) instances per year.

• If you have to do something o(hundreds of times), it's possible provided each instance can be done quickly (at most a few hours).

• o(thousands) are feasible with a modicum of automation (and each instance of automation falls into the first category), and

• o(tens of thousands and bigger) are highly unlikely without both automation and no requirement for human involvement.

What about processes (as opposed to projects):

In the UK 220 working days a year is about standard. Let's remove about 20 days for courses, staff meetings etc ... so that leaves about 200 days or, for a working day of 7.5 hours, a working year of about 1500 hours.

So a job that takes a few hours per item can only be done a few hundred times a year. A case in point: in the last year 260 standard names were added to the CF conventions. One person (Alison Pamment) read every definition, checked sources, sent emails for revision of name and definition etc. Alison works half time, so she only had 750'ish hours for this job, so I reckon she had a pretty good throughput; averaging roughly three hours per standard name.

Now that job is carried out by NCAS/BADC for the global community, and the reasonable expectation is that names that are proposed have been through some sort of design, definition, and internal/community vetting process before even getting to CF.

So, from a BADC point of view, we have to go through every datastream, identify all the parameters, compare with the CF definitions, and propose new names etc as necessary. If all our data hand standard names, we'd be able to expoit that effort to produce lovely interfaces where people could find all the data we hold without much effort, if they only cared about the parameter that was measured. But they don't. Unfortunately for us (workload), and fortunately for science, people do care about how things are measured (and/or predicted). So the standard names are the the thin end of the wedge. We also have to worry about all the rest of the metadata: the instruments, activities, data production tools, observation stations etc (the MOLES entities).

As a backwards looking project: Last year, we think we might have ingested about 20 milliion files. Give or take. Not all of which are marked up with CF standard names, and nearly none (statistically) were associated with MOLES entities. Truthfully we don't know how much data we have for which our metadata is inadequate (chickens and eggs). As always, we were under-resourced for getting all that information at the time.

My rule of thumb says our only hope of working it out and doing our job properly by getting the information is that we need to identify a few dozen datastreams (at most), and then automate some way of finding out what the appropriate entities and parameters were. If it's manual we're stuffed, Sam has another rule of thumb: if something has to be done (as a backwards project, rather than as a forward process) more than a thousand times, unless it's trivial it wont get done, even with unlimited time, because it's untenable for one human to do such a thing, and we don't have enough staff to share it out.

Fortunately, for a domain expert, some of these mappings are trivial. But some wont be, and even distinguishing between them is an issue ...

Still, I genuinely believe we can get this right going forward, and do it right for some of our data going backwards. Do I believe non-domain experts could do this at all? No I don't. So where does that leave the UKRDS which, at least on the face of it, has grandiose aims for all research data? (As the intellectual inheritor of the UKDA, I'm all in favour of it, as a project for all research data, forget it!)

## Cost Models

Sam and I had a good chat about cost models today - he's in the process of honing the model we use for charging NERC for programme data management support.

Most folk think the major cost in data management is for storage, and yes, for a PB-scale repository that might be true, but even then it might not, it all depends on the amount of diversity in the data holdings. If you're in it for the long time, then the information management costs trump the data storage costs.

Some folk also think that we should spend a lot of time on retention policies, and actively review and discard data when it's no longer relevant. I'm on record that from the point of view of the data storage cost, the data that we hold is a marginal cost on the cost of storing the data we expect. So I asserted to Sam that it's a waste of time to carry out retention assessment, since the cost of doing so (in person time) outweighs the benefits of removing the data from storage. I then rapidly had to caveat that when we do information migration (from, for example, one metadata system to another), there may be a signficant cost in doing so, so it is appropriate to assess datasets for retention at that point. (But again, this is not about storage costs, it's about information migration costs).

Sam called me on that too! He pointed out that not looking at something is the same as throwing it out, it just takes longer. His point was that if the designated community associated with the data is itself changing, then their requirements of the dataset may be changing (perhaps the storage format is obsolete from a consumer point of view even if it is ok from a bit-storage point of view, perhaps the information doesn't include key parameters which define some aspect of the context of the data production etc). In that case, the information value of the data holding is degrading, and at some point the data become worthless.

I nearly argued that the designated communities don't change faster than our information systems, but while it might be true now for us, it's almost certainly not true of colleagues in other data centres with more traditional big-iron-databases as both their persistence and information stores ... and I hope it wont remain true of us (our current obsession with MOLES changes needs to change to an obsession with populating a relatively static information type landscape).

However, the main cost of data management remains in the ingestion phase, gathering and storing the contextural information and (where the data is high volume) putting the first copy into our archive. Sam had one other trenchant point to make about this: the gathering information phase cost is roughly proportional to the size of the programme generating the data: if it includes lots of people, then finding out what they are doing, and what the data management issues are will be a signficant cost, nearly independent of the actual amount of data that needs to be stored: human communication takes time and costs money!

## In a maze of twisty little standards, all alike

I'm in the process of revisiting MOLES and putting it into an atom context, with a touch of GML and ORE on the side. I thought I'd take five minutes to add a proper external specification for people and organisations.

Five minutes! Hmmm !! When will I get back to atmospheric related work? Can I get someone else to go down this hole?

Train of thought follows:

So, there's FOAF.

Do any folk other than semantic web geeks actually use FOAF? (I say geeks advisedly, it's hard to take seriously a specification that explicitly includes a geekcode.) Even from a semantic web perspective wouldn't it be better to use a more fully featured specification and extract RDF from that?

So then there is OASIS CIQ (thanks Simon), which includes

• Extensible Name and Address Language (xNAL)

• Extensible Name Language (xNL) to define a Party?s name (person/company)

• Extensible Party Information Language (xPIL) for defining a Party?s unique information (tel, e-mail, account, url, identification cards, etc. in addition to name and address)

• Extensible Party Relationships Language (xPRL) to define party relationships namely, person(s) to person(s), person(s) to organisation(s) and organisation(s) to organisation(s) relationships.

Well that feels overhyped, and beyond what is needed. Perhaps xNAL would be OK, but xPIL probably wont gain me much and I think the xPRL is heading into RDF territory in an over-constrained way.

What about reverting to the ISO19115 specification. Arguably ISO19115 shouldn't be defining party information (in the same way I was complaining about it not using Dublin Core), but it does. What would that give us? CI_ResponsibleParty. Well, I'm comfortable about this?

What about Atom! Atom settles down to a very simple person specification: name, email address and an IRI. Well that's pretty good, because I have an IRI which doesn't even need to exist (and if it does could point to any of the above). What it does is unambiguously identify a person, unlike name which can differ according to the phase of the moon (Bryan N Lawrence, Bryan Lawrence, B.N. Lawrence, B. Lawrence etc). But I do want some of the other stuff, role and so on, so I could use CI_ResponsibleParty, in the knowledge that I can extract an atomPersonConstruct from that for serialisation into Atom (I think I'd have to stuff the IRI into the id attribute of CI_ResponsibleParty).

OK, I'm going to stop there, it seems that with an IRI to avoid ambiguity, and CI_ResponsibleParty, I can do what I need to do.

But I could have spent a couple of hours in a far more pleasant way.

(Update 09/06/08: I note that KML 2.2 allows an <xal:AddressDetails> structured address ... but it doesn't use it ... yet.)

## Beginning to get a grip on ORE

This Friday afternoon I was trying to get to the bottom of ORE. ORE is pretty much defined in RDF and lots of accompanying text. I've been trying to find a way of boiling down the essence of it. UML (at least as I use it) doesn't quite do the job, so this is the best I could do:

The next step is to look through the atom implementation documentation.

## Ignorance is xhtml bliss

Wow, I was mucking with some validation for this site (in passing), and I thought "while I'm here, I might as well change this site to deliver application/xhtml+xml rather than text/html". What a blunder.

I'll fix it for you poor IE folk ... soon.

## From ORE to DublinCore

Standards really are like buses, there's another along every minute, exactly which one should you choose? I'm deep in a little "standards review" as part of our MOLES upgrade. I plan to muse on the role of standards another day, this post is really about Dublin Core!

You've seen me investigate atom. You know I've been delving in ISO19115. You know I'm deep into the OGC framework of GML and application schema and all that. You know I think that Observations and Measurements is a good thing.

Today's task was to investigate ORE a little more, and the first thing I did was try and chase down the ORE vocabulary, which surprisingly, isn't in the data model per se, it lives in it's own document. Anyway, in doing so, I discovered something that I must have known once, and forgotten: Dublin Core is itself an ISO standard (ISO15836:2003). Of course no one refers to DC via it's ISO roots, because they're toll barred (i.e. the ISO version costs money), wheras the public Dublin Core site stands proud.

What amazes me of course is that Dublin Core and ISO19115 use different vocabularies for the same things, even though Dublin Core preceded ISO19115. What was TC211 thinking? Of course ISO19115 covers a lot more, but why wasn't ISO15836 explicitly in the core of ISO19115? The situation is stupid beyond belief: someone even had to convene a working group to address mapping between them. I've extracted the key mapping here.

Mind you, Dublin Core is evolving, unlke ISO15836 which by definition is static. We might come back to that issue. Anyway, the current DublinCore fifteen which describe a Resource look like this:

 term what it is type contributor a contributors name A coverage spatial and or temporal jurisdiction, range, or topic B creator the primary author's name A date of an event applicable to the resource C description of the Resource D or E format format, physical medium or dimensions (!) F identifier reference to the resource G language a language of the resource B (best is RFC4646) publisher name of an entity making the resource available A relation a related resource B rights rights information D (G) source a related resource from which the described resource is derived G subject describes the resource with keywords B title the name of the resource D type nature or genre of the resource H

We can see the "types" of the Dublin Core elements have some semantics which reduce to

 A free text (names) B free text (best to use arbitrary controlled vocab) C free text (dates) D really free text E grapical representations F free text (best to use MIME types) G free text (best to use a URI) H free text, B, but best to use dcmi-types

The last vocabulary consists of Collection, Dataset, Event, Image, InteractiveResource, MovingImage, PhysicalObject, Service, Software, Sound, StillImage, Text. (Note that StillImage differs from Image in that the former includes digital objects as well as "artifacts").

by Bryan Lawrence : 2008/05/09 : Categories metadata ndg iso19115 metafor : 0 trackbacks : 1 comment (permalink)

## RDF software stacks.

So we want an RDF triple store with all the trimmings!

We're running postgres as our preferred RDB. We've got some experience with Tomcat as a java service container. We prefer python in a pylons stack and from scripts as our interface layers (ideally we don't want to programme our applications in Java 1).

There appear to be four2 candidate technologies to consider as part (or all of) our stack: Jena, Sesame, RDFLib, and RDFAlchemy. The former two provide java interfaces to persistance stores, and both support postgres in the backend. RDFLib provides a python interface to a persistance store, but might not support postgres. RDFAlchemy provides a python interface to support RDFLib, Sesame via both the http interface and a SPARQL endpoint, and Jena via the same SPARQL endpoint (and underlying Joseki implementation).

Would using postgres as our backend database perform well enough? Our good friend Katie Portwin (and colleague) think so.

There appear to be three different persistance formats insofar as RDFLib, Jena and Sesame layout their RDF content in a different way. Even within Java there is no consistent API:

Currently, Jena and Sesame are the two most popular implementations for RDF store. Because there is no RDF API specification accepted by the Java community, Programmers use either Jena API or Sesame API to publish, inquire, and reason over RDF triples. Thus the resulted RDF application source code is tightly coupled with either Jena or Sesame API.

Are there any (recent) data which compare the performance of the three persistence formats and their API service stacks? It doesn't look like it, but I think we can conclude that either Jena or Sesame will perform OK, and I suspect RDFLib will too. Which of these provide the most flexibility into the future? Well, there are solutions to the interface problem on the Java side: Weijian Fang's Jena Sesame Model which provides access to a Sesame repository through the Jena API, and the Sesame-Jena Adaptor; and clearly from a python perspective RDFAlchemy is designed to hide all the API and persistence variability from the interface developer. I think if we went down the RDFLib route we'd either be stuck with python all the way down (not normally a problem), or we'd have to use it's SPARQL interface.

I have slight reservations about RDFAlchemy in that the relevant google group only has 14 members (including me), and appears to be in a phase of punctuated equilibrium as development revolves around one bloke.

Conclusions: if we went down postgres > tomcat(sesame) -> RDFAlchemy we'd be able to upgrade our interface layers if RDFAlchemy died by plugging in something based on pysesame and/or some bespoke python sparql implementation (it's been done, so we could use it, or build it. Others have built their own pylons thin layers to sesame too). We'd obviously be able to change our backends rather easily too in this situation. (Meanwhile, I intend to play with RDFlib in the interest of learning about manipulating RDFa.)

Link of the day: State of the Semantic Web, March 2008.

1: this isn't about language wars, this is about who we available to do work (ret).
2: obviously there are others, but these standout given our criteria (ret).

## atom for moles

As we progress with our MOLES updating, the issue of how best to serialise the MOLES content becomes rather crucial, as it impacts storage, presentation, and yes, semantic content: some buckets are better than other buckets!

Atom (rfc4287) is all the rage right now, which means there will be tooling for creating Atom and parsing etc, and Atom is extensible. It's also simple. Just how simple? Well, the meat of it boils down to one big UML diagram or three smaller diagrams which address:

1. The basic entities (feeds and entries),

2. The basic content (note that the xhtml could include RDFa!)

3. and links (note that while atom has it's own link class for "special links", xhtml content can also contain "regular" html links).

These three diagrams encapsulate what I think I need to know to move on with bringing MOLES, observations and measurements, and Atom together.

## Big Java

You know, those of us out there in the Ruby/Python/Erlang fringes might think we're building the Next Big Thing, and we might be right too, but make no mistake about it: as of today, Java is the Big Leagues, the Show, where you find the most Big Money and Big Iron and Big Projects. You don't have to love it, but you'd be moronic to ignore it.

... and the programmers ask for Big Money, write Big Code, and we can't afford the Big Money to pay for them, or the Big Time to read the code ... let alone maintain it.

(Which is not to say I have any problems with someone else giving/selling me an application in Java which solves one of my problems - provided they maintain it, I'm not that moronic :-)

## chaiten

There's nothing like a big volcano to remind one of our precarious hold on planet earth. Thanks to James for drawing my attention to Chaiten, and the fabulous pictures, via: Alan Sullivan, nuestroclima and the NASA earth observatory.

Along with genetics1, volcanism was the other thing that could have kept me from physics ... neither quite made it :-).

Anyway, I'm not quite sure what sort of volcano it is, nor of the real import of the explosions thus far, but as a caldera type volcano it could be more impressive yet ... if it's even vaguely similar to Taupo. When I was a kid we grew up with stories of the 7m deep pyroclastic flow in Napier (a little over 100 km away).

1: you can read that phrase how you like :-) (ret).

by Bryan Lawrence : 2008/05/07 : Categories environment : 0 trackbacks : 1 comment (permalink)

## Implementing my RDFa wiki code

I claimed it would be straightforward to add the RDFa syntax to my wiki.

Actually, most of it was. The hardest part was putting the attributes into the (many different sorts of) links that my wiki supports. So I took the opportunity to clean up the link handling code.

This is my RDFa wiki syntax1 unit test code. The attentive reader will note that I use both the format and statelessformat methods of my wiki formatter (the statelessformat method is called by the format method in normal use). This exposes the fact that it turned out to be easiest to do the RDFa handling in two passes. The first pass was in the normal statelessformat (which does links, simple bold, italic, greek etc). In doing that it also now marks up inline RDFa and flags to a second pass the block and page level RDFa. These get handled right at the end of all the other wiki handling (which handles list and table state, preformatting etc) - block level RDFa gets tacked onto the previous paragraph entity, and page level RDFa gets put into a DIV that encloses everything else.

The point about what I have done, is to try and develop a syntax that can be tacked onto (most existing) wiki syntaxes without much grief. It seemed to work. So now I have code that can create RDFa. The next step will be to plumb it into Leonardo (shouldn't be hard), and then try and play with some real RDFa creation. That may have to wait a week or two on a) my day job, and b)my life ...

1: In the code, note the slight syntax change from the previous exposition: there are no quotes around any attributes in the wiki text, although they appear in the output. Last time I left the quotes in for the page level stuff. (ret).

by Bryan Lawrence : 2008/04/25 : Categories metadata ndg python metafor : 0 trackbacks : 1 comment (permalink)

## creating RDFa

Let's just assume for a moment that I want to create RDFa in XHTML. Just how should I go about it? It appears that there are no html authoring tools that are "RDFa aware"!

I want to play wth RDFa but I'm not going to author it in raw HTML. I'ts just not worth it. Can I programme my way out of it? If I wanted to modify my wiki software to support (some) RDFa assertions what might I consider as important?

#### Basic Syntax

There are six XHTML attributes:

1. @rel: provides forward predicates (subject is the enclosing entity, object is pointed to by a scoped URI aka CURIE)

2. @rev: provides inverse predicates (subject is pointed to, object is enclosing entity)

3. @name: provides a predicate as a plain literal.

4. @content: a string for supplying a machine readable version of an object rathe r than the enclosing element content (displayed to humans).

5. @href: a partner resource object (eg remote html link),

6. @src: a partner resource object which is embeddable (eg an image).

(Just what I mean by enclosing entity needs definition too.)

There are five RDFa specific attributes:

1. @about: can provide a URI which gives a subject.

2. @property: can provide URI(s) used for expressing relationships between the subject and some literal text (predicate)

3. @resource: a resource object which is not clickable ...

4. @datatype: a uRI defining a datatype associated with

5. @instanceof

And we need to declare namespaces.

So, it's all about applying attributes to html entities. For the moment we could reduce the problem to using the RDFa attributes verbatim, but introducing a notation for how to do that in (my) wiki syntax.

#### Methodology?

I might want to be able to make namespace definitions with scope for a document and scope within specific blocks.

• An expedient way forward would be to only allow namespace definitions once, at the beginning of a document. This would result in conformant documents and solve the 80-20 for this issue.

I want to add RDFa assertions to entities? How? We can start with the syntax examples ...

Example one was adding page level metadata. Assuming I want to do that at the blog entry level, and assuming for the moment I can't use atom as the container format, then the container needs to be an html div which already takes us into a version of example three. At this point, we could introduce a page level syntax which looked something like:

@@
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:cal="http://www.w3.org/2002/12/cal/ical#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
property="dc:creator" content="Bryan lawrence"
@@
page content


which should result in

<div xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:cal="http://www.w3.org/2002/12/cal/ical#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
property="dc:creator" content="Bryan lawrence"
page content
</div>


To repeat example three, taking in account the limitation to namespaces only at the page level, we could enhance the wiki link with attributes:

This document is licensed under a


to clearly instantiate the predicate relationship of the link to the calling page.

Similarly, we could write the fourth example:

I'm holding @@property=cal:summary@ one last summer Barbeque@@


to achieve the

I'm holding <span property="cal:summary"> one last summer Barbecue</span> ...


example. Here we're using @@ to introduce the enclosing entity (span since it appears inside a block level entity) and @@ to close it, with @ used to terminate a specific attribute, allowing to have multiple attributes as required (@@blah=blah@blah2=blah2@ content @@) etc.

• We could add these as attributes to a block level entity by simply beginning the block level entity with the @@ syntax (i.e. paragraphs starting on new lines, list items, quotes) and terminating the block level attributes (rather than the block) with the @@. See the example a bit further below.

The next example is to have what is displayed different from the machine readable substrate, so we have

... on @@property=cal:dtstart@content=2007-916T1600-0500@datatype=xsd:datetime@
September
16th at 4pm@@.


which would give:

... on <span property="cal:dtstart" content="20070916T1600-0500"
datatype="xsd:datetime"> September 16th at 4pm </span>.


(Note that if we needed an @ within a property, we would escape it with a backslash to have \@ ).

The next example is to introduce the instance of attribute so that all the properties of a block level object are associated with it. That's straightforward:

...previous paragraph.

@@instanceof=cal:Vevent@@ I'm holding @@cal:summary@ one last summer barbeque@@
on @@property=cal:dstart@content=20070916T1600-0500@datatype=xsd:datetime@ September
16th at 4pm @@.

Next paragraph ...


Note the first @@ has no content after the attribute and before the closing @@ so it applies to the block level entity.

The final example in that section has the ability to make a remote document the subject of the association. We could do that using the same contact notation:

I think @@about=urn:ISBN:0091808189@instanceof=biblio:book"@ White's book 'Canteen Cuisine@@
is well worth ...


#### Summary

So, in terms of new wiki syntax it would appear that if we don't try and shorten the attributes themselves (which is worth considering, but not in this discussion), we can get away with using @@ in three different ways:

1. When @@ appears on it's own, it is intended to start a <div> which will carry on to the end of the current document and provide a page level scope for RDFa attributes.

2. When @@ begins a block level entity (paragraph, list item, description) it is used for block level attributes, unless it encloses text after the last attribute, in which case it is being used for inline attributes ...

3. which is the last type, which creates a span around some content to enclose the attributes.

In addition, we allow @ inside the standard square bracket link syntax.

Now I reckon that lot would be straightforward to implement!

Why hasn't anyone else introduced a wiki syntax? (Have they and I'm just not aware?)

## First time grant holder experience

Buried in a Wall Street Journal article (thanks Michael) about the lack of involvement of the US presidential candidates in science issues is an interesting statistic:

At the National Institutes of Health, the average age of a first grant is 42 for a Ph.D. and 44 for an M.D.

It'd be interesting to know about NSF too.

Is it that much better here? I don't know! It would be interesting to see statistics for average ages of grant holders, and first time grant holders for NERC, EPSRC and STFC ...

That said, it may well be illegal to hold that information here, and I'm not convinced NERC know how old I am, let alone anyone else. But it would still be interesting ... although years since doctoral graduation is probably a better metric which might not be illegal to hold. The latter could only be obtained by survey. Now if only I could convince someone to undertake such a survey ...

## other folks data models

Kill the cat!

I had just written a long posting which summarised some of the recent discussion on Cameron Neylon's blog, and in particular the material on data models which appears in his semantic category.

Anyway, I had saved it all to a file ... and just before I posted it, I typed cat filename ... and blew it all away. A good hour of diligent reading and summarising wasted. This was all in the context of drawing conclusions relevant to MOLES. I guess they'll appear in later posts, so you wont miss it all ...

So, no summary: you'll have to read all his posts. But meanwhile, some key links and quotes from them:

1. "No data model survives contact with reality". Pretty sure Roy Lowry said that a long time earlier than he did ...

2. FuGE (Amongst other things, UML model of genomics experiments). Interesting.

3. Minimal Information about a Neuroscience Investigation "standard" at Nature Precedings. Underwelming.

4. Distinguishing "recording" from "reporting". Totally agree.

## More on the windows subversion client under wine.

I reported my problems earlier.

This post is by way of notes wrt to the https problems.

Firstly, the detail of the error (with thanks to Rafael Coninck Teig?o:

Running just the client from the command line gives me this:
/tmp/svn-win32-1.4.6/bin$export WINEDEBUG='-all,+winsock' /tmp/svn-win32-1.4.6/bin$ wine svn.exe update ~/development/e-dj/
trace:winsock:DllMain 0x7eda0000 0x8 (nil)
trace:winsock:DllMain 0x7eda0000 0x1 0x1
trace:winsock:WSAStartup verReq=2
trace:winsock:WSAStartup succeeded
trace:winsock:WSAStartup verReq=202
trace:winsock:WSAStartup succeeded
svn: Network socket initialization failed
trace:winsock:DllMain 0x7eda0000 0x0 0x1

The excerpt o code within SVN that's returning this error is at session.c:
/* Can we initialize network? */
if (ne_sock_init() != 0)
return svn_error_create(SVN_ERR_RA_DAV_SOCK_INIT, NULL,
_("Network socket initialization failed"));


and the following note are from the SVN redbook:

The libsvn_ra_dav library is designed for use by clients that are being run on different machines than the servers with which they communicating, specifically servers reached using URLs that contain the http: or https: protocol portions. To understand how this module works, we should first mention a couple of other key components ... the Neon HTTP/WebDAV client library.

Subversion uses HTTP and WebDAV (with DeltaV) to communicate with an Apache server. ... When communicating with a repository over HTTP, the RA loader library chooses libsvn_ra_dav as the proper access module. The Subversion client makes calls into the generic RA interface, and libsvn_ra_dav maps those calls (which embody rather large-scale Subversion actions) to a set of HTTP/WebDAV requests. Using the Neon library, libsvn_ra_dav transmits those requests to the Apache server.

Hmmm, Neon Library and WebDav under Wine ... Hmmm.

(Update 17-04-2008: NB, this is a problem with both http and https, and it really is that subversion can't initialise the network library and I've not found any useful advice on the net. My guess is that I'd need to build my own binary of subversion with neon linked in different way than the default, or that there's some subtle wine configuration I don't know about. Either way I'm stymied!)

(Update 21-04-2008: I've raised this as a ticket with the Cross Over Office Folks since I need them to run EA anyway.)

Being a collection of links and notes.

You've seen me witter on about online resources in molesOnline Reference. Now we need to do some more thinking. Still mainly in the context of service binding (but where the services might just be view services too!).

Put it all in rdf they said.

#### RDF

What do I know?

Well, I understand about subject -> predicate -> object. I can bandy about the phrase "triplestore" with the best of them.

So if I have text -> typeOfLink -> target, and I'm thinking about this in terms of my metadata environment, then I can readily imagine at least three contexts for such links: embedded within document text, embedded within documents as lists of links, and as associations between documents at some metalevel (e.g. the relatedTo association in MOLES).

If I'm thinking about infrastructure, then I want to store:

• those triples

• the documents

• the associations and crucially, the context in which they all live. It would appear that context matters (this is all related to the context of reification, that is, that each triple is itself a resource describable in RDF).

RDF itself says nothing about what typeOfLink should be. For that we need an ontology for links ... ("oh no we don't" they scream from offstage, "oh yes we do", they scream from onstage. Yes, I fear this is going to descend to pantomime).

#### RDFa

Well, RDFa allows me to embed all sorts of information about a link in the text preceding and following it. In practise I might think about that ability as allowing me have an object which has some attributes, and in the context document for me to expose some of those attributes. That probably needs an example.

Let's imagine I have an objectType called person defined in a schema called, say, moles. For the purpose of this example, Person has attributes of name, email address, and web page. I might say that the URI of any person instance is the URI of their webpage.

In my metadata document, I might want to have some text in a paragraph which has a link to a persons web page, something like:

<p xmlns:moles="http://ndg.nerc.ac.uk/schemas/molesv2"
<!-- about is RDFa syntax and sets a current URI for this paragraph -->
<span property="moles:personName="Bryan Lawrence">
Bryan Lawrence</span> at
his email</a>.
</p>


which could get hoovered up into raw RDF triples.

My point here is that this particular link can be serialised via RDFa into a bunch of triples, and in the calling document the primary link might not even appear as a "simple link" (the primary link would not be visible in an html rendering of this link which looks like an email link). The semantics of the link are controlled by the semantics we have in the parent namespace.

(Issue to consider: how do we create all these documents with RDFa that conforms to MOLES or whatever other schemas are relevant?)

#### Atom

Let's start with the atom:content element, which can contain either content or links to the content of an atom:entry. This is pretty useful because it explicitly does a little bit of what MOLES version one tried to do. Best summarised by directly quoting the specification:

atomInlineTextContent =
element atom:content {
atomCommonAttributes,
attribute type { "text" | "html" }?,
(text)*

atomInlineXHTMLContent =
element atom:content {
atomCommonAttributes,
attribute type { "xhtml" },
xhtmlDiv
}
atomInlineOtherContent =
element atom:content {
atomCommonAttributes,
attribute type { atomMediaType }?,
(text|anyElement)*
}
atomOutOfLineContent =
element atom:content {
atomCommonAttributes,
attribute type { atomMediaType }?,
attribute src { atomUri },
empty
}
atomContent = atomInlineTextContent
| atomInlineXHTMLContent
| atomInlineOtherContent
| atomOutOfLineContent


Even more importantly: the atom:link element defines a reference from an entry or feed to a Web resource. The specification assigns no meaning to the content (if any) of that element.

atomLink =
atomCommonAttributes,
attribute href { atomUri },
attribute rel { atomNCName | atomUri }?,
attribute type { atomMediaType }?,
attribute hreflang { atomLanguageTag }?,
attribute title { text }?,
attribute length { text }?,
undefinedContent
}


Now we have something that extends xlink, and has a controlled vocab called the Registry of Link Relations (cue "oh no") for the link relative types, which currently has five members:

1. The value "alternate" signifies that the IRI in the value of the href attribute identifies an alternate version of the resource described by the containing element.

2. The value "related" signifies that the IRI in the value of the href attribute identifies a resource related to the resource described by the containing element. For example, the feed for a site that discusses the performance of the search engine at "http://search.example.com" might contain, as a child of atom:feed:

    <link rel="related" href="http://search.example.com/"/>


An identical link might appear as a child of any atom:entry whose content contains a discussion of that same search engine.

3. The value "self" signifies that the IRI in the value of the href attribute identifies a resource equivalent to the containing element.

4. The value "enclosure" signifies that the IRI in the value of the href attribute identifies a related resource that is potentially large in size and might require special handling. For atom:link elements with rel="enclosure", the length attribute SHOULD be provided.

5. The value "via" signifies that the IRI in the value of the href attribute identifies a resource that is the source of the information provided in the containing element.

The mechanism for updating these relationships is quite onerous, involving New assignments are subject to IESG Approval, as outlined in RFC2434. Requests should be made by email to IANA, which will then forward the request to the IESG, requesting approval. The request should use the following template:

• Attribute Value: (A value for the "rel" attribute that conforms to the syntax rule given above).

• Description:

• Expected display characteristics:

• Security considerations:

And now let's use some of the examples floating around in various documents where folk are talking about RESTful services. Sean's one that I used recently was

<entry>
<title>Some KML</title>
rel="alternate"
href="..."
/>
...
</entry>


Now, we could build a compound thing, and get a WADL document. There is a conventions syntax that I first saw in a Pat Cappelaere presentation, which is essentially of the form:

If I have a resource /x then I can get the metadata for it at /x.metadata and the atom representation at /x.atom, and maybe, if it's a service type thing, a service descrition at /x.wadl.

In otherwords, a convention (I think) which imposes a vocabulary (is it controlled, and if so by whom?) on a URI. So the link type is now in the URI scheme in a RESTful atom world ...

Andrew Turner gave some examples in a presentation I found via the geoweb rest group which also does the same sort of thing, now explicitly as an atom link:

<link
rel="alternate"
type="application/vnd.ogc.wms_xml; charset=utf-8"
href="http://mapufacture.com/feeds/1621.wms" />


He gave a template example too in that presentation that I have yet to understand ...

Now, let's jump to a discussion in Google Groups, where I learn (from Sean yet again) that "what a service must do is specified using an app:accepts element.

But that works fine provided what I'm uploading is a media type. What if my service needs to know the feature type? There is some discussion that went down this route, where Sean proposed something like:

text/xml;subtype=om/#,


which presumably is a syntax for pointing to a specific schema.

Jeremy Cothran suggested including a reference to an associated content schema, but didn't (I think) suggest exactly how that could be done.

Right. That's enough for now. More later.

by Bryan Lawrence : 2008/04/10 : Categories ndg metadata metafor : 0 trackbacks : 1 comment (permalink)

## Semis here we come

At last some sporting good news. Semi-Finals here we come.

Mind you, I suspect that's as far as we'll go ...

## Notes on brief service descriptions and bindings

I'm coming late1 to a discussion about representing services in feeds and catalogues. (See Sean and Jeroen, and especially Doug's commment on Seans' post.) I found these while trying to go back to an old post of Sean's (which I have yet to find) about REST and WPS, because I wanted to incorporate some of the ideas into a forthcoming presentation. Anyway, a bit of retrospective googling lead me here.

Now as it happens, in my own way, I was asking the same question just a few days ago. The fact that I was asking it in the context of ISO19115 obscured the semantics of the question, as indeed did most of the discussion linked above. (Actually asking it in the context of ISO19115 ought not have done so, because ISO19115 is a content standard not an implmentation standard, but that's a by-the-way).

Anyway returning to the question, this is in fact one of the major things behind my much longer rant about service descriptions. So much so that I fancy writing down my current train of thought. Sorry. You really don't have to read it ...

The problem is that catalogues (don't care what they are done in, Atom, ISOXXX or whatever), have list of datasets (entries or whatever if you have a different application). We want users to be able to select datasets and pass them off to a service client (e.g. Google Earth, a WMS based visualisation client or whatever). What we need to do is be able to flag for each dataset that it can support specific services ... which will vary on a dataset by dataset basis.

We have a bunch of choices of how to do this. For each dataset we can load in a bunch of endpoint addresses that specific services can exploit. This is what we currently do (where we potentially have some KMZ endpoints, some WMC endpoints, and a bunch of others). This scales horribly in many ways from a management point of view (add a new service, edit each and every metadata record ... and we have lots!). We do this in the context of a DIF related_URL. If you looked at the XML you get to see stuff like this:

<Related_URL>
<!-- Web Map Context URL is required by NERC Portals -->
<Related_URL><URL_Content_Type>GET DATA > WEB MAP CONTEXT</URL_Content_Type>
<URL>http://ourhost/ukcip_wms?request=GetContext</URL>
</Related_URL>


To add a new service, we need to create one of these for each dataset for which it can operate on2 , which will depend on lots of things, including the feature types of the data. Ideally of course we know the feature types of all the data, and so we can script the metadata mods ... but if we know that, then why don't we do late binding? (And automate the process by having service descriptions and data descriptions and let the associations themselves be discoverable (and renderable by the catalogue dynamically as users find things of interest.)

Well of course we're only crawling (in a software sense) towards adequate feature type descriptions, but the nirvana we want is real enough. But to do that we need some service descriptions ...

... which don't exist, so we have to solve the problem outlined in the first paragraph for the foreseeable future, in a wide variety of contexts (including ISO19115 based catalogues).

Now, back to the main point. For many reasons ORE is becoming very interesting to us, and so it was great to see Sean making the connection (that I should have made). I like it so much I'm going to repeat his two examples here:

(Sean stuff follows):

<entry>
<title>Some KML</title>
rel="alternate"
href="..."
/>
...
</entry>


Otherwise, you use rdf:type (ORE specifies that non-atom elements refer to the linked resource, not to the entry):

<entry>
<title>Some Journal Article</title>
rel="alternate"
type="text/xml"
href="..."
/>
<rdf:type>http://purl.org/eprint/type/JournalArticle</rdf:type>
...
</entry>


(Sean stuff concludes)

The point being that where standard resource type names (mime-types) exist, one can use them, and where they don't, one uses an RDF relationship. (Incidentally, I don't really see how this differs in semantic content than Stefan Tilkov's recent proposals for decentralised media types, which got less than enthusaistic reception - even to the point of my not understanding why he thinks his proposal is semantically different from using RDF.)

In the context of MOLES and CSML we've been thus far using xlink to try and make these typed links (which answers yet another of Sean's old questions: yes, us). We've got so far with it that Andrew Woolf has an OGC best practice recommendation on how to use xlink in this context. I strongly believe that typed links are crucial to our data and metadata systems, but I'm less convinced by xlink ... (yeah, we can build tools to navigate using it, but will our best practice paper disappear into the ether?)

So let's step away from it all and ask questions about the semantics. And when we do that, I expect to come back to both ORE-like concepts and RESTful ones ...

If I start with a dataset with a uri, I can consider a few things:

• I can have a resource available which provides a description of the dataset in a variety of different syntaxes (e.g. the raw XML for GML:CSML and/or MOLES, and html versions of either or both),

• I can have a bunch of bound-up service+dataset uris which show "summary" visualisations of the data (logos, quicklook aka thumbnail images etc),

• I can have a service+dataset uri which exposes a human navigable tool to subset/download/visualise or whatever the dataset (i.e. a service which does something to the dataset, but in the first instance what you get back is html which is NOT a representation of the dataset itself, and

• I can have a naked service endpoint which when I go there, can expose something about the dataset (but maybe exposes multiple other datasets as well).

In both of the last cases, it might be possible to automatically construct a uri to a resource which did represent some property of the dataset given the starting endpoint URI. Ideally that's what you get when you use standardised services ... In this latter case it ought to be possible to construct a RESTful URI in the end, even if the service itself wasn't particularly RESTful (e.g. WMS and WCS).

Now all of the above are really aggregations which brings me back to ORE.

I had better stop now. Conclusions can wait.

1: I lost contact from the world for most of March. One day I'll explain why here. Some will know. (ret).
2: This example also shows the situation we're in of having to have a different WMS endpoint for each dataset ... the way WMS works with getCapabilities doesn't scale to us having one WMS endpoint for all our data. (ret).

## more MOLES version two thinking

We've made some more progress with our thinking since last week ... but there is more to come.

Meanwhile:

Note the new relatedTo association class. That's in there to allow us to have arbitrary RDF associations between MOLES entities explicitly modelled. No idea if this is the right way to do it ...

Note also that the services are placed more sensibly, and the concept supports both services intended for human consumption (that are "of the web") and which are for data consumption ...

## Subversion under WINE

Today's task. Trying to get Enterprise Architect working with subversion.

Enterprise Architect runs under CrossOver Office on linux (Wine) and getting it to work with subversion is not as straightforward as one might like (but the folks at sparxsystems are very helpful!)

The first step is to verify that we can use subversion within crossover office.

I installed the standard windows 32 bit binary into my EA bottle.

The two repositories I am interested in working with are available via https:// and svn+ssh://, and neither worked straightforwardly (I think the sparxsystems folk have only worked with regular svn:// repositories thus far).

To test access, I created simple batch files of the form:

"C:\Program Files\Subversion\bin\svn.exe" checkout "https://url-stuff" > svnlog.log &2> svnlog-err.log


and

"C:\Program Files\Subversion\bin\svn.exe" checkout "svn+ssh://url-stuff" > svnlog.log &2> svnlog-err.log


which I put in my /home/user/.cxoffice/ea/drive_c/ directory with names something like batfile.bat. (In both cases, the repositories already exist and work fine with linux subversion).

We can then run those windows bat files with a unix script like this one:

/home/user/cxoffice/bin/wine --bottle "ea" --untrusted --workdir "/home/user/.cxoffice/ea""/drive_c" -- "/home/user/.cxoffice/ea/drive_c/batfile.bat"


Neither work. The two error messages I got were:

• (https:) svn: Network socket initialization failed

I haven't made much progress with the former. Googling yielded this query to the wine list. The author got no response and currently has no solution (I checked with him personally).

However, I made real progress with the latter. The best advice I found was this, but there were enough things to think about for the wine environment that it's worth documenting here.

1. I spent some fruitless time trying to install tortoiseSVN into my EA bottle under crossover. (It seems to work, but under a win2000 bottle, and EA needs a win98 bottle). I then tried Putty ... which installed fine. You'll need the lot (make sure you download the installer not the simple .exe).

2. You will need a public key and private key set up so that the batch files run by EA (and you in testing) can work, so we get that done first.

3. On my laptop, I used

ssh-keygen -t rsa


to get a public and private key pair (thanks). For the moment I'm using no passphrase (the laptop disk itself is encrypted).

4. Now I had a pair of keys sitting in my laptop ~/.ssh/ directory.

5. The repository I want to get to lives on server, so I log into server, so the public key needs to be there. I changed directory into my .ssh directory there, and

1. created an empty file authorized_keys (just "touch authorized_keys")

2. opened that file up, and copied in the public key from my laptop directory.

3. Then I tested that key pair from the command line (under linux, ssh you@server). Fine.

6. Then putty has to be encouraged to use that private key.

1. Putty doesn't understand the openssh private key file we've just generated, so we need to convert it.

1. A copy visible to windows is needed (i.e. copied away from the hidden .ssh directory).

2. Run puttygen from the crossover office run windows command (browse to the executable in the Putty Program Files directory).

3. Now click on File> Load Private Key and load the file, then save the private key out as .ppk file. Delete the copy of the private key, and move the .ppk file somewhere special on your drive_c in your windows partition.

7. Now you need to make sure that ssh is configured to use your putty. So, that means putting

ssh = "C:\Program Files\PuTTY\plink.exe" -i "C:\\yourdir\\yourfile.ppk"


in your subversion config file (which lives in

.cxoffice/yourEABottle/drive_c/Windows/Application Data/Subversion


And now at least I can checkout a working copy ...

I'll try and integrate that with EA another day, and the https problem will have to wait too. That lot took more than twice the time I allocated for it ...

## early MOLES version two thinking

Today's task was going through MOLES V1 and trying to capture the key semantics in UML.

MOLES V1 was implemented (and defined) in XML schema, and had some regrettable flaws which we want to sort out in V2 ... and like I said yesterday, we want to take a step towards O&M support (but realistically we won't get there in this one step).

So this is where we are: the key overarching diagram looks like this:

I don't have time to explain much of it today, but suffice to say it has all the good bits from V1 but cleaner lines (yes really) of specialisation and association. It should be much easier to implement and support.

It turns out with a cleaner structure a whole lot baggage disappeared, and rather less semantic content was left to differentiate between those key entities: productionTool, observationStation and activity than I thought ... (you can't see that in this diagram, you'll have to wait for a fuller discussion, but suffice to say it mostly comes down to codeLists, albeit that we might support a movingPlatform specialisation of an observationStation).

You will also see that there are the beginnings of some support for late binding of services. That might be a bit clearer with this:

Yesterday I was also wittering on about online references. My thinking isn't complete on this yet, but this is what I have in dgOnlineReference:

Note the use of applicationProfile to hold the semantics of an information resource, which ought to be a featureType. Note also I'd rather like to get the mimeType in here too ... rather like an xlink actually. Some more thinking to go on here (indeed the whole thing is still in a state of flux ...)

## The Scope of ISO19115

We're taking the first steps towards refactoring our Metadata Objects for Linking Environmental Sciences (MOLES) schema to be more easily understood and implementable and to support (if not conform with) the new Observations and Measurements OGC specification. In doing so it became obvious to me that I need to think about the relationship between MOLES entities and ISO discovery metadata.

ISO19115 specifies that

...a dataset (DS_DataSet) must have one or more related Metadata entity sets (MD_Metadata). Metadata may optionally relate to a Feature, Feature Attribute, Feature Type, Feature Property Type (a Metaclass instantiated by Feature association role, Feature attribute type, and Feature operation), and aggregations of datasets (DS_Aggregate). Dataset aggregations may be specified (subclassed) as a general association (DS_OtherAggregate), a dataset series (DS_Series), or a special activity (DS_Initiative). MD_Metadata also applies to other classes of information and services not shown in this diagram (see MD_ScopeCode, B.5.25).

Let's have a look at the MD_ScopeCode, which is the value of the MD_Metadata attribute hierarchyLevel:

 MD_ScopeCode CodeS Definition: class of information to which the referencing entity applies attribute 001 information applies to the attribute class attributeType 002 information applies to the characteristic of a feature collectionHardware 003 information applies to the collection hardware class collectionSession 004 information applies to the collection session dataset 005 information applies to the dataset series 006 information applies to the series nonGeographicDataset 007 information applies to non-geographic data dimensionGroup 008 information applies to a dimension group feature 009 information applies to a feature featureType 010 information applies to a feature type propertyType 011 information applies to a property type fieldSession 012 information applies to a field session software 013 information applies to a computer program or routine service 014 information applies to a ... service ... model 015 information applies to a copy or imitation of an existing or hypothetical object tile 016 information applies to a tile, a spatial subset of geographic data

I'm not convinced I understand all of those, particularly the model type, but also the collectionHardware, collectionSession and dimensionGroup types. Anyone who can shed some light on those would be welcome to comment below or email me ...

Obviously metadata should also apply to other entities. In particular, within the NDG we consider that the observation station (this caused us difficult in finding an appropriate noun inclusive of simulation hardware, but covering a ship, physical location, or a field trip etc), the data production tool (DPT, aka instrument, but inclusive of simulation software also known as models), and activities are also first class citizens of metadata. Perhaps collectionHardware and collectionSession might be relevant for the DPT and some activities. I don't know.

We also consider the deployment to be an important entity: a deployment links one or more data entities to a (DPT, activity, observation station) triplet, and may itself have properties. It's worth noting that in the observations and measurements framework, the concept of Observation binds a value of a property determined by a procedure to a feature. In the NDG world, data features live within data entities, and some part of what O&M calls a "procedure" is an attribute of a data production tool, but most of a "procedure" is in my mind synonymous with a deployment1. Values and properties live within data entities too (a data entity is described by an application schema of GML which can include from the O&M namespace).

Leaving aside a resolution of a formal data model for MOLES, the first class entities will need to support metadata, and so there needs to be a scope code for the appropriate first class entities.

John Hockaday on the metadata list, initially suggested extending the scope codes to cover:

 profile there are many community profiles being developed document a general "grab bag" type for documents. repository ... suitable for something like a RDBMS codeList there are many codeLists in ISO 19115, ISO 19119 and ISO 19139. These codeLists are extensible. modelRun or modelSession to distinguish from model (but see below) applicationSchema information about GML application schema themselves portrayalCatalogue for finding OGC Symbology Encoding or Styled Layer Descriptors for OGC Web Services.

Eventually he implemented some of those in a new codelist

 modelSession information applies to a model session or model run for a particular model document information applies to a document such as a publication, report, record etc. profile information applies to a profile of an ISO TC 211 standard or specification dataRepository information applies to a data repository such as a Catalogue Service, Relational Database, WebRegistry codeList information applies to a code list according to the CT_CodelistCatalogue format project information applies to a project or programme

Actually, even with this definition of modelSession to augment model (which he thought might be used for things like metadata about UML descriptions), I still have problems. Within NDG and NumSim, we have the concept of model code bases and experiments, and I think these need to be kept separate but linked.

Personally I don't like the dataRepository one ... but I can live with it.

Project is ok, but we would prefer activity, because we decided that, activities should include activities, and the parents may well be projects ... but not always ... (e.g. campaigns within a formal project may themselves have sub-campaigns etc).

At this point I might consider a slightly different extension set (which is of course the point of having extensible codelists). Given I'm not sure about these collection thingies, and given a tilt towards O&M, I might want to have

 document as above profile as above codeList as above dataRepository as above activity information applying to a project, programme or other activity productionTool information about an instrument or algorithm observationStation information about the characteristics, location and/or platform which carried out, or is capable of, an observation or simulation. deployment information linking a data entity, activity, productionTool and platform in a procedure

Now I can have an algorithm (computer model) described in a productionTool metadata document and the particular data entity it produces is a data entity (of course), with the particular switches, initial conditions etc, described in a deployment (although I suspect there should and will be ambiguity as to whether the attributes of a productionTool could inherit most if not all of the characteristics of a deployment).

A deployment most closely corresponds to an O&M observation in that we deploy a tool in a or at a particular station for an activity to make a measurement, and I'd love a better (compound) noun than ObservationStation ...

1: Actually the overlap between an observation and a procedure is significant, something that is pointed out in the O&M spec itself (ret).

## Online Resources in ISO19115 and MOLES

ISO19115 is a brave attempt to classify metadata. I think some aspects of it are fatally flawed in an RDF world, but some are not, and we're likely to support it for some time to come. But, let's face it, ISO19115 isn't all inclusive, so how do we reference external information? This posting is all questions. Hopefully tomorrow I'll provide answers!

The NASA DIF does this via the related_url keyword, which has three attributes: a textural description, the url itself, and a content_type from a controlled vocabulary. The related_url simply exists as a first order child of a DIF document.

In ISO19115 we need to use the CI_OnlineResource (Figure A.16). We find it has attributes of linkage (url), protocol, applicationProfile, name, description (all character strings), and function, which is drawn from an enumeration CI_OnLineFunctionCode. The latter has download, information, offlineAccess, order and search as values.

Clearly this codeList can and should be extended. However questions remain:

1. Is this the right way to do it, and

2. How should we use CI_OnlineResource anyway?

#### 1. Extending the CodeLists

Option one is to extend the CI_OnlineFunctionCode codeList. Ted Haberman on the metadata list1 suggested extending it with the GCMD controlled vocabulary2 which is of course what I initially thought.

John Hockaday replied with a suggestion about using the protocol attribute, and inserting a codeList there (and suggested that GCMD should have their own codeLists available - but given their aversion to public version control, we wouldn't rely on that). Also, I'm not so sure that simply using the GCMD codelist in either place is right, because the GCMD valids for the related_url mix the semantics of services and urls, even if they do have a sensible list, and the protocol is probably more about things like ftp and http than what we layer over either. I'm also concerned that we include the semantics to support an external online bibliographic citation.

#### 2. How do we actually use CI_OnlineResource anyway?

There are multiple places:

1. The online resource which is the best place to download the data itself. That's via MD_Metadata>MD_Distribution>MD_DigitalTransferOptions which has an attribute of onLine which can have 0..* multiplicity. Arguably we can add services there too 3

2. For bibliographic citations from the data to documentation in scientific papers, we could use something in LI_Lineage.

3. For citations from papers to the datasets it's not so obvious.

The latter is what brought me to this. In MOLES we want to have links between the various entities and between MOLES documents and other documents (particularly NumSim and SensorML documents, i.e. documents which describe productionTools). (In truth, the relationships we want to encode would be far more sensibly serialised in RDF than XML Schema.) We also want to have entities we call data granules which are part of data entities and we expect those data granules to each individually have different services available 4.

#### 3. Services

Our problem is that we don't want the data granules independently discoverable for fear of overwhelming search engines (including our own), but we might want to expose in an overarching metadata document for the data entity that services of a particular class exist. The problem is that we might want this in two different syntaxes: a browser should be able to produce something consumable by a human (VIEW EXTENDED METADATA) and a service client might want simply to obtain a list of granules with names, service types, and service adddresses. Of course we could handle this by having the latter accessible via cunning use of microformats or tags, but there are both performance and maintenance reasons why that wont wash.

1: for some reason not all messages, including ones relevant to this posting, appear in the metadata list archive (ret).
2: which is currently lurking at here - but note that GCMD and version control don't get on, we prefer our version controlled lists, eg. this one (ret).
3: Although obviously we believe in late binding and separating the service and data metadata! (ret).
4: Yes, I know, late binding again, but it's not here now! (ret).

## More quotations - Static Maps, Public Trusts, and Bad Processes

I'm so limited to time at the moment that I'm limited to quoting, but the three snippets I've got today resonate as much as the last ones ...

Firstly, Sean is out there leading us to brave new world beyond OGC protocols, and he was the first one to bring google static maps to my attention. Interesting.

Secondly, I used to never get around to reading All Points because it seemed to be high traffic, and I was low bandwidth. Well that's all true, but I have been making an effort lately to at least page through my feeds regularly and the signal seems to be getting pretty interesting. Joe Francica posted two things to remember today. The second, was a quote from Jim Geringer, former governor of Wyoming and ESRI's director of policy and public sector:

I want to remember that one in the same context as finding is not enough.

The other thing Joe posted was a good deal longer, but poses a really interesting question: Is Google Earth a Public Trust? in the sense that public bodies are starting to rely on privately funded base map and imagery database for mission critical applications in emergency management. He makes the point that they already rely on the computing infrastructure, but that external bodies managing public information is quite a step further. It's certainly something to think about.

by Bryan Lawrence : 2008/02/22 : Categories ndg environment computing : 0 trackbacks : 2 comments (permalink)

## authkit and pylons don't quite fit

Background - I'm using genshi as my templating engine in pylons 0.9.6.1 and I want to authkit to do access control and authentication. This is in the context of pyleo.

I'm following the guidance in the draft pylons book.

Problem of the day: wrapping the signin template in the site look-n-feel. This is slightly less than trivial because the signin template is produced by authkit, but it doesn't have easy direct access to the pylons templating system because pylons is yet to be instantiated (in the middleware stack).

The recommended way of doing it is to create a file (in pyleo's lib directory called "template") which loads what is needed to control the signin template (in a functinn called "make_template"), and point to that using

authkit.form.template.obj = pyleo.lib.template:make_template


in the development.ini file so that authkit can render a nice sign-in page.

There are a few problems in the current version of the guidance:

1. The current version of the doc wrongly has "template" instead of "make_template" after the colon in the development.ini config file.

2. For genshi, we don't want to call a template called "/signin.mako" we want to call "signin",

3. if your site banner wants to look at the c or g variable you have to do rather better than what is done with the State variable pretending to be c in the example template file. At the very least you need to add a __getitem__ method so that at least calls to c.something in your site templating code won't break, even if they don't work ... You might also probably need to access to the pylons globals ...

At this stage, my template.py which provides the render function at the authkit level looks like this:

import pylons
from pylons.templating import Buffet
from pylons import config
import pyleo.lib.helpers as h
from pyleo.lib.app_globals import Globals

class MyBuffet(Buffet):
def _update_names(self, ns):
return ns

def_eng = config['buffet.template_engines'][0]

buffet = MyBuffet(
def_eng['engine'],
template_root=def_eng['template_root'],
**def_eng['template_options']
)

for e in config['buffet.template_engines'][1:]:
buffet.prepare(
e['engine'],
template_root=e['template_root'],
alias=e['alias'],
**e['template_options']
)

class State:
def __getitem__(self,v):
return ''
c = State()

g=Globals()

def make_template():
''' In the following call, namespace is a dictionary of stuff for the templating
engine ... which is why c is a (nearly) empty class, and h is the normal helper '''
return buffet.render(
template_name="signin",
namespace=dict(h=h, g=g, c=State())
).replace("%", "%%").replace("FORM_ACTION", "%s")


Now it nearly works properly, but the pyleo site template currently uses the pylons c variable to produce a menu which is data dependent and obviously that doesn't work properly. We need to work out some way to get at that from "outside" pylons (which is where authkit lives). While that's a problem that can wait, it's a problem that needs solving ...

## Finding is not enough

It seems to be my day for wanting to quote folk:

On why finding something is not enough (via Bill de hOra) Samuel Johnson (1753):

I saw that one enquiry only gave occasion to another, that book referred to book, that to search was not always to find, and to find was not always to be informed.

This is useful for me in the context of, yet again, having to describe search behaviour to yet another audience.

## on global warming urgency

Steven Sherwood (in a Science letter), with respect the urgency for action in response to global warming:

Greater urgency comes from the rapid growth rate (especially in the developing world) of the very infrastructure that is so problematic. Mitigating climate change is often compared to turning the Titanic away from an iceberg. But this "Titanic" is getting bigger and less maneuverable as we wait--and that causes prospects to deteriorate nonlinearly, and on a time scale potentially much shorter than the time scale on which the system itself responds.

## Data Publication

As part of the claddier project, we have been working on a number of issues associated with how one publishes data (as opposed to journal articles which may or may not describe data).

We're about to submit this paper to a reputable journal. Some of you will recognise concepts (and maybe even some text) from various blog postings over the years. Any constructive comments would be gratefully received!

This paper presents a discussion of the issues associated with formally publishing data in academia. We begin by motivating the reasons why formal publication of data is necessary, which range from simple fact that it is possible, to the changing disciplinary requirements. We then discuss the meaning of publication and peer review in the context of data, provide a detailed description of the activities one expects to see in the peer review of data, and present a simple taxonomy of data publication methodologies. Finally, we introduce the issues of dataset granularity, transience and semantics in the context of a discussion of how to cite datasets, before presenting recommended citation syntax.

## Moving Modelling Forward ... in small steps

I'm in the midst of a series of "interesting" meetings about technology, modelling, computing, and collaboration ... Confucian times indeed.

Last week, we had a meeting to try and elaborate on the short and medium-term NERC strategy for informatics and data. For some reason, NERC uses the phrase "informatics" to mean "model development" (it ought to be more inclusive of other activities, and perhaps it is, but it's not obvious that all involved think that way). As it happens, we didn't spend much time discussing data, in part because from the point of view of the research programme in technology, the main issue at the moment is to improve the national capability in that area (i.e. through improvements and extensions to the NERC DataGrid and other similar programmes).

Anyway, in terms of "informatics" strategy we came up with three goals:

• In terms of general informatics, to avoid loosing the impetus given to environmental informatics by the e-Science programme,

• To try and increase the number of smart folk in our community who are capable of both leading and carrying out "numerically-rich" research programmes (i.e. more people who can carry our model development forward). We thought an initial approach of more graduate students in this area followed by a targetted programme might make a big difference.

• To try and identify some criteria by which we could evaluate improvement in model codes (in particular, if we want adaptive meshes etc, which ones, and how should we decide?). (Michael you ought to like that one :-)

This was in the context of trying to ensure that NERC improves the flexibility and agility (and performance) of its modelling framework so it can start to answer interesting questions about regional climate change. Doing so will undoubtedly stretch our existing modelling paradigms, particularly as we try and take advantage of new computer hardware.

During the meeting we all had our list of issues contributing to the discussion. This was my list of things to concentrate on:

• Improving our high resolution modelling (learning from and exploiting HIGEM).

• Improving our (the UK research community outside the Met Office) ability to contribute to AR5 simulations.

• Improving our ability to work with international projects like Earth System Grid (data handling) and PRISM (model coupling). (We - the UK - are involved with both, but not enough).

• Data handling for irregular grids.

• Model metadata (a la NumSim, PRISM, METAFOR).

• Future Computing Issues in general, but in particular:

• Massively parallelism on chip ... where we might expect memory issues: "Shared memory systems simply won't survive the exponential rise in core counts." (steve dekorte via Patrick Logan.)

• Better dynamic cores

• Better use of cluster grids and university supercomputing (not just the national services, will require much more portable code than we have now, and not a little validation of the models on each and every new architecture).

• i.e. better coding standards ...

• Better ensemble management and error reporting (Michael's bad experience is not dissimilar to folk here with the Unified Model).

• Learning the lessons of the GENIE project(s).

• Handling massive increases in data volumes.

• With consequential issues for transport and archival

• and the requirement to better exploit server-side data services

• Much better model componentisation and coupler(s).

And like everyone else, I wanted to know where are the smart folk to do all this?

Then today, we had an initial discussion about procuring a new computing resource with the Met Office (which, by the way, doesn't preclude our involvement in other national computing services, far from it). There isn't much I can say about this discussion, as much of it was in confidence, but suffice to say, it was all about how we can exploit a shared system on which we would be running the met office models for joint programmes ... of course it's that very same model which most certainly needs a technology refresh :-)

On Friday, we'll be discussing the new NERC-Met Office joint climate research programme ... (which will be one of the programmes exploiting the new system).

## Using more computer power, revisited.

In the comments to my post on why climate modelling is so hard, Michael Tobis made a few points that need a more elaborate response (in time and text) then was appropriate for the comments section, so this is my attempt to deal with them. But before, I do, let me reiterate that I don't disagree that there are substantial things that could and should be done to improve the way we do climate modelling. Where the contention lies maybe in our expectations of what improvements we might reasonably expect, and hence perhaps in our differing definitions of what might be impressive further progress.

Before I get into the details of my response, I'm going to ask you to read an old post of mine. Way back in January 2005, I tried to summarise the issues associated with where best to put the effort on improving models: into resolution, ensembles or physics?

Ok, now you've read that, three years on, it's worth asking whether I would update that blog entry or not? Well, I don't think so. I don't think changing the modelling paradigm (coding methods etc), would change the fundamentals of the time taken to do the integrations although it might well change our ability to assess changes and improve them, but I've already said I think that's a few percent advantage. So, in practise, we can change the paradigm, but then the questions still remain: ensembles, resolution or physics? Where to put the effort?

Ok, now to Michael's points:

Do you think existing codes are validated? In what sense and by what method?

In the models with which I am familiar I would expect every code module that can be tested physically against inputs and outputs has been done so for a reasonable range of inputs. That is to say, someone has used some test cases (not complete, in some cases, the complete set of inputs may be a large proportion of the entire domain of all possible model states, i.e. it can't be formally validated!), and tested the output for physical consistency and maybe even conservation of some relevant properties. There is no doubt in my mind that this procedure can be improved by better use of unit testing (Why is that if statement there? What do you expect it to do? Can we produce a unit test?), but in the final analysis, most code modules are physically validated, not computationally or mathematically validated. In most physical parameterisations, I suspect that's simply going to remain the case ...

Then, the parameterisation has been tested against real cases. Ideally in the same parameter space in which it should have been used. For an example of how I think this should be done, you can see Dean et al, 2007, where we have nudged a climate model to follow real events so we can test a new parameterisation. This example shows the good and bad: the right thing to do, and the limits of how well the parameterisation performed. It's obviously better, but not yet good enough ... there is much opportunity for Kaizen available in climate models, and this sort of procedure is where hard yards need to be won ... (but it clearly isn't a formal validation, and we will find cases where it's broken and needs fixing, but we'll only find those when the model explores that parameter space for us ... we'll come back to that).

(For the record, I think this sort of nudging is really important, which is why I recently had a doctoral student at Oxford working on this. With more time, I'd return to it).

It might be possible to write terser code (maybe by two orders of magnitude, i.e 10K lines of code instead of 1M lines of code).

While I think this is desirable, I think the parameterisation development and evaluation wouldn't have been much improved (although there is no doubt it would have helped Jonathan, the doctoral student, if the nudging code could have gone into a tidier model).

The value of generalisation and abstraction is unappreciated, and the potential value of systematic explorations of model space is somehow almost invisible, or occasionally pursued in a naive and unsophisticated way.

I don't think that the value is unappreciated. There are two classes of problem: exploring the (input and knob-type) parameters within a parameterisation, and exploring the interaction of the paramterisations (and those knobs). The former we do as well as is practicable and I certainly don't think the latter is invisible (e.g. Stainforth et al, 2004 from ClimatePrediction.net and Murphy et al, 2004 from the Met Office Hadley Centre QUMP project). You might argue that one or both of those are naive and unsophisticated. I would ask for a concrete example of how else we would do this. Leaving aside the issue of code per se, we are stuck with core plus parameterisations - plural - aren't we?

(if) there is no way forward that is substantially better than what we have ... I think the earth system modeling enterprise has reached a point of diminishing returns where further progress is likely to be unimpressive and expensive ...

I'm not convinced that what we have is so bad. We need to cast the question in terms of what goals are we going to miss, that another approach will allow us to hit?

Which brings us to your point

... If regional predictions cannot be improved, global projections will remain messy,

True.

... time to fold up the tent and move on to doing something else... the existing software base can be cleaned up and better documented, and then the climate modeling enterprise should then be shut down in favor of more productive pursuits.

I think we're a long way from having to do this! There is much that can and will be done from where we are now.

I have very serious doubts about the utility of ESMs built on the principles of CGCMs. We are looking at platforms five or six orders of magnitude more powerful than todays in the foreseeable future. If we simply throw a mess of code that wastes those orders of magnitude on unconstrained degrees of freedom, we will have nothing but a waste of electricity to show for our efforts.

I don't think anyone is planning on wasting the extra computational power, and I think my original blog entry shows at least one community was thinking, and I know (since I'm off to yet another procurement meeting next week) continues to think, very seriously about how to exploit improving computer power.

On what grounds do you think improving the models, and their coupling, will not result in utility?

## Whither service descriptions

(Warning, this is long ...)

Last week I submitted an abstract to the EGU meeting in April in the The Service Oriented Architecture approach for Earth and Space Sciences (ESSI10) session. I'd been asked to submit something, but I fear I may be a bit of a cuckoo in the SOA nest ... (if by SOA, we take a traditional definition of SOA=SOAP+WS-*).

The abstract can be summarised even more briefly in two sentences:

• there is a long tail of activities for which the abilities of web services to open up interoperabilty is being hindered by the difficulty in both service and data description, so

• there is a requirement for both more sophisticated and more easily understandable data and service descriptions.

Some might argue that the latter is a heresy. Amusingly, I wrote that before I opened up my feed aggregator this week to find another raft of postings about service description languages. Probably the best, and easily most relevant, from my point of view, was Mark Nottingham, who wrote by far the most sage stuff. I'll quote some of it below. It looks like there was an ongoing discussion that hit the big time when Ryan Tomayko wrote an amusing summary which was picked up by Sam Ruby and Tim Bray.

It's hard to provide a summary that is more succinct than the material in those links and adds value (to me, remember these are my notes, I don't care about you1) so I won't! But I will write this, because I don't particularly want to wade through them all again.

#### On Resources

The first point to note is that most of the proponents of service description languages (particularly those from a RESTful heritage) are finally realising that it's not just about the verbs, the nouns matter too! It's fine to argue that you don't need a service description language because we should all use REST, but the resources themselves can be far more complicated beasts than standard mime-types, and so they need description too.

Mark Nottingham said it best in his summary:

Coming from the other direction, another RESTful constraint is to have a limited set of media types. This worked really well for the browser Web, and perhaps in time we?ll come up with a few document formats that describe the bulk of the data on the planet, as well as what you can do with it.

However, I don?t mean "XML" or even "RDF" by "format in that sentence; those are the easy parts, because they?re just meta-formats. The hard part is agreeing upon the semantics of their contents, and judging by the amount of effort its taken for things like UBL, I?d say we?re in for a long wait.

I've wittered on about the importance of this before, again and again. However, there are fundamental problems with using XML to describe resources. I've alluded to this issue too, but along with the summary by James Clark, I liked the way that it was put here:

One of the big problems with XML is that it is a horrid match with modern data structures. You see, it is not that it isn't trivial to figure a way to serialize your data to XML; it is just that left to their own devices, everyone would end up doing it slightly differently. There is no one-true-serialization. So, eventually, you end up having to write code to build your data structures from the XML directly. The problem there is that virtually all XML APIs are horrible for this kind of code. They are all designed from the perspective of the XML perspective, not from the data serialization perspective.

It gets worse. XML is one of those things that looks really easy, but is actually full of nasty surprises that don't show up until either the week before you ship (or worse.., a few weeks after). Things like character encoding issues, XML Namespaces, XSD Wildcards. It is really hard for your average developer (who makes no pretenses at XML guru-hood) to write good XML serialization/hydration code. Everything is stacked against him: XML APIs, XML -Lang itself, XSD.

At one time, I think I understood what it meant "Share schema not type", but now I don't ...

#### On the Service Description Language itself

Well, I've tried to review this sort of thing before. Since, then WADL has hit the big time. From a semantic point of view, I can't say I understand the big differences between WSDL and WADL, although I can appreciate that the WADL syntax is much simpler (and so it's a good thing).

Some folk, sadly including Joe Gregorio (whose work I mostly admire), have made a big deal out of the fact that there is no point generating code from WSDL (or WADL or any service description language), because if you do, when the service changes the WSDL (should) change, and so your code will need regeneration or otherwise your service will break. I think that's tosh (it's true, but still tosh, best put in Robert Sayre in a comment):

HTML forms are service declarations.

No one is arguing that we don't need HTML forms! The fact is, that clients will break when services change! Sure, some changes wont break the clients, but some will! The issue really comes down to How well coupled are your services and clients? and if they are strongly coupled Will you know when the service changes, and can you fix your client if it does? From experience of using WSDL and SOAP (yuck), I know I'd MUCH rather simply get the new WSDL, and regenerate the interface types .... than muck around at a deep level. (That said, I'm not arguing in favour of SOAP per se! Today's war story about SOAP and WSDL is one set of new discovery client developers complaining about our "inconsistent use of camelcase" in our WSDL ... it seems that they're hand crafting to the WSDL, and they want us to break all the other clients to fit their coding standards).

Of course, me wanting to use a service description language presupposes I've used my human ability to read the documentation (if it exists, or the WSDL if I really have to), to decide whether such a solution is the "right thing to do".

#### What does this mean to me?

At the moment we use WSDL and SOAP in our discovery service. I'd much rather we didn't (see above). It could be RESTful POX, which is how we've implemented our vocabulary service (but inconsistent camel case would still break things). It probably will change one day. More importantly, for the data handling services, we're currently using OGC services, where the "service description language" is the Get Capabilities document. This much I know (and where I violently disagree with Joe G) is that it would be much easier to use a generic service description language than the hodgepodge of get capabilities documents we deal with. I think OGC Get Capabilities is an existence proof that a generic service description language would be a Good Thing (TM)! In the final analysis, that's probably what I'll say in April (as well as "SOAP sucks" and "You need both GML and data modelling).

1: I do really :-) (ret).

by Bryan Lawrence : 2008/01/22 : Categories ndg computing xml : 1 trackback : 6 comments (permalink)

## from every direction it's mitigation and acknowledgement

Most mornings now I get between half and an hour of time to myself: between feeding my baby boy who wakes up around 5.30 to 6 am and getting my daughter up around 7 to 7.30 am ... I mostly spend the time reading, coding (for pleasure ... I have made some significant progress on the new leonardo), and just cogitating.

This morning it was reading; and it seemed like from every direction we have climate change adaptation and mitigation issues:

• Paul Ramsey on climate change, peak oil and the deep ocean (I read Paul for his commentary on GIS and Postgres ...)

• New Scientist noting that we may be near peak coal (full article at author's website) (who needs to explain why they read the new scientist?)

• The Observer reporting that the Severn Tidal power scheme takes another step towards actuality ... (when in the UK you do have to read a Sunday newspaper, they are the best in the world ...).

• Joe Gregorio on batteries (Joe writes sage stuff on python, web services and much else).

Now the thing is, I found all of them in my thirty minutes this morning. It's an eclectic bunch of sources, but that's my point! (... and yes, it is unusual for me to read things from the NS, the Observer and my akregator all within thirty minutes ... and none of the readings from the latter were from my environmental folder).

## Why is climate modelling stuck?

Why is climate modelling stuck? Well, I would argue it's not stuck, so a better question might be: "Why is climate modelling so hard?". Michael Tobis is arguing that a modern programming language and new tools will make a big difference. Me, I'm not so sure. I'm with Gavin. So here is my perspective on why it's hard. It is of necessity a bit of an abstract argument ...

1. We need to start with the modelling process itself. We have a physical system with components within it. Each physical component needs to be developed independently, checked independently ... This is a scientific, then a computational, then a diagnostic problem.

2. Each component needs to talk to other components, so there needs to be a communication infrastructure which couples components. Michael has criticised ESMF (and by implication PRISM and OASIS etc), but regardless of how you do it, you need a coupling framework. This is a computational problem. I think it's harder than Michael thinks it is. Those ESMF and PRISM folks are not stupid ...

3. All those independently checked components may behave in different ways when coupled to other components (their interactions are nonlinear). Understanding those interactions takes time. This is a scientific and diagnostic problem.

4. We need a dynamical core. It needs to be fast, efficient, mass preserving, and stable in a computational sense. Stability is a big problem, given that the various parameterisations will perturb it in ways that are quite instability inducing. This is both a mathematical and a computational problem.

5. We need to worry about memory. We need to worry a lot about memory actually. If in our discussion we're going to get excited about scalability in multi-core environments, then yes, I can have 80 (pick a number) cores on my chip, but can I have enough memory and memory bandwidth to exploit them? How do we distribute our memory around our cores?

6. What about I/O bandwidth? Without great care, the really big memory hungry climate models can often get slowed up and be waiting spinning empty CPU cycles waiting for I/O. This is a computational problem.

Every time we add a new process, we require more memory. The pinch points change and are very architecture dependent. Every time we change the resolution, nearly every component needs to be re-evaluated. This takes time.

At this point, we've not really talked about code per se. All that said, the concepts of software engineering do map onto much of what is (or should be) going on. Yes, scientists should build unit tests for their parameterisations. Yes, there should be system/model wide tests. Yes, task tracking and code control would help. But, every time we change some code there may be ramifications we don't understand, not only in terms of logical (accessible in computer science terms) consequences, but from a scientific point of view, there might be some non-linear (and inherently unpredictable) consequences. Distinguishing the two takes time, and I totally agree that better use of code maintenance tools would improve things, but sadly I think it would be a few percent improvement ... since most of the things I've listed above are not about code per se, they're about the science and the systems.

So, personally I don't think it's the time taken to write lines of code that makes modelling so hard. Good programmers are productive in anything. I suspect changing to python wouldn't make a huge difference to the model development cycle. That said, anyone who writes diagnostic code in Fortran, really ought to go on a time management course: yes learning a high level language (python) takes time, but it'll save you more ... but the reason for that is we write diagnostic code over and over. Core model code isn't written over and over ... even if it's agonised over and over :-)

Someone in one of the threads on this subject mentioned XML. Given that there (might be) a climate modeller or two read this: let me assure you, XML solves nothing in this space. XML provides a syntax for encoding something, the hard part of this problem is deciding what to encode. That is, the hard part of the problem is the semantic description of whatever it is you want to encode (and developing an XML language to encapsulate your model of the model: remeember XML is only a toolkit, it's not a solution). If you want to use XML in the coupler, what do you need to describe to couple two (arbitrary) components? If it's the code itself, and you plan to write a code generator, then what is it you want to describe? Is it really that much easier to write a parameterisation for gravity wave drag in a new code generation language? What would you get from having done so?

So what is the way forward? Kaizen: small continuous improvements. Taking small steps we can go a long way ... Better coupling strategies. Better diagnostic systems. Yes: Better coding standards. Yes: more use of code maintenance tools. Yes: Better understanding of software engineering, but even more importantly: better understanding of the science (more good people)! Yes: Couple code changes to task/bug trackers. Yes: formal unit tests. No: Let's not try the cathedral approach. The bazaar has got us a long way ...

(Disclosure:I was an excellent fortran programmer, and a climate modeller. I guess I'm a more than competent python programmer, and I'm sadly expert with XML too. I hope to be a modeller again one day).

## Walking the Leonardo File System in Pylons

Now that we have access to the filesystem, the next step to porting is to get a pylons controller set up that can walk the filesystem ...

Start by installing pylons (in this case, 0.9.6):

easy_install Pylons


(watch that capitalisation: don't waste time with easy_install pylons ...)!

Now create the application... in my case a special sandbox directory called pylons ... and get a simpler controller up ...

cd ~/sandboxes/pylons
paster create --template=pylons pyleo template_engine=genshi
cd pyleo
paster controller main


At this point it's worth fixing something that will cause you grief later: we've decided on genshi, so we need an __init__.py in the pyleo/templates directory. Create one now. It can be empty. We won't need it til later ... but best get it done now!

At this point if we start the paster server up with

paster serve --reload development.ini


We can see a hello world on http://localhost:5000/main and a pylons default page on http://localhost:5000

The next step is to get rid of the pylons default page, and make our main controller handle pretty much everything (we wouldn't normally do this, but we're porting leonardo not starting something new). We do this by replacing the lines after CUSTOM ROUTES HERE in config/routing.py with:

    map.connect('',controller='main')
map.connect('*url',controller='main')


and removing public/index.html.

Now http://localhost:5000/anything gives us 'Hello World'.

The next step is to get hold of the path and echo it instead of 'Hello World'. We do that by accessing the pylons request object in our main controller, which we have available since in main.py we inherit from the base controller.

        return 'Hello World'pre] we have

	path=request.environ['PATH_INFO']
return path


And the next step is to pass it to a simple genshi template to echo it. We do this by

• making a simple template. Here is one (path.html which lives in the pyleo/templates directory):

<html xmlns="http://www.w3.org/1999/xhtml"    i
xmlns:py="http://genshi.edgewall.org/"
xmlns:xi="http://www.w3.org/2001/XInclude" lang="en">
<body>
<div class="mainPage">
<h4> $c.path </h4> </div> </body> </html>  • and callit from main.py after assigning the path value to something in the c object (itself visible to the template). Replace those two lines in main.py that we just replaced before, with:  c.path=request.environ['PATH_INFO'] return render('path')  And now we're using Genshi to show us our path. The next step is to bring the leonardo file system into play, so we put filesystem.py into the model directory. (As an aside: Pylons is Model-View-Controller. If that doesn't ring any bells, see this). Note that the MVC stuff is inside a directory embedded one more level than one might expect. That's why we have this wierd structure: sandboxes/pylons/pyleo is the source for a new (distributable) egg, and sandboxes/pylons/pyleo/pyleo is where our pylons application (with MVC) lives. Realistically, next time I do this, i won't be repeating every identical step, so I'm bound to stuff up. I might need to get some decent debugging. Set this by editing the development.ini file so that [logger_root] has level set to DEBUG rather than INFO. Be warned; it results in verbiage on the console! Right back to our thread ... In this first stab, we'll use filesystem.py as our model, and we'll simply put in place a view which walks the content and then displays the text for the moment (without a wiki formatter). Nice and straight forward. We simply modify our existing template: path.html: <html xmlns="http://www.w3.org/1999/xhtml" xmlns:py="http://genshi.edgewall.org/" xmlns:xi="http://www.w3.org/2001/XInclude" lang="en"> <!-- Simple genshi template for walking the leonardo file system --> <!-- We'll be using the javascript helpers later, so let's make sure we have them -->${Markup(h.javascript_include_tag(builtins=True))}

<body>
<div class="mainPage">
<h4> $c.path </h4> <ol py:if="c.files!=[]"> <li py:for="f in c.files">Page:<a href="${f['relpath']}">${f['title']}</a></li> </ol> <ol py:if="c.dirs!=[]"> <li py:for="d in c.dirs">Directory:<a href="${d['relpath']}">\${d['title']}</a></li>
</ol>
</div>
</body>
</html>


and our main controller. It's all in the main controller for now. We'll change that later. Meanwhile, our main controller now looks like this:

import logging

from pyleo.lib.base import *
from pyleo.model.filesystem import *

log = logging.getLogger(__name__)

class MainController(BaseController):

def index(self):
''' Essentially we're bypassing all the Routes goodness and using this
main controller to handle most of the Leonardo functionality '''
c.path=request.environ['PATH_INFO']

#Later we'll move this elsewhere so it doesn't get called every time ...
self.lfs=LeonardoFileSystem('/home/bnl/sandboxes/pyleo/data/lfs/')

#ok, what have we got?
dirs,files=self.lfs.get_children(c.path.strip('/'))

if (dirs,files)==([],[]): return self.getPage()

c.dirs=[]
c.files=[]
for d in dirs:
x={}
x['relpath']=os.path.join(c.path,d)
x['title']=d
c.dirs.append(x)
for f in files:
x={}
x['relpath']=os.path.join(c.path,f)
leof=self.lfs.get(f)
#print leof.get_properties()
x['title']=(leof.get_property('page_title') or f)
c.files.append(x)
return render('path')

def getPage(self):
''' Return an actual leonardo page '''
leof=self.lfs.get(c.path)
c.content=leof.get_content()
if c.content is None:
response.status_code=404
return
else:
#This is a leo file instance
c.content_type=leof.get_content_type()

#for now let's just count these ...

if c.content_type.startswith('wiki'):
# for now, just return the text ... without formatting etc ...
else:
t={'png':'image/png','jpg':'image/jpg','jpeg':'image/jpeg',
'pdf':'application/pdf','xml':'application/xml',
'css':'text/css','txt':'text/plain',
'htm':'text/html','html':'text/html'}
if c.content_type in t:
return c.content

***
highlight file error
***


and believe it or not, that's all we need to walk our filesystem, and return the contents ... it wont be a big step to add the formatter to get html, and to start thinking about how we want the layout to look in template terms. (We'll need to make sure we use a config file to locate our lfs as well).

The steps after that will be to add the login, upload, posting, trackback, feed generation etc ... none of which should be a big deal ... but I don't expect to do them quickly :-)

Update: I'm pleased to report that while this is a fork from the leonardo trunk, as is the django version, all three code stacks are now jointly hosted on google. The pylons code is in the pyleo_trunk and this is rev 464 (I haven't worked out the subversion revision syntax for the web interface).

## the leonardo file system

The most difficult thing about porting leonardo is interfacing with the leonardo file system (lfs). The lfs was designed to allow multiple backends through a relatively simple interface ... of course it's not properly documented anywhere, so remembering how it works was a bit difficult. The following piece of code shows the general principle:

from filesystem import LeonardoFileSystem
import sys,os.path
def WalkAndReport(leodir,inipath='/'):
''' Walks a leonardo filesystem and reports the contents in the same way
as doing ls -R would do '''

def walk(lfs,path):
directories,files=lfs.get_children(path)
for f in files:
leof=lfs.get(os.path.join(path,f))
#The following is the actual content at the path ... if it exists.
#It's what you would feed to a presentation layer ...
content=leof.get_content()
print '%s (%s)'%(f,leof.get_content_type())
for p in leof.get_properties(): print '---',p,leof.get_property(p)
for p in c.get_properties(): print '------',p,c.get_property(p)
for d in directories:
leod=os.path.join(path,d)
print '*** %s ***  (%s)'%(d,leod)
walk(lfs,leod)

lfs=LeonardoFileSystem(leodir)
walk(lfs,inipath)
if __name__=="__main__":
lfsroot=sys.argv[1]
if len(sys.argv)==3:
inipath=sys.argv[2]
else: inipath='/'
WalkAndReport(lfsroot,inipath)

***
highlight file error
***


While I'm at it, I'd better document a small bug in the leonardo file system itself that manifested itself on this blog (python 2.4.3 on Suse 10) but nowhere else ... the comments came back in the wrong order. The following diff on filesystem.py fixed that:

    def enclosures(self, enctype):
+        #BNL: modified to reorder by creation date, since we can't
+        #rely on the name or operating system.
enc_list = []
for d in os.listdir(self.get_directory_()):
match = re.match("__(\w+)__(\d+)", d)
if match and enctype == match.group(1):
index = match.group(2)
-                enc_list.append(self.enclosure(enctype, index))
-        return enc_list
+                e=self.enclosure(enctype, index)
+                sort_key=e.get_property('creation_time')
+                enc_list.append((sort_key,e))
+        enc_list.sort()
+        return [i[1] for i in enc_list]


## Playing with pylons and leonardo

I've suddenly been granted a couple of hours I didn't expect, so I thought I'd take the first steps towards forking leonardo (sorry James), so that we have a pylons version. I know James has a Django version, but I want pylons for a number of reasons:

• I've done a lot of work with pylons. I understand it.

• I have a number of extensions already to James' codebase (including full trackback inbound and outbound). (James' subversion got broken: we never resolved it, so they never got committed back).

• I want to build an egg, which allows external templating (i.e. you can complete control the look and feel via a genshi template or use the default within the egg).

• I want to do all of this so I have a nice small job which exercises and documents all the skills I've built up building in the NDG portal (in my spare time) over the last few months.

• I want to cache documents more efficiently (trivial in pylons).

• I want to be able to be able to produce archiveable versions for previous years (not so trivial). Tim Bray reminded us all that this is important!

I expect it will take months to do what I expect to be a few hours coding :-( I wonder how well this will compare with previous announcements!

I guess we'll need a new name. pyleo will do, in this case for pylons leonardo.

## virtualenv

One more thing to remember. I'm going to be building pyleo using pylons 0.9.6.1, but the ndg stuff (also on my laptop) is using pylons 0.9.5. Library incompatibility is scary. Fortunately, we have virtualenv to the rescue.

Using virtualenv, I can build a python instance that is independent of the stuff build into my main python (which is a virtual-python for historical reasons). It's better than virtual-python because I get the benefits of things in my system site-packages that I've installed since I installed my virtual python.

What to remember? (I really will be forgetting things like this when there are weeks between activity ... in particular, this way I'll know which python to use!)

Well, I built my new virtualenv instance by typing

python virtualenv.py pyleo


and I can change into it any time I like with

source ~/pyleo/bin/activate


I expect I'll be able to ensure I use this python in my (test) webserver when I get to it. It looks like I need to adjust the path to the libraries inside the outer script with

import site